Blankfactor is seeking a Senior Site Reliability Engineer to ensure the reliability, availability, and performance of mission-critical platforms. The role involves designing scalable systems, building robust automation, and collaborating with various teams to deliver high-performing services.
Responsibilities:
- Design and implement solutions that enhance application reliability, performance, scalability, and resilience
- Build and maintain monitoring, alerting, observability, and telemetry to drive proactive detection and rapid incident response
- Lead incident management efforts, perform root cause analysis, and implement action-oriented post-mortem improvements
- Automate operational workflows using scripting, IaC, and configuration management tools
- Analyze capacity, performance, and usage trends to forecast demand and optimize cloud costs
- Collaborate with engineering teams to embed operability, resilience, and security into application and architecture designs
- Support safe, reliable deployments through CI/CD pipelines, release governance, and change control
- Maintain clear runbooks, architecture diagrams, and operational documentation that enable efficient production support
Requirements:
- Managing Kubernetes and containerized workloads (EKS, AKS, GKE), including scaling, networking, upgrades, and orchestration
- Experience in public cloud platforms (AWS, Azure, or GCP) across compute, storage, networking, IAM, and cost governance
- Using observability and APM tools such as Dynatrace, Splunk, Prometheus, Grafana, Datadog, ExtraHop, etc
- Implementing security and compliance controls in regulated environments (e.g., PCI DSS, SOC 2), including secrets management and vulnerability remediation
- Infrastructure as Code experience using Terraform, CloudFormation, Ansible, or similar tools
- Designing and maintaining CI/CD pipelines using Jenkins, GitLab CI, GitHub Actions, or Azure DevOps
- Scripting and automation using Bash, PowerShell, or Python
- Equivalent combination of education, experience, and/or military background
- Key point is the experience on projects with high volume transactions and taking care of Zero data loss is a must which primarily in banking and payment projects. please avoid experience with Insurance project background
- Certifications such as AWS SysOps Administrator, AWS DevOps Engineer, Google Cloud DevOps Engineer, or CKA
- Experience with Premier applications, IBM iSeries, and/or Unisys systems
- Hands-on database operations and performance tuning (Oracle, SQL Server, PostgreSQL)
- Proven experience in major incident command, stakeholder communication, and cross-team coordination
- Experience with ITIL and ServiceNow (change, problem, and configuration management)