BMC Helix is an innovative company focused on redefining enterprise IT through AI and cutting-edge technologies. They are seeking a highly skilled Performance SRE Engineer to ensure the reliability, scalability, and efficiency of their systems while driving proactive performance optimization and cost management initiatives.

Responsibilities:

Maintain up time of systems and applications within agreed SLOs and reduce Mean Time to Recovery (MTTR)
Design and execute chaos experiments to validate system reliability and fault tolerance
Collaborate with R&D and Operations teams to identify weaknesses and improve system resilience
Analyze system performance metrics and identify bottlenecks across infrastructure and applications
Implement tuning strategies for Kubernetes clusters, workloads, and cloud resources
Drive cloud cost optimization strategies and implement FinOps best practices
Monitor and report on resource utilization and cost trends, ensuring alignment with business goals and objectives
Develop and improve observability across the environment, including proactive alerts derived from hyperscaler and Kubernetes metrics, logs, and traces
Enable dashboards for performance, reliability and FinOps KPIs
Manage and optimize Kubernetes clusters, including upgrades, scaling, and architecture improvements
Implement best practices for container orchestration and workload scheduling
Work closely with DevOps, engineering, and product teams to ensure performance and reliability goals are met
Document processes, standards, and findings for continuous improvement

Requirements:

Strong experience with Kubernetes administration and container orchestration
Hands-on experience with chaos engineering tools and implementation of best practices
Proficiency in cloud platforms (AWS, OCI, or GCP) and cost optimization strategies
Familiarity with performance testing tools (e.g., JMeter, Locust, k6)
Expertise with core DevOps and SRE technologies like: Ansible, Docker, Kubernetes, Helm, Jenkins, Terraform, IaaC via Terraform
Review recurring incidents and identify improvement and automation opportunities and collaboration with product feature development teams

Sr Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: