BMC Software is a leader in IT service and operations management, focusing on delivering innovative solutions. They are seeking a highly skilled Performance SRE Engineer to ensure the reliability, scalability, and efficiency of systems through proactive optimization and cost management initiatives.

Responsibilities:

Maintain up time of systems and applications within agreed SLOs and reduce Mean Time to Recovery (MTTR)
Design and execute chaos experiments to validate system reliability and fault tolerance. Collaborate with R&D and Operations teams to identify weaknesses and improve system resilience
Analyze system performance metrics and identify bottlenecks across infrastructure and applications. Implement tuning strategies for Kubernetes clusters, workloads, and cloud resources
Drive cloud cost optimization strategies and implement FinOps best practices. Monitor and report on resource utilization and cost trends, ensuring alignment with business goals and objectives
Develop and improve observability across the environment, including proactive alerts derived from hyperscaler and Kubernetes metrics, logs, and traces. Enable dashboards for performance, reliability and FinOps KPIs
Manage and optimize Kubernetes clusters, including upgrades, scaling, and architecture improvements. Implement best practices for container orchestration and workload scheduling
Work closely with DevOps, engineering, and product teams to ensure performance and reliability goals are met. Document processes, standards, and findings for continuous improvement

Requirements:

Strong experience with Kubernetes administration and container orchestration
Hands-on experience with chaos engineering tools and implementation of best practices
Proficiency in cloud platforms (AWS, OCI, or GCP) and cost optimization strategies
Familiarity with performance testing tools (e.g., JMeter, Locust, k6)
Expertise with core DevOps and SRE technologies like: Ansible, Docker, Kubernetes, Helm, Jenkins, Terraform, IaaC via Terraform
Review recurring incidents and identify improvement and automation opportunities and collaboration with product feature development teams

Sr Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: