BMC Software is a leader in IT service and operations management, focusing on delivering innovative solutions. They are seeking a highly skilled Performance SRE Engineer to ensure the reliability, scalability, and efficiency of systems through proactive optimization and cost management initiatives.
Responsibilities:
- Maintain up time of systems and applications within agreed SLOs and reduce Mean Time to Recovery (MTTR)
- Design and execute chaos experiments to validate system reliability and fault tolerance. Collaborate with R&D and Operations teams to identify weaknesses and improve system resilience
- Analyze system performance metrics and identify bottlenecks across infrastructure and applications. Implement tuning strategies for Kubernetes clusters, workloads, and cloud resources
- Drive cloud cost optimization strategies and implement FinOps best practices. Monitor and report on resource utilization and cost trends, ensuring alignment with business goals and objectives
- Develop and improve observability across the environment, including proactive alerts derived from hyperscaler and Kubernetes metrics, logs, and traces. Enable dashboards for performance, reliability and FinOps KPIs
- Manage and optimize Kubernetes clusters, including upgrades, scaling, and architecture improvements. Implement best practices for container orchestration and workload scheduling
- Work closely with DevOps, engineering, and product teams to ensure performance and reliability goals are met. Document processes, standards, and findings for continuous improvement
Requirements:
- Strong experience with Kubernetes administration and container orchestration
- Hands-on experience with chaos engineering tools and implementation of best practices
- Proficiency in cloud platforms (AWS, OCI, or GCP) and cost optimization strategies
- Familiarity with performance testing tools (e.g., JMeter, Locust, k6)
- Expertise with core DevOps and SRE technologies like: Ansible, Docker, Kubernetes, Helm, Jenkins, Terraform, IaaC via Terraform
- Review recurring incidents and identify improvement and automation opportunities and collaboration with product feature development teams