BMC Helix is the AI-native engine behind forward-thinking IT organizations, focusing on the economics of enterprise IT. The role of Performance SRE Engineer involves ensuring system reliability, scalability, and efficiency through proactive performance optimization, chaos engineering, and cost management initiatives.
Responsibilities:
- Maintain up time of systems and applications within agreed SLOs and reduce Mean Time to Recovery (MTTR)
- Design and execute chaos experiments to validate system reliability and fault tolerance. Collaborate with R&D and Operations teams to identify weaknesses and improve system resilience
- Analyze system performance metrics and identify bottlenecks across infrastructure and applications. Implement tuning strategies for Kubernetes clusters, workloads, and cloud resources
- Drive cloud cost optimization strategies and implement FinOps best practices. Monitor and report on resource utilization and cost trends, ensuring alignment with business goals and objectives
- Develop and improve observability across the environment, including proactive alerts derived from hyperscaler and Kubernetes metrics, logs, and traces. Enable dashboards for performance, reliability and FinOps KPIs
- Manage and optimize Kubernetes clusters, including upgrades, scaling, and architecture improvements. Implement best practices for container orchestration and workload scheduling
- Work closely with DevOps, engineering, and product teams to ensure performance and reliability goals are met. Document processes, standards, and findings for continuous improvement
Requirements:
- Strong experience with Kubernetes administration and container orchestration
- Hands-on experience with chaos engineering tools and implementation of best practices
- Proficiency in cloud platforms (AWS, OCI, or GCP) and cost optimization strategies
- Familiarity with performance testing tools (e.g., JMeter, Locust, k6)
- Expertise with core DevOps and SRE technologies like: Ansible, Docker, Kubernetes, Helm, Jenkins, Terraform, IaaC via Terraform
- Review recurring incidents and identify improvement and automation opportunities and collaboration with product feature development teams