BMC Helix is an innovative company focused on redefining enterprise IT through AI and cutting-edge technologies. They are seeking a highly skilled Performance SRE Engineer to ensure the reliability, scalability, and efficiency of their systems while driving proactive performance optimization and cost management initiatives.
Responsibilities:
- Maintain up time of systems and applications within agreed SLOs and reduce Mean Time to Recovery (MTTR)
- Design and execute chaos experiments to validate system reliability and fault tolerance
- Collaborate with R&D and Operations teams to identify weaknesses and improve system resilience
- Analyze system performance metrics and identify bottlenecks across infrastructure and applications
- Implement tuning strategies for Kubernetes clusters, workloads, and cloud resources
- Drive cloud cost optimization strategies and implement FinOps best practices
- Monitor and report on resource utilization and cost trends, ensuring alignment with business goals and objectives
- Develop and improve observability across the environment, including proactive alerts derived from hyperscaler and Kubernetes metrics, logs, and traces
- Enable dashboards for performance, reliability and FinOps KPIs
- Manage and optimize Kubernetes clusters, including upgrades, scaling, and architecture improvements
- Implement best practices for container orchestration and workload scheduling
- Work closely with DevOps, engineering, and product teams to ensure performance and reliability goals are met
- Document processes, standards, and findings for continuous improvement
Requirements:
- Strong experience with Kubernetes administration and container orchestration
- Hands-on experience with chaos engineering tools and implementation of best practices
- Proficiency in cloud platforms (AWS, OCI, or GCP) and cost optimization strategies
- Familiarity with performance testing tools (e.g., JMeter, Locust, k6)
- Expertise with core DevOps and SRE technologies like: Ansible, Docker, Kubernetes, Helm, Jenkins, Terraform, IaaC via Terraform
- Review recurring incidents and identify improvement and automation opportunities and collaboration with product feature development teams