General Motors is seeking an experienced Staff Site Reliability Engineer to join their Vehicle Security Platforms team. The role involves shaping the reliability of next-generation vehicle security platforms and driving strategies for system hardening, high availability, and operational scalability.
Responsibilities:
- Implement, and evolve secure, highly available, and globally distributed systems powering GM’s vehicle security platforms
- Own reliability roadmaps, establishing frameworks and strategies for system hardening, high availability, disaster recovery, and operational scalability
- Develop automation-first solutions to eliminate operational toil, with advanced use of languages such as Python, Go, and Java
- Lead incident response, driving systematic elimination of failure modes through blameless postmortems PRRs and cross-team preventative initiatives
- Drive observability strategies with best-in-class practices for metrics, logging, and distributed tracing, using Prometheus, Datadog, or similar stacks
- Partner with engineering, platform, and security teams to design for reliability from inception, influencing architecture reviews and CI/CD best practices
- Lead optimization, capacity planning, and performance-tuning strategies for large-scale, security-critical platforms
- Introduce modern SRE practices such as chaos engineering, resilience testing, and progressive delivery to validate support teams and evolve system safety along with SLO, SLI, and SLAs
- Mentor engineers across disciplines on SRE, platform resilience, secure operational practices, and architectural trade-offs
- Evaluate and adopt technologies (open-source, enterprise, homegrown) for security and reliability at scale
- Influence product strategy in partnership with engineering leads, ensuring operational reliability is prioritized alongside customer and business outcomes
Requirements:
- 7+ years of experience in Site Reliability Engineering, DevOps, or infrastructure/platform roles supporting secure, scalable systems
- Strong Proven expertise in designing and scaling cloud infrastructure (Azure) and container orchestration systems (Kubernetes, Docker)
- Demonstrated mastery of infrastructure-as-code frameworks (Terraform, Helm, CloudFormation, etc)
- Proficiency in Python and one JVM language (Java or Kotlin), and working knowledge of Go
- Deep architectural understanding of distributed systems, networking, system design, and large-scale security practices
- Track record of architecting and running zero-downtime systems in production
- Experience with modern monitoring and reliability tooling and frameworks (Prometheus, Datadog, OpenTelemetry, etc.)
- Experience leading incident response, uptime SLO/SLA management, and operational excellence initiatives across multiple teams
- Capable of influencing architecture and product strategy while maintaining a hands-on approach to systems reliability
- Exceptional communication skills, able to present complex trade-offs and foster alignment across executive, product, and engineering stakeholders
- BS/MS/PhD in Computer Science, Engineering, or equivalent industry experience
- Deep understanding of encryption technologies, secure data handling practices, and identity management
- Experience designing and operating IoT or automotive-focused architectures with rigorous availability and safety requirements
- Direct experience in chaos engineering, game-day testing, disaster recovery orchestration, and production load testing
- Ability to grow and mentor engineers into leaders in their domain, building SRE teams that can operate independently at scale
- Demonstrated success in defining and executing reliability strategies with measurable business impact
- Strong product mindset with the ability to balance engineering excellence with speed and business priorities