General Motors is seeking an Engineering Manager for their AI Cloud and Developer Infrastructure organization, which focuses on enhancing development tools for engineers. The role involves leading a team, defining technical strategies for observability, and ensuring comprehensive visibility across GM’s AV software stack.
Responsibilities:
- Manage and grow a team of engineers, conducting performance reviews, providing coaching, and supporting career development
- Define and execute the technical vision and roadmap for the observability platform, ensuring it provides actionable insights into complex systems
- Provide technical guidance on instrumentation, logging, metrics, and tracing to ensure comprehensive visibility across GM’s AV software stack
- Ensure the team's tools enable rapid detection, debugging, and resolution of unknown or unforeseen system failures to minimize downtime
- Work with other engineering teams—such as those developing AI/ML, firmware, and infrastructure—to implement observability practices and improve system reliability
- Lead the development of internal tools and data pipelines to collect, analyze, and visualize telemetry data at a massive scale
- Manage relationships and costs associated with third-party observability software and platforms
Requirements:
- Leadership experience: 5+ years of experience leading software or site reliability engineering (SRE) teams and balancing the tradeoff between velocity and reliability
- Bachelors Degree in Computer Science or related field or equivalent work experience
- Observability expertise: Deep understanding of core observability pillars: logs, metrics, and traces. Experience with technologies like Prometheus, Grafana, OpenTelemetry, and log management systems is crucial
- Software architecture: Strong background in designing, developing, and architecting distributed systems, cloud-native applications, and microservices
- Programming proficiency: Familiarity with Go, Python, Typescript or similar along with software development practices to inform code reviews and architectural decisions
- Cloud infrastructure: Experience with modern cloud offerings like GCP, AWS, or Azure and technologies like CI/CD pipelines, Kubernetes, and Docker
- Communication skills: Excellent interpersonal and communication skills to collaborate effectively with diverse teams and stakeholders
- Management experience: 3+ years of experience managing software engineering or site reliability engineering (SRE) teams
- Experience working with GCP, AWS, or Azure
- Familiarity with Kubernetes, Docker, Istio, Terraform, Prometheus, Grafana, TSDBs and observability pipelines (e.g. either for logging or metrics or tracing)
- Skilled in defining and instrumenting SLIs and SLOs
- Own or contribute to Open Source projects
- Passion for self-driving technology and its potential impact on the world