Ford Motor Company is a global leader in the automotive industry, and they are seeking an SRE Sr Engineer/Specialist to develop and enhance their global monitoring and observability platform. The role involves blending AI with software engineering to ensure the uptime and scalability of critical cloud services, while also driving the adoption of monitoring capabilities.

Responsibilities:

Write, configure, and deploy code that improves service reliability for existing or new systems; set standard for others with respect to code quality
Provide helpful and actionable feedback and review for code or production changes
Drive repair/optimization of complex systems with consideration towards a wide range of contributing factors
Lead debugging, troubleshooting, and analysis of service architecture and design
Participate in on-call rotation
Write documentation: design, system analysis, runbooks, playbooks. Provide design feedback and uplevel design skills of others
Implement and manage SRE monitoring applications using AI, Python, and Observability data
Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms
Work within GCP infrastructure, optimizing performance, and cost, and scaling resources to meet demand
Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks
Develop and maintain AI-enhanced automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery
Troubleshoot and resolve issues in our dev, test, and production environments
Participate in postmortem analysis and create preventative measures for future incidents
Implement and maintain security best practices across our infrastructure, ensuring compliance with industry standards and internal policies. Participate in security audits and vulnerability assessments
Participate in capacity planning and forecasting efforts to ensure our systems can handle future growth and demand. Analyze trends and make recommendations for resource allocation
Identify and address performance bottlenecks through code profiling, system analysis, and configuration tuning. Implement and monitor performance metrics to proactively identify and resolve issues
Develop, maintain, and test disaster recovery plans and procedures to ensure business continuity in the event of a major outage or disaster. Participate in regular disaster recovery exercises
Contribute to internal knowledge bases and documentation

Requirements:

Bachelor's degree in Computer Science, Engineering, Mathematics or equivalent work experience
3+ years of experience as an SRE, DevOps Engineer, Software Engineer or similar role
Strong experience with Python development and desired familiarity with Terraform Provider development
Proficient with monitoring and observability tools
Proficient with cloud services, with a strong preference for Kubernetes and Google Cloud Platform (GCP) experience
Solid programming skills in Python, with a good understanding of software development best practices
Experience with relational and document databases
Ability to debug, optimize code, and automate routine tasks
Strong problem-solving skills and the ability to work under pressure in a fast-paced environment
Excellent verbal and written communication skills
Agentic AI and MCP development experience
Experience with Dynatrace SaaS

SRE Sr Engineer/Specialist

Key skills

About this role

Responsibilities:

Requirements: