The College Board is dedicated to expanding educational opportunities and is seeking a Lead Engineer for their Enterprise Incident Management team. This role involves minimizing the impact of incidents on business operations and ensuring effective change management through automation and collaboration.

Responsibilities:

Evaluate incident and change management frameworks using data-driven insights to identify opportunities for improvement that will provide value to the EIM team and engineering teams
Design and implement automation solutions for incident response and management, change management, and observability leveraging input and feedback from domain SMEs and end users
Develop and maintain scripts, tools, and integrations to reduce manual processes and operational overhead
Define key performance indicators (KPIs) and metrics to measure the success of automation and improvement efforts and develop and enhance dashboards and reporting mechanisms to measure KPIs as well as incident and change management performance
Ensure compliance with governance, risk, and change control policies while promoting agility and innovation
Lead cross-functional initiatives and partner with domain SMEs (delivery team software engineers, security, infrastructure, network, observability, and operations) to analyze, design, and deliver powerful features, capabilities, and automation strategies that align with engineering best practices
Serve as a subject matter expert (SME) for cloud operations, infrastructure automation, and CI/CD pipelines
Collaborate with the EIM team’s director and other technology leaders to understand business objectives and team goals and to align solutions and process improvement efforts with those goals
Contribute to the long-term technology strategy by researching emerging trends, evaluating new tools (especially AI-driven tools that support observability), and recommending technologies or automations that improve cost-effectiveness, metrics delivery to evaluate performance, and system and process efficiency
Participate in weekly on-call and incident response rotations responsible for monitoring alerts to identify potential issues, ensuring timely triage and escalation of incidents, collaborating with impacted teams, and supporting assessment, response, and communication to bring the incident to resolution
Play an active role in agile scrum ceremonies (e.g., sprint planning, grooming, daily scrum meetings) while contributing to high-quality team deliverables
Provide technical direction and guidance to team members, ensuring alignment with architectural standards, best practices and organizational objectives
Review designs, automation scripts, and implementation plans, offering constructive feedback to improve quality, efficiency, and maintainability
Foster a culture of continuous learning and collaboration by mentoring engineers in modern automation, cloud infrastructure, and operational excellence

Requirements:

7+ years of software development experience with Infrastructure as Code (IaC), CI/CD framework, immutable infrastructure, automation, orchestration, and other modern DevOps patterns
Strong proficiency in IaC tools (e.g., Terraform, CloudFormation, Ansible) and experience with CI/CD pipeline design and automation using platforms such as Jenkins, GitLab CI, or GitHub Actions is a plus
Strong knowledge and experience with distributed cloud infrastructure, including AWS resources such as Lambda, SNS, SQS, S3, Step Functions, EC2, ECS, VPC, IAM, CloudWatch, and DynamoDB
Experience building event-driven cloud-based serverless applications, with technical knowledge of cloud computing, DevOps, and microservices
Strong coding/scripting experience for automation and integration tasks using tools (e.g., JavaScript, TypeScript, React.js, and Node.js) and proficiency in scripting languages (Python, Bash, PowerShell, etc.)
Familiarity with AI tools used for observability (e.g., AWS resilience hub)
Familiarity with incident and change management systems (e.g., Jira Service Management)
Deep understanding of ITIL frameworks, especially incident, change, and problem management
Experience integrating monitoring and alerting tools (e.g., Datadog, Prometheus, CloudWatch, Grafana)
Strong troubleshooting, analytical, and problem-solving skills
Proven ability to lead technical initiatives, influence cross-functional teams, and prioritize and execute tasks in a high-pressure environment
Excellent communication skills, with the ability to translate technical details into business outcomes
Ability to take a weekly, on-call shift every month and a half
Authorization to work in the U.S
A passion for expanding educational and career opportunities and mission-driven work
Curiosity and enthusiasm for emerging technologies, with a willingness to experiment with and adopt new AI-driven solutions and a comfort learning and applying new digital tools independently and proactively
Clear and concise communication skills, written and verbal
A learner's mindset and a commitment to growth: welcoming diverse perspectives, giving and receiving timely, respectful feedback, and continuously improving through iterative learning and user input
A drive for impact and excellence: solving complex problems, making data-informed decisions, prioritizing what matters most, and continuously improving through learning, user input, and external benchmarking
A collaborative and empathetic approach: working across differences, fostering trust, and contributing to a culture of shared success

Lead Engineer, Enterprise Incident & Change Management

Key skills

About this role

Responsibilities:

Requirements: