The College Board is dedicated to expanding educational opportunities and is seeking a Lead Engineer for their Enterprise Incident Management team. This role involves minimizing the impact of incidents on business operations and ensuring effective change management through automation and collaboration.
Responsibilities:
- Evaluate incident and change management frameworks using data-driven insights to identify opportunities for improvement that will provide value to the EIM team and engineering teams
- Design and implement automation solutions for incident response and management, change management, and observability leveraging input and feedback from domain SMEs and end users
- Develop and maintain scripts, tools, and integrations to reduce manual processes and operational overhead
- Define key performance indicators (KPIs) and metrics to measure the success of automation and improvement efforts and develop and enhance dashboards and reporting mechanisms to measure KPIs as well as incident and change management performance
- Ensure compliance with governance, risk, and change control policies while promoting agility and innovation
- Lead cross-functional initiatives and partner with domain SMEs (delivery team software engineers, security, infrastructure, network, observability, and operations) to analyze, design, and deliver powerful features, capabilities, and automation strategies that align with engineering best practices
- Serve as a subject matter expert (SME) for cloud operations, infrastructure automation, and CI/CD pipelines
- Collaborate with the EIM team’s director and other technology leaders to understand business objectives and team goals and to align solutions and process improvement efforts with those goals
- Contribute to the long-term technology strategy by researching emerging trends, evaluating new tools (especially AI-driven tools that support observability), and recommending technologies or automations that improve cost-effectiveness, metrics delivery to evaluate performance, and system and process efficiency
- Participate in weekly on-call and incident response rotations responsible for monitoring alerts to identify potential issues, ensuring timely triage and escalation of incidents, collaborating with impacted teams, and supporting assessment, response, and communication to bring the incident to resolution
- Play an active role in agile scrum ceremonies (e.g., sprint planning, grooming, daily scrum meetings) while contributing to high-quality team deliverables
- Provide technical direction and guidance to team members, ensuring alignment with architectural standards, best practices and organizational objectives
- Review designs, automation scripts, and implementation plans, offering constructive feedback to improve quality, efficiency, and maintainability
- Foster a culture of continuous learning and collaboration by mentoring engineers in modern automation, cloud infrastructure, and operational excellence
Requirements:
- 7+ years of software development experience with Infrastructure as Code (IaC), CI/CD framework, immutable infrastructure, automation, orchestration, and other modern DevOps patterns
- Strong proficiency in IaC tools (e.g., Terraform, CloudFormation, Ansible) and experience with CI/CD pipeline design and automation using platforms such as Jenkins, GitLab CI, or GitHub Actions is a plus
- Strong knowledge and experience with distributed cloud infrastructure, including AWS resources such as Lambda, SNS, SQS, S3, Step Functions, EC2, ECS, VPC, IAM, CloudWatch, and DynamoDB
- Experience building event-driven cloud-based serverless applications, with technical knowledge of cloud computing, DevOps, and microservices
- Strong coding/scripting experience for automation and integration tasks using tools (e.g., JavaScript, TypeScript, React.js, and Node.js) and proficiency in scripting languages (Python, Bash, PowerShell, etc.)
- Familiarity with AI tools used for observability (e.g., AWS resilience hub)
- Familiarity with incident and change management systems (e.g., Jira Service Management)
- Deep understanding of ITIL frameworks, especially incident, change, and problem management
- Experience integrating monitoring and alerting tools (e.g., Datadog, Prometheus, CloudWatch, Grafana)
- Strong troubleshooting, analytical, and problem-solving skills
- Proven ability to lead technical initiatives, influence cross-functional teams, and prioritize and execute tasks in a high-pressure environment
- Excellent communication skills, with the ability to translate technical details into business outcomes
- Ability to take a weekly, on-call shift every month and a half
- Authorization to work in the U.S
- A passion for expanding educational and career opportunities and mission-driven work
- Curiosity and enthusiasm for emerging technologies, with a willingness to experiment with and adopt new AI-driven solutions and a comfort learning and applying new digital tools independently and proactively
- Clear and concise communication skills, written and verbal
- A learner's mindset and a commitment to growth: welcoming diverse perspectives, giving and receiving timely, respectful feedback, and continuously improving through iterative learning and user input
- A drive for impact and excellence: solving complex problems, making data-informed decisions, prioritizing what matters most, and continuously improving through learning, user input, and external benchmarking
- A collaborative and empathetic approach: working across differences, fostering trust, and contributing to a culture of shared success