DraftKings Inc. is a technology company that is at the forefront of integrating AI into its operations to enhance customer experiences and streamline processes. As a Lead Site Reliability Engineer, you will drive initiatives to improve infrastructure reliability and efficiency while mentoring engineers and collaborating across teams to implement automation and monitoring strategies.

Responsibilities:

Lead SRE initiatives across multiple projects and products, collaborating with cross-functional teams to shape platform and infrastructure engineering efforts across the organization
Drive technical excellence by mentoring and guiding engineers, fostering a culture of continuous learning and innovation
Architect and automate self-healing, fault-tolerant infrastructure with declarative configurations, GitOps, and event-driven automation for scalable deployments across public clouds and on-premise
Design, develop, and maintain software-driven infrastructure automation to build internal tools and eliminate repetitive operational tasks
Own and drive decisions on product deployment, performance tuning, monitoring, and alerting to ensure high availability and system efficiency in production
Define key metrics and SLAs around new web services being created to support our rapid traffic growth
Design and implement monitoring and alerting strategies to enforce application SLAs

Requirements:

At least 6 years of experience managing distributed cloud environments (GCP, AWS, vSphere, Nutanix) and platform automation at scale
Deep expertise in container orchestration (Kubernetes) and container runtimes (Docker, containers), with the ability to design, scale, and troubleshoot complex workloads
Expert-level understanding of networking and web concepts, with the ability to debug issues down to the packet level
Strong experience developing software for automation and infrastructure tooling (Go, Python)
Strong understanding of Linux-based operating systems, including performance tuning, bootloaders, storage, partitioning, kernel debugging, and low-level system optimizations
Experience with Infrastructure as Code (IaC) and configuration management tools (Terraform, Ansible, Chef, etc.), ensuring scalable and repeatable infrastructure provisioning
Understanding of applications written in various programming languages (C#/.NET, Java, Elixir, Ruby, etc)
Experience in AWS Greengrass IoT management and A/B booting

Lead Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: