DraftKings Inc. is a technology company that is at the forefront of integrating AI into its operations to enhance customer experiences and streamline processes. As a Lead Site Reliability Engineer, you will drive initiatives to improve infrastructure reliability and efficiency while mentoring engineers and collaborating across teams to implement automation and monitoring strategies.
Responsibilities:
- Lead SRE initiatives across multiple projects and products, collaborating with cross-functional teams to shape platform and infrastructure engineering efforts across the organization
- Drive technical excellence by mentoring and guiding engineers, fostering a culture of continuous learning and innovation
- Architect and automate self-healing, fault-tolerant infrastructure with declarative configurations, GitOps, and event-driven automation for scalable deployments across public clouds and on-premise
- Design, develop, and maintain software-driven infrastructure automation to build internal tools and eliminate repetitive operational tasks
- Own and drive decisions on product deployment, performance tuning, monitoring, and alerting to ensure high availability and system efficiency in production
- Define key metrics and SLAs around new web services being created to support our rapid traffic growth
- Design and implement monitoring and alerting strategies to enforce application SLAs
Requirements:
- At least 6 years of experience managing distributed cloud environments (GCP, AWS, vSphere, Nutanix) and platform automation at scale
- Deep expertise in container orchestration (Kubernetes) and container runtimes (Docker, containers), with the ability to design, scale, and troubleshoot complex workloads
- Expert-level understanding of networking and web concepts, with the ability to debug issues down to the packet level
- Strong experience developing software for automation and infrastructure tooling (Go, Python)
- Strong understanding of Linux-based operating systems, including performance tuning, bootloaders, storage, partitioning, kernel debugging, and low-level system optimizations
- Experience with Infrastructure as Code (IaC) and configuration management tools (Terraform, Ansible, Chef, etc.), ensuring scalable and repeatable infrastructure provisioning
- Understanding of applications written in various programming languages (C#/.NET, Java, Elixir, Ruby, etc)
- Experience in AWS Greengrass IoT management and A/B booting