DataHub is an AI & Data Context Platform adopted by over 3,000 enterprises, including Apple and Netflix. They are seeking a Senior Software Engineer to lead technical initiatives for DataHub Cloud and enhance reliability and scalability of their platform offerings.
Responsibilities:
- Design and implement robust, scalable infrastructure solutions for DataHub Cloud and enterprise deployments
- Lead the technical vision for multi-cloud deployment strategies and distributed system integrations
- Architect monitoring, observability, and alerting systems across diverse environments
- Drive best practices for infrastructure as code, configuration management, and deployment automation
- Partner with product and engineering teams to influence the development of advanced deployment capabilities
- Collaborate with cross-functional teams to help build systems for seamless installation, upgrade, and rollback processes across various environments
- Influence the design and help implement comprehensive monitoring and health check systems for distributed deployments
- Partner with engineering teams to help develop self-healing and automated remediation capabilities
- Establish and maintain SLAs/SLOs for both cloud and enterprise offerings
- Lead incident response and post-mortem processes to drive continuous improvement
- Implement chaos engineering practices to proactively identify system weaknesses
- Optimize system performance, capacity planning, and cost efficiency
- Mentor and guide a team of SRE engineers and collaborate with platform engineering teams
- Work closely with product, engineering, and customer success teams to ensure reliable product delivery
- Improve on-call practices, runbooks, and knowledge sharing processes
- Drive cross-functional initiatives to improve overall system reliability
Requirements:
- 8+ years of experience in Site Reliability Engineering, Platform Engineering, or DevOps roles
- 3+ years of technical leadership experience managing engineering teams
- Strong expertise with cloud platforms (AWS, GCP, Azure) and infrastructure automation tools
- Proficiency in containerization technologies (Docker, Kubernetes) and orchestration
- Experience with infrastructure as code tools (Terraform, CloudFormation, Pulumi)
- Strong programming skills in Python, Java, or similar languages
- Deep understanding of monitoring and observability tools (Prometheus, Grafana, Datadog, etc.)
- Experience with CI/CD pipelines and deployment automation
- Strong knowledge of networking, security, and database operations in cloud environments
- Experience building and operating multi-tenant SaaS platforms
- Background in developing customer-facing deployment and management tools
- Knowledge of data infrastructure and metadata management systems
- Experience with service mesh technologies and microservices architectures
- Previous experience in a customer-facing technical role or working with enterprise clients
- Experience with data governance or data catalog platforms