Caris Life Sciences is a company dedicated to transforming cancer care through precision medicine and innovative solutions. They are seeking a Staff DevOps Engineer to design, implement, and optimize cloud-native infrastructure, focusing on AWS and Kubernetes, while providing technical leadership and driving operational excellence across the organization.
Responsibilities:
- Lead the design, implementation, and management of Kubernetes clusters on AWS EKS, ensuring high availability, scalability, and security
- Implement and manage advanced features including autoscaling, monitoring, logging, and security policies
- Spearhead proof-of-concept (PoC) initiatives for new tools and environments, evaluating their potential benefits for the organization
- Manage the full lifecycle of Kubernetes clusters, including regular upgrades, patch management, version control, and performance optimization
- Provide expert-level support and guidance to teams for deploying and optimizing applications on Kubernetes, including container orchestration and service mesh implementation
- Design and implement monitoring and alerting solutions for applications and infrastructure using CloudWatch, Prometheus, and Datadog
- Develop observability standards and dashboards, leveraging AI/AIOps approaches and SRE agents to enable anomaly detection, alert noise reduction, and automated root cause analysis
- Develop and maintain Infrastructure as Code (IaC) using tools such as Terraform or AWS CDK, and implement CI/CD pipelines for efficient application deployment and image management
- Design and implement security solutions, including the deployment and management of security tools, and translate SOX compliance requirements into actionable implementation plans for cloud environments
- Lead initiatives for cloud migration and modernization of legacy applications, collaborating with cross-functional teams to support their cloud and infrastructure needs
- Provide technical leadership and mentorship to junior engineers on cloud technologies and DevOps practices, implementing knowledge-sharing initiatives to ensure broad support capabilities across the team
- Stay current with emerging AWS services and features, evaluating their potential benefits and optimizing cloud resource utilization and cost-efficiency
- Develop and maintain comprehensive documentation, including a team knowledge base, runbooks, and process documentation to eliminate information silos
- Proactively identify areas of inefficiency and develop strategic plans for process improvements across the DevOps and cloud infrastructure landscape
- Participate in on-call rotations to support critical cloud infrastructure and respond to emergency issues as needed
Requirements:
- Bachelor's degree in Computer Science, Information Technology, or related field
- 7+ years of experience in DevOps or Site Reliability Engineering roles
- 5+ years of hands-on experience with AWS services and cloud architecture
- 5+ years of hands-on experience with Kubernetes, including deep expertise in cluster management, troubleshooting, and optimization
- Strong proficiency in at least one programming language (e.g., Python, Go, Java)
- Extensive experience with Infrastructure as Code tools (e.g., Terraform, CloudFormation, AWS CDK)
- Deep understanding of containerization technologies (Docker) and orchestration platforms (Kubernetes) including security best practices
- Experience with CI/CD tools and methodologies, particularly GitLab CI and Github actions
- Strong knowledge of networking concepts and implementation in cloud environments
- Excellent problem-solving skills and ability to troubleshoot complex systems
- Proven ability to lead PoC initiatives and evaluate new technologies
- Demonstrated experience in creating and maintaining technical documentation and knowledge bases
- Demonstrated ability to identify operational inefficiencies and develop strategic plans for process improvements in complex cloud and DevOps environments
- Strong analytical skills with the ability to translate technical insights into actionable business recommendations
- Strong communication and mentoring skills, with the ability to effectively transfer knowledge to team members of varying experience levels
- Proficient in Microsoft Office Suite, specifically Word, Excel, Outlook, and general working knowledge of Internet for business use
- AWS Professional level certifications (e.g., Solutions Architect Professional, DevOps Engineer Professional)
- Kubernetes certifications (e.g., CKA, CKAD, CKS)
- Experience with multiple cloud platforms (e.g., AWS, GCP) for multi-cloud architectures
- Knowledge of database technologies, including MySQL, PostgreSQL, and DynamoDB
- Proficiency with specific monitoring and observability tools such as Prometheus, Grafana, and ELK stack
- Familiarity with serverless architectures and microservices
- Hands-on experience with configuration management tools (e.g., Ansible, Chef, Puppet)
- Experience in implementing knowledge management systems or tools in a DevOps environment
- Contributions to open-source projects or personal projects demonstrating cloud expertise