Position: Site Reliability Engineer
Location: Santa Clara, CA
Duration: 12 Months
Responsibilities
Design operate and scale a multi cluster Kubernetes platform that provisions machines workloads and cloud instances on demand Build secure and resilient platform microservices including continuous integration and delivery single sign on role based access control secret encryption and monitoring workflows Integrate artificial intelligence tooling into workloads and develop agents and tools that help SRE teams operate at scale Own the production release lifecycle including Helm based deployments multi architecture container builds staged rollouts and rollback strategies Implement audit logging usage analytics and automation to support a large and growing internal user base Partner closely with software engineering and infrastructure teams to deploy new products and manage supporting systems and processes
Required Qualifications
Six or more years of DevOps or Site Reliability Engineering experience supporting production Kubernetes environments in cloud or on premises deployments
Strong experience with Kubernetes custom resource definitions operators ingress and cluster networking
Experience integrating artificial intelligence tools into engineering or operational workflows
Strong programming skills in Python or Go with working knowledge of TypeScript and React
Hands on experience with cloud provisioning such as AWS or equivalent platforms
Experience with identity federation protocols including OIDC and SAML and with secret management solutions
Solid understanding of relational databases caching systems and asynchronous networking patterns
Bachelor or Master degree in Computer Science or equivalent professional experience
Demonstrated success delivering internal developer platforms and CI CD pipelines
Preferred Qualifications
Experience working in Linux based multi tenant environments including virtual machines and Kubernetes administration
Deep experience building agent driven workflows developer tools command line interfaces and machine control platforms
Experience developing AI assisted operational tooling such as automated runbooks anomaly detection or large language model driven workflows
Hands on experience with CI tools such as Jenkins or GitLab CI and CD tools such as Argo or Flux
Experience with monitoring and observability tools including Prometheus Grafana Victoria Metrics Datadog Splunk or Kibana
Strong documentation practices with a focus on root cause analysis and reusable runbooks