Position: Site Reliability Engineer

Location: Santa Clara, CA

Duration: 12 Months

Responsibilities

Design operate and scale a multi cluster Kubernetes platform that provisions machines workloads and cloud instances on demand Build secure and resilient platform microservices including continuous integration and delivery single sign on role based access control secret encryption and monitoring workflows Integrate artificial intelligence tooling into workloads and develop agents and tools that help SRE teams operate at scale Own the production release lifecycle including Helm based deployments multi architecture container builds staged rollouts and rollback strategies Implement audit logging usage analytics and automation to support a large and growing internal user base Partner closely with software engineering and infrastructure teams to deploy new products and manage supporting systems and processes

Required Qualifications

Six or more years of DevOps or Site Reliability Engineering experience supporting production Kubernetes environments in cloud or on premises deployments

Strong experience with Kubernetes custom resource definitions operators ingress and cluster networking

Experience integrating artificial intelligence tools into engineering or operational workflows

Strong programming skills in Python or Go with working knowledge of TypeScript and React

Hands on experience with cloud provisioning such as AWS or equivalent platforms

Experience with identity federation protocols including OIDC and SAML and with secret management solutions

Solid understanding of relational databases caching systems and asynchronous networking patterns

Bachelor or Master degree in Computer Science or equivalent professional experience

Demonstrated success delivering internal developer platforms and CI CD pipelines

Preferred Qualifications

Experience working in Linux based multi tenant environments including virtual machines and Kubernetes administration

Deep experience building agent driven workflows developer tools command line interfaces and machine control platforms

Experience developing AI assisted operational tooling such as automated runbooks anomaly detection or large language model driven workflows

Hands on experience with CI tools such as Jenkins or GitLab CI and CD tools such as Argo or Flux

Experience with monitoring and observability tools including Prometheus Grafana Victoria Metrics Datadog Splunk or Kibana

Strong documentation practices with a focus on root cause analysis and reusable runbooks

Site Reliability Engineer

Key skills

About this role