Location:- Austin Texas

Hybrid

Job Description:

We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) / DevOps Engineer to join our team. The ideal candidate will have a strong background in systems engineering, cloud infrastructure, and distributed systems, with a focus on reliability, scalability, and performance of production environments.

Key Responsibilities:

Design, build, and maintain highly available, scalable, and distributed systems.
Manage and optimize cloud-based infrastructure (AWS or Google Cloud Platform).
Implement and maintain CI/CD pipelines and DevOps best practices.
Monitor system performance, availability, and reliability using modern observability tools.
Define and manage SLIs, SLOs, and error budgets to ensure service reliability.
Handle incident management, perform root cause analysis (RCA), and drive postmortems.
Automate infrastructure and operational processes using scripting and programming languages.
Work with containerization and orchestration tools like Docker and Kubernetes.
Integrate security and compliance into system architecture and workflows.
Develop and maintain documentation including runbooks, dashboards, and operational standards.

Required Qualifications:

8+ years of experience in Systems Engineering, DevOps, or Site Reliability Engineering (SRE) roles.
Strong expertise in Linux/Unix systems and system internals.
Proficiency in at least one programming/scripting language (Python, Go, Java, or Bash).
Experience designing and operating highly available distributed systems.
Hands-on experience with cloud platforms (AWS or Google Cloud Platform) and cloud-native services.
Experience with Docker and Kubernetes.
Strong understanding of monitoring, alerting, and logging frameworks.
Experience managing SLIs, SLOs, and error budgets.
Knowledge of incident management, RCA, and postmortem practices.
Experience incorporating security and compliance into DevOps workflows.

Site Reliability Engineer (SRE) / DevOps Engineer

Key skills

About this role