Location:- Austin Texas
Hybrid
Job Description:
We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) / DevOps Engineer to join our team. The ideal candidate will have a strong background in systems engineering, cloud infrastructure, and distributed systems, with a focus on reliability, scalability, and performance of production environments.
Key Responsibilities:
- Design, build, and maintain highly available, scalable, and distributed systems.
- Manage and optimize cloud-based infrastructure (AWS or Google Cloud Platform).
- Implement and maintain CI/CD pipelines and DevOps best practices.
- Monitor system performance, availability, and reliability using modern observability tools.
- Define and manage SLIs, SLOs, and error budgets to ensure service reliability.
- Handle incident management, perform root cause analysis (RCA), and drive postmortems.
- Automate infrastructure and operational processes using scripting and programming languages.
- Work with containerization and orchestration tools like Docker and Kubernetes.
- Integrate security and compliance into system architecture and workflows.
- Develop and maintain documentation including runbooks, dashboards, and operational standards.
Required Qualifications:
- 8+ years of experience in Systems Engineering, DevOps, or Site Reliability Engineering (SRE) roles.
- Strong expertise in Linux/Unix systems and system internals.
- Proficiency in at least one programming/scripting language (Python, Go, Java, or Bash).
- Experience designing and operating highly available distributed systems.
- Hands-on experience with cloud platforms (AWS or Google Cloud Platform) and cloud-native services.
- Experience with Docker and Kubernetes.
- Strong understanding of monitoring, alerting, and logging frameworks.
- Experience managing SLIs, SLOs, and error budgets.
- Knowledge of incident management, RCA, and postmortem practices.
- Experience incorporating security and compliance into DevOps workflows.