System Design & Maintenance: Design, implement, and maintain scalable, secure, and reliable systems, with Python-based services and microservices as the primary stack.
Monitoring & Observability: Build and manage monitoring, alerting, and logging systems using Python-native tooling (e.g. Prometheus clients, OpenTelemetry SDK, structlog); proactively identify and resolve performance issues.
Automation & Tooling: Develop and maintain automation tools and internal libraries in Python to streamline operations, reduce manual intervention, and support CI/CD pipelines.
Collaboration: Partner with Python development squads to ensure new features are designed with reliability in mind; conduct code reviews for reliability-critical paths; participate in Agile ceremonies.
Incident Management: Conduct root cause analysis for incidents and implement corrective actions to prevent recurrence; participate in on-call rotations for critical systems; maintain runbooks in version-controlled Python projects.
Continuous Improvement: Drive initiatives to improve system performance, reliability, and scalability through Python best practices, including profiling, benchmarking, and dependency management.

Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
Minimum 2 years in SRE, DevOps, or similar roles.
Strong Python proficiency — including async frameworks (asyncio, FastAPI), ORM frameworks (Django), testing (pytest), packaging (Poetry/pip), and scripting.
Experience with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes, Docker).
Familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation.
Strong problem-solving skills and ability to work effectively under pressure.
Excellent communication and collaboration skills for cross-functional teamwork.

Site Reliability Engineer

Key skills