System Design & Maintenance: Design, implement, and maintain scalable, secure, and reliable systems, with Python-based services and microservices as the primary stack.
Monitoring & Observability: Build and manage monitoring, alerting, and logging systems using Python-native tooling (e.g. Prometheus clients, OpenTelemetry SDK, structlog); proactively identify and resolve performance issues.
Automation & Tooling: Develop and maintain automation tools and internal libraries in Python to streamline operations, reduce manual intervention, and support CI/CD pipelines.
Collaboration: Partner with Python development squads to ensure new features are designed with reliability in mind; conduct code reviews for reliability-critical paths; participate in Agile ceremonies.
Incident Management: Conduct root cause analysis for incidents and implement corrective actions to prevent recurrence; participate in on-call rotations for critical systems; maintain runbooks in version-controlled Python projects.
Continuous Improvement: Drive initiatives to improve system performance, reliability, and scalability through Python best practices, including profiling, benchmarking, and dependency management.
Requirements
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
Minimum 2 years in SRE, DevOps, or similar roles.
Strong Python proficiency — including async frameworks (asyncio, FastAPI), ORM frameworks (Django), testing (pytest), packaging (Poetry/pip), and scripting.
Experience with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes, Docker).
Familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation.
Strong problem-solving skills and ability to work effectively under pressure.
Excellent communication and collaboration skills for cross-functional teamwork.