Amtex Systems Inc is seeking a Site Reliability Engineer to enhance their cloud operations. The role involves implementing observability, standardizing disaster recovery practices, and setting up monitoring and logging systems.
Responsibilities:
- Implement advanced observability (logs, metrics, tracing)
- Introduce automated incident response
- Standardize DR and reliability practices
- Set up monitoring, logging, and alerting
- Define incident workflows and runbooks
- Introduce basic SRE practices
- Monitoring & alerting baseline
- Logging framework
- Runbooks for critical services
- SRE fundamentals (SLIs, SLOs, error budgets)
- Automated compliance checks
- Backup & archival setup
- Reliability playbooks
- Error budget governance
- Automated incident response
- Backup & DR standardization
- Centralized logging
- Metrics standardization
- Distributed tracing
Requirements:
- Implement advanced observability (logs, metrics, tracing)
- Introduce automated incident response
- Standardize DR and reliability practices
- Set up monitoring, logging, and alerting
- Define incident workflows and runbooks
- Introduce basic SRE practices
- Cloud Ops baseline model (incident workflows)
- Monitoring & alerting baseline
- Logging framework
- Runbooks for critical services
- SRE fundamentals (SLIs, SLOs, error budgets)
- Automated compliance checks
- Backup & archival setup
- Reliability playbooks
- Error budget governance
- Automated incident response
- Backup & DR standardization
- Centralized logging
- Metrics standardization
- Distributed tracing