Home
Jobs
Saved
Resumes
Site Reliability Engineer II at Backblaze | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Site Reliability Engineer II
Backblaze
Remote
Website
LinkedIn
Site Reliability Engineer II
India
Full Time
2 hours ago
Visa Sponsorship
Apply Now
Key skills
Ansible
Docker
Grafana
Jenkins
Kubernetes
Linux
Microservices
Prometheus
Python
Terraform
Go
Bash
CI/CD
Remote Work
About this role
Role Overview
Support the availability and durability of critical services across production environments.
Monitor service health using SLIs, SLOs, and error budgets, and escalate issues when thresholds are at risk.
Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements.
Follow established ITIL/OSS processes (incident, change, problem, and capacity management).
Develop automation for common operational tasks, reducing manual intervention and toil.
Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint,ELK).
Work with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).
Write scripts (Bash, Python, Go, etc.) to improve system reliability and efficiency.
Partner with engineering, product, and operations teams to support resilient system design and operations.
Assist in capacity planning and disaster recovery exercises.
Work with vendors and service providers to troubleshoot service issues and track SLA performance.
Document systems, share learnings, and help grow a reliability-minded engineering culture.
Contribute to playbooks, runbooks, and operational documentation.
Identify recurring issues and propose long-term improvements.
Promote reliability-focused practices within development and operations teams.
Requirements
Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
2–4 years of experience in site reliability, systems engineering, or operations.
Exposure to large-scale, production-grade systems.
Solid Linux systems administration and troubleshooting skills.
Familiarity with service reliability concepts
monitoring, alerting, incident response, and root cause analysis.
Proficiency in at least one scripting language (Python, Bash, or Go).
Understanding of containers (Kubernetes, Docker) and microservices concepts.
Knowledge of incident response and operational best practices.
Tech Stack
Ansible
Docker
Grafana
Jenkins
Kubernetes
Linux
Microservices
Prometheus
Python
Terraform
Go
Benefits
Paid time off
Professional development opportunities
Remote work options
Apply Now
Home
Jobs
Saved
Resumes