Block MB is a fast-growing AI infrastructure company that is building a cutting-edge vector database platform used for AI search, recommendation systems, and large-scale data discovery. They are looking for a Site Reliability Engineer to join their Cloud Operations team and help ensure their cloud platform remains reliable, scalable, and secure as usage continues to grow.
Responsibilities:
- Operating and maintaining production cloud infrastructure at scale
- Managing Kubernetes clusters, networking, and deployment pipelines
- Improving monitoring, logging, and alerting systems
- Leading incident response and root cause analysis
- Automating operational tasks to reduce manual toil
- Improving security, reliability, and performance of production systems
- Working closely with platform and infrastructure teams
- Participating in on-call rotations
Requirements:
- 5+ years experience in DevOps / SRE / Infrastructure roles
- Strong hands-on Kubernetes production experience
- Solid knowledge of Linux systems and networking
- Experience with AWS, GCP, or Azure
- Experience with monitoring, alerting, and incident management
- Familiarity with infrastructure-as-code and automation tools
- Terraform
- Prometheus / Grafana / Loki / OpenTelemetry
- Scripting with Python, Bash, or Go
- Experience working in SaaS or cloud infrastructure environments