Operate and improve platform tools so product teams can ship reliably triaging tickets, fix build issues, and handling routine service requests (access, secrets, environment setup).
Maintain and extend self-service workflows (templates, golden paths) by updating docs, examples, and guardrails under guidance from senior engineers.
Perform day-to-day Kubernetes operations: deploy/update Helm charts, manage namespaces, diagnose rollout issues, and follow runbooks for incident response.
Support CI/CD pipelines (e.g., GitLab CI): keep pipelines green, add/adjust jobs, implement basic quality gates, and help teams adopt safer deploy strategies (blue/green, canary).
Monitor and operate the observability stack using Prometheus, Alert manager, and Thanos; maintain alert rules, dashboards, and SLO/SLA indicators; help reduce alert noise and improve signal quality.
Assist with service instrumentation across the core observability pillars—tracing, logging, and metrics—with hands-on OpenTelemetry usage (collectors/SDKs) and related telemetry tooling.
Contribute to and improve documentation: runbooks, FAQs, onboarding guides, and standard operating procedures.
Participate in an on-call rotation as needed with a well-defined escalation path; assist during incidents, post small fixes, and capture learnings in docs.
Help with cost
and performance-minded housekeeping: right-size workloads, prune unused resources, and automate routine tasks where appropriate.
Requirements
8+ years in a platform/SRE/DevOps or infrastructure role, with a strong bias toward automation and support.
Practical understanding of monitoring/observability (dashboards, logs, alerts) and how to use them for triage and remediation, including Prometheus/Alertmanager/Thanos and OpenTelemetry basics.
Comfortable working from tickets (Jira/ServiceNow), following change-management practices, and communicating clearly with stakeholders.
Highly preferred candidates also have: Terraform experience, API integration experience (Java, Python, or Go), deeper Linux fundamentals, and exposure to insurance/financial services environments.
Tech Stack
AWS
Docker
EC2
Java
Kubernetes
Linux
NGINX
Prometheus
Python
ServiceNow
Terraform
Go
Benefits
We help make an impact by solving real problems using innovation, improved customer experiences and the right technologies.