Lead SRE team initiatives focused on system reliability, automation, and operational excellence.
Architect and implement solutions to enhance availability, performance, and scalability of cloud and on-premises services.
Oversee incident management processes, ensuring timely response and thorough root cause analysis.
Develop and refine monitoring, alerting, and reporting frameworks; ensure actionable insights for service health.
Guide adoption of Infrastructure as Code (IaC) and CI/CD pipelines to streamline deployments and reduce risk.
Collaborate with software engineering and product teams to integrate reliability requirements into design and development.
Mentor engineers on SRE principles, fostering a culture of continuous improvement and operational rigor.
Establish service level objectives (SLOs), service level indicators (SLIs), and error budgets in partnership with stakeholders.
Manage on-call rotations, ensuring effective coverage and knowledge sharing.
Document architectural decisions, operational procedures, and incident retrospectives.
Operational Excellence for AI Systems – Identifying AI/ML Use Cases, Influence and implement SRE best practices including SLIs/SLOs for ML workloads, automated remediation, capacity modeling.
Observability & Monitoring for ML
Define and implement monitoring strategies for model drift, data anomalies, pipeline failures, system performance, and user experience.
Proactive risk identification and mitigation during deployments to ensure system stability.
Ensure long-term stability through Technical Debt Maintaining observability and performance of critical pharmacy applications.
Supporting timely restoration of services during outages, with 24/7 coverage to meet enterprise Service Level Agreements (SLAs).
Driving incident response and root cause analysis to prevent recurrence and improve system resilience.
Requirements
Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
7+ years of relevant experience in SRE, DevOps, or software engineering, including 2+ years in a technical leadership role.
Minimum 5 years' relevant experience with Python, Pyspark, Azure Databricks, Snowflake, SQL, ORACLE, POSTGRES, File Transfer, REST API, and KAFKA
Proficiency with cloud platforms (AWS, Azure, GCP), container orchestration, and automation tools.
Strong scripting and programming skills (e.g., Python, Go, Bash).
Deep understanding of distributed systems, networking, and security principles.
Proven experience leading large-scale incident response and postmortem processes.
Excellent communication and stakeholder management skills.
Experience building automation around: CI/CD (ADO YAML pipelines), Testing and validation.
Tech Stack
AWS
Azure
Cloud
Distributed Systems
Google Cloud Platform
Kafka
Oracle
Postgres
PySpark
Python
SQL
Go
Benefits
medical, dental and vision benefits
401(k) retirement savings plan
time off (including paid time off, company and personal holidays, volunteer time off, paid parental and caregiver leave)