Lirio is a technology/software company that provides expertise in behavioral science, data science, and machine learning. The Senior System Reliability Engineer will be responsible for the reliability, scalability, and performance of cloud-native applications and infrastructure, leading automation, monitoring, and incident response processes while mentoring other engineers.
Responsibilities:
- Architect, implement, and maintain automated solutions for deployment, monitoring, alerting and incident response using Lirio’s technology stack (AWS, Azure, Kubernetes, Kafka, Java, TypeScript, Groovy, Databases/SQL)
- Develop and manage infrastructure as code (e.g., Terraform, AWS CloudFormation)
- Build and optimize CI/CD pipelines for seamless, reliable delivery
- Define, implement, and continuously refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets for critical services
- Identify and reduce operational toil through automation, platform improvements, and architectural changes
- Performance analysis and optimization of Lirio systems and services
- Ensure high availability and scalability of services through proactive engineering, load testing, and capacity planning across multi-tenant and client-specific environments
- Review infrastructure changes, automation scripts, and reliability-impacting code changes to ensure production readiness
- Collaborate with software engineers to embed reliability, security, and operational best practices into development workflows
- Partner with software engineering teams during design and architecture discussions to identify reliability risks early
- Monitor system health using modern observability tools (e.g., Prometheus, Grafana, Datadog)
- Participate in a defined on-call rotation supporting production systems, with clear escalation paths and expectations
- Contribute to and maintain incident severity definitions, response procedures, and no-blame postmortem practices
- Lead incident response, root cause analysis, and postmortems for production issues
- Triage and resolve issues, ensuring minimal downtime and rapid recovery
- Support client onboarding and production rollouts by ensuring reliability, observability, and operational readiness standards are met
- Mentor and coach engineers on reliability engineering principles, operational ownership, and incident response best practices
- Design processes to share operational knowledge and avoid single points of failure
- Advise colleagues on architecture and reliability strategies
- Help establish shared operational ownership across teams to reduce single points of failure and knowledge silos
- Stay current with industry trends in reliability engineering, cloud operations, and automation
- Bring innovation to operational practices and system design, evaluating and introducing new tools and technologies as appropriate for Lirio
- Evaluate new tooling with an emphasis on operational simplicity, security, and long-term maintainability
- Define and document operational processes, incident response playbooks, and reliability standards
- Contribute to operational planning, incident reviews, and reliability documentation
Requirements:
- 5-7 years related experience
- Bachelor's Degree in related field
- Linux systems and networking fundamentals (DNS, TCP/IP, TLS)
- Distributed systems debugging and failure analysis
- Load, stress, and fault-injection testing
- CI/CD tools and processes
- Version control (e.g., Git)
- Cloud platforms (e.g., AWS, Azure)
- Containers and orchestration (Kubernetes)
- Kafka (messaging/streaming)
- Scripting and programming languages (e.g., Java, TypeScript, Groovy, Python)
- Agile methodologies (e.g., Scrum, XP, SAFe)
- Databases/SQL
- Observability/monitoring tools (DataDog)