Ensure the reliability, availability, performance, and scalability of production systems using software engineering practices.
Collaborate closely with development teams to design, build, and maintain resilient, observable, and automated platforms that meet defined service level objectives (SLOs).
Develop and implement automation tools to streamline manual and repetitive operational tasks.
Document processes, workflows, and system configurations to support ongoing operations and future enhancements.
Continuously monitor production systems, proactively addressing incidents and performance issues.
Participate in capacity planning and ongoing improvements to system resilience and scalability.
Maintain effective communication with executive management, business stakeholders, and cross-functional technical teams.
Stay current with emerging site reliability engineering practices, tools, and technologies.
Requirements
8 years of experience in systems engineering, DevOps, or site reliability engineering roles.
8 years of strong experience with Linux/Unix systems and system internals.
8 years of proficiency in one or more programming/scripting languages (e.g., Python, Go, Java, Bash).
8 years of experience designing and operating highly available, distributed systems.
8 years of strong knowledge of cloud platforms (such as AWS or GCP) and cloud-native services.
8 years of experience with containerization and orchestration (e.g., Docker, Kubernetes).
8 years of strong understanding of monitoring, alerting, and logging concepts.
8 years of experience defining and managing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
8 years of familiarity with incident management, root cause analysis (RCA), and postmortems.
8 years of experience integrating security and compliance into operational workflows.
Must be able to pass a background check.
May require additional background checks as required by projects and/or clients at any time during employment.
Tech Stack
AWS
Cloud
Distributed Systems
Docker
Google Cloud Platform
Java
Kubernetes
Linux
Python
Unix
Go
Benefits
Medical, Dental and Vision Insurance
Wellness Program
Flexible Spending Accounts (Healthcare, Dependent Care, Commuter)
Short-Term and Long-Term Disability options
Basic Life and AD&D Insurance (Company Provided)
Voluntary Life and AD&D options
401(k) Retirement Savings Plan with matching after one year