Design and manage robust infrastructure solutions to ensure the reliability and scalability of critical systems.
Proactively monitor system health using performance metrics and automated tools to detect and address potential issues.
Lead incident management efforts during service disruptions, ensuring swift resolution and minimal downtime.
Analyze root causes of system failures and implement long-term solutions to enhance system reliability.
Develop scripts and tools to automate repetitive tasks, improving operational efficiency and reducing manual interventions.
Collaborate with development teams to align on reliability goals and integrate best practices into software design and deployment.
Maintain comprehensive system documentation to support efficient troubleshooting and knowledge sharing.
Requirements
Proven experience managing and maintaining highly-available systems, including cloud-based infrastructure.
Proficiency in programming to automate repetitive tasks and reduce manual effort.
Solid understanding of monitoring tools, incident management platforms, and metrics analysis.
Deep knowledge of system performance optimization, troubleshooting methodologies, cloud platforms, databases, CI/CD, distributed systems, and security best practices.
Strong communication skills (written and verbal) to effectively collaborate across cross-functional teams.
Analytical mindset for interpreting data, metrics, and patterns to make informed decisions and predict future issues.
Ability to view interconnected systems holistically, anticipating the broader impact of changes and designing for resilience.
Tech Stack
Cloud
Distributed Systems
Swift
Benefits
Comprehensive benefits package
Medical, dental, and vision plans
Participation in 401(K) (USA) & DCPP (Canada) with company matching