Build, deploy safely and incrementally, and operate critical production systems with focus on scalability, reliability, observability, performance and security.
Build automation to remove toil and proactively monitor, respond to, and enhance alerts with automated handling.
Create and maintain incident response runbooks, triage platform and infrastructural issues, and write postmortem documents to prevent recurring incidents.
Plan and communicate maintenance windows on production systems while engaging with 3rd party vendor support as needed.
Work with Arista's product development teams to identify infrastructural bottlenecks and design solutions to enhance developer experience and workflow efficiency.
Survey and adopt best practices around infrastructure and platform design to maintain secure, scalable and fault-tolerant systems, including studying OSS system implementations for better triage and resolution.
Requirements
At least BSc Computer Science or Engineering + 3 years’ experience, MS Computer Science or Engineering + 3 years’ experience, or equivalent work experience.
Knowledge of one or more of Go, Python, shell scripting to be able to implement medium complexity automation workflows.
Knowledge of Linux (or UNIX) from administration and debugging perspective
Hands-on experience in operating software systems (infrastructure, complex applications etc) at scale
Experience in server provisioning (esp from storage and networking perspective).
Strong problem solving and software troubleshooting skills