Key skills

Linux system administrationWindows system administrationPowerShellPythonBashObject-oriented programmingAutomation toolsConfiguration management toolsChefPuppetAnsibleCloud infrastructure managementAWSGCPContainer platformsEKSGKEInfrastructure as CodeTerraformCI/CD pipelinesSource controlGitDeployment toolsJenkinsOctopusRundeckNetworking conceptsDNSLoad balancingMonitoring toolsDatadogSplunkNew RelicDatabase administrationMS SQL ServerAgile methodologyScrumJIRAITIL practicesIncident managementChange managementRoot cause analysisSQLAISQL ServerSource ControlLoad BalancingJiraAgileCI/CDChange ManagementCommunication

About this role

Coupa is a company that provides a total spend management platform powered by AI. The Sr. Site Reliability Engineer will be responsible for ensuring the availability and performance of critical services while building automation to prevent issues, as well as providing infrastructure support and collaborating with cross-functional teams.

Responsibilities:

Own end-to-end availability and performance of critical services, including building automation to prevent recurring issues
Administer Linux and Windows systems across web, application, and database servers
Develop and automate solutions using various programming languages
Provide application and infrastructure support, including participating in on-call rotations for emergencies
Enhance monitoring, alerting, and observability to ensure reliability and performance
Collaborate with cross-functional teams on releases, infrastructure, troubleshooting, and maintain documentation such as RCAs

Requirements:

Bachelor's degree in Computer Science, Information Systems, or related field, with 5+ years of experience in system administration and large-scale web operations
Strong programming skills (PowerShell, Python, Bash, or OOP languages) and experience with automation and configuration management tools (Chef, Puppet, Ansible, etc.)
Hands-on experience managing cloud infrastructure (AWS, GCP) and container platforms (EKS, GKE), plus Infrastructure as Code tools like Terraform
Proficiency in CI/CD pipelines, source control (Git with complex branching), and deployment/automation tools (Jenkins, Octopus, Rundeck)
Solid understanding of networking and operations concepts (DNS, load balancing), monitoring tools (Datadog, Splunk, New Relic), and database administration (MS SQL Server)
Strong Agile/Scrum experience (JIRA), ITIL practices (incident/change management, RCA), and excellent communication, problem-solving, and ownership skills

Sr. Site Reliability Engineer - 11444

Key skills

About this role

Responsibilities:

Requirements: