EPAM Systems is seeking a Senior Site Reliability Engineer (SRE)/DevOps to optimize, maintain, and scale their IT infrastructure and operations. This role involves combining software and systems engineering to build and run large-scale, distributed, fault-tolerant systems.

Responsibilities:

Design, build and maintain the infrastructure and tools to allow for the speedy development and release of software
Ensure continuous availability, performance and scalability of production systems and services
Implement automation tools for efficient operations and response to system alerts and issues
Collaborate closely with the development team to improve the reliability and performance of the system
Develop and maintain operational documentation and specifications on system builds and operational processes
Monitor and report on service level objectives for a given application's services
Establish key performance indicators in cooperation with business and product owners
Foster a culture of continuous improvement, testing and automation

Requirements:

Bachelor's or Master's degree in Computer Science, Information Technology or related field
3+ years of experience in an SRE/DevOps role with a proven track record of scaling and automating large-scale systems
Understanding of cloud computing services, preferably AWS, Azure or GCP
Proficiency in scripting languages such as Python and Bash along with infrastructure as code tools such as Terraform and CloudFormation
Skills in container orchestration tools such as Kubernetes and Docker
Knowledge of CI/CD pipelines and tools such as Jenkins and GitLab CI
Familiarity with monitoring and alerting tools such as Prometheus, Grafana and New Relic
Excellent leadership and communication skills
English proficiency at B2 level or higher

Senior Site Reliability Engineer (SRE)/DevOps

Key skills

About this role

Responsibilities:

Requirements: