EPAM Systems is seeking a Senior Site Reliability Engineer (SRE)/DevOps to optimize, maintain, and scale their IT infrastructure and operations. This role involves combining software and systems engineering to build and run large-scale, distributed, fault-tolerant systems.
Responsibilities:
- Design, build and maintain the infrastructure and tools to allow for the speedy development and release of software
- Ensure continuous availability, performance and scalability of production systems and services
- Implement automation tools for efficient operations and response to system alerts and issues
- Collaborate closely with the development team to improve the reliability and performance of the system
- Develop and maintain operational documentation and specifications on system builds and operational processes
- Monitor and report on service level objectives for a given application's services
- Establish key performance indicators in cooperation with business and product owners
- Foster a culture of continuous improvement, testing and automation
Requirements:
- Bachelor's or Master's degree in Computer Science, Information Technology or related field
- 3+ years of experience in an SRE/DevOps role with a proven track record of scaling and automating large-scale systems
- Understanding of cloud computing services, preferably AWS, Azure or GCP
- Proficiency in scripting languages such as Python and Bash along with infrastructure as code tools such as Terraform and CloudFormation
- Skills in container orchestration tools such as Kubernetes and Docker
- Knowledge of CI/CD pipelines and tools such as Jenkins and GitLab CI
- Familiarity with monitoring and alerting tools such as Prometheus, Grafana and New Relic
- Excellent leadership and communication skills
- English proficiency at B2 level or higher