Orange Logic is a company that has been solving complex content challenges for over two decades with its intelligent Digital Asset Management system. The Site Reliability Engineer is responsible for ensuring the availability, reliability, and optimal performance of critical platform services and applications, collaborating closely with infrastructure, development, and operations teams in a cloud-based environment.

Responsibilities:

Monitor, administer, and troubleshoot application performance and infrastructure health using observability tools
Analyze and resolve application issues, provide timely status updates, and perform thorough root cause investigations
Respond to alerts, outages, and system degradations, executing recovery procedures and supporting post-incident reviews
Deliver front-end and back-end application support, including stakeholder consultation for performance improvements
Implement and manage infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible, or Puppet
Administer cloud-native services (e.g., EC2, S3, RDS, Kubernetes) on AWS, Azure, or Google Cloud
Develop and maintain automation scripts to streamline deployments, configuration management, and repetitive tasks
Ensure consistent code migration across environments to maintain application stability and functionality
Deploy and maintain application monitoring tools such as Prometheus, Grafana, and ELK stack
Establish proactive alerting and visibility into system behavior to ensure rapid detection and resolution of issues
Plan and execute application and configuration change procedures with minimal disruption
Support scheduled maintenance activities including patching, updates, and server health checks
Participate in an on-call rotation to support incident resolution during evenings and weekends
Collaborate with Development, Infrastructure, and Production Support teams to optimize system performance and scalability
Identify and implement process improvements to enhance service reliability and deployment efficiency
Stay informed on the latest industry practices and tools related to site reliability, DevOps, and cloud infrastructure

Requirements:

Bachelor's or Master's degree in Computer Science, Engineering, or a related field
8+ years of experience in site reliability, DevOps, or production engineering roles with increasing responsibilities
Strong knowledge of distributed systems, cloud platforms (AWS, Azure, GCP), and containerized environments (Docker, Kubernetes)
Proficient in SQL and scripting languages (Python, Bash, PowerShell)
Extensive experience with observability stacks and automated alert systems
Familiarity with web protocols, networking fundamentals, and API performance optimization
Demonstrated ability to lead cross-functional initiatives and influence without authority
Excellent verbal and written communication skills with a focus on clarity and action
Experience mentoring engineering teams and conducting architectural or reliability reviews

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: