Orange Logic is a company focused on solving complex content challenges through their Digital Asset Management system. The Site Reliability Engineer (SRE) is responsible for ensuring the availability and performance of critical platform services, collaborating with various teams to maintain service excellence in a cloud-based environment.
Responsibilities:
- Monitor, administer, and troubleshoot application performance and infrastructure health using observability tools
- Analyze and resolve application issues, provide timely status updates, and perform thorough root cause investigations
- Respond to alerts, outages, and system degradations, executing recovery procedures and supporting post-incident reviews
- Deliver front-end and back-end application support, including stakeholder consultation for performance improvements
- Implement and manage infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible, or Puppet
- Administer cloud-native services (e.g., EC2, S3, RDS, Kubernetes) on AWS, Azure, or Google Cloud
- Develop and maintain automation scripts to streamline deployments, configuration management, and repetitive tasks
- Ensure consistent code migration across environments to maintain application stability and functionality
- Deploy and maintain application monitoring tools such as Prometheus, Grafana, and ELK stack
- Establish proactive alerting and visibility into system behavior to ensure rapid detection and resolution of issues
- Plan and execute application and configuration change procedures with minimal disruption
- Support scheduled maintenance activities including patching, updates, and server health checks
- Participate in an on-call rotation to support incident resolution during evenings and weekends
- Collaborate with Development, Infrastructure, and Production Support teams to optimize system performance and scalability
- Identify and implement process improvements to enhance service reliability and deployment efficiency
- Stay informed on the latest industry practices and tools related to site reliability, DevOps, and cloud infrastructure
Requirements:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field
- 8+ years of experience in site reliability, DevOps, or production engineering roles with increasing responsibilities
- Strong knowledge of distributed systems, cloud platforms (AWS, Azure, GCP), and containerized environments (Docker, Kubernetes)
- Proficient in SQL and scripting languages (Python, Bash, PowerShell)
- Extensive experience with observability stacks and automated alert systems
- Familiarity with web protocols, networking fundamentals, and API performance optimization
- Demonstrated ability to lead cross-functional initiatives and influence without authority
- Excellent verbal and written communication skills with a focus on clarity and action
- Experience mentoring engineering teams and conducting architectural or reliability reviews