Suna Solutions is seeking a Site Reliability Engineer (SRE) to join a global engineering team responsible for maintaining the reliability, scalability, and efficiency of large-scale digital platforms. The role involves collaborating with software development and product teams to design, build, and operate resilient systems while focusing on system performance and operational automation.

Responsibilities:

Design, implement, and operate fault-tolerant systems to ensure high availability and resiliency of digital products
Maintain production systems and ensure stable performance across distributed cloud environments
Design and maintain monitoring, alerting, logging, and tracing solutions that provide real-time insights into system performance and customer experience
Use operational metrics to monitor service health and proactively address potential issues
Analyze system performance, scalability, and capacity to identify bottlenecks and improvement opportunities
Implement optimizations to improve efficiency, stability, and cost effectiveness in cloud environments
Develop automation tools to streamline deployments, scaling, incident response, and operational workflows
Support infrastructure and application deployment pipelines through automation and scripting
Participate in an on-call rotation within a globally distributed engineering team
Lead incident response efforts, troubleshoot production issues, and coordinate resolution during system outages
Conduct post-incident reviews and implement improvements to prevent future incidents
Partner with engineering teams to improve developer experience, operational maturity, and overall system reliability
Work with security and compliance teams to ensure systems follow security and privacy best practices

Requirements:

Professional experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
Experience managing and operating Kubernetes-based production systems
Hands-on experience with Amazon Web Services (AWS) and infrastructure-as-code tools
Experience building or maintaining CI/CD pipelines and automated deployment workflows
Proficiency in Python for scripting, automation, or backend development
Strong understanding of distributed systems architecture and networking fundamentals
Experience with monitoring and observability tools such as Datadog and AWS CloudWatch
Experience working in globally distributed engineering teams
Strong troubleshooting and root cause analysis skills
Experience implementing automation to improve operational efficiency
Familiarity with security and compliance best practices in cloud environments

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: