Suna Solutions is seeking a Site Reliability Engineer (SRE) to join a global engineering team responsible for maintaining the reliability, scalability, and efficiency of large-scale digital platforms. The role involves collaborating with software development and product teams to design, build, and operate resilient systems while focusing on system performance and operational automation.
Responsibilities:
- Design, implement, and operate fault-tolerant systems to ensure high availability and resiliency of digital products
- Maintain production systems and ensure stable performance across distributed cloud environments
- Design and maintain monitoring, alerting, logging, and tracing solutions that provide real-time insights into system performance and customer experience
- Use operational metrics to monitor service health and proactively address potential issues
- Analyze system performance, scalability, and capacity to identify bottlenecks and improvement opportunities
- Implement optimizations to improve efficiency, stability, and cost effectiveness in cloud environments
- Develop automation tools to streamline deployments, scaling, incident response, and operational workflows
- Support infrastructure and application deployment pipelines through automation and scripting
- Participate in an on-call rotation within a globally distributed engineering team
- Lead incident response efforts, troubleshoot production issues, and coordinate resolution during system outages
- Conduct post-incident reviews and implement improvements to prevent future incidents
- Partner with engineering teams to improve developer experience, operational maturity, and overall system reliability
- Work with security and compliance teams to ensure systems follow security and privacy best practices
Requirements:
- Professional experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
- Experience managing and operating Kubernetes-based production systems
- Hands-on experience with Amazon Web Services (AWS) and infrastructure-as-code tools
- Experience building or maintaining CI/CD pipelines and automated deployment workflows
- Proficiency in Python for scripting, automation, or backend development
- Strong understanding of distributed systems architecture and networking fundamentals
- Experience with monitoring and observability tools such as Datadog and AWS CloudWatch
- Experience working in globally distributed engineering teams
- Strong troubleshooting and root cause analysis skills
- Experience implementing automation to improve operational efficiency
- Familiarity with security and compliance best practices in cloud environments