Federal Express Corporation is seeking a Senior Site Reliability Engineering Analyst to enhance system reliability and performance. This role will lead initiatives to improve observability, automation, and operational best practices while collaborating with engineering teams.
Responsibilities:
- Lead reliability and performance improvements, including capacity planning, failover strategies, and MTTA/MTTR reduction
- Develop technical solutions for complex system issues and resilience gaps
- Assess reliability risks and recommend enhancements to ensure service continuity
- Refine and promote best practices for reliability, maintainability, and scalability
- Mentor team members and provide technical guidance
- Recommend engineering improvements that drive consistency and long-term stability
- Improve monitoring, alerting, and observability to strengthen system awareness
- Support incident response and RCA activities to ensure effective resolution
- Document incident learnings and share knowledge across teams supporting Agile Release Train(s)
- Partner with development, operations, and architecture teams to integrate reliability into system design and delivery
- Reduce operational toil through automation and process optimization
- Enhance engineering workflows, CI/CD pipelines, and readiness practices
- Perform additional responsibilities as required to support organizational goals
Requirements:
- Strong written and verbal communication skills
- Ability to analyze complex technical problems and implement effective solutions
- Solid understanding of distributed systems, cloud environments, and modern application architectures
- Hands-on experience with observability platforms (Dynatrace required)
- Experience with monitoring, incident management, and RCA practices
- Ability to lead initiatives independently and collaborate across teams
- Demonstrated focus on reliability, resiliency, automation, and continuous improvement
- Development experience (e.g., Python, Java, scripting for automation)
- Cloud expertise (e.g., Azure, GCP) including deployment, architecture, and operations
- Bachelor's Degree in Computer Science, Engineering, Information Systems and/or related field or equivalent
- Five (5) or more years equivalent work experience in information technology or engineering environment
- Experience with AI/ML-powered monitoring, automation, or incident prediction
- Familiarity with SRE-aligned frameworks such as SLIs/SLOs, error budgets, and reliability patterns