Donnelley Financial Solutions is a values-driven organization that empowers employees to build fulfilling careers. The Principal Site Reliability Engineer - Cloud is responsible for designing, building, securing, monitoring, and maintaining the SaaS product cloud infrastructure to ensure optimal performance for customers.
Responsibilities:
- Champion and implement a culture to maintain performant, reliable, secure, cost-effective platform cloud infrastructure in DFIN SaaS products based on operationalized processes you define
- Champion security of our cloud infrastructure collaborating with Security and Governance teams and using static and dynamic tooling
- Champion and implement application and cloud infrastructure monitoring and alerting to prevent client impacting issues by ensuring system availability, performance and scalability to maintain SLOs and SLAs
- Optimize cloud infrastructure and application performance at scale while maintaining effective cost controls
- Automate cloud infrastructure buildout and maintenance including system operational runbooks
- Dive deep into technology and stay on the forefront of the latest tools, technologies, and strategies; help evaluate, prototype, and integrate them into operationalized work processes
- Perform with broad independence and deliver on project milestones and tasks you define on schedule while communicating progress regularly
- Build strong relationships with SRE team members and software engineering teams to hold each other accountable for quality expectations
- Learn continuously and apply lessons learned
- Evangelize best practices, eliminate bottlenecks, and improve process
- Participate in on-call duties 365/24/7 and lead the triage and RCA of production incidents
Requirements:
- 8+ years experience designing, building, securing, monitoring and maintaining cloud infrastructure in Azure or AWS
- 5+ years experience creating, configuring, maintaining and monitoring Kubernetes clusters (AKS or EKS) in cloud infrastructure to optimize application performance and reliability
- 5+ years building and deploying Infrastructure as Code with Terraform or similar technology
- 5+ years experience with common cloud networking, firewall and load balancing configuration
- 5+ years experience writing software in any modern software language such as C# .NET, Java
- 5+ years experience creating automated deployments with tools such as Harness, Azure DevOps, Ansible or Jenkins to manage Infrastructure as Code and software build and deployment in a continuous integration (CI) / continuous delivery (CD) environment
- 5+ years experience implementing production performance, availability, and scalability monitoring and alerting using a tool such as New Relic, Dynatrace, DataDog or AppDynamics
- 5+ years experience supporting public client facing revenue generating systems
- Experiencing monitoring and preventing issues with databases and database queries (SQL) using tools like Solarwinds Database Performance Analyzer, Idera SQL Diagnostic Manager, or Redgate SQL Monitor
- Experience planning, coordinating, developing and executing all stages of post deployment verification test scripts
- Experience securing Windows or Linux systems in 24x7 production environment
- BS in Computer Science or equivalent work experience