Donnelley Financial Solutions is seeking a Senior Site Reliability Engineer to ensure their SaaS products are fast, stable, and optimized for customers. The role involves championing a culture of SRE, leveraging AI tools for system reliability, and implementing monitoring and alerting to maintain service levels.

Responsibilities:

Champion and implement a culture of SRE to maintain a high-quality platform infrastructure in DFIN SaaS products
Leverage AI tools to enhance system reliability, including intelligent observability, incident prediction and automated remediation across cloud infrastructure
Evaluate and implement emerging AI powered operations and observability solutions to proactively improve system performance, reliability and scalability
Champion and implement application and infrastructure monitoring and alerting to prevent client impacting issues by ensuring system availability, performance and scalability to maintain SLOs and SLAs
Optimize application performance at scale
Automate everything including system operational runbooks
Define and support continuous integration and deployment pipelines (CI/CD) aligned to branching and quality assurance strategies
Dive deep into technology and stay on the forefront of the latest tools, technologies, and strategies; help evaluate, prototype, and integrate them into work processes
Perform with broad independence and deliver on project milestones and tasks on schedule while communicating progress regularly
Build strong relationships with SRE team members and software engineering teams to hold each other accountable for quality expectations
Learn continuously and apply lessons learned
Evangelize best practices, eliminate bottlenecks, and improve process
Participate in on-call duties 365/24/7 and lead the triage and RCA of production incidents

Requirements:

5+ years experience designing, building, securing, monitoring and maintaining cloud infrastructure in Azure or AWS
Experience applying AI capabilities within CloudOps operations
5+ years experience writing software in any modern software language such as C# .NET, Java
5+ years experience creating automated deployments with tools such as Harness, Azure DevOps, Ansible or Jenkins to manage Infrastructure as Code and software build and deployment in a continuous integration (CI) / continuous delivery (CD) environment
5+ years experience implementing production performance, availability, and scalability monitoring and alerting using a tool such as New Relic, Dynatrace, DataDog or AppDynamics
5+ years experience writing scripts in PowerShell or Python/Bash to automate system operations as runbooks for Windows or Linux environments
5+ years experience supporting public client facing revenue generating systems
Strong DevOps focus and experience building and deploying Infrastructure as Code with Terraform or similar technology
Experiencing monitoring and preventing issues with databases and database queries (SQL, Cosmos) using tools like Solarwinds Database Performance Analyzer, Idera SQL Diagnostic Manager, or Redgate SQL Monitor
Experience planning, coordinating, developing and executing all stages of post deployment verification test scripts
Experience securing Windows or Linux systems in 24x7 production environment
Experience with containerization and managing Kubernetes clusters (AKS or EKS)
Experience with common cloud networking, firewall and load balancing configuration
BS in Computer Science or equivalent work experience
Relevant certifications or training in AI, Cloud AI services or AIOps platforms are a plus

Senior Site Reliability Engineer - Cloud

Key skills

About this role

Responsibilities:

Requirements: