Donnelley Financial Solutions is seeking a Senior Site Reliability Engineer to ensure their SaaS products are fast, stable, and optimized for customers. The role involves championing a culture of SRE, leveraging AI tools for system reliability, and implementing monitoring and alerting to maintain service levels.
Responsibilities:
- Champion and implement a culture of SRE to maintain a high-quality platform infrastructure in DFIN SaaS products
- Leverage AI tools to enhance system reliability, including intelligent observability, incident prediction and automated remediation across cloud infrastructure
- Evaluate and implement emerging AI powered operations and observability solutions to proactively improve system performance, reliability and scalability
- Champion and implement application and infrastructure monitoring and alerting to prevent client impacting issues by ensuring system availability, performance and scalability to maintain SLOs and SLAs
- Optimize application performance at scale
- Automate everything including system operational runbooks
- Define and support continuous integration and deployment pipelines (CI/CD) aligned to branching and quality assurance strategies
- Dive deep into technology and stay on the forefront of the latest tools, technologies, and strategies; help evaluate, prototype, and integrate them into work processes
- Perform with broad independence and deliver on project milestones and tasks on schedule while communicating progress regularly
- Build strong relationships with SRE team members and software engineering teams to hold each other accountable for quality expectations
- Learn continuously and apply lessons learned
- Evangelize best practices, eliminate bottlenecks, and improve process
- Participate in on-call duties 365/24/7 and lead the triage and RCA of production incidents
Requirements:
- 5+ years experience designing, building, securing, monitoring and maintaining cloud infrastructure in Azure or AWS
- Experience applying AI capabilities within CloudOps operations
- 5+ years experience writing software in any modern software language such as C# .NET, Java
- 5+ years experience creating automated deployments with tools such as Harness, Azure DevOps, Ansible or Jenkins to manage Infrastructure as Code and software build and deployment in a continuous integration (CI) / continuous delivery (CD) environment
- 5+ years experience implementing production performance, availability, and scalability monitoring and alerting using a tool such as New Relic, Dynatrace, DataDog or AppDynamics
- 5+ years experience writing scripts in PowerShell or Python/Bash to automate system operations as runbooks for Windows or Linux environments
- 5+ years experience supporting public client facing revenue generating systems
- Strong DevOps focus and experience building and deploying Infrastructure as Code with Terraform or similar technology
- Experiencing monitoring and preventing issues with databases and database queries (SQL, Cosmos) using tools like Solarwinds Database Performance Analyzer, Idera SQL Diagnostic Manager, or Redgate SQL Monitor
- Experience planning, coordinating, developing and executing all stages of post deployment verification test scripts
- Experience securing Windows or Linux systems in 24x7 production environment
- Experience with containerization and managing Kubernetes clusters (AKS or EKS)
- Experience with common cloud networking, firewall and load balancing configuration
- BS in Computer Science or equivalent work experience
- Relevant certifications or training in AI, Cloud AI services or AIOps platforms are a plus