Calance is seeking a Lead SRE Platform Engineer to drive reliability engineering strategy across critical IT Business Solutions platforms. This role focuses on improving uptime, performance, and operational efficiency through software enhancements, observability, automation, and data-driven root cause analysis.

Responsibilities:

Define and mature SRE best practices across cloud and on-prem environments
Design and implement comprehensive monitoring strategies using tools such as:
Develop dashboards, alerts, synthetic testing, and proactive monitoring capabilities
Establish and evolve a MELT data strategy to improve service reliability
Provide data-driven RCA investigations and implement preventative solutions
Support and enhance reliability across:
Microsoft Azure (software, storage, Azure local)
Hyper-V and legacy VMware environments
NetApp and Pure storage platforms
Azure log analytics
Infrastructure as Code using Terraform
Migration from Azure DevOps to GitHub (strong GitHub experience required)
Azure-based, internally developed .NET/C# applications
Internal message queuing systems
Logging, analytics, and synthetic testing post-patching
API-based integrations
Workday (Payroll)
ADP Vantage (Timekeeping)
Blue Yonder Warehouse Management System (WMS)
Vocollect handheld voice picking devices
Network analytics for identifying dead zones and connectivity issues
Barcode scanners and device connectivity troubleshooting
Lead CI/CD reliability improvements (Azure DevOps → GitHub transition critical)
Enhance pipeline automation with embedded security controls
Advance Infrastructure-as-Code standards (Terraform)
Improve configuration management and change governance
Drive automation to reduce manual intervention and operational risk
Work within BMC ecosystem including:
Optimize automated incident generation (SCOM → BMC workflows)
Improve triage, escalation, and impact modeling across services
Monitor vendor performance and escalate appropriately
Participate in off-hour escalation support when required
Develop predictive reliability models using statistical techniques
Identify systemic risk across production systems
Guide tooling decisions (e.g., Dynatrace vs. Datadog or other observability platforms)
Ensure regulatory and operational compliance standards are met
Facilitate cross-functional collaboration and document SRE procedures and planning artifacts

Requirements:

5–7+ years of Software Engineering and Infrastructure/Database Engineering experience
Deep expertise in DevSecOps practices
Deep expertise in Observability platforms
Deep expertise in API integrations
Deep expertise in Performance management tools
Deep expertise in ITIL principles
Deep expertise in ITSM data analytics
Deep expertise in MELT data collection and analysis
Experience in Azure cloud environments
Strong analytical and problem-solving skills
Demonstrated ability to influence technical direction
Excellent communication and cross-team collaboration skills
Continuous improvement mindset focused on reliability engineering
Strong programming experience in .NET / C#
Strong programming experience in Python
Strong programming experience in SQL
Experience with MSSQL (primary) and Oracle (limited)
Experience with GitHub (critical for upcoming transition)
Agile/Scrum experience
Knowledge of Reliability-Centered Engineering and maintenance strategies
Experience with synthetic testing and proactive validation post-deployment
Bachelor's degree in a related technical field

Site Reliability Engineer Lead - Dynatrace & Azure

Key skills

About this role

Responsibilities:

Requirements: