Calance is seeking a Lead SRE Platform Engineer to drive reliability engineering strategy across critical IT Business Solutions platforms. This role focuses on improving uptime, performance, and operational efficiency through software enhancements, observability, automation, and data-driven root cause analysis.
Responsibilities:
- Define and mature SRE best practices across cloud and on-prem environments
- Design and implement comprehensive monitoring strategies using tools such as:
- Develop dashboards, alerts, synthetic testing, and proactive monitoring capabilities
- Establish and evolve a MELT data strategy to improve service reliability
- Provide data-driven RCA investigations and implement preventative solutions
- Support and enhance reliability across:
- Microsoft Azure (software, storage, Azure local)
- Hyper-V and legacy VMware environments
- NetApp and Pure storage platforms
- Azure log analytics
- Infrastructure as Code using Terraform
- Migration from Azure DevOps to GitHub (strong GitHub experience required)
- Azure-based, internally developed .NET/C# applications
- Internal message queuing systems
- Logging, analytics, and synthetic testing post-patching
- API-based integrations
- Workday (Payroll)
- ADP Vantage (Timekeeping)
- Blue Yonder Warehouse Management System (WMS)
- Vocollect handheld voice picking devices
- Network analytics for identifying dead zones and connectivity issues
- Barcode scanners and device connectivity troubleshooting
- Lead CI/CD reliability improvements (Azure DevOps → GitHub transition critical)
- Enhance pipeline automation with embedded security controls
- Advance Infrastructure-as-Code standards (Terraform)
- Improve configuration management and change governance
- Drive automation to reduce manual intervention and operational risk
- Work within BMC ecosystem including:
- Optimize automated incident generation (SCOM → BMC workflows)
- Improve triage, escalation, and impact modeling across services
- Monitor vendor performance and escalate appropriately
- Participate in off-hour escalation support when required
- Develop predictive reliability models using statistical techniques
- Identify systemic risk across production systems
- Guide tooling decisions (e.g., Dynatrace vs. Datadog or other observability platforms)
- Ensure regulatory and operational compliance standards are met
- Facilitate cross-functional collaboration and document SRE procedures and planning artifacts
Requirements:
- 5–7+ years of Software Engineering and Infrastructure/Database Engineering experience
- Deep expertise in DevSecOps practices
- Deep expertise in Observability platforms
- Deep expertise in API integrations
- Deep expertise in Performance management tools
- Deep expertise in ITIL principles
- Deep expertise in ITSM data analytics
- Deep expertise in MELT data collection and analysis
- Experience in Azure cloud environments
- Strong analytical and problem-solving skills
- Demonstrated ability to influence technical direction
- Excellent communication and cross-team collaboration skills
- Continuous improvement mindset focused on reliability engineering
- Strong programming experience in .NET / C#
- Strong programming experience in Python
- Strong programming experience in SQL
- Experience with MSSQL (primary) and Oracle (limited)
- Experience with GitHub (critical for upcoming transition)
- Agile/Scrum experience
- Knowledge of Reliability-Centered Engineering and maintenance strategies
- Experience with synthetic testing and proactive validation post-deployment
- Bachelor's degree in a related technical field