Dayforce is a global human capital management company that offers a unified Cloud HCM platform. As a Lead Observability Engineer, you will provide senior technical leadership in the implementation and continuous improvement of Dayforce’s observability platform, ensuring reliable telemetry collection and operational workflows across distributed systems.
Responsibilities:
- Design, implement, and operate components of the Dayforce observability platform in alignment with architectural standards and platform strategy
- Lead implementation, tuning, and operational improvements across observability tooling including metrics, logs, traces, dashboards, alerting, and synthetic monitoring
- Apply best practices for telemetry collection and instrumentation across application and infrastructure workloads
- Build, maintain, and enhance dashboards and alerting mechanisms to support service ownership and incident response
- Enable and onboard engineering and infrastructure teams to drive consistent adoption and effective platform usage
- Design and optimize data pipelines for high-cardinality telemetry data, balancing performance, reliability, and cost
- Partner with platform and engineering teams to gather requirements and deliver solutions aligned to operational needs
- Provide mentorship through code reviews, documentation, and knowledge sharing
- Participate in on-call rotations and operational reviews to drive reliability improvements and post-incident learnings
Requirements:
- Must be a US citizen
- Ability to obtain US security clearance
- Design, implement, and operate components of the Dayforce observability platform in alignment with architectural standards and platform strategy
- Lead implementation, tuning, and operational improvements across observability tooling including metrics, logs, traces, dashboards, alerting, and synthetic monitoring
- Apply best practices for telemetry collection and instrumentation across application and infrastructure workloads
- Build, maintain, and enhance dashboards and alerting mechanisms to support service ownership and incident response
- Enable and onboard engineering and infrastructure teams to drive consistent adoption and effective platform usage
- Design and optimize data pipelines for high-cardinality telemetry data, balancing performance, reliability, and cost
- Partner with platform and engineering teams to gather requirements and deliver solutions aligned to operational needs
- Provide mentorship through code reviews, documentation, and knowledge sharing
- Participate in on-call rotations and operational reviews to drive reliability improvements and post-incident learnings
- Strong communication and collaboration skills across engineering and infrastructure teams
- Ability to gather requirements, prioritize effectively, and deliver high-quality solutions within defined scope
- Significant experience operating and troubleshooting distributed systems in production environments
- Experience implementing and operating observability platforms including metrics, logging, tracing, and alerting systems
- Hands-on experience with OpenTelemetry, distributed tracing, and APM tooling
- Experience working with data pipelines, ETL processes, and high-cardinality telemetry datasets
- Proficiency in at least one object-oriented programming language and one scripting language
- Demonstrated ability to deliver scalable, reliable, and maintainable technical solutions
- Strong interest in learning and adopting emerging technologies within an established architectural framework
- Bachelor's degree plus 5–10 years of related experience, Master's degree plus 6 years of related experience, or equivalent combination of education and experience
- Experience operating and tuning observability storage systems such as ClickHouse
- Hands-on experience with Kubernetes observability and monitoring containerized workloads
- Experience extending or integrating Grafana dashboards, data sources, or plugins
- Familiarity applying AI-assisted tooling to observability workflows
- Contributions to internal tooling, automation, or documentation that improved observability adoption and usability