Design and deploy observability platforms using industry-leading tools such as ELK Stack, New Relic, Datadog, and Alert site
Develop and maintain monitoring strategies, dashboards, and alerting rules to ensure system reliability and performance
Collaborate with engineering teams to instrument applications and infrastructure for comprehensive observability
Troubleshoot complex system issues using observability data and provide actionable insights
Establish best practices for logging, metrics collection, and distributed tracing
Optimize observability infrastructure for cost-efficiency and performance
Conduct training and knowledge-sharing sessions with development and operations teams
Participate in on-call rotations and incident response activities
Continuously evaluate and recommend new observability tools and technologies
Requirements
5+ years of experience in platform engineering, DevOps, or systems engineering roles
Hands-on expertise with at least two of the following platforms: ELK Stack, New Relic, Datadog, or Alertsite
Strong understanding of monitoring, logging, metrics, and alerting concepts
Proven experience creating and maintaining monitoring dashboards and visualizations
Hands-on experience implementing synthetic monitoring and end-to-end transaction monitoring, Application Performance Monitoring (APM) concepts and implementation, Real User Monitoring (RUM) and digital/browser/mobile app observability
Knowledge of SLI/SLO definition and measurement methodologies
Familiarity with MTTA, MTTR, MTTD, and other incident metrics
Proficiency in scripting languages (Python, Bash, or similar)
Experience with cloud platforms (AWS, Azure, or GCP)
Knowledge of containerization and orchestration technologies (Docker, Kubernetes)