Act in the implementation, administration, maintenance and evolution of the observability platform.
Configure and operate observability tools, manage agents and collectors, and enforce retention policies and performance tuning.
Build dashboards, alerts and workflows, and perform incident troubleshooting.
Requirements
Strong experience with Datadog or Elasticsearch, including implementation, administration, maintenance and platform evolution.
Expertise in configuring and operating the tool, including: management of agents and collectors; retention policies; performance tuning; consumption/licensing; and platform organization and governance.
Experience with application instrumentation.
Practical knowledge of OpenTelemetry, distributed telemetry and modern observability.
Ability to analyze and correlate metrics, logs and traces.
Experience with advanced troubleshooting, incident investigation, profiling, tracing and root cause analysis.
Experience building dashboards, alerts, queries, notebooks and workflows within the tool.
Knowledge of integrations via APIs, webhooks and native connectors, including scenarios with ITSM/CMDB and monitoring tools.
Experience with cloud environments and distributed applications.
Experience with Kubernetes/EKS and monitoring/instrumentation of containerized workloads.
Familiarity with agile practices such as Scrum and Kanban.
Preferred: advanced mastery of Datadog or Elasticsearch.
Experience in high-data-volume environments, with multiple services and distributed architectures.
Experience in 24/7 operations and scenarios with high availability and resilience requirements.
Experience supporting business-critical applications, preferably in sectors with high operational demands such as retail, finance, logistics or e-commerce.
Knowledge of observability applied to microservices, APIs, messaging and hybrid/cloud environments.
Experience integrating observability with incident management and problem management processes.