Maintain and support one of the largest logging deployments in the industry
Collaborate with SRE and Engineering/Product teams in driving critical initiatives
Ensure the highest level of up-time and Quality of Service (QoS) to Adobe’s customers through operational excellence
Define service level objectives (SLOs) and service level indicators (SLIs) to measure service quality
Design and maintain production monitoring systems
Solve performance and stability issues using a wide variety of tools
Requirements
7-10+ years production level experience with distributed applications at scale in public and/or private cloud
Experience architecting and implementing large-scale Observability platforms
B.S. degree in Computer Science or related technical field
Must Have Experience with internally hosted logging systems like Splunk, Clickhouse, Loki, Elastic, assisting clients and improving environment performance and stability
AI agent development and experience integrating AI workflows into large-scale deployments
Experience architecting distributed environments with thousands of users
Programming experience with languages like Go, Python
Experience building integrations and applications to large-scale Observability environments
Experience designing and implementing systems for fault tolerance, scalability and stability.
Experience developing, deploying and running distributed applications on cloud platforms
Experience with container and orchestration technologies (Docker, Kubernetes)