DigitalOcean is a cutting-edge technology company that focuses on simplifying cloud services for builders. They are seeking a highly skilled Staff Software Engineer to join their Customer Observability/Insights team to architect, build, and maintain large-scale distributed systems that enhance their customer-facing observability ecosystem.
Responsibilities:
- Architect, design, develop, and maintain scalable backend services and systems
- Drive technical initiatives and large cross-team projects from concept to production
- Collaborate with product managers, UX designers, and engineers across distributed teams to deliver end-to-end solutions
- Develop deep expertise in observability tools and technologies such as Prometheus, Grafana, time-series databases, and distributed tracing
- Build and maintain high-performance APIs and microservices using Go (Golang) and gRPC, integrating with systems like Kafka, Redis, and NoSQL databases
- Work with Terraform and Ansible to automate infrastructure deployment and configuration management
- Utilize knowledge of SQL for data analysis, service integration, and operational insights
- Lead efforts in debugging, troubleshooting, and performance tuning of complex distributed systems
- Champion operational excellence by improving reliability, monitoring, and alerting practices
- Provide technical leadership, mentorship, and guidance to other engineers
Requirements:
- 15+ years of relevant industry experience building and operating large-scale cloud services or distributed systems in a fast-paced, high-growth environment
- Strong programming experience in Go (Golang) and deep understanding of distributed systems fundamentals
- Solid understanding of observability, monitoring, and alerting systems (e.g., Prometheus, Grafana)
- Experience working with OTEL (OpenTelemetry) Collector, including instrumentation, data pipelines, and telemetry ingestion for metrics, logs, and traces
- Proven experience designing and implementing scalable event-driven architectures using Kafka or similar technologies
- Experience with gRPC, Terraform, and Ansible for service communication and infrastructure automation
- Working knowledge of SQL, Redis, and NoSQL databases
- Demonstrated ability to drive operational excellence and improve system reliability
- Experience making pragmatic technical trade-offs while balancing short-term needs and long-term goals
- Excellent communication and collaboration skills, especially with geographically distributed teams
- Strong ownership mindset and the ability to independently deliver high-impact projects
- Experience with cloud-native environments (Kubernetes, Docker, microservices)
- Familiarity with time-series databases and distributed tracing frameworks
- Prior experience building or maintaining observability platforms