Harness is the AI Software Delivery Platform company, and they are seeking a Staff Software Engineer with expertise in distributed systems and cloud-native backend engineering. The role involves architecting scalable backend systems, ensuring operational excellence, and collaborating across teams to enhance platform capabilities.
Responsibilities:
- Architect and develop scalable, fault-tolerant backend systems that handle millions of requests per second
- Implement microservices using Go, Java, or Python, ensuring high availability and resilience
- Deploy and manage applications on AWS, GCP, or Azure with Kubernetes (EKS, GKE, AKS)
- Work with Kafka, Pulsar, RabbitMQ for distributed messaging and streaming workloads
- Implement best practices for graceful degradation, retries, circuit breakers, and auto-scaling
- Define SLAs/SLIs/SLOs, set up robust alerting & escalation processes for incident handling
- Lead post-incident analysis, drive corrective actions, and improve system reliability
- Define and implement logging, monitoring, and distributed tracing using Prometheus, OpenTelemetry, Grafana, Datadog
- Diagnose and optimize latency, throughput, and memory utilization for large-scale distributed systems
- Design and implement highly concurrent, multithreaded backend services for parallel processing
- Improve performance of SQL (PostgreSQL, MySQL) and NoSQL (Cassandra, DynamoDB, Redis, MongoDB) solutions
- Implement API security, authentication, authorization, and ensure compliance with SOC2, ISO 27001, PCI DSS
- Guide engineers in best practices for platform engineering, microservices, and distributed systems
- Work with cloud engineering, security, and product engineering teams to align platform capabilities with business needs
Requirements:
- 10 -14 years of experience in backend platform engineering, distributed systems, and microservices
- Strong programming expertise in Go, Java, or Python, with a focus on multithreading and concurrency
- Expertise in Kubernetes, service meshes (Istio, Linkerd), and cloud infrastructure
- Deep understanding of gRPC, REST APIs, GraphQL, and API performance tuning
- Hands-on experience with CI/CD and infrastructure automation (Terraform, Pulumi)
- Proven ability to manage production incidents and other operational excellence practices
- Excellent debugging and problem-solving skills in complex, distributed environments