DriveWealth is a global B2B financial technology organization dedicated to democratizing access to financial independence. As a Senior Site Reliability Engineer, you will be responsible for the architecture, scalability, and self-healing capabilities of the Brokerage-as-a-Service platform, focusing on reducing toil through engineering and ensuring the platform's reliability.
Responsibilities:
- Design and develop internal tools and SRE platforms to eliminate repetitive tasks (toil) and improve developer velocity
- Architect and maintain modular, reusable IaC using Terraform and manage GitOps workflows via ArgoCD
- Implement OpenTelemetry standards and the Grafana stack (Alloy, Loki, Tempo, Mimir) to provide deep insights into system health. Define and manage SLIs, SLOs, and Error Budgets
- Review software architecture and Kubernetes metrics to ensure high availability, capacity planning, and cost-optimization across AWS regions
- Lead incident response, perform complex root-cause analysis (RCA), and champion a blameless post-mortem culture
- Partner with engineering teams to foster the adoption of new tools, security standards, and reliability best practices
Requirements:
- Proficient in Linux administration with a deep understanding of the TCP/IP stack, OSI model, DNS, and network troubleshooting
- Experience working in highly regulated financial environments or with FIX/API connectivity
- Hands-on experience managing production-grade Kubernetes clusters, including RBAC, autoscaling, Helm, and multi-cluster patterns
- Strong grasp of AWS core services, security, and high-availability patterns
- Proficiency with boto3 and AWS CLI for automation
- Experience building secure, automated delivery pipelines and operating GitOps workflows (ArgoCD)
- Strong scripting and development skills in Python or Golang, along with Bash and Ansible
- Experience with secrets management, vulnerability scanning, and securing the software supply chain
- Familiarity with using LLMs, Public MCPs, or Bedrock Agent Core to enhance SRE workflows
- Experience managing Kafka, MQ, SQS, or orchestration tools like Airflow and Rundeck
- Applicants must be authorized to work for any employer in the U.S