Circle is one of the world’s leading internet financial platform companies, building the foundation of a more open, global economy through digital assets and payment applications. As a Site Reliability Engineer at Circle, you'll harness AI-powered insights to build and maintain production infrastructure, ensuring reliability and performance for a rapidly expanding global customer base.
Responsibilities:
- Empower agile development teams with a high-performance CI/CD pipeline, ensuring fast, high-quality releases with measurable performance and quality metrics
- Design, maintain, and secure cloud infrastructure using Infrastructure-as-Code tools like Terraform and Crossplane
- Automate operational tasks using Go, Python, and serverless solutions (AWS Lambda, Kubernetes Jobs)
- Manage and monitor Kubernetes clusters for multiple production workloads
- Develop and maintain blockchain infrastructure, managing nodes across Ethereum, Solana, Arbitrum, Base, Avalanche, and others
- Ensure system reliability and security by participating in on-call rotations, troubleshooting disruptions, conducting root cause analysis, and collaborating with Security teams on security-focused tools and frameworks
- Plan, test, and implement disaster recovery strategies for a highly available microservices architecture
- Leverage AI-powered solutions for managing infrastructure, analyzing logs, detecting anomalies, capacity planning, maintaining predictively, and optimizing performance
- Mentor and support team growth, fostering collaboration and scalability
Requirements:
- 4+ years in DevOps or SRE roles
- 3+ years in CI/CD platform development and microservices support
- Strong observability, problem-solving, and performance optimization skills in complex, distributed systems
- Hands-on experience with Blue-Green, Canary, and A/B Testing deployment strategies for services and databases
- Understanding of multi-region and multi-cloud architectures
- Proficiency in Go, Python, and Shell
- Proficiency in AI tools utilization to support daily activities
- Excellent communication skills—able to break down technical concepts and foster collaboration
- Observability, troubleshooting, and performance optimization skills in complex, distributed systems
- Experience with Kubernetes clusters at scale, containerization, and Helm charts
- Experience with modern CI/CD platforms with seemingly complex gates and workflows
- Experience with distributed blockchain systems and blockchain full nodes
- Experience with networking (routing, DNS, load balancing, edge networking)
- Experience with APM, RUM, monitoring, and telemetry tools
- Experience with database technologies (PostgreSQL, Redis, OpenSearch)
- Experience with migrating and transforming large, complex datasets from diverse sources, structures, and formats
- Experience with data warehousing (Apache Airflow, AWS DMS, Snowflake)
- Experience with IaC with Terraform or Crossplane for cloud deployments
- Experience with AI tools (GitHub Copilot, Gemini, and ChatGPT) for productivity and code quality
- Experience with large Language Models (LLMs) and AI applications in software development and operations
- 7+ years in DevOps or SRE roles (for Staff Site Reliability Engineer)
- 5+ years in CI/CD platform development and microservices support (for Staff Site Reliability Engineer)
- Understanding macro activities and how to subdivide them, maintaining the focus (for Staff Site Reliability Engineer)
- Familiarity with multi-region and multi-cloud architectures and their implementations (for Staff Site Reliability Engineer)