Role Overview
- Define and drive the reliability strategy for our SaaS platform.
- Design, build, and maintain the shared infrastructure services and platforms that our product and application teams will depend on.
- Hold teams accountable to meet customer facing Service Level Agreements (SLAs).
- Design Continuous Delivery (CD) processes for government deployments that will eventually be used commercially.
- Develop robust internal-facing tools and automation for infrastructure provisioning and management primarily using Go (Golang) or Python.
- Architect and optimize foundational solutions within Cloud environments (AWS, Azure, etc.).
- Design and implement shared Event-Driven Architecture components and messaging platforms using technologies like Kafka or Google Pub/Sub.
- Design and build resilient Distributed Systems components that serve as building blocks for other applications.
- Manage and optimize our shared infrastructure across Multi-Region Cloud Environments.
- Establish and enhance centralized Observability and Monitoring platforms and tools.
- Define and implement clear, well-documented RESTful API designs for the infrastructure services you build.
- Implement and manage Service Mesh (e.g., Envoy, Istio) capabilities.
- Design, implement, and optimize highly available Relational Database services or shared data platforms.
- Collaborate closely with product development teams to understand their infrastructure needs and pain points.
- Participate in on-call rotations to support the critical shared infrastructure you build.
Requirements
- 9+ years of experience in an Infrastructure Development, Platform Engineering, or Site Reliability Engineering role, with a strong focus on building tools and services for other engineers
- Deep expertise with Kubernetes in production environments, particularly in providing it as a platform(i.e single tenant and multi-tenant deployment architectures)
- Strong programming skills in Go (Golang) and Python, with experience building robust, maintainable backend services and automation
- Extensive hands-on experience with at least one major Cloud Provider (AWS, GCP, or Azure); multi-cloud experience is a strong plus, especially in building abstractions over them
- Proven experience designing and implementing Event-Driven Architecture and message queuing systems (e.g., Kafka, RMQ, NATS) as shared services
- Solid understanding and practical experience with CI/CD pipeline tools (especially GitLab CI) and experience establishing automated delivery processes for other teams
- Demonstrable experience designing and operating Distributed Systems, with an understanding of patterns for creating reliable, shared components
- Familiarity with Multi-Region Cloud Environments and strategies for building globally distributed and highly available platform
- Proficiency in establishing and utilizing comprehensive Observability and Monitoring platforms (e.g., Prometheus, Grafana, ELK stack, Datadog) for shared infrastructure
- Strong experience with RESTful API design principles and building well-documented, consumable APIs
- Knowledge of Service Mesh concepts and practical experience with solutions like Istio in a platform context
- Hands-on experience with Relational Databases (e.g., MySQL, PostgresSQL), ideally in managing them as a service
- Excellent communication skills and the ability to clearly articulate complex technical concepts to both technical and non-technical audiences
- A strong customer-centric mindset, treating internal development teams as your primary customers
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience or equivalent military experience required
Nice-to-Have Qualifications
- Experience with FedRAMP compliance and government security requirements
- Track record of implementing secure CI/CD pipelines in restricted or regulated environments
Tech Stack
- AWS
- Azure
- Cloud
- Distributed Systems
- Google Cloud Platform
- Grafana
- Kafka
- Kubernetes
- MySQL
- Prometheus
- Python
- Go
Benefits
- Competitive compensation, benefits, and growth opportunities