Incode is the leading provider of world-class identity solutions that is reinventing the way humans authenticate and verify their identities online. As a Senior Site Reliability Engineer, you will focus on maintaining and securing the AWS GovCloud infrastructure, ensuring high availability and compliance for government-related projects while collaborating closely with engineering and security teams.
Responsibilities:
- Apply Site Reliability Engineering principles to ensure high availability, performance, and operational resilience of GovCloud systems
- Participate in on-call rotations, responding to production incidents through monitoring, alerting, and defined escalation processes
- Lead incident mitigation, root-cause analysis, and postmortems to continuously improve system reliability
- Manage and maintain AWS GovCloud infrastructure, including EKS clusters, DocumentDB, RDS PostgreSQL, Kafka, OpenSearch, and S3
- Plan and execute infrastructure and application-level configuration changes following government change control procedures
- Support secure and compliant production environments with a strong focus on stability and operational excellence
- Manage Kubernetes workloads using Amazon EKS, Helm charts, and ArgoCD in production environments
- Implement and maintain GitOps-based deployment workflows with appropriate security controls
- Ensure Kubernetes resources and deployments follow compliance and security best practices
- Partner with Engineering and Security teams to implement and enforce secure system design and operational controls
- Support the onboarding and integration of security tools and technologies that meet GovCloud and federal compliance requirements
- Contribute to infrastructure and operational practices aligned with FedRAMP, NIST 800-53, FISMA, and related standards
- Use Infrastructure as Code and automation to improve consistency, scalability, and reliability across environments
- Continuously improve SRE processes, tooling, and monitoring while maintaining compliance constraints
- Identify operational risks and drive improvements that reduce incident frequency and impact
- Create and maintain clear, high-quality documentation, including runbooks and operational guides, explaining both the what and the why
- Work closely with engineering, security, and infrastructure teams to support deployments and troubleshoot production issues
- Contribute to hiring efforts by participating in interviews and evaluating SRE candidates when needed
Requirements:
- US Citizenship required (mandatory for AWS GovCloud access)
- 5+ years of experience in SRE, DevOps, Platform Engineering, or Infrastructure-focused roles
- Strong hands-on experience operating AWS GovCloud environments and understanding GovCloud-specific requirements
- Deep experience with Kubernetes (EKS), Docker, Helm, and production containerized systems
- Expert-level experience with Terraform (modules, versioning, state management); experience with Ansible is a plus
- Experience with ArgoCD or similar GitOps tools for continuous deployment in secure environments
- Strong Linux production experience, including firewalls, access controls, and disk encryption
- Experience working with federal or industry security standards, such as FedRAMP, NIST 800-53, FISMA, SOC 2, or ISO 27001
- Working knowledge of Python and Bash, with familiarity in Go for Kubernetes-related operations
- Ability to read and troubleshoot application code; prior software engineering experience is a plus