Inframark is a leading company in Automation and Intelligence focused on delivering innovative solutions for water and wastewater plants. The DevOps Engineer will take ownership of infrastructure, modernizing and stabilizing the cloud-based platform while following best practices and proactively improving operations.
Responsibilities:
- Take ownership of production monitoring and alerting using Prometheus, Grafana, and CloudWatch—proactively identify issues before they become incidents
- Modernize production EKS cluster with GitOps practices (ArgoCD), comprehensive monitoring, and proper deployment workflows following industry best practices
- Streamline staging deployment process; eliminate branch-based workarounds and establish clean GitOps patterns
- Design infrastructure patterns that scale to hundreds of customers and own AWS infrastructure operations including patching, maintenance, cost optimization, and security compliance—stay ahead of requirements
- Expand into MLOps—building the infrastructure that enables data scientists to deploy models at scale across multiple utility customers once DevOps operations are automated
- Manage Kubernetes clusters (EKS) including pod migrations, resource optimization, troubleshooting, and security updates—proactively, not reactively
- Maintain infrastructure as code using Terraform and Ansible following best practices—all changes tested in non-production before deployment
- Support engineering teams with infrastructure needs, unblock them quickly, and establish self-service patterns where possible—anticipate needs, don't wait for requests
- Manage message queue infrastructure (Kafka/Redpanda) including retention policies, storage optimization, and performance tuning
- Document infrastructure, create runbooks, and automate operational tasks to move systems into maintenance mode
- Clean up technical debt—proactively identify infrastructure to decommission, resources to consolidate, and costs to optimize
Requirements:
- 5+ years of experience in DevOps, infrastructure, or site reliability engineering
- Demonstrated ability to take ownership and initiative—you see what needs to be done and do it without waiting for direction
- Deep knowledge of DevOps and infrastructure best practices—you know what good looks like and implement it proactively
- Strong Kubernetes experience (EKS preferred) including cluster management, deployments, services, and troubleshooting
- Hands-on AWS experience (EC2, EKS, ECS, RDS, VPC, IAM, CloudWatch, S3)
- Infrastructure as code proficiency (Terraform and Ansible)
- GitOps experience (ArgoCD, Flux, or similar)
- CI/CD pipeline experience (Bitbucket Pipelines, Jenkins, GitHub Actions, or similar)
- Monitoring and observability experience (Prometheus and Grafana preferred)
- Python scripting ability for automation and tooling
- US citizenship (required for AWS GovCloud access)
- Self-starter mentality—you identify problems and opportunities, then drive solutions to completion
- Proven track record of delivering tested, high-quality infrastructure changes on schedule
- Excellent communication skills—proactive about sharing status, raising blockers, and documenting decisions
- Curiosity about machine learning and interest in transitioning to MLOps as the platform matures
- Any MLOps or ML infrastructure experience (KServe, Kubeflow, SageMaker, model serving)
- Experience with data pipelines, feature engineering, or supporting data science teams
- AWS GovCloud experience and understanding of compliance requirements (FedRAMP)
- Experience with message queue systems (Kafka, Redpanda)
- Container security and vulnerability scanning (Snyk)
- Background in SaaS platforms, IoT, or critical infrastructure