Knotch is a growth-stage technology company that helps brands optimize content performance through AI. The DevOps Engineer will be responsible for building and scaling infrastructure, ensuring systems are fast, resilient, and cost-efficient while supporting AI-driven capabilities.
Responsibilities:
- Design, build, and maintain scalable, secure, and highly available infrastructure across pre-production and production environments
- Develop and manage CI/CD pipelines to enable fast, reliable, and repeatable deployments across multiple environments
- Own infrastructure as code (IaC) practices using tools like Terraform to ensure consistency and reproducibility
- Manage environment lifecycle (development, staging, production), including promotion workflows and configuration management
- Partner closely with Engineering, Data, and AI teams to support system performance, reliability, and scalability
- Implement and maintain monitoring, logging, and alerting systems to ensure high visibility into system health and performance
- Optimize infrastructure for cost, performance, and reliability, especially for compute- and data-intensive AI workloads
- Support Kubernetes-based deployments and container orchestration for distributed systems
- Contribute to security best practices across infrastructure, including IAM, networking, and application-level protections
- Create dashboards and reporting systems to provide visibility into system performance, uptime, and operational metrics
- Document architecture, operational processes, and infrastructure decisions to support knowledge sharing and onboarding
- Act as a DevOps/SRE partner across teams, helping troubleshoot issues and improve system reliability
Requirements:
- You have a minimum 5+ years of experience in DevOps, Site Reliability Engineering, or Infrastructure Engineering roles within SaaS, PaaS, or cloud-native environments
- Prior experience in growth-stage and/or startup environment scaling from $10M to $20M+ ARR with a lean team
- Strong experience with Google Cloud Provider (GCP), including IAM, networking, and data services
- Hands-on experience with Infrastructure as Code tools such as Terraform
- Experience building and maintaining CI/CD pipelines (GitHub Actions, ArgoCD, or similar)
- Solid experience with Kubernetes, Docker, and containerized environments
- Familiarity with deployment tools such as Helm
- Experience with monitoring and observability tools like Prometheus and Grafana
- Strong understanding of system reliability, scalability, and performance optimization
- Ability to work across multiple systems and priorities in a dynamic environment
- Strong documentation and communication skills, with attention to clarity and detail
- Supplementary experience supporting AI/ML or data-intensive workloads in production environments
- Familiarity with workflow orchestration or data pipeline tools
- Experience with cost optimization strategies for cloud infrastructure
- Exposure to security frameworks and compliance best practices
- Experience working with distributed or globally deployed systems