ITMC Systems, Inc is seeking a Senior DevOps Engineer to join the Salesforce AI Research Incubation Team. The role involves designing, implementing, and maintaining cloud infrastructure and CI/CD pipelines to support AI research and development, ensuring the reliability, scalability, and security of AI-driven applications.
Responsibilities:
- Design, implement, and manage cloud infrastructure (AWS, GCP) including networking, security, and compute resources
- Develop and maintain CI/CD pipelines to automate deployment and testing of AI models and applications
- Build, manage, and optimize Kubernetes clusters for deploying AI services and research applications
- Implement infrastructure as code (IaC) using Terraform and Helm to ensure repeatable and scalable deployments
- Automate system operations and monitoring using Python and shell scripting
- Ensure security best practices across cloud environments, including firewall and access control management
- Troubleshoot infrastructure issues and optimize system performance
- Collaborate with AI researchers and software engineers to streamline model deployment and integration
- Task about managing databases (SQL and No-SQL), including database provisioning, performance tuning, and backup strategies
- Ensure database security, replication, and high availability across cloud environments
Requirements:
- Bachelor's degree in Computer Science, Software Engineering, or a related field
- Experience with AI/ML model deployment and pipeline automation
- 3+ years of experience in DevOps, cloud infrastructure, or site reliability engineering
- Strong experience with AWS and GCP, including DNS, VM management, networking, Kubernetes, and firewall security
- Proficiency in CI/CD pipeline development and automation (GitHub Actions, Jenkins, GitLab CI/CD, etc.)
- Expertise in Docker, Kubernetes, and Helm for container orchestration and deployment
- Hands-on experience with Terraform for infrastructure provisioning and management
- Strong scripting skills in Python and shell scripting for automation
- Solid understanding of networking, security best practices, and cloud monitoring tools
- Excellent troubleshooting and problem-solving skills
- Knowledge of logging and monitoring tools (Prometheus, Grafana, ELK stack, etc.)
- Familiarity with serverless computing and cloud-native application design
- Contributions to open-source DevOps tools or frameworks
- Experience with Salesforce Falcon is a plus