Zone 5 Technologies is redefining unmanned aircraft systems with a focus on innovative autonomous solutions. They are seeking a Platform Engineer to design and operate scalable compute infrastructure that powers their autonomous vehicle simulation and testing framework, enabling rapid iteration on autonomy algorithms through parallel simulation workloads.
Responsibilities:
- Design and implement auto-scaling compute infrastructure for simulation workloads using cloud platforms
- Build and maintain on-premises GPU and CPU clusters for simulation and machine learning training
- Architect hybrid cloud solutions that optimize cost and performance across cloud and local compute resources
- Implement job scheduling and orchestration systems using Kubernetes for thousands of concurrent simulations
- Design storage solutions for large-scale simulation data, logs, and artifacts using cloud and local storage systems
- Deploy and maintain robotics simulation environments at scale
- Build CI/CD pipelines for automated simulation testing of autonomy software
- Create infrastructure for distributed parameter sweeps, Monte Carlo testing, and regression suites
- Develop monitoring and observability systems for simulation fleet health and resource utilization
- Implement data pipelines for simulation results ingestion, analysis, and visualization
- Write and maintain infrastructure as code for reproducible infrastructure deployment
- Build automation tools and CLI utilities to simplify developer access to compute resources
- Implement GitOps workflows for infrastructure changes and configuration management
- Create self-service interfaces for engineers to launch and manage simulation jobs
- Develop cost monitoring and optimization strategies for cloud and on-prem resources
- Monitor and optimize infrastructure performance, reliability, and cost efficiency
- Troubleshoot complex distributed systems issues across networking, storage, and compute layers
- Implement backup, disaster recovery, and business continuity strategies
- Maintain security best practices including IAM, secrets management, and network isolation
- Collaborate with autonomy, ML, and robotics teams to understand compute requirements and optimize workflows
- Design and implement network architectures for distributed simulation workloads across AWS and on-premises environments
- Configure VPCs, subnets, security groups, and routing for secure, high-performance compute clusters
- Establish hybrid cloud connectivity (VPN, Direct Connect, site-to-site tunnels) between on-premises and cloud resources
- Optimize network performance for large data transfers, multi-node communication, and distributed workloads
- Support internal infrastructure network design and provide technical guidance to engineering programs
- Troubleshoot network issues including latency, packet loss, and connectivity problems across distributed systems
Requirements:
- Bachelor's in Computer Science, Software Engineering, or related technical field – equivalent industry experience also welcome
- 2-5+ years of experience in platform engineering, DevOps, SRE, or cloud infrastructure roles
- Strong hands-on experience with Kubernetes for container orchestration and workload management
- Experience with cloud computing platforms and services (compute, storage, networking)
- Deep understanding of Linux system administration and troubleshooting
- Strong networking fundamentals including TCP/IP, routing, DNS, VPNs, and security
- Understanding of infrastructure as code principles and configuration management
- Proficiency in scripting and automation (Python, Bash, or similar)
- Experience building and maintaining CI/CD pipelines
- Solid grasp of distributed systems concepts, job scheduling, and resource management
- Ability to design infrastructure from first principles and make architectural decisions
- Experience building infrastructure for simulation, robotics, or autonomous systems workloads
- Understanding of GPU computing and accelerated workload management
- Knowledge of job scheduling systems for batch and parallel workloads
- Experience managing on-premises clusters and hybrid cloud architectures
- Familiarity with robotics middleware (ROS/ROS2) or simulation platforms
- Understanding of cost optimization for compute-intensive workloads
- Experience with monitoring, logging, and observability systems
- Knowledge of containerization technologies and image management
- Background in data engineering, MLOps, or machine learning infrastructure
- Experience with network performance analysis and troubleshooting
- Understanding of software-defined networking and network automation
- Familiarity with security compliance requirements in aerospace/defense environments