Shyld AI builds safety-critical robotics and perception systems deployed on real devices, and they are seeking a Senior Infrastructure Engineer to manage their cloud and edge infrastructure. The role involves overseeing CI/CD, deployments, runtime reliability, and ensuring compliance with SOC 2 standards.
Responsibilities:
- Own and operate cloud infrastructure: compute, networking, storage, messaging, CI runners
- Standardize environments with infrastructure-as-code, runbooks, and safer deploy practices
- Build and maintain CI/CD and release pipelines for containerized services and device components
- Manage deployments and runtime reliability (startup, recovery, watchdogs, rollbacks, staged rollouts)
- Create and maintain integration test infrastructure (service-to-service and end-to-end CI)
- Build device provisioning and automated setup for edge deployments
- Own observability across backend and device fleet: logging, metrics, dashboards, alerting
- Lead or strongly contribute to SOC 2 (Type I / Type II) readiness and ongoing compliance:
- Implement and maintain controls (access, change management, logging, incident response, vendor risk, encryption)
- Build auditable workflows and automation for evidence collection
- Ensure traceability for changes (approvals, release notes, rollbacks, audit trails)
- Build and maintain secure device firmware deployment processes, including:
- Firmware/code signing (adding signatures, managing keys/certificates securely)
- Release integrity verification, staged rollouts, versioning, rollbacks, and auditability
- Collaboration with embedded/robotics teams to ensure safe and reliable update strategies
- Implement secrets and authentication management (secure distribution, rotation, service auth)
- Maintain strong access control and identity practices across cloud + edge (IAM/RBAC, OAuth/OIDC/JWT, mTLS as applicable)
- Write monitoring SQL for operational health checks, anomaly detection, and reporting/dashboards
- Develop automation and services in Python for operational workflows, observability, and tooling
- Build and maintain internal/external APIs to support deployment orchestration, telemetry pipelines, and integrations
Requirements:
- 4+ years in DevOps / SRE / Platform / Infrastructure Engineering with production ownership
- Strong Linux, networking, and debugging skills across distributed systems
- Deep Docker/container experience and CI/CD ownership
- Cloud infrastructure experience (AWS/GCP/Azure), including IAM, networking, storage, compute
- Observability experience (logs/metrics/tracing), dashboards, and alerting
- Secrets management experience (Vault / cloud secret managers / KMS) and secure rotation practices
- Authentication and identity knowledge: IAM/RBAC, OAuth/OIDC/JWT, mTLS
- Experience building and maintaining integration test pipelines (service-to-service and end-to-end CI)
- Proven ability to support SOC 2 compliance in engineering practice (controls, evidence, audit readiness, change management)
- Experience delivering secure firmware/device updates, including signing and release integrity
- Edge/IoT/robotics production experience (ROS2 a plus)
- Infrastructure-as-code with Terraform/Pulumi
- Device identity/attestation and secure update pipelines (supply chain integrity, signed artifacts)
- HIL/simulation testing; MQTT/EMQX/Kafka/NATS
- SRE practices: SLIs/SLOs, incident response, postmortems, error budgets