Nscale is a GPU cloud provider engineered for AI, offering high-performance infrastructure for AI start-ups and enterprises. The Infrastructure Engineer (Ironic Specialist) will focus on OpenStack bare metal provisioning and lifecycle management, ensuring reliable operation of large-scale physical infrastructure and contributing to the OpenStack community.
Responsibilities:
- Designing, implementing, and operating scalable and resilient bare metal provisioning platforms with a strong focus on OpenStack Ironic
- Owning the lifecycle of physical infrastructure through automated discovery, enrolment, provisioning, cleaning, deprovisioning, and hardware state management
- Managing and improving integrations between Ironic and related OpenStack services such as Nova, Neutron, Glance, Keystone, Placement, and supporting automation tooling
- Building and maintaining robust provisioning workflows for a wide range of hardware profiles, including GPU-enabled and high-performance server platforms
- Driving automation for hardware onboarding, firmware and BIOS configuration, deployment workflows, validation, and recovery using infrastructure-as-code and configuration management tools
- Troubleshooting complex issues across provisioning pipelines, PXE/iPXE, BMC interfaces, out-of-band management, image deployment, network boot, and hardware compatibility
- Acting as a 3rd/4th line escalation point for advanced bare metal and provisioning incidents, carrying out root cause analysis and implementing long-term fixes
- Supporting platform upgrades, lifecycle management, and operational improvements across Ironic and its dependencies
- Collaborating closely with network, compute, data centre, and support teams to ensure efficient and reliable delivery of physical infrastructure services
- Contributing specialist input to infrastructure roadmap planning, capacity expansion, standard builds, and hardware platform qualification
- Supporting pre-sales and solution design efforts by providing expert guidance on bare metal capabilities, operational models, and deployment constraints
- Contributing to upstream OpenStack bare metal communities, particularly Ironic and related projects, through bug reports, code contributions, testing, reviews, and design discussions where appropriate
- Tracking upstream roadmaps, release changes, and community direction to help shape Nscale's bare metal strategy, upgrade planning, and platform standards
- Representing Nscale's operational requirements, hardware use cases, and scaling challenges in upstream discussions to help drive practical improvements for both the business and the wider community
- Ensuring provisioning platforms and operational processes adhere to security, compliance, and operational standards
- Participating in on-call rotations and incident response activities for critical infrastructure services
Requirements:
- Strong Linux systems administration and troubleshooting experience
- Deep hands-on experience deploying, operating, upgrading, and troubleshooting large-scale OpenStack environments
- Strong specialist knowledge of OpenStack Ironic and the surrounding provisioning ecosystem
- Strong understanding of bare metal provisioning concepts including PXE/iPXE, DHCP, TFTP/HTTP boot, BMC technologies, RAID configuration, firmware management, disk imaging, and node lifecycle states
- Strong experience with out-of-band management technologies such as Redfish, IPMI, or vendor management interfaces
- Strong experience designing and building automation for physical and virtual infrastructure using tools such as Ansible
- Strong Python and Bash skills
- Experience troubleshooting complex provisioning and hardware integration issues across server, network, and management layers
- Experience operating production infrastructure at scale with a strong focus on reliability, repeatability, and operational safety
- Ability to collaborate across infrastructure, support, and architecture teams to solve complex technical problems
- Experience contributing to or working closely with upstream open-source communities is highly desirable, particularly within OpenStack, Ironic, Metal3, or related infrastructure projects
- Ability to evaluate upstream changes, influence technical direction, and translate community developments into practical outcomes for production bare metal platforms
- Experience with GPU server platforms, hardware qualification, or large-scale bare metal cloud environments would be highly desirable
- Knowledge of Neutron, networking for provisioning, and the integration points between networking and bare metal deployment would be beneficial