DigitalOcean is a cutting-edge technology company focused on simplifying cloud services for builders. The Systems Engineer II will lead sustaining engineering efforts for the hardware infrastructure of the DigitalOcean server fleet, ensuring optimal performance and support for hardware and firmware components.
Responsibilities:
- Act as technical lead for the Sustaining Engineering team in the Infra::Machines::Design Organization
- Support server hardware, cabling, and networking hardware throughout its operational lifecycle Monitor the #machines channel and MACHINES JIRA project for issues and drive them to resolution
- Participate in 24/7 on-call rotation with other members of the team
- Act as Tier 2 escalation for Datacenter Operations (DCOPS) and Cloud Operations (CloudOps) regarding hardware and firmware components
- Develop and maintain standards and practices for DigitalOcean hardware operations
- Work closely with the Qualification team, Firmware team, Fleet Lifecycle Engineering team (FLE), Foresight team, and Infrastructure Services team to resolve issues in tooling, firmware packages, hardware components, and other operational concerns
- Liase with DCOPS teams to develop, deliver, and support hardware-related runbooks
- Liase with vendor support teams regarding hardware and firmware issues and drive issues to resolution Identify gaps in tooling and operational processes and engage appropriate peer teams to close gaps
- Help with development of tooling and associated runbooks to address gaps in operational capabilities around hardware and firmware operationsCoordinate with Ops teams on monitoring thresholds, failure modes and alerting
- Assist in troubleshooting cause of failures and work to prevent them in the future
- Raise the quality bar in the delivery of our cloud infrastructure by identifying industry best practices and working to adopt them
Requirements:
- Technical Degree (BS Computer Science/Engineering) or equivalent practical experience
- Hands-on experience operating a cloud infrastructure at mid-tier scale or better
- An in-depth understanding of server hardware, firmware, and infrastructure
- Strong knowledge in troubleshooting techniques, Python and BASH
- Clear communication and collaboration across key stakeholders
- An insatiable passion for constant improvement