IBM Software is dedicated to transforming client challenges into innovative solutions through AI-powered, cloud-native products. The Staff Software Engineer II will lead the development of a multi-tenant, cloud-native compute platform, focusing on secure execution and operational excellence in a distributed systems environment.
Responsibilities:
- Build and evolve the secure, multi-tenant compute substrate and isolation primitives that safely execute customer and internal workloads in a shared environment
- Design and evolve APIs that provide clear, safe abstractions for polyglot workloads (containers, functions, and services) with diverse performance and isolation needs
- Integrate the Secure Compute platform with core data and application services so that teams can onboard new workloads with minimal friction
- Implement and harden workload isolation, network policies, identity and access, and secure execution environments required to safely run customer-supplied code
- Drive operational excellence through rich observability, automated health checks, self-healing workflows, and robust rollout and rollback practices
- Define and drive the technical direction for Secure Compute, including platform architecture, runtime, and security for running trusted and untrusted workloads at scale
- Design and implement platform APIs and Kubernetes controllers/operators (primarily in Go) that power workload lifecycle, autoscaling, placement, and isolation for containers and serverless-style functions
- Partner with product and platform teams to shape and deliver the roadmap for Secure Compute, enabling new customer-facing features and internal platforms to build on a common compute substrate
- Deliver high-impact initiatives in areas such as workload scheduling, failure and disruption handling, private and public networking patterns, rollout strategies, and fleet-level resource management
- Lead technical design reviews and influence architecture across teams, ensuring Secure Compute primitives are easy to adopt, safe by default, and aligned with broader platform strategy
- Mentor and grow engineers on the team through design guidance, code reviews, pair programming, and sharing best practices for secure, reliable, operable platform development
- Own operational excellence for key Secure Compute services, including availability, reliability, SLOs, performance, on-call response, incident management, and disaster recovery
Requirements:
- 10+ years of experience delivering scalable backend or infrastructure software in production
- Proven track record of leading the delivery of large-scale, highly available, low-latency distributed systems
- Deep expertise in Kubernetes, including controller development, operator patterns, and preferably multi-region or multi-cluster architectures
- Strong proficiency in Go with experience building production-grade services and control planes
- Experience with multi-tenant platform architectures and security/isolation patterns (for example, namespaces, network policies, sandboxing, secrets and identity management), plus hands-on work with secure container runtimes and low-level Linux internals (for example, Kata Containers, Cloud Hypervisor, cgroups, namespaces, seccomp) and performance troubleshooting and tuning for containerized/virtualized workloads
- Familiarity with gRPC, Protobuf, and internal platform API design for service-to-service communication
- Hands-on experience with observability and operational practices (metrics, logs, traces, alerting, SLOs, rollout strategies, incident response)
- Experience with public cloud environments (such as AWS, GCP, Azure) and cloud‑provider integrations
- Demonstrated technical leadership and mentorship, including driving cross‑team alignment on architecture and execution
- Master's Degree
- Strong collaboration skills and history of working effectively with product, SRE/operations, security, and peer engineering teams
- A smart, humble, and empathetic attitude with a strong sense of ownership and teamwork
- Drive and excitement about building foundational cloud infrastructure in a fast‑paced, innovative environment