PTC is a leading company transforming the physical and digital worlds through their software solutions. They are seeking a Principal Software Engineer to ensure the reliability, scalability, and operational excellence of their platform, leading cross-organization reliability initiatives and serving as a technical authority.
Responsibilities:
- Own Reliability at Scale
- Lead design, implementation, and evolution of reliability, availability, and resiliency strategies for large‑scale distributed systems written primarily in Java
- Apply deep experience operating complex, distributed systems to guide architectural decisions, reliability strategies, and long‑term system evolution
- Identify systemic risks in application architecture, data flows, and infrastructure, and drive architectural improvements that measurably improve availability, performance, and scalability
- Set and evolve reliability standards, best practices, and operational principles across R&D
- Drive Operational Excellence
- Lead efforts to prevent, detect, and mitigate incidents through technical improvements and operational maturity
- Serve as a senior coordination point during major incidents, helping manage response and guide long‑term remediation
- Champion blameless post-incident reviews and ensure learnings translate into durable system improvements
- Reduce Toil Through Engineering
- Apply advanced software engineering practices to eliminate manual work, reduce operational load, and improve system observability
- Design and build internal platforms, automation, and tooling that support Java‑based services and their operational needs
- Raise the bar on monitoring, alerting, and SLO/SLI adoption across systems
- Lead Through Influence and Collaboration
- Partner deeply with product engineers, architects, and engineering leadership to ensure reliability and operability are first‑class concerns in system design
- Review and influence designs for complex systems involving technologies such as datastores, messaging systems, and coordination services
- Serve as a technical mentor and coach for SREs and other engineers, raising overall engineering and operational maturity
- Shape Strategy and Direction
- Contribute to longer‑term reliability and infrastructure strategy aligned with business growth
- Stay current with industry trends in SRE, distributed systems, and the Java ecosystem, turning insights into practical improvements
- Help define what 'great reliability' looks like for the organization and how we measure it
Requirements:
- US Citizenship or Permanent Residents only due to ITAR requirements
- Ability to work east coast (EST) hours
- Be available for on-call rotation once every 10 weeks
- 10+ years of experience in software engineering, site reliability engineering, or systems engineering roles
- Extremely strong proficiency with the Java programming language and its ecosystem, including building, debugging, and operating production Java services
- Deep experience operating complex, distributed systems in production environments
- Strong software engineering background, with a track record of delivering high‑quality, maintainable code
- Expert understanding of incident management, service reliability, and performance engineering
- Strong hands‑on experience with observability (metrics, logs, traces), capacity planning, and SLO‑driven reliability
- Deep familiarity with modern cloud‑based infrastructure, CI/CD pipelines, and infrastructure‑as‑code practices
- Ability to reason about failure modes across application, data, and infrastructure layers
- Demonstrated ability to lead complex initiatives that span teams and organizational boundaries
- Comfortable making high‑impact technical decisions in ambiguous environments
- Strong communicator who can influence design and operational decisions across a wide range of stakeholders
- Systems thinker focused on root‑cause analysis and durable fixes
- Calm and effective under pressure, especially during high‑severity incidents
- Curious, data‑driven, and committed to continuous improvement
- Experience operating or supporting systems using technologies such as MongoDB, ZooKeeper, and RabbitMQ
- Background in performance tuning and scalability optimization of Java services
- Experience setting or influencing engineering standards at the organization level
- Prior involvement in evolving SRE or platform practices in a growing engineering organization
- Experience designing, operating, or scaling systems in cloud environments such as AWS (preferred), including familiarity with core services, networking models, and reliability features