RTX is a leading company in aviation software, and they are seeking a Principal Site Reliability Engineer to enhance the infrastructure of FlightAware, the world's largest flight tracking platform. The role involves automating processes, ensuring service availability, and collaborating with various teams to maintain and improve the reliability of systems.

Responsibilities:

Spend your days working to automate and improve reliability and continue to push FlightAware's infrastructure forward, ensuring it is resilient and reproducible
Be responsible for service availability, performance, monitoring, incident response, and capacity planning
Create, improve, and manage environments to ensure decisions on resource allocation, problem identification, and capacity planning are based on accurate data-driven insights
Maintain a physical infrastructure using Kubernetes, Linux, & Ceph, and a cloud infrastructure in AWS as part of the Site Reliability Engineering team
Impact technology decision and direction to grow and support the FlightAware platform
Collaborate closely with fellow SREs on your team and extend your collaboration across other FlightAware teams and disciplines to design dependable and scalable solutions and services
Identify, implement, and champion process improvements to enhance productivity, collaboration, and delivery efficiency, while ensuring alignment with company goals and industry best practices

Requirements:

Typically requires a degree in Science, Technology, Engineering or Mathematics (STEM) and minimum 8 years prior relevant experience or an Advanced Degree in a related field and minimum 5 years of experience or in absence of a degree, 12 years of relevant experience
Must be authorized to work in the U.S. without sponsorship now or in the future. RTX will not offer sponsorship for this position
Experience as a SRE, Platform Engineer, or related position within a Linux or UNIX environment working on large, complex infrastructures and/or projects using Docker and Kubernetes solutions
Experience automating configuration and infrastructure with tools such as Saltstack, Ansible, Terraform or other declarative languages
Experience with hardware; including servers, network switches, & cabling
Experience managing Kubernetes clusters using GitOps with continuous delivery (CD) pipelines such as Flux or Argo
Experience deploying and maintaining large, distributed storage solutions, such as Ceph
Established proficiency in at least one (ideally more) of the following: Python, Go, Rust, or Shell (bash, awk, sed)
Experience with PostgreSQL, or equivalent RDBMS and SQL in general
Experience working with Nix or NixOS
Familiarity with Cloud infrastructure, ideally AWS
Understanding of SRE principles including building observability solutions and exposing metrics to inform SLO's and KPI's
Understanding of how IT infrastructure services work, including: DNS, DHCP, PXE, LDAP, NFS
Understanding of network segmentation, routing and VPNs
You are a private pilot; you are looking to pursue your private pilot license or have a passion for aviation

Principal Site Reliability Engineer - FlightAware (Remote)

Key skills

About this role

Responsibilities:

Requirements: