Group 1001 is a consumer-centric, technology-driven family of insurance companies focused on delivering outstanding value and operational performance. They are seeking a Senior Network Reliability Engineer to build a Site Reliability Engineering practice with a network scope, applying SRE principles to enhance the firm's network platform and ensure reliability across multi-cloud environments.

Responsibilities:

Treat reliability as an engineered property. Define SLOs and error budgets for the network platform — DNS resolution, edge availability, mesh ingress success, cross-region path health — and use them to gate changes, not just to color dashboards. Lead postmortems with a focus on permanent remediation, not pattern-recognition. Alert on symptoms users feel, not on causes that may or may not produce impact
Move network state into code. Use Terraform (or Pulumi), Ansible, and Python to replace CLI-driven configuration with declarative, version-controlled, peer-reviewed change running through Infra CI/CD. This applies equally to the edge tier (Cloudflare), security platforms (Zscaler ZIA/ZPA, ZTNA policies, next-gen firewalls), the cloud network fabric (Transit Gateway, Cloud WAN, VPCs, Route53, IPAM), and increasingly the Kubernetes and service-mesh layer
Build network policy as intent, not rule lists. Express what flows are permitted, what segments are isolated, what egress is inspected, what zones share DNS — and engineer the compilers that turn that intent into per-vendor configuration. Use Policy as Code (OPA/Rego, Sentinel, Cilium NetworkPolicy) to catch invariant violations at plan time, not apply time
Infrastructure as Code (IaC): Design, deploy, and manage network infrastructure using Terraform or Ansible, moving the firm away from manual configuration to a code-first approach
Engineer the cloud network platform. Operate and extend our multi-account AWS Landing Zone — Cloud WAN segmentation, Transit Gateway peering, IPAM-driven CIDR allocation, shared private DNS, cross-account telemetry pipelines. Build the platform abstractions that make a new account or service land correctly with policy and connectivity composed from declarative inputs
Extend platform thinking into the container tier. Kubernetes networking, service mesh (Istio, Linkerd, Consul Connect), eBPF-based observability and policy (Cilium, Hubble), and the integration points where mesh-level authz meets cloud-tier identity. Recognize that an "internal" service is one logical hop on a chain of policy enforcement points and engineer for that explicitly
Improve telemetry and observability with intent. Build alerts as structured payloads with runbook links, suspected blast radius, and dependency-aware suppression. Author both system-health dashboards for operators and end-user monitoring dashboards that reflect actual user experience. Use Grafana, Elastic, Open Telemetry where each fits
Mentor and grow the team. Provide technical guidance to junior engineers, foster a culture of learning, and work out loud across Platform Engineering so the patterns you build cross-pollinate to adjacent domains
Handle hardware when required. Provide maintenance and configuration support for routers, switches, and firewalls at data centers and offices when needed — bringing code-first practices to physical hardware where possible (templating, change validation, zero-touch provisioning) and direct hands-on competence where it isn't
Incident Response: Serve as an escalation point for network issues, some complex and some basic but not yet covered by runbooks. Troubleshooting with a focus on root cause analysis and permanent remediation with a documentation-first mindset
Reduce toil and hand off cleanly. Repetitive operational tasks are scoped engineering problems with measurable payoff. Author runbooks and SOPs that the NOC can execute confidently; package routine work for L1/L2 handoff so engineering interrupt drops over time. Coordinate across Data Platforms, NOC/SOC, and Cyber Security so reliability practices spread instead of staying siloed

Requirements:

Deep understanding of TCP/IP, BGP, OSPF, VPNs, and SD-WAN architecture
Proven experience with Terraform (state management, modules) and Ansible (playbooks, roles) – or similar – in a production environment
Proficiency in Python for automation and API interaction, or similar
Hands-on experience with Cloudflare, zScaler, and/or enterprise firewalls
Experience configuring monitoring tools (e.g., Datadog, Prometheus, Grafana) to create meaningful alerts and dashboards
Service mesh experience (Istio, Linkerd, Consul Connect, Cilium)
eBPF-based observability (Hubble, Pixie)
AWS Multi-account landing zone tooling experience (AFT, Control Tower, or equivalent)
Policy as Code experience (OPA/Rego, Sentinel, Cilium NetworkPolicy)
A strong belief that a job isn't done until the documentation in written
A mindset that actively seeks to automate repetitive tasks
Willingness to handle physical hardware tasks when required while maintaining a software-centric engineering mindset

Senior Network Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: