Group 1001 is a consumer-centric, technology-driven family of insurance companies focused on delivering outstanding value and operational performance. They are seeking a Senior Network Reliability Engineer to build a Site Reliability Engineering practice with a network scope, applying SRE principles to enhance the firm's network platform and ensure reliability across multi-cloud environments.
Responsibilities:
- Treat reliability as an engineered property. Define SLOs and error budgets for the network platform — DNS resolution, edge availability, mesh ingress success, cross-region path health — and use them to gate changes, not just to color dashboards. Lead postmortems with a focus on permanent remediation, not pattern-recognition. Alert on symptoms users feel, not on causes that may or may not produce impact
- Move network state into code. Use Terraform (or Pulumi), Ansible, and Python to replace CLI-driven configuration with declarative, version-controlled, peer-reviewed change running through Infra CI/CD. This applies equally to the edge tier (Cloudflare), security platforms (Zscaler ZIA/ZPA, ZTNA policies, next-gen firewalls), the cloud network fabric (Transit Gateway, Cloud WAN, VPCs, Route53, IPAM), and increasingly the Kubernetes and service-mesh layer
- Build network policy as intent, not rule lists. Express what flows are permitted, what segments are isolated, what egress is inspected, what zones share DNS — and engineer the compilers that turn that intent into per-vendor configuration. Use Policy as Code (OPA/Rego, Sentinel, Cilium NetworkPolicy) to catch invariant violations at plan time, not apply time
- Infrastructure as Code (IaC): Design, deploy, and manage network infrastructure using Terraform or Ansible, moving the firm away from manual configuration to a code-first approach
- Engineer the cloud network platform. Operate and extend our multi-account AWS Landing Zone — Cloud WAN segmentation, Transit Gateway peering, IPAM-driven CIDR allocation, shared private DNS, cross-account telemetry pipelines. Build the platform abstractions that make a new account or service land correctly with policy and connectivity composed from declarative inputs
- Extend platform thinking into the container tier. Kubernetes networking, service mesh (Istio, Linkerd, Consul Connect), eBPF-based observability and policy (Cilium, Hubble), and the integration points where mesh-level authz meets cloud-tier identity. Recognize that an "internal" service is one logical hop on a chain of policy enforcement points and engineer for that explicitly
- Improve telemetry and observability with intent. Build alerts as structured payloads with runbook links, suspected blast radius, and dependency-aware suppression. Author both system-health dashboards for operators and end-user monitoring dashboards that reflect actual user experience. Use Grafana, Elastic, Open Telemetry where each fits
- Mentor and grow the team. Provide technical guidance to junior engineers, foster a culture of learning, and work out loud across Platform Engineering so the patterns you build cross-pollinate to adjacent domains
- Handle hardware when required. Provide maintenance and configuration support for routers, switches, and firewalls at data centers and offices when needed — bringing code-first practices to physical hardware where possible (templating, change validation, zero-touch provisioning) and direct hands-on competence where it isn't
- Incident Response: Serve as an escalation point for network issues, some complex and some basic but not yet covered by runbooks. Troubleshooting with a focus on root cause analysis and permanent remediation with a documentation-first mindset
- Reduce toil and hand off cleanly. Repetitive operational tasks are scoped engineering problems with measurable payoff. Author runbooks and SOPs that the NOC can execute confidently; package routine work for L1/L2 handoff so engineering interrupt drops over time. Coordinate across Data Platforms, NOC/SOC, and Cyber Security so reliability practices spread instead of staying siloed
Requirements:
- Deep understanding of TCP/IP, BGP, OSPF, VPNs, and SD-WAN architecture
- Proven experience with Terraform (state management, modules) and Ansible (playbooks, roles) – or similar – in a production environment
- Proficiency in Python for automation and API interaction, or similar
- Hands-on experience with Cloudflare, zScaler, and/or enterprise firewalls
- Experience configuring monitoring tools (e.g., Datadog, Prometheus, Grafana) to create meaningful alerts and dashboards
- Service mesh experience (Istio, Linkerd, Consul Connect, Cilium)
- eBPF-based observability (Hubble, Pixie)
- AWS Multi-account landing zone tooling experience (AFT, Control Tower, or equivalent)
- Policy as Code experience (OPA/Rego, Sentinel, Cilium NetworkPolicy)
- A strong belief that a job isn't done until the documentation in written
- A mindset that actively seeks to automate repetitive tasks
- Willingness to handle physical hardware tasks when required while maintaining a software-centric engineering mindset