Afresh is the leading AI company in fresh food, partnering with grocers to eliminate food waste and make fresh food accessible. As a Senior Software Engineer in Infrastructure, you will be responsible for building and improving the infrastructure that supports service teams, ensuring reliable and safe operations while driving performance improvements.
Responsibilities:
- Own and deliver infrastructure projects end-to-end, from problem definition and technical design through implementation, rollout, and iteration
- Build and improve platform primitives that make it easier for service teams to deploy, operate, and debug their services
- Improve observability and operational readiness so we can detect issues early, reduce time-to-recovery, and prevent repeat incidents
- Identify and implement cost and performance improvements across our cloud infrastructure and developer tooling
- Work closely with Security to implement practical security controls and protect sensitive data (for example, least-privilege access, secret management, and network controls)
- Participate in our on-call rotation and continuously improve monitoring and alerting to maintain a low page rate
- Stay current on infrastructure best practices and evaluate improvements with a pragmatic, impact-focused mindset
Requirements:
- 5+ years of relevant software engineering experience (or equivalent experience)
- Experience delivering complex technical work in production environments
- Ability to turn ambiguous problems into a plan and execute with a high level of ownership and good judgment
- Operated and maintained mission-critical cloud infrastructure with high uptime
- Designed and implemented scalable infrastructure (Azure preferred, but AWS/GCP are also fine)
- Experience with core cloud networking (VPC/VNet design, routing, DNS, load balancing, and connectivity)
- Ability to build improvements that make it easier for service owners to manage their own systems
- Led or played a key role in high-severity production incidents
- Ability to troubleshoot complex issues, restore service, and communicate clearly with stakeholders
- Experience writing and maintaining runbooks and playbooks to reduce MTTR
- Strong experience writing, maintaining, and operating production Terraform codebases
- Proficiency in at least one general-purpose programming language (Python preferred, but others are fine)
- Ability to operate and troubleshoot workloads in a Kubernetes cluster
- Experience with AI-assisted development and integrating LLM-based tooling into infrastructure workflows
- Startup mindset with effective prioritization and focus on impact
- Relentless delivery focus with clear communication of risks and tradeoffs
- Ability to build strong working relationships and provide mentorship
- Strong self-management skills and investment in personal growth
- Ability to drive a project or well-scoped initiative and coordinate execution through launch
- Strong communication skills with partner teams and ability to validate solutions early
- Experience implementing automation to reduce manual intervention