Afresh is an AI platform for grocery that aims to reduce food waste and improve operational efficiency for grocers. As a Senior Software Engineer in Infrastructure, you will be responsible for building and enhancing the infrastructure that supports service teams, ensuring reliable and safe operations while implementing cost and performance improvements.
Responsibilities:
- Own and deliver infrastructure projects end-to-end, from problem definition and technical design through implementation, rollout, and iteration
- Build and improve platform primitives that make it easier for service teams to deploy, operate, and debug their services
- Improve observability and operational readiness so we can detect issues early, reduce time-to-recovery, and prevent repeat incidents
- Identify and implement cost and performance improvements across our cloud infrastructure and developer tooling
- Work closely with Security to implement practical security controls and protect sensitive data (for example, least-privilege access, secret management, and network controls)
- Participate in our on-call rotation and continuously improve monitoring and alerting to maintain a low page rate
- Stay current on infrastructure best practices and evaluate improvements with a pragmatic, impact-focused mindset
Requirements:
- 5+ years of relevant software engineering experience (or equivalent experience)
- Experience delivering complex technical work in production environments
- Ability to turn ambiguous problems into a plan and execute with a high level of ownership and good judgment
- Operated and maintained mission-critical cloud infrastructure with high uptime
- Designed and implemented scalable infrastructure (Azure preferred, but AWS/GCP are also fine)
- Experience with core cloud networking (VPC/VNet design, routing, DNS, load balancing, and connectivity)
- Led or played a key role in high-severity production incidents
- Troubleshot complex issues and restored service
- Wrote and maintained runbooks and playbooks to reduce MTTR
- Strong experience writing, maintaining, and operating production Terraform codebases
- Proficiency in at least one general-purpose programming language (Python preferred, but others are fine)
- Operated and troubleshot workloads in a Kubernetes cluster
- Actively used AI coding assistants and integrated LLM-based tooling into infrastructure workflows
- Prioritized effectively, stayed focused on impact, and were comfortable with ambiguity
- Made commitments and delivered, surfaced risks early, and communicated clearly when tradeoffs were needed
- Built strong working relationships, incorporated feedback, and helped unblock others
- Invested in personal growth, maintained healthy boundaries, and used time off appropriately
- Drove a project or well-scoped initiative, aligned with partners on requirements and success criteria, wrote or contributed to technical designs, and coordinated execution through launch
- Communicated well with partner teams, sought to understand their needs, validated solutions early, and knew when and how to push back constructively when tradeoffs were required
- Experience implementing automation to reduce manual intervention