Design, build, and maintain reliable, scalable, and secure infrastructure for Fable’s product services
Improve system observability, monitoring, and alerting to ensure high availability and fast incident response
Contribute to and evolve SRE practices, including SLIs/SLOs, incident management, and postmortems
Support and improve CI/CD pipelines and deployment processes
Identify and reduce operational complexity across systems and tooling
Work across infrastructure and application layers to diagnose and resolve reliability and performance issues, including making targeted improvements to application code when needed
Support infrastructure and platform capabilities required for AI/ML-powered features, including scaling, performance, and reliability considerations
Monitor and optimize infrastructure costs across cloud environments
Contribute to capacity planning and cost forecasting for infrastructure and services
Identify opportunities to improve performance and efficiency at the system level
Evaluate and optimize the cost and performance of compute-intensive workloads (e.g., AI/ML services), ensuring efficient resource usage and scalability
Work with third-party vendors and tools that support Fable’s infrastructure and operations
Help evaluate, select, and manage tools and services to support platform reliability and scalability
Support vendor-related troubleshooting and ongoing service improvements
Partner with Engineering teams to improve reliability, performance, and operational readiness of new features
Partner with application engineering teams to improve service architecture, performance, and observability, and help define best practices for building reliable, scalable systems
Act as a point of support and escalation for production issues
Collaborate across teams to manage dependencies and ensure smooth system operations
Contribute to building strong SRE and operational practices across the organization
Share knowledge through documentation, pairing, and technical discussions
Help onboard and support more junior team members as the team grows
Contribute to improving ways of working within the team and across Engineering
Requirements
5–8+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or Platform Engineering
Strong experience with cloud infrastructure (AWS, GCP, or Azure)
Experience building internal platforms, tooling, or shared services that improve developer productivity and system reliability
Experience designing systems that bridge infrastructure and application layers
Ability to work across the stack: comfortable reading, debugging, and making changes to application code (e.g., backend services, APIs) when needed to improve reliability, performance, or observability
Experience with at least one backend programming language (e.g., Node.js, Python, Go, Java)
Strong experience with monitoring, observability, and alerting tools (e.g., Datadog, Prometheus, Grafana)
Solid understanding of CI/CD systems and modern deployment practices
Experience managing infrastructure as code (e.g., Terraform, CloudFormation)
Experience optimizing system performance and infrastructure costs
Familiarity with security and compliance considerations in cloud environments
Experience working with third-party vendors and infrastructure tools
Familiarity with infrastructure considerations for AI/ML workloads (e.g., high-compute services, data pipelines, or third-party AI platforms) is a strong asset
Curiosity about emerging technologies and their impact on infrastructure, reliability, and cost at scale
Strong problem-solving skills and ability to navigate complex systems