Consultative Ownership: Work with autonomy to own problems and deliver solutions, acting as a bridge between development and operations.
Observability Architecture: Design and implement robust monitoring solutions using the LGTM stack to ensure system health and performance.
Reliability Strategy: Advise clients on defining meaningful SLOs/SLIs and managing error budgets to balance innovation with stability.
AI Assistance: Drive use of AI Agents or AI tools for intelligent automation and improving operational efficiency.
Incident Leadership: Lead post-incident reviews (Blameless Post-Mortems) to identify systemic improvements and reduce future toil.
Mentorship: Coach less experienced engineers within Fabric and our client teams on SRE principles and modern infrastructure patterns.
Advising our clients on the right technical decisions and advocating for the right practices to use.
Participate in interviewing and recruitment based on business needs.
Thought Leadership: Contribute to the SRE community through blog posts, meetups, or internal knowledge sharing.
Operational Support & Availability: Rotational Support Coverage: Participate in a sustainable team rotation to provide extended service coverage (including weekends) for business-critical systems.
Incident Response: Act as a primary responder for high-priority (P1/P2) incidents during your rostered shift, focusing on rapid restoration and clear stakeholder communication.
Requirements
Strong expertise in Observability: Deep comfort with Grafana, including the LGTM stack (Loki, Grafana, Tempo, Mimir) or Grafana Cloud, OpenTelemetry.
Container Orchestration: Solid experience with Kubernetes management, configuration, and troubleshooting in production.
Good understanding of AI Agent frameworks and tools like Grafana AI Assistant.
Cloud Proficiency: Hands-on experience with GCP or AWS, including networking, security, and cloud-native services.
Modern Deployment: Proven experience implementing GitOps (ArgoCD) and CI/CD pipelines (GitLab CI, GitHub Actions, etc.).
Infrastructure as Code (IaC): Experience with tools like Terraform.
Automation & Scripting: Proficiency in at least one language (e.g., Python, Go, or Bash) for building tooling and automating operational tasks.
Incident Management: Experience with on-call rotation tools (Grafana on-call, Opsgenie) and a strong commitment to a blameless culture.
Tech Stack
AWS
Cloud
Google Cloud Platform
Grafana
Kubernetes
Python
Terraform
Go
Benefits
Flexibility to support work-life balance while maintaining professional independence.
Contract duration is typically 12 months, with the possibility of renewal based on project needs and performance.
Payment is in daily rates in Australian dollars, reflecting your experience and meeting the local Indian market.
Contractors are fully integrated into project teams and Fabric’s culture.