Lead strategic initiatives to advance observability maturity across the company, designing data pipelines for the three pillars (Logs, Metrics and Tracing).
Design and maintain telemetry pipelines that ensure end-to-end visibility, integrating application code through to the infrastructure layer.
Lead the definition of meaningful SLIs and SLOs, helping engineering teams manage their Error Budget so business decisions are based on performance data.
Propose technical solution alternatives for the Observability roadmap, ensuring the technology stack remains up-to-date and efficient.
Reduce alert fatigue by building intelligent alerting and high-fidelity dashboards (Grafana/Datadog), focusing on the end-customer experience.
Promote a FinOps culture, using performance data (CPU/Memory vs. Latency) to identify cloud waste and propose cost and scaling optimizations (HPA/VPA).
Support and consult development teams on instrumenting applications (OpenTelemetry or APM SDKs), ensuring data is useful for troubleshooting complex incidents.
Coordinate joint efforts with other tribes when observability solutions exceed the platform scope, sharing reusable tools.
Anticipate technical risks and dependencies during business discussions, influencing the prioritization of technical debt and resilience improvements.
Promote SRE culture, focusing on reducing MTTR (Mean Time To Repair) and implementing evidence-based post-mortems.
Requirements
Experience in Critical Environments: Proven track record supporting high-scale platforms (thousands of requests per second).
Proficiency with APM tools: Advanced experience with leading tools such as Datadog, Dynatrace or New Relic.
Observability stack expertise: Deep knowledge of Prometheus (metrics), Grafana (visualization), OpenSearch/Elasticsearch (logs) and Kong (API Gateway).
Cloud-agnostic mindset: Strong AWS experience, with a focus on solutions that avoid vendor lock-in.
Instrumentation knowledge: Hands-on experience with OpenTelemetry (OTel) and distributed telemetry patterns.
Soft skills and influence: Ability to persuade developers to adopt observability by design and translate technical metrics clearly to product leadership.
Availability for hybrid work: On-site presence at the Morumbi (SP) office once a month for 4 consecutive days (Creditas in Person).
Tech Stack
AWS
Cloud
ElasticSearch
Grafana
Prometheus
Benefits
Health insurance (Alice)
Dental plan (SulAmérica)
Wellz: 100% free therapy sessions
Wellhub: access to gyms and studios
Creditas Endurance: incentive program for high-impact sports
Pharmacy partnership (Univers)
Life insurance (Porto Seguro)
Birthday day off
Extended parental leave: 6 months for birthing parents and 35 days for non-birthing parents
Family Care: support program for maternity and paternity
Childcare allowance
Assistance for dependents with disabilities (PWDs)
SESC: access to SESC units for you and your dependents