Reliability Leadership: Lead the technical planning of operations initiatives, ensuring that resilience and scalability are core requirements from solution design.
Platform Engineering (Self-Service): Act as an enabler for technology teams by creating abstractions and automations that remove obstacles in deployment and infrastructure management, enabling smooth, low-friction operations.
Go beyond day-to-day operations by identifying repetitive tasks and turning them into robust automations so the Ops team can focus on high-value work.
Innovation and Continuous Improvement: Rethink legacy operational processes, introducing improvements (even small script or workflow changes) that bring predictability and safety to the production environment.
Incident Response Lead: Serve as the technical escalation point in complex crises, leading resolution efforts and, importantly, conducting root cause analysis to prevent recurrence.
Review proposed changes and architectures, ensuring they meet the company's security, cost (FinOps), and operational excellence standards.
Mentorship and SRE Culture: Be a catalyst for SRE culture within Operations, raising the team's technical level and promoting knowledge-sharing about distributed systems.
Requirements
Infrastructure as Code (IaC): Advanced experience with Terraform for large-scale environments, creating standards that enable engineering teams to operate autonomously.
AWS Expertise: Deep knowledge of AWS services.
Observability: Define and implement strategies based on SLIs, SLOs and Error Budgets, using New Relic to anticipate incidents.
CI/CD Experience: Strong command of automation pipelines (preferably GitLab CI/CD) with a focus on security.
Infrastructure Security: Experience remediating infrastructure vulnerabilities (CVEs) and image hardening, ensuring a hardened and compliant environment.
Linux Operating System: Advanced OS-level troubleshooting.
Orchestration Leadership: Lead the container strategy with a primary focus on Amazon ECS (EC2 and Fargate), ensuring resilience and cost efficiency.
Data Governance: Serve as a reference for MySQL and PostgreSQL, ensuring our instances (RDS/Aurora) are optimized for high performance and security.
Tech Stack
AWS
EC2
Linux
MySQL
Postgres
Terraform
Benefits
Health insurance;
Dental insurance;
Meal or food allowance;
Childcare assistance;
Transportation allowance;
Profit-sharing (PPR) program;
Paid day off during your birthday month;
Life insurance;
Wellhub (employee wellness program);
Férias&Co (vacation/travel benefit);
6-month maternity leave and 20-day paternity leave;
Flexible working hours;
#Secuida
our Quality of Life program;
Partnerships with various establishments and institutions in education, health, leisure, entertainment, and more.