Lead 24x7x365 IT operations, being responsible for system stability, performance and the resolution of critical incidents.
Orchestrate war rooms and conduct root cause analysis (RCA), ensuring corrective and preventive actions are well defined and executed on schedule.
Coordinate and provide technical support to Infrastructure, Networks, SRE, Information Security, Helpdesk, DBA and Support teams.
Operate with both technical and product-oriented perspectives, working closely with agile squads in partnership with development and product teams.
Actively participate in agile rituals (planning, review), manage requests via Jira or similar tools, and ensure clear visibility of priorities.
Drive change and deployment processes, ensuring quality, security and stability of services.
Represent the area in audits, certifications (ISO, SOC, etc.), RFPs and client meetings, ensuring policies and documentation are up to date.
Monitor KPIs and SLAs, and provide periodic reports and analyses to the executive management.
Manage technology contracts and vendors, ensuring service levels, contractual compliance and effective performance in critical situations.
Oversee budgeting with a focus on resource optimization, capacity planning, cost forecasting (cloud and on-premises) and CAPEX/OPEX control, seeking operational efficiency without compromising quality. Responsible for cost-benefit and ROI analyses for technology projects, and supporting prioritization of investments aligned with strategic business value.
Implement and evolve Infrastructure as Code (IaC) practices using tools such as Terraform, Ansible, CloudFormation and CI/CD pipelines.
Requirements
Bachelor's degree in Computer Science, Engineering, Information Systems or related fields.
Minimum of 8 years in technical roles within infrastructure, networking, cloud or security, with at least 5 years in technical leadership.
Experience in high-criticality, high-availability environments, with a track record in incident management, service stability and running war rooms.
Hands-on knowledge of cloud environments (AWS, GCP or Azure), network security, observability tools (Prometheus, Grafana, ELK, etc.) and automation.
Experience with agile methodologies (Scrum, Kanban) and tools such as Jira, Confluence, Trello or similar.
Experience with information security standards and certifications such as ISO 27001, SOC2 and LGPD (Brazilian data protection law).
Experience with SRE practices, including defining and monitoring SLOs/SLIs/SLAs, full observability, pipeline automation and resilience testing. Strong command of scalability and high-availability strategies, incident management with learning-oriented post-mortems, and promotion of a reliability culture across development, operations and security.
Experience with ITIL / COBIT / Service Governance (e.g., ITIL for incidents, problem management, organization, planning and change management).
Ability to lead multiple teams and ensure continuous value delivery in dynamic, high-criticality environments.
Knowledge and hands-on experience in managing technology contracts and vendors, focused on performance, cost-effectiveness and strategic alignment.
Advanced English, to engage with vendors outside the country.
Tech Stack
Ansible
AWS
Azure
Cloud
Google Cloud Platform
Grafana
Prometheus
Terraform
Benefits
Prescription drug coverage plan
50% coverage for general prescription medications and 100% coverage for women's health medications.
360° well-being: Totalpass
Conexa Saúde: online therapy
Personalized meal plans
Wellness check-up: assess your physical and mental health
Physical and mental well-being content
Platform with over 3,500 video classes across multiple fitness modalities
Commitment to diversity
No dress code (feel free to wear shorts whenever you want!)
Bridge days for all holidays
Birthday day off
Lots of fun (we love a party!)
Hybrid model (3 days in the office in São Caetano do Sul).