Reliability and Availability: learn and apply reliability concepts (SLOs/SLIs) and the fundamental principles of Site Reliability Engineering (SRE) under guidance;
Incident Management: respond to first-line (Level 1) incidents by following established runbooks and promptly escalate medium
or high-complexity issues to senior levels;
Actively participate in Root Cause Analysis (Post-Mortem) meetings as an essential part of the learning and improvement process;
Automation: assist in developing and executing automation for simple operational tasks using basic scripts (e.g., Bash, Python);
Monitoring and Observability: actively monitor the health of systems and infrastructure using observability tools. Perform initial analysis and triage of anomalies and report deviations;
Capacity Management: monitor system and infrastructure capacity and help implement scalability strategies to ensure proper performance under varying workloads;
Operation and Maintenance: perform infrastructure operation and maintenance tasks, always strictly following the team’s procedures and runbooks;
Infrastructure as Code (IaC): support the team in applying existing IaC templates and modules for routine infrastructure tasks;
Collaboration and Troubleshooting: collaborate with the team to troubleshoot low-complexity issues and collect information for resolving larger incidents;
Documentation: assist in creating, reviewing, and maintaining basic technical documentation for processes and runbooks;
Security: support the team in applying basic security requirements and complying with established guidelines;
Continuous Improvement: participate in continuous improvement discussions and initiatives, proposing optimizations for processes under your responsibility.
Requirements
Bachelor’s degree in Computer Science, Computer Engineering, Information Systems, or a related field;
Up to 2 years of proven experience in internships or IT/SRE support or operations roles;
Basic knowledge of Linux operating systems;
Initial experience with cloud platforms (AWS, Azure, or GCP);
Familiarity with container orchestration using Kubernetes and cluster management with Rancher and/or OpenShift;
Initial experience with monitoring and observability tools (Prometheus, Grafana, ELK Stack, etc.);
Knowledge of a programming language for automation and scripting (Python, Go, Bash, etc.);
Initial experience with infrastructure automation tools (IaC);
Knowledge of networking concepts and communication protocols (TCP/IP, DNS, HTTP, etc.);
Familiarity with DevOps practices and agile methodologies.
Tech Stack
AWS
Azure
DNS
Google Cloud Platform
Grafana
Kubernetes
Linux
OpenShift
Prometheus
Python
TCP/IP
Go
Benefits
Meal allowance or food voucher to support balanced meals.
Group Bradesco Health Plan to support the well-being of you and your family.