Continuously monitor the production environment, tracking ticket queues and alerts via management tools;
Analyze and respond to Level 1 incidents, ensuring rapid identification and initial remediation of issues;
Execute operational procedures such as service restarts and environment recovery actions;
Perform log and metric analysis using observability tools to diagnose failures;
Act proactively to detect incidents and service degradation;
Set up and lead war rooms, coordinating communication and actions to resolve critical incidents;
Escalate incidents to internal teams and vendors when necessary;
Track and ensure follow-up on incidents, keeping stakeholders updated, including executive levels;
Support the stability and availability of microservices-based applications and distributed environments;
Collaborate with development and operations teams to resolve problems and drive continuous improvement of the environment;
Contribute to the evolution of SRE practices and the DevOps culture in day-to-day operations.
Requirements
Previous experience working as an SRE, NOC, or Command Center analyst;
Knowledge of microservices architecture;
Experience with CI/CD pipelines and practices;
Knowledge of Kubernetes;
Experience with AWS cloud;
Experience with monitoring and troubleshooting tools, such as Dynatrace;
Knowledge of Linux operating systems;
Experience with DevOps culture and SRE practices;
Experience in incident management and log analysis;
Strong analytical and problem-solving skills;
Clear communication for interaction with technical teams and stakeholders;
Bachelor's degree completed.
Desirable: Experience with ITSM tools (e.g., ServiceNow); experience in high-availability and mission-critical environments; experience automating operational routines; knowledge of advanced observability practices (metrics, logs, and traces); experience leading critical incidents and crisis management.