Analyze application reliability, performance, and availability.
Monitor deployment issues for applications, addressing performance or security problems as they arise and capturing lessons learned to prevent similar incidents in the future.
Proactively manage the task backlog, identify opportunities for improvement, and propose effective collaborative solutions.
Maintain effective communication with teams responsible for different application journeys, ensuring a clear understanding of needs and priorities.
Stay up to date with industry trends, best practices, and emerging technologies related to cloud computing and DevOps/SRE.
Requirements
Experience as a Site Reliability Engineer (SRE) and familiarity with SRE metrics.
Experience monitoring Java backend applications.
Strong experience with FinOps practices and cloud cost management.
Experience working with observability tools such as Datadog, Grafana, Prometheus, and Thanos.
Experience with AWS-based platforms (ECS, EKS) and/or Kubernetes and Docker.
Experience with Linux.
Technical knowledge of GitHub, Jenkins, and Splunk (desirable).
Experience with CI/CD pipelines (GitHub Actions, CodeBuild, CodePipeline).
Infrastructure as Code (Terraform).
Analytical skills and strong problem-solving ability, with a desire to learn and adapt in a dynamic environment.
Performance testing and stress testing.
Understanding of chaos engineering concepts (what to test, what to validate, which failures to inject into the application, e.g., removing a database node and observing application behavior).
Ability to troubleshoot efficiently and propose continuous improvements (Splunk, dashboards, tracing tools).
Familiarity with mobile application monitoring (Android and iOS).
Knowledge of Google Analytics and Firebase Crashlytics.
Familiarity with any of the following (if applicable).
Knowledge of programming languages such as Java, Shell Script, Golang, Python.
Tech Stack
Android
AWS
Docker
Firebase
Grafana
iOS
Java
Jenkins
Linux
Prometheus
Splunk
Terraform
Go
Benefits
Health and dental insurance;
Food and meal vouchers;
Childcare assistance;
Extended parental leave;
Partnerships with gyms and health & wellness professionals via Wellhub (Gympass) and TotalPass;
Profit-sharing/Performance bonus (PLR);
Life insurance;
Continuous learning platform (CI&T University);
Employee discounts club;
Free online platform dedicated to physical and mental health and wellness;