AWSCloudDistributed SystemsGoogle Cloud PlatformGrafanaITSMPythonServiceNowSplunkTerraformGCPGoogle CloudOpenTelemetryDynatraceCI/CDCommunicationCollaborationRemote Work
About this role
Role Overview
Design and implement end-to-end observability solutions across applications, infrastructure, and cloud environments.
Develop dashboards, alerts, and telemetry frameworks to provide real-time visibility into system health and performance.
Build automation solutions to eliminate repetitive operational tasks and improve efficiency.
Enable runbook automation, self-healing capabilities, and automated incident triage workflows.
Define and implement SLIs, SLOs, and alerting strategies to improve service reliability.
Drive improvements in MTTD and MTTR through actionable alerts and telemetry-driven insights.
Implement proactive monitoring, anomaly detection, and predictive alerting to identify issues before customer impact.
Leverage AIOps capabilities for alert correlation and intelligent incident response.
Integrate observability platforms with CI/CD pipelines, cloud services, and ITSM tools such as ServiceNow.
Collaborate with engineering, product, and operations teams to establish observability standards and operational readiness practices.
Mentor teams and drive adoption of observability best practices across the organization.
Requirements
4+ years of experience in Observability Engineering, Site Reliability Engineering, or related domains.
Hands-on experience with observability platforms such as Dynatrace, Splunk, Grafana, and OpenTelemetry.
Strong expertise in AWS and GCP, with familiarity with cloud-native architectures.
Proficiency in Python for automation and operational tooling.
Experience implementing metrics, logs, events, and distributed tracing (MELT) across distributed systems.
Hands-on experience with Terraform and Infrastructure as Code practices.
Strong understanding of SLIs, SLOs, alerting strategies, and incident response frameworks.
Excellent troubleshooting, communication, and collaboration skills.
Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience).
Tech Stack
AWS
Cloud
Distributed Systems
Google Cloud Platform
Grafana
ITSM
Python
ServiceNow
Splunk
Terraform
Benefits
Culture of Relentless Performance : join an unstoppable technology development team with a 99% project success rate and more than 30% year-over-year revenue growth.
Competitive Pay and Benefits : enjoy a comprehensive compensation and benefits package, including health insurance, language courses, and a relocation program.
Work From Anywhere Culture : make the most of the flexibility that comes with remote work.
Growth Mindset : reap the benefits of a range of professional development opportunities, including certification programs, mentorship and talent investment programs, internal mobility and internship opportunities.
Global Impact : collaborate on impactful projects for top global clients and shape the future of industries.
Welcoming Multicultural Environment : be a part of a dynamic, global team and thrive in an inclusive and supportive work environment with open communication and regular team-building company social events.
Social Sustainability Values : join our sustainable business practices focused on five pillars, including IT education, community empowerment, fair operating practices, environmental sustainability, and gender equality.