Drive observability strategy and implementation including monitoring, logging, and performance optimization for complex distributed systems.
Deep understanding of AI technologies and its working, to enable end to end observability of such systems and build governance around their upkeep, costing and management.
Ability to implement Infrastructure as Code (IaC) and other DevOps tools/technologies to standardize infrastructure deployment, application installation, and consumer-specific customization.
Lead the design and architecture of scalable observability solutions that support AI/ML workloads and LLM implementations across enterprise systems.
Collaborate with cross-functional teams to translate business requirements into robust technical solutions and platform capabilities.
Mentor and guide engineering teams on architectural best practices, system design principles, and emerging AI technologies.
Lead platform modernization initiatives for Observability and AI governance, containerization, and implementation of DevOps practices to improve system reliability and deployment velocity.
Requirements
Bachelor’s or master’s degree in computer science, Engineering, or related technical field with 10+ years of experience in platform engineering, systems architecture, or related roles.
Proven experience in designing and implementing large-scale distributed systems with expertise in cloud platforms such as AWS, Azure, or Google Cloud Platform.
Strong background in AI/ML infrastructure and LLM technologies including experience with model deployment, training pipelines, and inference optimization.
Expertise in observability and monitoring solutions such as Prometheus, Grafana, ELK Stack, or DataDog for complex distributed environments.
Proficiency in containerization and orchestration technologies such as Docker, Kubernetes, or OpenShift with experience in production deployments.
Strong leadership and communication skills with demonstrated ability to mentor technical teams and collaborate effectively with stakeholders across the organization.
Tech Stack
AWS
Azure
Cloud
Distributed Systems
Docker
Google Cloud Platform
Grafana
Kubernetes
OpenShift
Prometheus
Benefits
Health & Wellness: Health care coverage designed for the mind and body.
Flexible Downtime: Generous time off helps keep you energized for your time on.
Continuous Learning: Access a wealth of resources to grow your career and learn valuable new skills.
Invest in Your Future: Secure your financial future through competitive pay, retirement planning, a continuing education program with a company-matched student loan contribution, and financial wellness programs.
Family Friendly Perks: It’s not just about you. S&P Global has perks for your partners and little ones, too, with some best-in class benefits for families.
Beyond the Basics: From retail discounts to referral incentive awards—small perks can make a big difference.