Drive observability strategy and implementation including monitoring, logging, and performance optimization for complex distributed systems
Deep understanding of AI technologies and its working, to enable end to end observability of such systems and build governance around their upkeep, costing and management
Ability to implement Infrastructure as Code (IaC) and other DevOps tools/technologies to standardize infrastructure deployment, application installation, and consumer-specific customization.
Lead the design and architecture of scalable observability solutions that support AI/ML workloads and LLM implementations across enterprise systems
Collaborate with cross-functional teams to translate business requirements into robust technical solutions and platform capabilities
Mentor and guide engineering teams on architectural best practices, system design principles, and emerging AI technologies
Lead platform modernization initiatives for Observability and AI governance, containerization, and implementation of DevOps practices to improve system reliability and deployment velocity
Requirements
Bachelor’s or master’s degree in computer science, Engineering, or related technical field
10+ years of experience in platform engineering, systems architecture, or related roles
Proven experience in designing and implementing large-scale distributed systems with expertise in cloud platforms such as AWS, Azure, or Google Cloud Platform
Strong background in AI/ML infrastructure and LLM technologies including experience with model deployment, training pipelines, and inference optimization
Expertise in observability and monitoring solutions such as Prometheus, Grafana, ELK Stack, or DataDog for complex distributed environments
Proficiency in containerization and orchestration technologies such as Docker, Kubernetes, or OpenShift with experience in production deployments
Strong leadership and communication skills with demonstrated ability to mentor technical teams and collaborate effectively with stakeholders across the organization.
Tech Stack
AWS
Azure
Cloud
Distributed Systems
Docker
Google Cloud Platform
Grafana
Kubernetes
OpenShift
Prometheus
Benefits
Health & Wellness: Health care coverage designed for the mind and body.
Flexible Downtime: Generous time off helps keep you energized for your time on.
Continuous Learning: Access a wealth of resources to grow your career and learn valuable new skills.
Invest in Your Future: Secure your financial future through competitive pay, retirement planning, a continuing education program with a company-matched student loan contribution, and financial wellness programs.
Family Friendly Perks: It’s not just about you. S&P Global has perks for your partners and little ones, too, with some best-in class benefits for families.
Beyond the Basics: From retail discounts to referral incentive awards—small perks can make a big difference.