Judi Health is an enterprise health technology company providing a comprehensive suite of solutions for employers and health plans. The Senior Scalability Engineer will focus on observability platform development, architecting and developing a custom observability platform to enhance engineering productivity and system monitoring.

Responsibilities:

Architect observability platform: Design, implement, and maintain the LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) as the primary observability platform across all engineering teams, making architectural decisions that balance cost, performance, and developer experience
Build internal observability products: Design and develop production-grade internal platform products with React/TypeScript frontends and Python/Rust backends that provide engineers with powerful log search, metrics visualization, and trace analysis capabilities
Develop custom log indexing systems: Architect and build high-performance log indexing solutions using Rust that process logs and provide sub-second search across billions of log lines at a fraction of the cost
Integrate SQL analytics for logs: Design and implement solutions leveraging AWS Athena or similar SQL query engines (DuckDB, ClickHouse) for ad-hoc log analysis and historical queries, enabling engineers to run complex SQL queries over S3-based log data for deep investigations and trend analysis
Create advanced query interfaces: Build sophisticated web interfaces that allow engineers to query logs, metrics, and traces with features like saved queries, query templates, correlation analysis, and pattern detection, supporting both full-text search and SQL-based analytics
Balance cloud-native and open-source: Architect solutions that thoughtfully leverage both AWS-managed services (CloudWatch, Athena, Kinesis) and open-source tooling (LGTM stack, Quickwit) to optimize for cost, performance, and operational flexibility based on use case requirements
Integrate AWS observability: Design seamless integration between AWS CloudWatch Logs/Metrics and our custom observability platform, providing unified visibility across managed and self-hosted infrastructure
Build intelligent alerting: Develop smart dashboards, monitors, and alerting systems that reduce noise, detect anomalies, and help teams respond to incidents quickly
Partner with engineering teams: Work directly with product teams to integrate observability into their services, establish logging and metrics standards, and instrument code effectively, serving as the observability subject matter expert
Enable performance optimization: Provide the observability foundation that allows the Scalability team to identify performance bottlenecks, track optimization impact, and measure platform stability with data-driven insights
Establish observability standards: Define and document comprehensive observability standards including structured logging patterns, metric naming conventions, trace instrumentation, dashboard design principles, and query best practices
Drive platform adoption: Lead workshops, create documentation, and build self-service tooling that democratizes observability across engineering, making it easy for teams to adopt best practices
Demonstrate technical leadership: Mentor engineers on observability practices, lead architecture reviews for instrumentation approaches, and represent the Scalability team in cross-functional planning
Work in an Agile/Scrum environment to continually deliver value to stakeholders and clients

Requirements:

10+ years of software engineering or infrastructure engineering experience with demonstrated progression into technical leadership roles
Several years of experience leading technical initiatives, building platform products, or serving as a subject matter expert on observability infrastructure
Strong experience with React/TypeScript for frontend development and Python (Flask/SQLAlchemy) for backend services
LGTM stack expertise: Deep production experience with Loki, Grafana, Tempo, and Prometheus/Mimir for logs, metrics, and distributed tracing at scale
AWS observability: Extensive experience with AWS CloudWatch Logs and Metrics, including custom metrics, log insights, dashboard creation, and integration patterns
SQL analytics for logs: Production experience with SQL-based log analytics using AWS Athena, DuckDB, or similar query engines for analyzing structured and semi-structured data at scale
Cloud-native and open-source balance: Demonstrated ability to architect solutions leveraging both managed cloud services and open-source tooling, understanding trade-offs between operational overhead, cost, flexibility, and vendor lock-in
Search and indexing experience: Hands-on experience building or operating search systems using OpenSearch, Elasticsearch, Lucene, Tantivy, or similar search and analytics engines
Performance-critical systems: Experience building high-performance systems that process large volumes of data efficiently (millions of log lines, high-cardinality metrics)
Systems thinking: Deep understanding of distributed systems, microservices architectures, and the complex observability challenges they present
Data at scale: Proven track record handling high-volume structured and unstructured logging data, identifying patterns, and building efficient search/query solutions that perform well under load
Product mindset: Ability to build internal platform products that engineers love to use, with attention to UX, performance, and reliability
Rust development experience: Production experience with Rust for building high-performance data processing, indexing, or search systems. Strong interest in learning Rust is acceptable if combined with systems programming experience in C/C++/Go
Infrastructure as code: Experience with Terraform for managing observability infrastructure and AWS resources
Additional observability platforms: Experience architecting or operating Datadog, New Relic, Splunk, or other enterprise observability platforms
Advanced query languages: Deep expertise with PromQL, LogQL, SQL optimization, and query optimization for high-cardinality data
Columnar storage formats: Experience with Parquet, ORC, or other columnar storage formats for efficient log storage and analytics on S3
Incident management: Experience designing incident response workflows, postmortem processes, and SLO/SLI frameworks that drive reliability improvements
Cost optimization: Track record of reducing observability costs while maintaining or improving capabilities (e.g., CloudWatch → S3/custom indexing migration)
Data pipelines: Experience with streaming data pipelines, ETL processes, or real-time data processing
Distributed tracing: Deep knowledge of OpenTelemetry, Jaeger, Zipkin, or distributed tracing architectures
Git expertise and experience working in a mono repository
Previous Pharmacy Benefits Manager (PBM) or healthcare technology experience
Experience building developer tools or internal platforms that improve engineering productivity

Senior Scalability Engineer - Observability

Key skills

About this role

Responsibilities:

Requirements: