Judi Health is an enterprise health technology company providing a comprehensive suite of solutions for employers and health plans. The Senior Scalability Engineer will focus on observability platform development, architecting and developing a custom observability platform to enhance engineering productivity and system monitoring.
Responsibilities:
- Architect observability platform: Design, implement, and maintain the LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) as the primary observability platform across all engineering teams, making architectural decisions that balance cost, performance, and developer experience
- Build internal observability products: Design and develop production-grade internal platform products with React/TypeScript frontends and Python/Rust backends that provide engineers with powerful log search, metrics visualization, and trace analysis capabilities
- Develop custom log indexing systems: Architect and build high-performance log indexing solutions using Rust that process logs and provide sub-second search across billions of log lines at a fraction of the cost
- Integrate SQL analytics for logs: Design and implement solutions leveraging AWS Athena or similar SQL query engines (DuckDB, ClickHouse) for ad-hoc log analysis and historical queries, enabling engineers to run complex SQL queries over S3-based log data for deep investigations and trend analysis
- Create advanced query interfaces: Build sophisticated web interfaces that allow engineers to query logs, metrics, and traces with features like saved queries, query templates, correlation analysis, and pattern detection, supporting both full-text search and SQL-based analytics
- Balance cloud-native and open-source: Architect solutions that thoughtfully leverage both AWS-managed services (CloudWatch, Athena, Kinesis) and open-source tooling (LGTM stack, Quickwit) to optimize for cost, performance, and operational flexibility based on use case requirements
- Integrate AWS observability: Design seamless integration between AWS CloudWatch Logs/Metrics and our custom observability platform, providing unified visibility across managed and self-hosted infrastructure
- Build intelligent alerting: Develop smart dashboards, monitors, and alerting systems that reduce noise, detect anomalies, and help teams respond to incidents quickly
- Partner with engineering teams: Work directly with product teams to integrate observability into their services, establish logging and metrics standards, and instrument code effectively, serving as the observability subject matter expert
- Enable performance optimization: Provide the observability foundation that allows the Scalability team to identify performance bottlenecks, track optimization impact, and measure platform stability with data-driven insights
- Establish observability standards: Define and document comprehensive observability standards including structured logging patterns, metric naming conventions, trace instrumentation, dashboard design principles, and query best practices
- Drive platform adoption: Lead workshops, create documentation, and build self-service tooling that democratizes observability across engineering, making it easy for teams to adopt best practices
- Demonstrate technical leadership: Mentor engineers on observability practices, lead architecture reviews for instrumentation approaches, and represent the Scalability team in cross-functional planning
- Work in an Agile/Scrum environment to continually deliver value to stakeholders and clients
Requirements:
- 10+ years of software engineering or infrastructure engineering experience with demonstrated progression into technical leadership roles
- Several years of experience leading technical initiatives, building platform products, or serving as a subject matter expert on observability infrastructure
- Strong experience with React/TypeScript for frontend development and Python (Flask/SQLAlchemy) for backend services
- LGTM stack expertise: Deep production experience with Loki, Grafana, Tempo, and Prometheus/Mimir for logs, metrics, and distributed tracing at scale
- AWS observability: Extensive experience with AWS CloudWatch Logs and Metrics, including custom metrics, log insights, dashboard creation, and integration patterns
- SQL analytics for logs: Production experience with SQL-based log analytics using AWS Athena, DuckDB, or similar query engines for analyzing structured and semi-structured data at scale
- Cloud-native and open-source balance: Demonstrated ability to architect solutions leveraging both managed cloud services and open-source tooling, understanding trade-offs between operational overhead, cost, flexibility, and vendor lock-in
- Search and indexing experience: Hands-on experience building or operating search systems using OpenSearch, Elasticsearch, Lucene, Tantivy, or similar search and analytics engines
- Performance-critical systems: Experience building high-performance systems that process large volumes of data efficiently (millions of log lines, high-cardinality metrics)
- Systems thinking: Deep understanding of distributed systems, microservices architectures, and the complex observability challenges they present
- Data at scale: Proven track record handling high-volume structured and unstructured logging data, identifying patterns, and building efficient search/query solutions that perform well under load
- Product mindset: Ability to build internal platform products that engineers love to use, with attention to UX, performance, and reliability
- Rust development experience: Production experience with Rust for building high-performance data processing, indexing, or search systems. Strong interest in learning Rust is acceptable if combined with systems programming experience in C/C++/Go
- Infrastructure as code: Experience with Terraform for managing observability infrastructure and AWS resources
- Additional observability platforms: Experience architecting or operating Datadog, New Relic, Splunk, or other enterprise observability platforms
- Advanced query languages: Deep expertise with PromQL, LogQL, SQL optimization, and query optimization for high-cardinality data
- Columnar storage formats: Experience with Parquet, ORC, or other columnar storage formats for efficient log storage and analytics on S3
- Incident management: Experience designing incident response workflows, postmortem processes, and SLO/SLI frameworks that drive reliability improvements
- Cost optimization: Track record of reducing observability costs while maintaining or improving capabilities (e.g., CloudWatch → S3/custom indexing migration)
- Data pipelines: Experience with streaming data pipelines, ETL processes, or real-time data processing
- Distributed tracing: Deep knowledge of OpenTelemetry, Jaeger, Zipkin, or distributed tracing architectures
- Git expertise and experience working in a mono repository
- Previous Pharmacy Benefits Manager (PBM) or healthcare technology experience
- Experience building developer tools or internal platforms that improve engineering productivity