Beacon Hill is seeking a Site Reliability Engineer with a strong ownership mentality to drive automation and improvement. The role is responsible for ensuring the stability, reliability, and performance of critical applications, focusing on Application Performance Monitoring and continuous improvement of production systems.

Responsibilities:

Own the reliability, availability, and performance of key applications and services in production environments
Design, implement, and maintain SRE practices including SLIs/SLOs, error budgets, and incident postmortems
Participate in and help improve incident management processes (triage, root cause analysis, corrective actions, follow‑through)
Automate operational tasks (deployments, scaling, recovery, maintenance) using scripts, pipelines, and infrastructure‑as‑code where applicable
Lead the design and implementation of end‑to‑end observability for applications, including metrics, logs, traces, and synthetic monitoring
Configure, maintain, and extend AppDynamics dashboards, health rules, baselines, and alerts to proactively identify issues
Integrate OpenTelemetry and other observability tooling to standardize telemetry across services
Analyze performance data to find systemic issues (slow transactions, resource bottlenecks, noisy alerts) and drive remediation
Collaborate with application development teams to design performant, resilient architectures and patterns
Contribute to codebases (services, tools, pipelines) with a focus on reliability, observability, and operational excellence
Champion continuous improvement by identifying patterns in incidents and performance issues and driving engineering changes to prevent recurrence
Establish and socialize best practices for performance testing, load testing, and capacity planning
Serve as a technical leader and subject matter expert for APM, SRE, and observability within the team and broader organization
Mentor other engineers in SRE principles, performance debugging, and monitoring best practices
Partner with product, architecture, infrastructure, and support teams to align reliability efforts with business priorities

Requirements:

5+ years of professional experience in software engineering, site reliability engineering, or a closely related discipline
Strong hands‑on experience with AppDynamics in production environments (dashboards, health rules, transaction detection, alerting, baselining, war‑room usage)
Practical experience with SRE practices: SLIs/SLOs, error budgets, incident response, post‑incident reviews, and runbooks
Experience with observability tooling and standards, including OpenTelemetry (tracing, metrics, logging) and integration into APM/monitoring platforms
Solid programming skills in one or more languages commonly used in backend or distributed systems (e.g., Java, .NET, Python, Go, or similar)
Experience with CI/CD pipelines and modern deployment practices (e.g., Git-based workflows, automated testing and deployment)
Strong understanding of distributed systems, microservices, and cloud architectures (latency, resiliency, back‑pressure, timeouts, circuit breakers)
Demonstrated ability to troubleshoot complex production issues across application, infrastructure, and network layers
Experience with additional APM / monitoring stacks (e.g., Dynatrace, New Relic, Datadog, Prometheus, Grafana, Splunk, ELK, etc.)
Hands‑on experience with Kubernetes, containers, and service meshes
Experience designing or operating systems in public cloud environments (Azure, AWS, or GCP)
Experience mentoring or leading other engineers in an SRE/DevOps context

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: