Beacon Hill is seeking a Site Reliability Engineer with a strong ownership mentality to drive automation and improvement. The role is responsible for ensuring the stability, reliability, and performance of critical applications, focusing on Application Performance Monitoring and continuous improvement of production systems.
Responsibilities:
- Own the reliability, availability, and performance of key applications and services in production environments
- Design, implement, and maintain SRE practices including SLIs/SLOs, error budgets, and incident postmortems
- Participate in and help improve incident management processes (triage, root cause analysis, corrective actions, follow‑through)
- Automate operational tasks (deployments, scaling, recovery, maintenance) using scripts, pipelines, and infrastructure‑as‑code where applicable
- Lead the design and implementation of end‑to‑end observability for applications, including metrics, logs, traces, and synthetic monitoring
- Configure, maintain, and extend AppDynamics dashboards, health rules, baselines, and alerts to proactively identify issues
- Integrate OpenTelemetry and other observability tooling to standardize telemetry across services
- Analyze performance data to find systemic issues (slow transactions, resource bottlenecks, noisy alerts) and drive remediation
- Collaborate with application development teams to design performant, resilient architectures and patterns
- Contribute to codebases (services, tools, pipelines) with a focus on reliability, observability, and operational excellence
- Champion continuous improvement by identifying patterns in incidents and performance issues and driving engineering changes to prevent recurrence
- Establish and socialize best practices for performance testing, load testing, and capacity planning
- Serve as a technical leader and subject matter expert for APM, SRE, and observability within the team and broader organization
- Mentor other engineers in SRE principles, performance debugging, and monitoring best practices
- Partner with product, architecture, infrastructure, and support teams to align reliability efforts with business priorities
Requirements:
- 5+ years of professional experience in software engineering, site reliability engineering, or a closely related discipline
- Strong hands‑on experience with AppDynamics in production environments (dashboards, health rules, transaction detection, alerting, baselining, war‑room usage)
- Practical experience with SRE practices: SLIs/SLOs, error budgets, incident response, post‑incident reviews, and runbooks
- Experience with observability tooling and standards, including OpenTelemetry (tracing, metrics, logging) and integration into APM/monitoring platforms
- Solid programming skills in one or more languages commonly used in backend or distributed systems (e.g., Java, .NET, Python, Go, or similar)
- Experience with CI/CD pipelines and modern deployment practices (e.g., Git-based workflows, automated testing and deployment)
- Strong understanding of distributed systems, microservices, and cloud architectures (latency, resiliency, back‑pressure, timeouts, circuit breakers)
- Demonstrated ability to troubleshoot complex production issues across application, infrastructure, and network layers
- Experience with additional APM / monitoring stacks (e.g., Dynatrace, New Relic, Datadog, Prometheus, Grafana, Splunk, ELK, etc.)
- Hands‑on experience with Kubernetes, containers, and service meshes
- Experience designing or operating systems in public cloud environments (Azure, AWS, or GCP)
- Experience mentoring or leading other engineers in an SRE/DevOps context