Diagnose, reproduce, and analyze highly technical product issues reported from the field across both the federal enclave and AWS production environments.
Lead root cause investigations, document findings, and partner with engineering teams to implement corrective actions.
Monitor and analyze production performance indicators, logs, telemetry, and alerts from multiple environments.
Provide technical summaries and incident reports to engineering leadership and escalation managers.
Work closely with development teams to understand new features, architectural changes, and potential failure points.
Partner with QA to validate fixes, verify edge-case scenarios, and enhance test coverage around recurring incidents.
Provide feedback based on real-world usage and cross environment incident patterns to reduce regressions.
Build and maintain debugging tools, scripts, dashboards, and test harnesses that improve troubleshooting in both secure and cloud environments.
Identify recurring issues and propose enhancements to observability, stability, and supportability.
Improve internal processes for incident handling, cross-team communication, and reporting.
Serve as the technical escalation point for Support teams, delivering deep-dive explanations and guidance.
Translate complex technical findings into clear, actionable insights for both technical and non-technical stakeholders.
Participate in on-call rotations as needed.
Requirements
3+ years of experience in technical support, DevOps, SRE, QA, or an R&D-adjacent engineering role.
Strong troubleshooting skills across distributed systems, APIs, microservices, or cloud environments.
Hands on experience with logs, debugging tools, and monitoring platforms (e.g., Kibana, Grafana, Datadog, Splunk).
Solid scripting/coding ability (Python, Bash, PowerShell, or similar).
Excellent communication skills; able to articulate complex problems clearly.
Ability to work across secure and cloud environments, including adapting workflows for federal enclave constraints
Nice to have
Experience working in or supporting federal enclave / restricted-access environments.
Experience supporting a FedRAMP certified product in production.
Experience with CI/CD pipelines and build/test automation.
Familiarity with Docker, Kubernetes, or cloud platforms (Azure, AWS, GCP).
Background in incident management, postmortems, or SRE best practices.
Understanding of networking, databases, and API debugging tools (Postman, Fiddler, etc.).