Ad Hoc is a technology company that empowers organizations to deliver scalable, impactful digital services. They are seeking a Staff Software Engineer - Full Stack/SRE to lead and monitor project delivery, improve software engineering processes, and contribute to the reliability and performance of the va.gov platform. This role involves troubleshooting production issues, collaborating with engineering teams, and enhancing observability and incident response practices.
Responsibilities:
- Plans and executes on roadmaps for new projects without explicit guidance and direction from technical supervisors
- Actively participates in conversations and planning sessions with partners and key stakeholders
- Periodically travels to work with and present to clients, partners, and stakeholders
- Elaborates on and evolves complex and ambiguous products to uncover constraints and new opportunities
- Reduces ambiguity in the systems they work with, including adding documentation, refactoring, and automated testing
- Effectively communicates on existing systems, design decisions, past performance, and a major history of the projects that they’ve been part of for bid-writing, tech demos, and other potentially client-facing communications
- Participates in technical depth interviews with new candidates
- Presents on technical topics effectively, articulating implementation complexity and other costs to inform business decisions
- Troubleshoot and Resolve Production Issues: Diagnose and fix performance bottlenecks, errors, and other issues within the va.gov application (primarily a Ruby on Rails monolith, including Sidekiq background jobs, but familiarity with similar frameworks is valuable)
- Observability & Monitoring: Utilize DataDog (and potentially Dynatrace) to monitor application performance, identify anomalies, and proactively address potential problems. Develop and maintain relevant dashboards and alerts
- Incident Response and On-Call Rotation ("The Watch"): Participate in our on-call rotation approximately once per month. Unlike traditional pager-driven on-call, "The Watch" involves reviewing the previous day's alerts and ensuring no silent failures occurred (such as background jobs exhausting without an alternate submission path). During your on-call week, expect to work 2-4 hours each day on the weekend to maintain system reliability
- Code Contributions: Write and review code to improve observability and fix bugs (Ruby on Rails), implement improvements, and maintain internal tools (JavaScript/SvelteKit, and Python)
- Consulting & Collaboration: Work closely with other engineering teams to provide guidance on best practices for observability, reliability, and performance. Communicate technical issues clearly to both technical and non-technical audiences
- Process Improvement: Identify and implement improvements to our monitoring, alerting, and incident response processes. Contribute to documentation and runbooks
- Maintain Internal Tools: Contribute to the development and maintenance of a small SvelteKit application used for tracking team metrics and success
Requirements:
- Bachelor's degree and 9+ years of engineering experience or Site Reliability Engineer
- 3+ years of experience with backend web application development in a production environment. Strong preference for Ruby on Rails experience, but candidates with demonstrable experience in other dynamic languages (e.g., Python/Django/Flask, Node.js/Express, PHP/Laravel) or compiled languages with web frameworks (e.g., Java/Spring, C#/.NET) will be considered
- Experience with Sidekiq or other background job processing framework. If not Sidekiq, experience must be with a comparable system in their chosen language/framework (e.g., Celery for Python)
- Proven experience with application performance monitoring (APM) tools, specifically DataDog and/or Dynatrace. Ability to interpret metrics and identify root causes of performance issues
- Demonstrated experience in incident response and troubleshooting complex production issues
- Experience with at least one modern JavaScript framework (React, Angular, Vue, Svelte, etc.)
- Excellent communication, collaboration, and consulting skills
- Ability to work effectively in a fast-paced, dynamic environment
- Experience working within an Agile environment
- Experience with vets-api
- Prior experience working within the VA/OCTO environment or any large government software deployment that integrates with multiple legacy services
- Experience with Python for scripting, API interactions, and ETL/data engineering tasks
- General understanding of DevOps concepts (containerization, virtualization, networking)
- Familiarity with GitHub Actions
- Experience with the U.S. Web Design System (USWDS)