Jensen Hughes is a leader in fire protection engineering and risk-based fields, dedicated to making the world safe and secure. They are seeking a Platform Engineer to build and operate their cloud platform on AWS, focusing on infrastructure CI/CD and observability while enabling AI capabilities.
Responsibilities:
- AWS platform engineering (multi-account)
- Design, build, and operate secure, reliable AWS foundations across a multi-account AWS environment (AWS Organizations / Control Tower where applicable), including networking, IAM, KMS, secrets, tagging, and shared services
- Establish scalable patterns for compute, storage, and networking; enable repeatable environments across dev/stage/prod
- Improve developer experience through standards, templates, and clear platform documentation
- Own Terraform architecture end-to-end: module strategy, state design, environment separation, provider/version management
- Build and maintain a production-grade Terraform SDLC:
- PR-driven workflows with plan previews, approvals, and promotion across environments
- Controlled apply mechanisms with audit trails and rollback plans
- Drift detection and safe reconciliation strategy • Import/migration/refactor patterns without downtime
- Implement baseline guardrails (tagging, encryption, access controls) as code wherever feasible
- Implement PR-driven infrastructure delivery using GitOps principles (not Kubernetes-only):
- Git as the source of truth; PRs as change requests
- Automated validation/testing/security checks on every change
- Safe promotion model (dev → stage → prod) with appropriate gates
- Controlled applies for production (approval gates / break-glass procedures), with full traceability
- Standardize pipelines in the team’s primary CI/CD platform (GitHub Actions) and integrate with existing enterprise tooling where needed
- Establish repo structure, branching strategy, and operational runbooks for the infrastructure delivery workflow
- Own the Splunk observability operating model: dashboards, alerting standards, SLOs/SLIs, runbooks, and on-call readiness
- Build/operate telemetry pipelines for reliability and cost efficiency (noise reduction, sampling/cardinality strategies, retention and routing)
- Partner with application teams to improve visibility, reduce MTTR, and drive incident learnings into platform improvements
- Partner with engineering teams to enable agentic AI use cases using Amazon Bedrock and AgentCore (tool integration patterns, secure operation, production readiness)
- Help establish foundational patterns for agent deployment and operations (environments, permissions, observability, and evaluation/reliability practices) aligned to enterprise controls
- Participate in incident response; lead postmortems and drive systemic, preventive fixes
- Measure and improve platform reliability, security posture, and cost efficiency over time
Requirements:
- 8–10 years of experience in Platform Engineering / SRE / DevOps (or equivalent experience delivering platform outcomes)
- AWS expertise, including multi-account patterns (AWS Organizations / Control Tower preferred), networking, IAM/security, and operations
- Terraform expert with proven ownership of org-scale infrastructure-as-code (modules, state, CI controls, large refactors)
- Proven experience designing Infrastructure CI/CD and PR-driven infrastructure delivery (GitOps principles) for Terraform and cloud configuration
- PR-based automation with plan previews and security/policy checks
- Controlled apply processes with approvals and auditability
- Environment promotion patterns and rollback strategies
- Strong production experience with observability platforms such as Splunk, Datadog, Grafana, or Dynatrace, including building and operating dashboards, alerting standards, and telemetry pipelines (logs/metrics/traces) in production
- Strong Linux and troubleshooting skills; proficiency in automation (Python or Go preferred)
- Experience building agentic AI solutions using Amazon Bedrock Agents and/or Amazon Bedrock AgentCore (deployment/operations, tool integration patterns)
- OpenTelemetry at scale (standards, collectors/gateways, sampling, correlation across logs/metrics/traces)
- Policy-as-code experience (Conftest/Sentinel or similar) applied to Terraform and platform guardrails
- Experience building an Internal Developer Platform (IDP) / self-service workflows (golden paths, templates, paved roads)
- Databricks on AWS platform support (workspace/cluster policies, reliability, cost controls; Unity Catalog familiarity a plus)