TEKsystems is a leading provider of business and technology services, seeking a highly versatile DevOps Engineer to take charge of cloud engineering and reliability. The role involves making sound technical decisions, leading future team-building efforts, and ensuring scalability and cost optimization across diverse systems.
Responsibilities:
- Core focus: DevOps / Cloud Engineering
- Must be the sole DevOps engineer initially — someone confident making sound technical decisions for the business
- Needs to be strong‑willed: able to push back when the business requests something that isn’t optimal, articulate risks, and propose the right solution
- Expected to both engineer and lead (future team-building potential)
- Communicate technical issues in business terms to non-technical stakeholders
- Challenge developers constructively on reliability, scalability, and release risk
- Partner with product teams to enforce reliability standards and guardrails
Requirements:
- Core focus: DevOps / Cloud Engineering
- Must be the sole DevOps engineer initially — someone confident making sound technical decisions for the business
- Needs to be strong‑willed: able to push back when the business requests something that isn't optimal, articulate risks, and propose the right solution
- Expected to both engineer and lead (future team-building potential)
- AWS (primary cloud)
- Lambda + monitoring/observability
- Python (building automation, AI-driven reports)
- Linux background
- CI/CD experience
- Terraform (Infrastructure as Code)
- Datadog (nice‑to‑have)
- Strong IT fundamentals and a broad technical skillset spanning infrastructure, cloud, databases, and automation
- Ability to operate in ambiguity
- Ability to communicate effectively with both technical and non-technical stakeholders
- Proactively drives reliability, scalability, and cost optimization across diverse systems
- Site reliability, azure, Automation, Cloud
- Azure: Resource groups, networking (VNets, NSGs), AKS, App Services, Functions, Storage, Key Vault, Monitor, Policy, Cost Management
- Systems & Networking: Linux/Windows internals, DNS, TLS, routing, load balancing, caching
- Datastores: SQL Server, PostgreSQL/MySQL, Cosmos DB, Redis—query performance, indexing, backup/restore, HA/DR patterns
- Observability: Metrics, logs, traces; SLOs/Error Budgets; using Azure Monitor/Log Analytics/Grafana/Prometheus
- Automation: Infrastructure-as-Code (Bicep/Terraform), CI/CD (GitHub Actions/Azure DevOps), scripting (Python/PowerShell), runbooks
- Security & Compliance: Secrets, identity (Azure AD), least-privilege, policy enforcement, vulnerability/TLS hygiene
- Ticket triage & backlog hygiene: “Why is this open?”; age/priority/impact; close/noise reduction; define clear exit criteria
- Incident management: Rapid diagnosis; comms to business; post-incident reviews that produce durable fixes (not blame)
- Capacity & performance: Can we scale back safely? Where do we need headroom? Evidence-based decisions
- Change management: Guardrails, pre-flight checks, safe deploys/rollbacks, feature flags
- Communicate technical issues in business terms to non-technical stakeholders
- Challenge developers constructively on reliability, scalability, and release risk
- Partner with product teams to enforce reliability standards and guardrails
- Expert Level