Architect, build, and maintain internal infrastructure tools (e.g., CI/CD pipelines, observability platforms, self-service portals) with a focus on usability and developer productivity.
Develop full-stack solutions (frontend to backend) to automate service deployment, monitoring, and lifecycle management.
Write clean, maintainable code (nodejs/python) and infrastructure-as-code (Terraform) with an emphasis on simplicity and reusability.
Design and implement frameworks for service health monitoring, logging, tracing, and incident management.
Build tools to enforce compliance with security, cost optimization (FinOps), and reliability best practices.
Collaborate with SRE and DevOps teams to refine service SLAs and error budgets.
Define and evangelize coding standards, API design patterns, and tooling conventions across engineering teams.
Explore AI/ML applications to enhance tooling capabilities (e.g., anomaly detection, automated root cause analysis, LLM-powered ChatOps).
Requirements
3+ years of full-stack development experience with a focus on infrastructure or platform tooling.
Proficiency in modern languages (e.g., Python, Java, Node.js, Go) and frameworks ( Nest, Spring).
Strong understanding of distributed systems, microservices, and cloud-native technologies (Kubernetes, Docker, AWS/GCP/Azure).
Demonstrated ability to design tools that balance flexibility with governance (e.g., policy-as-code, guardrails).
Aesthetic sensibility for clean abstractions, intuitive APIs, and frictionless user experiences.
Bachelor’s degree in Computer Science or equivalent practical experience.
Hands-on experience using LLM-based tools (e.g., GitHub Copilot, ChatGPT) to improve development efficiency and code quality.
Experience building or integrating AI-powered systems (e.g., LLM-based automation, ChatOps, anomaly detection) is strongly preferred.
Hands-on experience with at least one major cloud provider (GCP preferred), including infrastructure design and cost optimization.