Engaging with development teams on architecting better system design, deployment, capacity planning, identifying/highlighting areas for improvement, enforcing DevOps and SRE practices, and supporting them as they transition to production
Monitoring and logging the availability, performance, and health of production systems in support of meeting service level objectives
Enhancing and implementing automation and tooling to continuously improve the reliability, scalability, and velocity of services deployed on instances
Actively practicing DevOps culture and SRE practices for building well stable product releases across environments
24/7/365 responsibility for assigned production applications, including on-call responsibilities, following defined procedures, owning and managing a runbook for each application, and maintaining system uptime to contractual SLAs
Participating in emergency incident response, on-call rosters, and practicing blameless post-mortems that lead to improvements in resiliency
Learning any appropriate tech stack needed for the organization
Building, maintaining, and improving CI/CD pipeline in support of all assigned applications
Supporting production and non-production environments where appropriate
Executing and maintaining industry best practices for security, compliance, and auditing, including a continuous improvement cycle
Handling planned and support activities as per the priority set
Requirements
At least 2 years of cloud-based Platform/DevOps engineering experience
Get it done mindset
Hands-on experience with Python, Shell, Groovy, YAML scripting
Experience with networking concepts, MySQL/Dynamo DB, Lambda, backup & recovery strategies and adhering security policies
Demonstrated experience with production support in a cloud environment, including outage/incident management
Demonstrated collaboration across departments, teams, and partners in different geographic areas
Expertise with analyzing and troubleshooting large-scale, multi-region application and its infra in a public cloud (Primarily AWS)
Experience with cloud deployment and management tools (e.g. Terraform, CDK, CodeDeploy)
Experience with containerization using Docker, Kubernetes, or Open-shift
Demonstrated experience with monitoring, logging, and alerting tools (e.g. CloudWatch, New Relic, Grafana, ELK, Chaos Search)
Hands-on experience with enabling CI/CD pipeline over codes using Jenkins or Azure or AWS (Code Pipeline)
Expert-level troubleshooting skills using application and infra logs
Ability to identify and enhance the metrics needed for product stability and reliability
Experienced in continuously identifying and implementing the possible automations
Experience working with cross-functional teams on the activities till completion
Self-driven and ability to lead objectives to completion
Excellent written and oral communication skills across different layers within a company