ArcheSys Inc is a technology firm specializing in innovative cloud solutions and services for clients across various industries. They are seeking a highly motivated Fullstack Engineer to design, develop, and maintain Grafana dashboards and data pipelines while enhancing system visibility and reliability through AWS and DevOps practices.
Responsibilities:
- Design, develop, and maintain comprehensive, intuitive, and real-time Grafana dashboards that visualize key operational metrics, business KPIs, and application logs
- Collaborate with SRE, development, and product teams to gather requirements and translate complex data into clear, actionable visualizations
- Optimize Grafana dashboards for performance, scalability, and usability, ensuring quick loading times and effective data presentation
- Implement alerting rules within Grafana to proactively notify teams of anomalies and potential issues
- Design and implement robust ETL/ELT pipelines to extract, transform, and load data from various sources (e.g., Prometheus, Splunk, CloudWatch, RDS, OpenTelemetry, custom APIs) into data stores consumable by Grafana
- Write and optimize complex queries (SQL, PromQL, Splunk SPL, etc.) to ensure data accuracy and efficiency
- Develop and maintain APIs to facilitate data exchange and integration between different system components and monitoring tools
- Implement data quality checks, performance tuning (indexing, partitioning), and backup/restore strategies for data sources
- Design, deploy, and manage scalable and resilient AWS infrastructure to support Grafana instances, data sources, and related services
- Utilize AWS services such as EC2, ECS/EKS, Lambda, S3, RDS, CloudWatch, Kinesis, DynamoDB, and others to build and optimize our observability platform
- Implement security best practices within the AWS environment, including IAM roles, security groups, and network configurations
- Design, implement, and maintain robust CI/CD pipelines for automating the build, testing, and deployment of Grafana dashboards, underlying data pipelines, and infrastructure as code
- Utilize tools like AWS CodePipeline, Jenkins, GitLab CI, or similar for continuous integration and continuous deployment
- Develop and maintain Infrastructure as Code (IaC) using Terraform, CloudFormation, or Ansible for managing all AWS resources
- Automate operational tasks, monitoring deployments, and testing processes to improve efficiency and reliability
- Apply SRE principles to ensure the reliability, scalability, and performance of our monitoring and observability infrastructure
- Participate in on-call rotations, responding to alerts and incidents related to dashboard functionality, data accuracy, and performance
- Conduct root cause analysis (RCA) for incidents and implement corrective actions to prevent recurrence
- Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key services, ensuring dashboards reflect these metrics accurately
- Work closely with cross-functional teams (development, operations, product) to understand monitoring needs and provide expert guidance on observability best practices
- Create and maintain comprehensive documentation detailing dashboard designs, data sources, query logic, AWS architecture, and operational procedures
- Contribute to code reviews, promote best practices, and mentor junior team members
Requirements:
- Bachelor's degree in Computer Science, Software Engineering, or a related technical field, or equivalent practical experience
- 4-7 years of experience in a Fullstack Development, Data Engineering, or SRE role with a strong focus on monitoring, observability, and AWS infrastructure
- Proven hands-on experience designing, developing, and maintaining complex Grafana dashboards
- Strong proficiency in at least one backend programming language (e.g., Python, Go, Java, Node.js)
- Extensive experience with various data sources for Grafana (e.g., Prometheus, Loki, Splunk, SQL databases, CloudWatch)
- Deep hands-on experience with AWS cloud services, including but not limited to EC2, ECS/EKS, Lambda, S3, RDS, CloudWatch, Kinesis, DynamoDB
- Proven experience designing and implementing robust CI/CD pipelines and DevOps automation using tools like AWS CodePipeline, Jenkins, GitLab CI, or similar
- Strong experience with Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, or Ansible
- Solid understanding of SRE principles, including SLOs, SLIs, error budgets, toil reduction, and incident management
- Experience with containerization technologies (Docker, Kubernetes)
- Excellent analytical and problem-solving skills with a keen eye for detail
- Strong communication and interpersonal skills, with the ability to articulate complex technical concepts clearly to diverse audiences
- Ability to work independently and collaboratively in a fast-paced, dynamic environment
- AWS Certifications (e.g., Solutions Architect, DevOps Engineer)
- Experience with other observability tools (e.g., Datadog, New Relic, OpenTelemetry)
- Knowledge of distributed tracing concepts and tools (e.g., Jaeger, Tempo)
- Experience with machine learning for anomaly detection in time-series data
- Contributions to open-source projects related to Grafana or observability