Developing and maintaining web crawlers using Go to extract data from target websites.
Utilize headless browsing techniques, such as Chrome DevTools, to automate and optimize data collection processes.
Collaborate with cross-functional teams to identify, scrape, and integrate data from APIs and web pages to support business objectives.
Create and implement efficient parsing patterns using tokenizers, regular expressions, XPaths, and CSS selectors to ensure accurate data extraction.
Design and manage distributed job queues using technologies such as Redis, Aerospike and Kubernetes to handle large-scale distributed crawling and processing tasks.
Develop strategies to monitor and ensure data quality, accuracy, and integrity throughout the crawling and indexing process.
Continuously improve and optimize existing web crawling infrastructure to maximize efficiency and adapt to new challenges.
Requirements
Proficiency in Go (Golang)/Rust/Zig for building scalable and efficient web crawlers.
Deep understanding of TCP, UDP, TLS and HTTP/1.1,2,3 protocols and web communication.
Knowledge of HTML, CSS, and JavaScript for parsing and navigating web content.
Familiarity with cloud platforms (AWS, GCP), orchestration (Kubernetes, Nomad), and containerization (Docker) for deployment.
Mastery of queues, stacks, hash maps, and other data structures for efficient data handling.
Ability to design and optimize algorithms for large-scale web crawling.
Hands-on experience with networking and web scraping libraries.
Understanding of how search engines work and best practices for web crawling optimization.
Experience with SQL and/or NoSQL databases (knowing Aerospike is a bonus) for storing and managing crawled data.
Familiarity with data warehousing and scalable storage solutions.
Knowledge of distributed systems (e.g., Hadoop, Spark) for processing large datasets.
Experience with web archiving projects & tooling, open-source archiving is a big plus!
Experience applying Machine Learning to improve crawling efficiency or accuracy.
Experience with low-level networking programming and/or userspace TCP/IP stacks.