Apryse is an industry-leading provider of document software development technology, committed to delivering cutting-edge solutions. They are seeking a Full Stack Data Discovery Engineer to design and implement systems that analyze technology usage across various ecosystems, focusing on building data pipelines and dashboards to transform raw data into actionable insights.
Responsibilities:
- Own the full stack: Design, build and optimize scalable data pipelines to discover OSINT and software usage across a wide public ecosystem
- Pipeline development: Develop APIs, microservices, crawlers, document fingerprinting to gather data securely and efficiently. Implement backoff/caching, data normalization, and persist to SQL/NoSQL indexes
- Data Discovery: Conduct systematic searches across the web, public databases, developer ecosystems and other platforms to identify potential external data repositories relevant to organizational objectives
- Metadata and Attribution Analysis: Programmatically uncover and analyze metadata associated with identified data sources to understand data structure, content, quality, and potential use cases
- Signals & scoring: develop heuristics/ML‑lite ranking to identify relevant artifacts , deduplicate, and assign confidence scores
- Data Governance: Ensure data quality, security, compliance and governance
- Productize discovery: build internal tools that let non‑engineers run searches, review candidates, and export leads—fast and safely
- Documentation and Reporting: Document data structures, origins (data lineage), and quality issues. Create clear, concise reports and presentations to communicate findings and recommendations to technical and non-technical stakeholders
- Collaboration: Work closely with data stewards, data architects, and internal business units to define data requirements and facilitate the integration of new data sources
- Innovation and Scale: Continuously explore new data sources, improve attribution logic and propose ML-based enhancements to finding and classifying data
Requirements:
- Bachelor's degree in Computer Science, Engineering, Library Science, Information Systems, Data Management, or a related field
- 1-5 years of proven experience as a full-stack developer and data engineer
- Back-end: Python, SQL, Java and Node.js
- Front-end: Modern JS/TS + React, component libraries, auth patterns, state mgmt
- Data & search: schema design, dedup/near‑dup logic, Elasticsearch/OpenSearch; building usable search/triage UIs
- Acquisition: Scrapy/Playwright/Puppeteer; API design with rate‑limit/backoff; ethical crawling
- Experience with cloud-native architecture and containerization
- Familiarity with metadata standards (e.g., Dublin Core, XML) and data management tools
- Exceptional attention to detail and strong analytical thinking skills
- Excellent written and verbal communication skills, with the ability to translate technical findings into business insights
- Strong problem-solving aptitude and the ability to work independently and collaboratively in a fast-paced environment
- Master's degree
- Knowledge of data visualization tools (e.g. Power BI, Tableau) to present findings
- Experience building internal platforms/tools used by end users or GTM teams