Conduct exploratory phases (EDA, data quality assessment, completeness, analytical feasibility) on structured and unstructured datasets;
Define modeling approaches balancing fine-tuning of Transformer models (BERTimbau and similar) and the use of LLMs for extraction/structuring, with clear criteria for reproducibility, cost and auditability;
Build embedding pipelines, RAG and semantic search using vector databases (Qdrant, Milvus, ChromaDB);
Calibrate prioritization scores and anomaly detection (Isolation Forest, Autoencoders, HDBSCAN) in collaboration with domain experts;
Version experiments and models ensuring traceability and governance;
Produce high-level technical and scientific documentation (reports and, when applicable, papers);
Act as the technical interlocutor with domain experts to validate criteria, thresholds and metrics.
Requirements
Degree in Data Science, Statistics, Computer Science or a related field;
5+ years working on NLP projects in production, preferably in Portuguese;
Strong proficiency in Python, pandas, scikit-learn and PyTorch (Transformers);
Hands-on experience with Transformer models (BERTimbau, multilingual BERT);