MC09 – Tokenization & Data Pre-processing for Fine-Tuning (with FineWeb integration)
Building high-value LLMs starts long before training begins—it starts with clean, well-tokenized data. In this hands-on session you’ll learn how to transform raw text, including slices from FineWeb’s 15-trillion-token corpus, into robust datasets ready for both Retrieval-Augmented Generation (RAG) and parameter-efficient fine-tuning.
What we’ll cover
Tokenizer selection and vocabulary extension (BPE, WordPiece, SentencePiece)
Advanced cleaning: de-duplication, PII masking, prompt–response alignment
FineWeb pipeline: filtering, sharding, streaming-friendly formats
Decision guide—when to deploy RAG vs when to fine-tune for cost, latency, and compliance
You will leave with
A production-ready preprocessing repo you can drop into any ML workflow
A practical checklist for choosing RAG or fine-tuning on future projects
Confidence to handle domain, multilingual, or regulated data at scale
Who should attend
Data scientists, ML engineers, and technical leads looking to reduce hallucinations, accelerate model updates, and establish a bullet-proof data pipeline.
Hosted by Decoding Data Science.