Glass Image Background
cover image
MC09: When to Fine-Tune, Tokenization and Preprocessing Data
Hosted by
host-profile-image
Mohammad Arshad

JUN

21

Sat, 21 Jun

07:00 AM - 09:00 AMfalse

Online

Register to get link
Hey, See you at the event!
Ticket PriceUSD 19

Ticket PriceUSD 19

About

MC09 – Tokenization & Data Pre-processing for Fine-Tuning (with FineWeb integration) Building high-value LLMs starts long before training begins—it starts with clean, well-tokenized data. In this hands-on session you’ll learn how to transform raw text, including slices from FineWeb’s 15-trillion-token corpus, into robust datasets ready for both Retrieval-Augmented Generation (RAG) and parameter-efficient fine-tuning. What we’ll cover Tokenizer selection and vocabulary extension (BPE, WordPiece, SentencePiece) Advanced cleaning: de-duplication, PII masking, prompt–response alignment FineWeb pipeline: filtering, sharding, streaming-friendly formats Decision guide—when to deploy RAG vs when to fine-tune for cost, latency, and compliance You will leave with A production-ready preprocessing repo you can drop into any ML workflow A practical checklist for choosing RAG or fine-tuning on future projects Confidence to handle domain, multilingual, or regulated data at scale Who should attend Data scientists, ML engineers, and technical leads looking to reduce hallucinations, accelerate model updates, and establish a bullet-proof data pipeline. Hosted by Decoding Data Science.
Event By
Ask a question

Location

MC09: When to Fine-Tune, Tokenization and Preprocessing Data
Register to get event link
Online
This event is part of a community
community-profile-image
Artificial Intelligence
10,845 Members
Built with
en