MC09: When to Fine-Tune, Tokenization and Preprocessing Data

Artificial Intelligence

Hosted by

Mohammad Arshad

JUN

Sat, 21 Jun

07:00 AM - 09:00 AMfalse

Online

Hey, See you at the event!

Ticket PriceUSD 19

About

MC09 – Tokenization & Data Pre-processing for Fine-Tuning (with FineWeb integration) Building high-value LLMs starts long before training begins—it starts with clean, well-tokenized data. In this hands-on session you’ll learn how to transform raw text, including slices from FineWeb’s 15-trillion-token corpus, into robust datasets ready for both Retrieval-Augmented Generation (RAG) and parameter-efficient fine-tuning. What we’ll cover Tokenizer selection and vocabulary extension (BPE, WordPiece, SentencePiece) Advanced cleaning: de-duplication, PII masking, prompt–response alignment FineWeb pipeline: filtering, sharding, streaming-friendly formats Decision guide—when to deploy RAG vs when to fine-tune for cost, latency, and compliance You will leave with A production-ready preprocessing repo you can drop into any ML workflow A practical checklist for choosing RAG or fine-tuning on future projects Confidence to handle domain, multilingual, or regulated data at scale Who should attend Data scientists, ML engineers, and technical leads looking to reduce hallucinations, accelerate model updates, and establish a bullet-proof data pipeline. Hosted by Decoding Data Science.

Event By

Ask a question

Location

MC09: When to Fine-Tune, Tokenization and Preprocessing Data

Online

This event is part of a community

Artificial Intelligence

10,845 Members

Built with

Start your own business