In this beginner-friendly 40-minute workshop, you’ll learn a simple, repeatable way to evaluate Q&A answers from different LLMs using a tiny dataset and two complementary approaches: basic automatic scores (Exact Match/F1) and an “LLM-as-Judge” rubric for Correctness, Faithfulness, Relevance, and Conciseness. We’ll show how to compare models fairly (same prompt/settings, temperature=0, consistent context), interpret results, and turn findings into actions using a light Analyze → Measure → Open Coding → Axial Coding loop. You’ll leave with a plug-and-play rubric, a mini dataset template, and a beginner notebook that generates a clear side-by-side report—so you can pick the right model with confidence and iterate quickly.
+ AI Residency Q/A
Add the DDS Google calendar link so that you don't miss any events