A Hands-on workshop to operationalize GenAI apps with LangSmith: instrument tracing, run quick evals (faithfulness & answer quality), compare prompts/agents, and add basic monitoring before shipping. Ideal for builders working with LangChain/LangGraph who want production confidence without heavy MLOps overhead.
Workshop: GenAI Ops with LangSmith (40 mins)
Who this is for
Data scientists, ML/AI engineers, and builders shipping LLM/RAG or agent workflows.
Familiar with Python and LangChain/LangGraph basics (nice to have, not mandatory).
What you’ll learn (outcomes)
Wire tracing into your LangChain/LangGraph app and read runs like a pro.
Build a golden dataset and run evaluations (faithfulness, relevance, and LLM-as-judge).
Compare prompts/agents and track experiment results to pick the best variant.
Add lightweight monitoring (latency, token usage, error rates) and feedback loops.
Ship a simple RAG pipeline with guardrails you can trust.
Tools & setup (minimal)
Python 3.10+, LangChain/LangGraph, LangSmith account & API key.
Any LLM provider key (OpenAI/Anthropic/etc.).
A small set of PDFs/markdown files for the RAG demo (we’ll provide sample data).
Agenda (40 minutes, fast-paced)
Why GenAI Ops (3 min)
Risks in LLM apps: regressions, hallucinations, brittle prompts. How LangSmith closes the loop.
Instrumentation & Tracing (7 min)
Add LangSmith callback to a LangChain/LangGraph app.
Live view: spans, inputs/outputs, token & latency metrics, error drill-down.
RAG Mini-Build (8 min)
Quick ingestion → retrieve → generate.
Tag runs with metadata (dataset, prompt version, retriever params).
Evaluations & Golden Sets (12 min)
Create a dataset (questions + references).
Run evals: faithfulness (groundedness), answer quality, optional toxicity.
Compare variants (prompt v1 vs v2, retriever k=4 vs k=8).
Pick a winner using experiment tables & charts.
Monitoring & Guardrails (5 min)
Set alerts on error rate/latency.
Capture user feedback; close the training loop.
Wrap-up & Next Steps (5 min)
CI idea: run evals on every PR.
How to productionize with LangServe, and what to watch in week 1.
Hands-on demo you’ll follow
Project: “Docs Q&A” RAG.
You’ll do:
Add langsmith tracing to the chain/graph.
Build a 20–30 sample golden dataset.
Run eval experiments and read scorecards.
Swap prompts/retriever params; choose the best config with evidence.
Enable basic dashboards & alerts.
Takeaways
A working template (app + eval script).
A repeatable method to test, compare, and monitor GenAI systems.
Checklists for pre-ship sanity: datasets, evals, tracing, thresholds, alerts.
Optional extensions (if time permits / resources provided)
Regression suite for agent tools (tool-call correctness).
Prompt/version governance with tags & experiment lineage.
Cost profiling and latency budgets per route.