You need to trace and eval, otherwise you're not in production.
This AI workshop evaluated the performance of various search providers, using Langchain and a custom Python framework to compare their accuracy and efficiency across multiple queries. The workshop leveraged Langsmith for tracking experiments, generating evaluation datasets, and visualizing results, providing a comprehensive analysis of LLM performance.
This workshop demonstrates LangSmith's capabilities for tracing and analyzing LLM calls, specifically using Google's Gemini model for cost-effective text summarization. The lesson highlights error tracking, token counting, and cost analysis within a practical example focused on Ottawa's unique bootstrapping tech ecosystem.
A real eval task! Setup a dataset, then an evaluator (LLM as a Judge) and run an experiment on LangSmith.
Why is evals your next superpower?