Learn how to systematically evaluate, improve, and iterate on AI agents using structured assessments.

Learn how to add observability to your agent to gain insights into its steps and know how to debug it. Learn how to add observability to your agent to gain insights into its steps and know how to debug it.Learn how to set up evaluations for the agent components by preparing testing examples, choosing the appropriate evaluator (code-based or LLM-as-a-Judge), and identifying the right metrics. Learn how to set up evaluations for the agent components by preparing testing examples, choosing the appropriate evaluator (code-based or LLM-as-a-Judge), and identifying the right metrics.Learn how to structure your evaluation into experiments to iterate on and improve the output quality and the path taken by your agent. Learn how to structure your evaluation into experiments to iterate on and improve the output quality and the path taken by your agent._Learn how to systematically assess and improve your AI agent’s performance in Evaluating AI Agents, a short course built in partnership with Arize AI and taught by John Gilhuly, Head of Developer Relations, and Aman Khan, Director of Product.When you’re building an AI Agent, an important part of the development process is evaluations or evals. Whether you’re building a shopping assistant, coding agent, or research assistant, having a structured evaluation process helps you refine its performance systematically—rather than relying on trial and error. With a systematic approach, you structure your evaluations to assess the performance of each component of the agent, as well as its end-to-end performance. For each component, you select the appropriate evaluators, testing examples, and metrics. This process helps you identify any areas of improvement so you can iterate on your agent during development and in production. In this course, you’ll build an AI agent, add observability to visualize and debug its steps, and evaluate its performance component-wise.In detail, you’ll: Distinguish between evaluating LLM-based systems and traditional software testing.Explore the basic structure of AI agents – routers, skills, and memory – and implement an AI  agent from scratch.Add observability to the agent by collecting traces of the steps taken by the agent and visualizing the traces.Choose the appropriate evaluator – code-based, LLM-as-a-Judge, and human annotations – for each component of the agent. Set up evaluations for the skills and router decisions of the agent example using code-based and LLM-as-a-judge evaluators, by creating testing examples from collected traces and preparing detailed prompts for the LLM-as-a-judge.Compute a convergence score to evaluate if the example agent can respond to a query in an efficient number of steps.Run structured experiments to improve the performance of the agent by exploring changes to the prompt, LLM model, or the agent’s logic.Understand how to deploy these evaluation techniques to monitor the agent’s performance in production.By the end of this course, you’ll know how to trace AI agents, systematically evaluate them, and improve their performance. _Anyone who has basic Python knowledge and wants to learn to evaluate, troubleshoot, and improve AI agents effectively—both during development and in production. Familiarity with prompting an LLM model would be helpful but not required.

Evaluating AI Agents

Learn MLOps tools for managing, versioning, debugging, and experimenting in your ML workflow.

Learn to evaluate programs utilizing LLMs as well as generative image models using platform-independent tools Learn to evaluate programs utilizing LLMs as well as generative image models using platform-independent toolsInstrument a training notebook, and add tracking, versioning, and logging Instrument a training notebook, and add tracking, versioning, and loggingImplement monitoring and tracing of LLMs over time in complex interactions Implement monitoring and tracing of LLMs over time in complex interactions_Machine learning and AI projects require managing diverse data sources, vast data volumes, model and parameter development, and conducting numerous test and evaluation experiments. Overseeing and tracking these aspects of a program can quickly become an overwhelming task.This course will introduce you to Machine Learning Operations tools that manage this workload. You will learn to use the Weights & Biases platform which makes it easy to track your experiments, run and version your data, and collaborate with your team.This course will teach you to:Instrument a Jupyter notebookManage hyperparameter configLog run metricsCollect artifacts for dataset and model versioningLog experiment resultsTrace prompts and responses to LLMs over time in complex interactionsWhen you complete this course, you will have a systematic workflow at your disposal to boost your productivity and accelerate your journey toward breakthrough results._Anyone who has familiarity with Python and PyTorch or similar framework and an interest in managing, versioning, and debugging their machine learning workflow.

Evaluating and Debugging Generative AI

Learn how to create an automated CI pipeline to evaluate your LLM applications on every change, for faster and safer development.

Learn how LLM-based testing differs from traditional software testing and implement rules-based testing to assess your LLM application. Learn how LLM-based testing differs from traditional software testing and implement rules-based testing to assess your LLM application.Build model-graded evaluations to test your LLM application using an evaluation LLM. Build model-graded evaluations to test your LLM application using an evaluation LLM.Automate your evals (rules-based and model-graded) using continuous integration tools from CircleCI. Automate your evals (rules-based and model-graded) using continuous integration tools from CircleCI._In this course, you will learn how to create a continuous integration (CI) workflow to evaluate your LLM applications at every change for faster, safer, and more efficient application development.When building applications with generative AI, model behavior is less predictable than traditional software. That’s why systematic testing can make an even bigger difference in saving you development time and cost. Continuous integration, a key part of LLMOps, is the practice of making small changes to software in development and thoroughly testing them to catch issues early when they are easier to fix. With a robust automated testing pipeline, you’ll be able to isolate bugs before they accumulate – when they’re easier and less costly to fix. Automated testing lets your team focus on building new features, so that you can iterate and ship products faster.After completing this course, you will be able to:Write robust LLM evaluations to cover common problems like hallucinations, data drift, and harmful or offensive output.Build a continuous integration (CI) workflow to automatically evaluate every change to your application.Orchestrate your CI workflow to run specific evaluations at different stages of development._Anyone with basic Python knowledge and familiarity with building LLM-based applications.

Evaluating AI Agents

Evaluating and Debugging Generative AI

Automated Testing for LLMOps