Welcome to Evaluating and Debugging Generative AI. I'm here with Carey Phelps, founding product manager at Weights and Biases, and instructor for this short course. Hey Andrew, excited to be here. When you're building a machine learning system, keeping track of all the data, model, and hyperparameter options can get complicated. I've been on a lot of projects where I train a model, then tune the architecture, then retrain the model, and decide to change the training settings and so on. After iterating on the model a few times, you end up asking, you know, the model I trained last week, that worked pretty well, but how do I replicate that result from a week ago, and did I remember to save not just the hyperparameter values, but also the exact datasets I use? More generally, when running a lot of models, how do you systematically keep track of everything you're trying, and use the results you're seeing to efficiently drive improvements? Even for a small team, managing and tracking machine learning model training and evaluation gets complicated, and the complexity grows worse with larger teams. I've seen that many teams can be much more efficient if this step of machine learning development is done more rigorously. So, this short course covers tools and best practices for systematically tracking and debugging generative AI models during the development process. We'll be using tools from Weights and Biases which offers an easy and flexible set of tools that's become a bit of an industry standard for machine learning experiment tracking. Generative AI models will cover both large language models for text generation and diffusion models for image generation, but generative AI models adds an additional layer of complexity compared to supervised learning, given that their output is complex and so they can be harder to evaluate. So, Carey, you know these challenges really well. Can you share with learners what they'll be learning in this course? Yeah, absolutely. Thanks, Andrew. Hello, everyone. I'm excited to be here with you. In this course, we'll be focused on evaluating and debugging generative AI. First, we'll show you how to track and visualize your experiments. Then, we'll teach you how to monitor diffusion models. And we'll discuss how to evaluate and fine-tune LLMs. Throughout the course, you'll learn about a range of debugging and evaluation tools, including Experiments to track your machine learning experiments,. Artifacts to version and store datasets and models, Tables to visualize and examine predictions made by your models. Reports to collaborate and share experimental results, and the Model Registry to manage the lifecycle of your models. Finally, Prompts for evaluating large-language model generation. These tools can work with a wide range of frameworks and computing platforms, including Python, TensorFlow, or PyTorch. So, a lot of good stuff there. And quite a few people have contributed to the development of this course. On the Weights and Biases side, we're grateful to the hard work of Darek Kleczek, as well as Thomas Capelle, and from Deeplearning.ai, Geoff Ludwig and Tommy Nelson. By the end of this course, you understand best practices and also have a set of tools for systematically evaluating and debugging generative AI projects. I hope you enjoy the course.