Audio classification has many applications. For example, you may want to identify the language someone is speaking, or perhaps you want to know what birds are in your area. Let's build a sound classifier. But there's a catch. In a traditional classification, a model predicts a label from a predefined set of classes it was trained on. If there are no models trained on your specific set of classes, you would have to collect a dataset and fine-tune a model. Here, you'll use an alternative approach that doesn't require fine-tuning. For this classroom, the libraries have already been installed for you. If you're running this on your own machine, you can install the transformers and datasets and other required libraries by running the following. I will command them out because they're already installed here. We're going to need a sound to classify, so let's load an audio dataset from the Hugging Face hub. The ESC50 dataset is a labeled collection of five-second environmental sounds, such as sounds made by animals and humans, nature sounds, indoor sounds, urban noises. You're not going to need the whole dataset, so we're just going to load a few examples. In this case, YOU will beわ Gonna take a listen at the sound of a cute dog barking. Ow! You hear the dogs barking? Wow. Are the dogs barking? Yes Or no No Sounds like a dog to me. Let's build the classification pipeline. For this kind of audio classification you will need a pre-trained CLAP model. At the moment it is one-of-a-kind architecture available for these tasks, so you can find it on the Hug & Face Hub by filtering models with feature extraction in the multi-model task and then filtering by the name CLAP. To classify your audio example you'll only need the array of audio data. However, the example has to have a sampling rate that the model expects. Let's step back and talk about the sampling rate. A sound wave is a continuous signal. This means it contains an infinite number of signal values in a given time. But the audio your computer can work with is a series of discrete values, known as digital representation. So, what is a digital representation? To get the digital representation of a continuous audio signal, we first capture the sound with a microphone. Then the analog signal is converted into an electrical signal. Then the electrical signal is sampled to get the digital representation. Sampling means measuring the value of a continuous signal at fixed time steps. As a result, the sampled waveform is discrete with a finite number of values at uniform intervals. A very important characteristic of the digitized audio is the sampling rate. It is the number of samples taken in one second and it is measured in hertz or kilohertz. For example, 8 kilohertz is the sampling rate of audio in a telephone or a walkie-talkie. 16 kilohertz is a sampling rate that is good enough to capture human speech without it sounding muffled. A sampling rate of 192 kilohertz is something that you can expect from professional, high-definition, audio recording equipment. But why is the sampling rate important? It's important when working with AI models. Consider an example. A five-second sound at a sampling rate of 8 kilohertz will be represented as a series of 40,000 signal values. The same five-second sound sampled at 16 kilohertz will be represented as a series of 80,000 signal values. And at 192 kilohertz, it will be represented with almost a million values. For a transformer model, these three arrays are very different. Transformer models treat input as sequences and rely on attention mechanisms to learn audio representation. They are trained on datasets where all examples have the same sampling rate and they do not generalize well to other sampling rates. So, for a transformer model trained on 16 kilohertz audio, a five-second high-quality audio that is expressed in nearly a million values will look like a 60-second recording. For a transformer model trained on 16 kilohertz audio, a five-second high-quality audio that is expressed in nearly a million values will look like a 60-centimetric recording. If a transformer model has been trained with audio samples, each recorded at a sampling rate of 16 kilohertz, it's going to view any input as if it was recorded at the same sampling rate. So, let's take a one-second sound, which has been recorded with 192 kilohertz sampling rate. How many values will the array representing that sound have? 192,000 values. But if we take a model that has been trained on audio examples with 16 kilohertz sampling rate, is it going to see as one second or more? Let's find out. The model is going to see 192,000 values and it expects that one second contains 16,000 values. So, for this model, the recording is going to look like 12 seconds. So, what if now we have a five-second recording at high definition, meaning 192 kilohertz, and the same model that was trained with audio samples at 16 kilohertz. How long will the five-second recording at high definition look like to this model? We have five seconds times 192,000 values per second. So, we have 5 seconds. So, we have 5 seconds. So, we have 5 seconds. So, this sample will be an array of 960,000 values. The model expects each second to contain 16,000 values. Let's divide the number of values that we have by 16,000. This way, we'll see how many seconds the model will think that this example is. So, as you can see, the original sound was only 5 seconds, but with a lot of samples per second, but for a model that has been trained with a lower sampling rate, this exact audio will look like a 60-second recording. Now, let's get back to our task and check what the sampling rate or the model in this lesson expects. We can get this information from the pipeline. So, this model was trained with a lower sampling rate. So, this sample is recorded at 48 kilohertz. Let's check the sampling rate of our example. In this case, this is not a large difference in the sampling rate, and the model will likely do okay, but this is not always going to be the case, as you'll see in other examples. So, let's see how you can automatically cast the whole dataset to the correct sampling rate when loading it with Datasets library. Let's check the first sample again. Now, it has the same sampling rate as the model. When you load the dataset this way, all of the audio examples will have the correct sampling rate. So, the audio sample is now ready for the model. However, you also need to provide the pipeline with the candidate label. CLAP takes both audio and text as input and computes the similarity between the two. If you pass a text input that strongly correlates with an audio input, you'll get a high similarity score. Conversely, passing a text input that is completely unrelated to the audio input will return a low similarity score. So, let's define some candidate labels to compare the sample with. CLAP Pass the audio sample and the candidate labels to the pipeline and see what label is the most likely. CLAP Now, try more than two candidate labels. And then, try some completely unrelated labels. See if you can gain some intuition for the limitations of this approach. Let's try some completely unrelated labels. CLAP We'll use the same pipeline with the same audio sample. Remember, that was a dog barking. CLAP As you can see here, the candidate labels now have nothing to do with dogs or barking, yet the model still tries to find the most plausible label among given options. CLAP speech recognition.