Automatic Speech Recognition is a task that involves transcribing speech audio recording into text. Think meeting notes or automatically generated video subtitles. For this task, you'll learn to work with the Whisper model by OpenAI. Just as before, all the necessary libraries have been installed for you, but if you're running this on your own machine, you will need the same set of libraries as before, plus radio interface. Let's load a speech dataset. This time we'll take LibriSpeech. It is a corpus of approximately 1000 hours of data derived from narrated audiobooks. Oftentimes, audio datasets are very large, so it's useful to know how to use them. To load them in streaming mode, this way the examples will be loaded as needed. For a streamed dataset, you can access the examples one by one. By the way, if you want to access more than one example, let's say the first five, you can do it with the take function. Inside this, you can see the data set. In this list, you can access individual examples with their indices. You can pick whichever example you prefer, but for now, let's stick with the first one. Just like before, you will only need the audio part of this example. Let's listen to the narration. This is the first example. There are thousands of pre-trained models for automatic speech recognition available on the Hug&Face Hub. You can find them by selecting the automatic speech recognition task. However, Whisper by OpenAI remains one of the best models for this task. Whisper was pre-trained on a vast quantity of labeled audio transcription data, 680,000 hours to be precise. What is more, 117,000 hours of this pre-training data is multilingual, or non-English. This results in checkpoints that can be applied to over 96 languages. Here, for the sake of efficiency, we will use the distilled version of the model that only works for the English language. By distilled, I mean a smaller model that was trained using the responses of the full Whisper model. This checkpoint is over 10 times smaller, 5 times faster, and within 3% word error rate of the large model. Just as before, let's check the sampling rate that Whisper expects. Now let's see what sampling rate our example has. Hooray! This time they're the same. So we can pass the audio as is to the pipeline. It worked! Now let's compare this to the transcription that came with the example. Notice that unlike the transcription that came with the example, Whisper returns transcription with proper capitalization and punctuation, which makes it much easier to read. Now let's build a simple transcription demo, and we'll use Gradio for this. Let's create a transcribed speech function that will be a wrapper around our pipeline. You can use Grass in this example. I took from the très carpe visual�� dining style. Blanc's translation becomes relaxing, meaning that after compiling, you may go over to the pull list of conversion files. This stressful experience is known across the world as sound súper process. faculties are in the test available anywhere. This is an example where a student can choose to你们 search the same request on vídeo challenge page like Shutterstock or Factory Se cholera. The Wordpress nostra bene. This users rice starting with F3 slice dishes. Actually, steps, we make literally no more possible assessments. So Meatlives require that somebody done audio processing by using a drop text. Somia目前round 졸MS find where the audio input is going to come from, in this case it's microphone, and what the output should look like. In this case it's a text box. If you would like to learn more about Gradio, there is a course about Gradio on Deep Learning AI. And we'll create the tab for uploading files in the same fashion. Now just bring everything together and launch the demo. Try the demo, try recording yourself speaking into the microphone, or try uploading audio files and see if the transcription matches what you say. Let's test if Whisper can transcribe what I'm saying. Ta-da! And try speaking for about a minute and see what happens to the transcription. You may notice that this demo only transcribed part of what you were saying if you were speaking for longer than 30 seconds. This is because with the Google Transcript you can only transcribe a part of what you were saying. So, you see, if you were speaking for longer than 30 seconds, this is because with this demo you only transcribed part of what you were saying if you were speaking for longer than 30 Whisper expects audio samples to be under 30 seconds and everything else will be truncated. Realistically, you may want to transcribe longer recordings, say a whole meeting. You can still do that with this pipeline, but you will need to provide a few additional arguments. So let's illustrate this. Let's get a longer audio example first. We're gonna stop the demo, otherwise it's gonna prevent us from running any more cells. To stop the Gradio demo, click on the square icon that interrupts the kernel. We're gonna be using the same automatic speech recognition pipeline. Let's check again what the sampling rate the model expects. Now, this is a big difference in sampling rate. How many level samples can we expect? No harm! We're gonna clear the playlist to show an example. Just a small triage of what you heard during the demo. Once we clear the playlist, you will see text- gathering icon already flashed to the top of the demo pattern. Sit back and see the typing condition while keeping your Keep서ver meter and subscribers ч on all time groups. listening experience. Stereo is great at adding spatial component to audio so when you're listening to music you get a better experience but for transformer models it's usually not needed. Most transformer models work with mono channel audio. This is because you don't really need to have the spatial information to identify whether a sound is of a dog barking or a cat meowing. You don't really need to know where the speech is coming from, you just need to know what has been said. At the same time stereo audio has two channels so that's twice amount of data and it just increases complexity for the computations without really providing any benefit. Let's see how we can convert this audio to mono. Let's check the shape of the audio array. As you can see there are two channels in this audio but we need just one. We're going to use a library called Librosa to convert this audio array from stereo to mono. Librosa expects the shape of the audio array to have the number of channels first and then the data so one more step before we can do the conversion is to transpose this array. Let's check the shape again. Now let's convert this audio to mono. Now that this is done let's listen to the example. They run, they laugh, I see the glow shining on their eyes. Hello, if you're boasting outside the room without the electricity. My heart flutters tonight I still wonder in modeling the odors of the sellers near the Range Saoirse Hill. Two numbers are written in lowönst, oh no that tone to lowönst and CBS has a low-end tone to low-end. Let's hear the sample of Yupn samples. The sample is at a bass for the low-end sound. I'm speaking to the simulation sample, it's given couple millimeters of catch. Let's select cycle and let's see what happens. It is 2%. This is what is being versus. I want to see its sound, I want to watch how it sounds. to guess what the narration is but fails to do so. Let's double-check this. Let's see what the sampling rate of this example is. The sampling rate of the example is 44,100 Hz and let's see what the pipeline expects. The pipeline expects the audio to be sampled at 16 kHz. Let's fix this. We can resample an individual file using Librosa. Now the audio example is ready for the pipeline. For the pipeline to be able to transcribe the longer video we will need to pass a few arguments to it. But first let's talk about how automatic speech recognition pipeline handles longer requirements. Because whisper can only take in up to 30 seconds at the time, to transcribe this longer example the pipeline will split the long file into chunks. We can specify the chunks length. For whisper, 30-second chunks are optimal since this matches the input that the model expects and each segment will have a small amount of overlap with the previous one. This allows the pipeline to accurately stitch the segments back together at the boundaries since it can find the overlap between segments and merge the transcriptions accordingly. Because the audio is split into chunks, the pipeline can transcribe the chunks independent of each other and then combine the results. For this reason, you can transcribe batches of chunks in parallel. Let's see the arguments that will be passed into the pipeline. Next, the length for the chunks, 30 seconds in this case. Next, we will specify how many chunks we want processed in parallel using BadgeSize. In this case, the original file is only 1 minute and 21 seconds so there is no need to have a BadgeSize larger than three or four. In case of a larger file, the BadgeSize is greater than 30 seconds. size will depend on your hardware and memory available to you. If you try a large batch size and get an out of memory error, you know that you need to try a smaller batch size. So in the first introduction lesson I gave you a rule of thumb to estimate how much memory you would need for a model to run. To estimate the batch size, think of it as a multiplier for that memory amount. So how many models you can run in parallel. So if you have the hardware to do that, you can have a large number of batches. In this case, four manages, but we also don't need any more because 30 second splits with a bit of overlap for it is probably even redundant. Probably we can do three, but just in case three or four for this case is fine. Try and experiment with batch size and see what your hardware can handle. Finally, you can set return timestamps to true and this enables predicting segment level timestamps for the audio data. These timestamps indicate the start and end time for a short passage of audio and they can be particularly useful for aligning the transcription with the input audio. To output the transcription with the timestamps, you can print the chunks part of the output. Now we get the transcription for the full audio and we get the timestamps. Now let's see how we can modify the demo to accept longer audio recordings. First, let's copy the original demo. Let's copy the transcribe speech function. Here we're going to modify the factory file, which isrows the current audio file We'll pass an additional argument for the checkout. Zhang是不是 The new公用 The last function we need to update. The code snippet launching the demo doesn't change, so it's ready to go. Try uploading an audio file that is longer than 30 seconds or record yourself speaking into a microphone, again, longer than 30 seconds and see if it works. In the next lesson, you'll learn how to go in the opposite direction and go from text to speech. Let's go to the next lesson.