In this lesson, you will deploy your first on-device model with just a few lines of code. You will get introduced to real-time image segmentation and all the models that are popular for this particular task. You will pick a state of the art model and deploy it on-device, validate you medical correctness and performance testing, and see the inference performing on a real smartphone. All right. Let's get started. There are several applications of on-device AI. This includes things like real-time object detection, where you can detect people and faces, and QR codes. Speech recognition, which translates your speech input into text. Pose estimation that can predict the human pose based on a real-time image or video stream. Image generation where you can type a text and get a picture of that particular text rendered by generative AI. Super resolution, to take a low resolution image and create content that is higher resolution, and image segmentation, which will be the focus of today's lesson. You can explore various models for all of these topics, at github.qualcomm.com. There are recipes in GitHub as well as Hugging Face for you to get all of these models running on-device. So what is image segmentation. It's the task where given an image, you can break it down into all the meaningful segments. So you can do analysis and easier object identification. So to give you an example, let's say you have this particular image of a street. The image segmentation model consumes this image and provides various different segments of the predicted areas of this image. For example, the colors in pink refers to roads. Green refers to trees. The lighter green refers to sidewalks, and red refers to people. There are many types of image segmentation. The two most popular ones are semantic segmentation and instance segmentation. Now, in semantic segmentation, as described by the image on the left in blue, every pixel refers to a specific class. So here the yellow refers to the table and the pink refers to people. In instance segmentation, every pixel refers to not only a class, but also an individual instance within the same category. So with your image on the right, every one of those people are marked in a different color, and they are depicted to be a separate instance of the person category. So there are many applications of image segmentation. It is deployed for all advanced driving assistance systems to mark the roads in the cars and the people separately. It's used extremely popularly in image editing software to apply filters on your hair or your face, or to blur out backgrounds. It's also used everywhere in video conferencing software to blur out your backgrounds as you're speaking on calls. And finally, it's also used in drones to map out landscapes, especially for agricultural and industrial applications. So in this lesson, you will deploy a real-time segmentation model. Here, the processing and segmentation of the image must happen instantaneously and must be done frame by frame. This particular video depicts real-time segmentation being performed on a street view camera. The light blue are the cars, the gray is the road, the darker green are the trees, the pink are the purple, and the blue is the sky. All of this has to be applied on a frame by frame basis. This makes it extremely challenging from a systems perspective. To give you an example, let's assume you have to process all of this data at 30 frames per second. Now, 30 frames per second implies that you have about 33 milliseconds to process that entire frame before the next frame arrives. This means that your entire AI model must run within 30 milliseconds for you to get all of analysis needed before the next frame arrives. There are several different models that are used for semantic segmentation. Here are four of the most popular ones. The first one is ResNet, which uses residual connections for training very deep networks. It's pretty popular for segmentation applications. High resolution network or HRNet. Is another algorithm that is popular for real-time segmentation, and this network operates by taking higher resolution representations through the network to capture finer grained details. The next one that's popular is called FANet or Feature Agglomeration Network, which focuses on agglomerating in various different features from different scales. Again, to get the best detail possible for prediction and another popular network is DDRNet, which is a Dual ynamic Resolution Network that employs dual-path architectures to balance efficiency and accuracy. And all of these networks can run entirely locally on the device. So in this particular lesson we will be focused on a network called FFNET or a Fuss Free Network. Which has a simple encoder decoder architecture. With a ResNet like backbone and a small multi-scale head. The performance from an accuracy standpoint of this particular network is just as good as complex semantic segmentation networks like HRNet or FANet that I introduced you earlier too. But it has one big advantage of being computationally very efficient, thereby making it perfect for deployment on-device. Another advantage of this particular network, is that it's extremely configurable, so you can have various sizes of encoders and decoders depending upon the needs of your environment, as well as the accuracy needs for your application. So here are a few variants of the FFNET. architecture that you will get to explore in today's lesson. So this includes the FFNET 40, 54 and 78 S architectures that are all based on a ResNet backbone. They operate on a 1024x248 resolution. So it's a pretty high resolution. These models range from about 55MB to about 100MB in size. And they have between 13 and 27 million parameters, and they take about 62 to 96 gigaflops of computation to run on the device. There are also low res variants of the FFNET architecture that operate on lower resolutions. These are the 78 low res and the 122 low res. They operate on 512 cross 1024 resolution. they're a little bit larger, between 100 to 130MB in size. And their parameters are somewhere around 26 to 32 million. The lower resolutions obviously require fewer operations and as a result, require lesser number of gigaflops. Now, let's see how you can deploy all of these networks on a device. In this notebook, you will deploy a real-time segmentation model on-device. The goal of this notebook is to give you a high-level overview of how to deploy models on-device. In the next lesson, you will learn a lot more details of every concept being covered in this particular notebook. This notebook deploys the FFNET model. I have shared a link to the FFNET paper for your reference. Now you will explore the various different variants of the FFNET network. And you will do so with the Qualcomm AI hub Python package. Which contains a lot of PyTorch models that are ready for you to try out right away. In order to get computational summaries of a particular model, you will use the torch info package that gives you a very detailed description of the total number of parameters, the size of the model, as well as the computational complexity. Now let's run some code to provide a computational summary of the model. So this takes the pre-trained weight of the model with the input resolution of 1024x2048. As you learned in the lesson, and uses the torch info package to provide a summary of the model for this particular input resolution. The summary contains the number of parameters, the computational complexity, as well as the size of the network. So if we scroll down below, the total number of parameters is about 13.9 million, all of which are trainable. The total computational complexity of this network is about 62 Gigaflops, and the input size of this particular model is about 25MB. So this gives you a rough sense of how big this model is, how much compute power it's going to take, and the total number of parameters that are required to deploy this particular model. I've created an exercise for you recreate the table that was shown to you in a lecture. This exercise allows you to explore different variants of the FFNET architecture and the computation complexities associated with each variant. So the top three variants here that are highlighted operate on the higher resolution of 1024x2048, while the next two variants operate on the low resolution variants of 512x FFNET1024. You can get the computational summary including the number of parameters, the computational complexity, as well as the model size for each of these FFNET variants. The next step is for you to set up AI hub in order to configure the device in the loop deployment. AI hub is available as a Python package. Next we will setup qualcomm AI hub for the device in the loop deployment. You can install it as a Python package and configure it with the API token. Now, you will run a simple demo of the FFNET 40 S network in the notebook. Now this particular line downloads pre-trained weights of the FFNET 40 S model directly from the source and runs through sample inference on a simple image and provides the predictions of that particular model on that particular image. This specific line of code runs inference on a sample image that was pre-provided to the network, and provides the predictions along with the annotations. Red being people, green being trees, gray being the road, blue being other cars. Note that this particular demo runs entirely in the notebook in the cloud environment. And the goal of this lesson is for you to be able to take this particular model and this particular example and run it on a smartphone. Now let's run this particular model on a real smartphone. For that, we will use the FFNET export function and provide a device. For example, device being the Samsung Galaxy S23. Running this is specific piece of line of code, does all the four steps that you were shown in the slides. The model is taken from PyTorch and converted and compiled for the Samsung Galaxy S23. That is done through this specific compilation job. Once the compilation is completed, a real physical Samsung Galaxy S23 is provisioned in the cloud. Once this real device is provisioned, a performance profile is run to get a sense of how long this specific model is going to take to run on the device. And then finally once the performance profiling is completed, we run through inference on-device on the same sample that was shown to you earlier. So that you get a sense of how accurate it is to run on the device. So this particular process takes about 2 to 5 minutes depending on the load on the server. The outcome of this particular script is a very simple performance summary that's highlighted here. So this says that this model was run on a Samsung Galaxy S23. It ran in about 22 milliseconds and the total number of operators are about 92. It ran entirely on a neural processing unit. You will explore each of these concepts in the next lesson in a lot of detail. The numerical correctness is also displayed here. So this shows a specific measure called peak signal-to-noise ratio, where it's comparing the numerical correctness of the inference that ran on the device with the one that ran locally in the notebook environment. And anything more than 30 is typically considered good. This specific scenario is 62, which means that the on-device inference matches the cloud almost exactly. Note that many of these links may not be accessible to you. So we've provided a simple set of explorable links for you that you can explore for all of the variants of FFNET that has a detailed view that describes performance, that describes the number of layers, the memory consumption, for all of the different variants of the FFNET architecture. Now, in the last step of this particular notebook, you will run the image segmentation model on a physical device. This runs the same demo that you saw earlier on in the notebook. But instead of running it in the local notebook environment, it's going to run on a Samsung Galaxy S23 that's provisioning the cloud for you. Note that this particular line takes about a few minutes, and in the few minutes, the PyTorch model is sent to the server and compiled for the Samsung Galaxy S23. A Samsung Galaxy S23 is provision in the cloud for you to be able to access. The image is then passed through the model that is compiled for the device, and the output predictions are given back to you, and then you will see a displayed output of the results coming from the Samsung Galaxy S23. Okay, great. You see the result? It looks exactly the same as what I got in the notebook. The red being the people, the blue being the cars. The green being the trees. And the gray being the road. But this particular inference came from a real Samsung Galaxy S23. So in this lesson, you learned about the various different variants of the FFNET. You explored the computational complexity of each of those variants. You were able to run a demo locally in the notebook environment. You were able to export the model for the Samsung Galaxy S23, and you were able to measure performance at about 22 milliseconds for the net 40 S variant. And you noticed that the PSNR between the cloud inference and the on-device inference was about 62, which meant that it provided the same results as the cloud. And finally you saw an end to end demo of all of this working where you provided an image and you got the results back and looked visually exactly the same as what you got in the cloud. In the next lesson, we'll go into a lot more detail of each of these concepts. Break them down. Understand what's happening behind the scenes of each of these areas, so you can fully understand what it takes to deploy models on the device. All right. See you there.