In this lesson, you will learn four key concepts for on-device deployment. The first, is around neural network graph capture which captures computation that you're connecting to the device. The second is around on-device compilation. The third, around accelerating with hardware the models on the device. And the fourth around the importance of validating on-device numerical correctness. All right, let's go. You will go over key concepts to enable you to take a model train in the cloud and deploy on-device. The first is graph capture, where you will learn how to capture the computation of a neural network into a portable format that you can take for the device. The second is compilation, where you will learn how to compile models for a target device. Third, you will learn how to validate the compiled model on-device so that your cloud and device produce the same results. And finally, you will run a performance profile on-device so you can be confident that your model can meet the constraints of the device. So what is graph capture? You might be familiar with a block of code in PyTorch that looks a bit like this. Here you have two convolutions: conv1 and conv2. And each of those convolutions are followed by ReLU based activations. Now the key thing that you should know is that this particular representation of the model in code can be turned into a graph representation. And that is the concept called graph capture. As you can see, the same model is now represented as a graph where the input is x and you have the two convolutions and you have the two Relu's. Now let's see how to do this in a notebook. Let's start by importing the FFNET 40S model using the Qualcomm AI Hub package. This loads the PyTorch model in memory in your notebook. In order to capture the computations of the FFNET 40 S network, you will use the PyTorch git trace functionality. In order to do so, First, you have to set up your PyTorch model with the input shapes that are required for the network. This particular network takes an RGB image of size 1024x2048. That is described in this particular input shape as a three-channel 1024 2048 tensor. Next, you generate some random input in order to provide it to the network, so you can trace it. And finally you will call the torch jit trace function that takes the sample input passes it through the model. This particular function process is all the computation as it goes through the network, to understand what is needed in order to capture the entire computation as a graph. The resulting output is a PyTorch traced model that looks a bit like this. As you can see, this trace model captures all the computations that describe this FFNET 40 S network, except these computations are entirely portable so that you can take it for deployment on-device. Now, in the next step we're going to take this graph that is capture the computation of the particular network and compile it for consumption on the device. Let's look at that in the notebook. Now, let's look at how you can compile this model for the device. In order to do so, first, you will set up Qualcomm AI hub. And you will use the Get Devices API, to get a list of all the devices that are available to you in the cloud. These are real physical devices that are provisioned that you can call in with the APIs in order to run models on them. As you can see, there are Samsung devices, that are Google devices, that are Xiaomi devices, that are various robotics devices as well. In order to compile the trace model for the device, all you need to do is use the submit compile job API, pick the device that you would like to compile for and provide the trace model, and provide the input specification. In this case, an RGB 124x248 image. When you run this block of code, the trace model gets sent over to the cloud and optimized for this particular Samsung Galaxy S23 device. The output of the compile job is a target model that's compiled for that specific device. You can get access to the target model using the Get Target Model API that provides you with a downloadable artifact that is deployable on the device. Now let's learn what these compatible artifacts are. You may have been wondering what the dot Tf lite extension was in the compile model that we downloaded. That is an artifact that is compatible with the TensorFlow Lite runtime. There are three popular runtimes for on-device deployment. One is TensorFlow Lite, that I recommend for Android applications. Another is ONNX runtime that I recommend for windows based applications. And the third one is the Qualcomm AI engine that is suitable for fully embedded applications on Qualcomm's hardware. Let's learn a little bit more about the TensorFlow Lite runtime. this is specially designed for mobile platforms and embedded devices for extremely efficient performance. It provides fast response times by reducing computation overhead. It's pretty flexible in deployment so you can use it for smartphones, various IOT gadgets, making it very portable and ubiquitous. It's extremely energy efficient because it uses traditionally less power. And it's also hardware-accelerated and fully compatible with the neural processing units or NPU's using a mechanism called delegation. Now let's explore this in an exercise that I have created for you in the notebook. This exercise allows you to try out different runtimes, including TensorFlow Lite, ONNX runtime, as well as the Qualcomm AI engine. This can be called as an option passed in to the submit Compile Job API, and the option specifies the target runtime. You can explore more options in the URL provided here. Now that you have compiled a model for the device. Let's look at how you can execute it on the device. It's really important to understand the computational resources that are available to you on a device. There are three popular compute units available on a modern device, the CPU, the GPU, and the neural processing unit of the NPU. The CPU is the most flexible general purpose computation engine. You can do computations, you can do branches. You can do for loops. It's extremely easy to program against. Next is the GPU, which is designed for high performance, complex parallel computation. It's slightly harder to program than a CPU, but still relatively easy, and it's designed for highly performing parallel computation. The third unit is the neural processing unit, which is extremely efficient, sometimes up to ten times more efficient than the CPU for neural network. The downside of the NPU is that it's slightly less flexible than the CPU and the GPU. Each of the runtimes that you learned about in the previous lessons, have backends that allow you to tap into the CPU, the GPU, or the NPU, and you have control as an application developer to be able to target each of these compute units separately. It's also important to understand that different devices have different capabilities. Now, devices that are more than 78 years old typically don't have a neural processing unit. They only have a CPU and GPU. Whereas modern devices, especially Qualcomm-powered devices, have neural processing units that allow you to do computation extremely efficiently. Now let's see this in a notebook. Now you will be able to deploy the compiled model on a device and perform a performance profile. In order to do so. You will use the Submit Profile Job API. By selecting a specific device. In this case, the Samsung Galaxy S23, and providing the target model that was previously compiled using the compile job. Now this particular block of code submits the target model to the cloud provisions a Samsung Galaxy S23 and runs the profile. The profile is then downloaded using the Download Profile API, and you can visualize the performance profile using this print profile metrics from job function. This particular function takes a couple minutes, and you will see the results displayed below. The model ran using the TensorFlow Lite runtime, at about 27.9 milliseconds. It ran entirely on the neural processing unit and consumed about 3 to 5MB of application memory. Next, you will explore in an exercise how you can try different computation units to get a sense of how the model performs with each of those computation units. Can do so by submitting options to the submit Profile job function and providing the compute unit. It can either be CPU, GPU, or NPU. Note that each of these compute units have a fallback, so certain operations of the neural network graph are not supported by those compute units, the default fallback for those specific operations to the CPU. Now you will learn how to check the accuracy of the model on-device. In order to do so, will take a sample input. It can be a few samples or a single sample. You will run inference with the PyTorch runtime in your notebook environment. You will also run inference on the device, on the neural processing unit on the same image. And then you will compare the results between the cloud environment and the device environment. And in order to compare these two results, we will use the peak signal to noise ratio measurement, that measures the delta between the image produce here, in the cloud, and the image produced on the device. Let's see this in the notebook. Now let's look at how to perform on-device inference. To do so you've been provided with some sample inputs associated with the FFNET 40S network. Using the FFNET 40S work sample inputs command. These sample inputs correspond to a numpy array of size 3x1024x2048, which is an RGB image of size 1024x2048, which is what the model requires. Note that the values of these tensors are scaled between 0 and 1. Now let's take the sample inputs that are provided here and pass it through the FFNET 40S network defined in torch and get outputs. These are outputs that are running locally in the notebook. The goal now is to compare this with what you get on-device. And you can do that with the Submit Inference job API. You pass in the target model, which is the compiled artifact sample inputs, which we just saw, and choose a device, in this case, the Samsung Galaxy S23. When you run this command, inference is run through that cloud hosting device, and you get back the results in numpy format. You can get access to the results using the Download Output Data API. This job takes about a minute or so to complete. And the output is a numpy array. Now you can compare the result between the local CPU inference that ran in your notebook, and one that ran on the device that's hosted with the cloud. As you can see, the big signal to noise ratio is about 60. Anything more than 30 is typically considered pretty accurate. So this implies that the results that you got on the device match up very well with what you get on the cloud. So you can be confident to deploy this particular model. Now let's get ready for deployment. The model has been validated on-device for performance as well as numerics. And you can use the Get Target model API to download the model. And you have an artifact that you can deploy on-device. In this lesson, you learn four key concepts. The first was on how to capture a PyTorch graph with a trace. The second was how you take that trace function and compile it for the device. The third is how you profile that particular device for the neural processing unit for maximum hardware efficiency. And finally, you learned how to validate the model on-device to make sure your cloud environment and your device environment produce the same results. In the next lesson, you will learn how to take this particular model, quantize it and make it up to four times faster and four times smaller. See you there.