In this lesson, you will learn how to integrate the AI model into a smartphone application on-device. The model will take frames directly from the camera, and give you results that you will be able to see in a live demo of real time segmentation operating at 30 frames per second on a smartphone. Let's have some fun! Now let's look at how you can integrate a model into an on-device application. In this particular application. Integration requires you to have an understanding of how data gets transformed all the way from the camera stream into the output stream. Device integration typically, is the process in which you integrate your models as part of a processing pipeline. The pipeline starts with the camera stream, which gives you data either in RGB format or YUV format. The GPU pre-processing phase takes the data from YUV to RGB, because the model requires RGB, and it typically down samples to the resolution in which you train the model. If you have a model that's trained, let's say in 224x224 resolution, the downsampling requires you to take the resolution of the camera stream, which is 720p, and convert it to the resolution of the model, which is 224x224. The model is run on the neural processing unit as efficiently as possible. In segmentation applications, you get an output mask, which is typically the same resolution of the model. So in this case the output mask is 224x224. Once you get the output mask from the model, you then perform GPU-based post-processing, which typically involves smoothing upsampling to the output resolution, thresholding, as well as blurring in order to nicely fuze the output mask on top of the image. The result is an output stream with the predictions overlaid on top of the camera stream. So how is this application going to be implemented? There are five main parts of how this application is implemented. The first is a camera stream. Where you have to extract the frames from the camera stream. Let's say at 30 frames per second. And you have to understand what the camera stream provides. It's typically either RGB data or YUV data. Then you have to implement the pre-processing. This is done using OpenCV on the GPU for faster pre-processing. It is extremely important to use the GPU to get the best performance. Third is model inference, which is done on the device using the runtime APIs, which are typically C++, or Java-based APIs for model inference. You should ensure that the model runs on the Neural Processing unit for best performance. Next you should implement post-processing, which takes the output of the model and uses OpenCV on the GPU for faster post-processing in order to overlay the outputs on top of the display. And the most important bit is the packaging of the runtime. You should make sure your application packages all the runtime dependencies to maximize hardware acceleration. So to give you a sense of how runtime dependencies are managed, a typical Android project has your Java source, your native source, and various dependencies for your source code. Applications that have AI-based models, typically require the models to be packaged as part of the application. So this is the box here, with the Tflite models. Now the runtimes also need to be packaged with your application. And these are also bundled up separately. You have the TensorFlow Lite runtime which is the package that has only the CPU implementation. You have the GPU delegate, which is an extra dependency to be able to use the GPU for processing on older devices. And then you have the NPU based delegates, which are specialized libraries that are provided in order for you to tap into the NPU. So you have options to package your model either with the application or downloaded over the air. If it's large, in the same way, the libraries can also be deployed either with your application or bundled over the air to reduce the size of your application. These are all really important considerations for you to be able to provide the best possible user experience for your application users. Now let's see this particular application in action in a real demo. In this demo you will see a 30 frames per second real-time segmentation model that we trained and quantized here. We will showcase the compute unit utilization with the NPU and see the difference between the NGP with the CPU. This application is compatible across all Android phones. It runs on the NPU's on Qualcomm powered devices that are released post 2019, and GPUs on all other phones. Let's see this in action. So let's load up this real time segmentation demo. I'm going to start off by running this on the CPU. So you get a sense of how the CPU performs for this model. I'm going to click the start camera button. And you can see the real-time segmentation of Ismaeil here. The dark blue being the background and the red being is Ismaeil. You only get about one frame per second. It takes about 800 milliseconds to run the model of the CPU. It's extremely slow. So now I'm going to switch over to use the neural processing unit. And press the start camera button again. As you can see, things are much snappier. You can run this at 30 frames per second as Ismaeil moves around. The tracking is much more accurate. Real-time segmentation running on the device, on the neural processor. So that was exciting. You saw real-time segmentation running on neural processing unit at 30 frames per second, extremely efficiently being able to detect people accurately. Going into the details of how this particular application was built was slightly outside the scope of this course. So I've included some text help and all the code samples that you need to build this application and try it out on your own. So in this lesson, you learned how to deploy the model that you had quantized and trained for the device for real-time segmentation. You learned how to deploy this model inside of a camera pipeline that involved pre-processing as well as post-processing. And finally, you saw a demo of this particular model in action.