From CPU to NPU: The Secret to ~15x Faster AI on Intel’s Latest Chips

Intel’s newest chips now come with a Neural Processing Unit (NPU), built to handle AI and machine learning tasks more efficiently than a regular CPU. Instead of struggling with AI workloads on the CPU, the NPU is designed to run them faster and with less power. This is great because you can free up the CPU to do other general tasks, but I wanted to know how much better the NPU can run a model, compared to the CPU. Based on my test, it’s roughly a 15x performance boost, which is great.

If you’re looking to buy an edge device with an NPU, I can recommend the Khadas Mind 2 Mini PC, as it’s really small and packs a lot of power, plus it has a small battery that serves as a UPS, so you can just move it around from one USB power supply to another without it losing power. It’s quite nice. OK, now let’s see how I got to that number from the title.

In real-time computer vision, throughput and latency are two fundamental performance metrics that impact the efficiency and responsiveness of a system. Throughput refers to the number of frames processed per second (FPS), determining how much data the system can handle over time. This is basically what you’re referring to when you ask “how long it takes to process this video”. Latency, on the other hand, is the time it takes to process a single frame from input to output, affecting how quickly the system responds to new data. Low latency is crucial for real-time applications like augmented reality and autonomous driving. When you play around with a system and it feels “laggy”, it’s because it has high latency. You want to keep your latency low and your throughput high.

I’m going to assume you already installed OpenVINO on your system, and that you have an Intel chip with an NPU in it. You can quickly check if those two things are true by running this command:

import openvino as ov

core = ov.Core()
core.available_devices

You should see something like [‘CPU’, ‘GPU’, ‘NPU’] as the reply to that. Those are the available devices in OpenVINO. If you don’t see your device, make sure you installed the drivers correctly and troubleshoot it before continuing.

Next, we need a model. I’ll be using ResNet-50, one of the most well-known Convolutional Neural Network architectures, introduced in Microsoft’s 2015 paper “Deep Residual Learning for Image Recognition” It was trained on ImageNet-1K at a resolution of 224×224, meaning you can feed it an image of that size, and it will predict the probabilities for 1,000 different object categories. You can get the object names here.

Luckily, ResNet-50, optimised for OpenVINO, is available for download here. Just grab those two files, resnet50_fp16.xml, and resnet50_fp16.bin and place them in your working folder. If you want to try with another model, you can also do that. Make sure to run the OpenVINO optimiser on your model to get the best performance. I’m also going to use OpenCV for image loading and resizing, so let’s install it first, and make sure numpy is there as well:

pip install opencv-python numpy

Now let’s classify an image with this model. Write this into a file and save it as classify.py:

import openvino as ov
import numpy as np
import cv2

def classify_image():
    # Step 1: Load OpenVINO model
    core = ov.Core()
    model = core.read_model("resnet50_fp16.xml")
    compiled_model = core.compile_model(model, "CPU")  # Use "NPU" if available

    # Step 2: Get input tensor details
    input_layer = compiled_model.input(0)
    input_shape = input_layer.shape  # Should be (1, 3, 224, 224)

    # Step 3: Load and preprocess image
    image = cv2.imread("input.jpg")
    image = cv2.resize(image, (224, 224))  # Resize to match model input
    image = image[:, :, ::-1]  # Convert BGR to RGB (OpenCV loads as BGR)
    image = image.astype(np.float32) / 255.0  # Normalise to [0,1]
    image = np.transpose(image, (2, 0, 1))  # HWC to CHW
    image = np.expand_dims(image, axis=0)  # Add batch dimension

    # Step 4: Run the inference
    output = compiled_model(image)[compiled_model.output(0)]

    # Step 5: Process the results
    top_class = np.argmax(output)  # Get class index

    # Load ImageNet labels (remember to download the file)
    imagenet_labels = np.array([line.strip() for line in open("imagenet_classes.txt").readlines()])

    # Display result
    print(f"Predicted Class: {imagenet_labels[top_class]}")

if __name__ == "__main__":
    classify_image()

Make sure that in the same folder you have these files: classify.py, imagenet_classes.txt, resnet50_fp16.xml, and resnet50_fp16.bin. Now add any image in there and rename it to input.jpg. Now simply call the script:

python classify.py

You should get the correct predicted class, like this:

Now that we know that the model actually works correctly with OpenVINO, we can benchmark it with a convenient tool that comes with it. It’s called benchmark_app, and it allows you to quickly check the performance of your devices with different models. You can call it like this:

benchmark_app -m MODEL -d DEVICE -hint HINT

For this benchmark, I ran these four commands:

benchmark_app -m "resnet50_fp16.xml" -d CPU -hint latency
benchmark_app -m "resnet50_fp16.xml" -d CPU -hint throughput
benchmark_app -m "resnet50_fp16.xml" -d NPU -hint latency
benchmark_app -m "resnet50_fp16.xml" -d NPU -hint throughput

These are the results:

Device	Hint	Median Latency (ms)	Average Latency (ms)	Min Latency (ms)	Max Latency (ms)	Throughput (FPS)
CPU	Latency	25.31	24.73	18.38	47.16	40.32
CPU	Throughput	47.38	63.65	38.52	135.35	62.69
NPU	Latency	1.68	1.7	1.52	8.3	569.71
NPU	Throughput	9.15	8.4	3.21	86.6	936.05

Key Takeaways:

NPU (Latency mode) achieves 1.70ms average latency compared to 24.73ms on CPU (~15x improvement)
NPU (Throughput mode) reaches 936.05 FPS, which is ~15x higher than CPU throughput mode (62.69 FPS)

This confirms that Intel’s NPU significantly outperforms the CPU in both latency and throughput, with a roughly 15x performance boost for this particular model.

Posted in AI, Computer Vision, Open Source, OpenVINO.

Tagged with intel, npu, openvino, performance, python.

rev="post-723" 1 comment

By samontab – February 5, 2025

How to run Llama 3.2 locally on Mac and serve it to a local Linux laptop to use with Zed

UPDATE: I wrote this post for Llama3.1, but just after I published it, Llama3.2 was released. It comes with similar performance but faster inference as it’s a distilled model(~2.6 times faster than Llama3.1 in a quick test I did). I updated this post to use 3.2 but it should work with any other version as well.

Zed is a great editor that supports AI assistants. In this post I will explain how you can share one Llama model you have running in a Mac between other computers in your local network for privacy and cost efficiency. Also, fans might get loud if you run Llama directly on the laptop you are using Zed as well.

Since I’ve found that Apple silicon (M1, M2, etc) is quite good at running these models, I will assume the model will be run in that computer. The default LLama3.2:3b works fine on a Mac Mini M1 with 16GB. If you have 8GB you might need to use simpler models.

The first step is to install ollama in your Mac. Just follow the instruction on the website. You will end up with a little llama icon on menu bar(top right). We now need to install a model, Llama3.2 in particular. In the terminal run this:

ollama run llama3.2

This will try to run llama3.2 and since it is not yet installed, it will fetch the latest model for you. After it downloads and runs the model, simply type something to test it all works. It should look like this:

>>> Tell me a joke
Here's one:

What do you call a fake noodle?

An impasta.

>>> Send a message (/? for help)

Now that it works locally we want to make it available for other computers in the local network. Open the terminal and run this command:

launchctl setenv OLLAMA_HOST 0.0.0.0:11434

Now click on the icon and exit ollama. Then start ollama again. It should now be ready to accept connections from other computers in your network. To check connectivity, go to a Linux computer in your network and open a terminal. Run the following:

curl http://your_mac.local:11434/api/generate -d '{
 "model": "llama3.2",
 "prompt": "Tell me a joke",
 "options": {
   "num_ctx": 4096
 }
}'

Make sure you see a correct response before continuing. Now let’s configure Zed to use it. Open settings (CTRL-SHIFT-P and write open settings). Add these settings there (plus any others you already have, it’s a json file):

{
  "language_models": {
    "ollama": {
      "api_url": "http://your_mac.local:11434",
      "low_speed_timeout_in_seconds": 120,
      "keep_alive": "120s",
      "available_models": [
        {
          "provider": "ollama",
          "name": "llama3.2:latest",
          "max_tokens": 16384
        }
      ]
    }
  },
  "assistant": {
    "default_model": {
      "provider": "ollama",
      "model": "llama3.2:latest"
    },
    "version": "2"
  }
}

It should now be configured. You can go to the Assistant Panel (CTRL+?) and ask whatever you want there. You can add context as well with /tab and others.

That’s it, now you have a shared Llama 3.2 model running in one computer, while using it on another computer in your network, privately and free, integrated into a great text editor.

Posted in AI, Open Source, Programming, Zed.

Tagged with free, Linux, Llama, local, mac, private, shared, Zed.

rev="post-707" No comments

By samontab – September 25, 2024