Skip to content


Photogrammetry: From Images to a 3D Mesh

Photogrammetry is the art and science of reconstructing 3D geometry from 2D images. In this post, I’ll walk you through the full pipeline, from a simple set of photos to a detailed 3D mesh.

Pipeline Overview:

  • Input Images: Overlapping photos of a scene or object
  • Feature Extraction & Matching: Detect and match keypoints between views
  • Sparse Reconstruction: Estimate camera poses and triangulate a sparse 3D point cloud using Structure-from-Motion
  • Dense Reconstruction: Build a high-resolution point cloud using multi-view stereo
  • Normals Estimation: Compute surface normals from the dense cloud
  • Mesh Generation: Construct the final 3D mesh

Input Images

To start, you need multiple images of your subject taken from different viewpoints. It’s important to move around the object rather than just rotating the camera. Translation is key for recovering depth information. Simply rotating the camera will only produce a panorama, not a 3D model. Also, ensure there’s significant overlap between images so the software can find enough matching features. For this example, I used the Gerrard Hall dataset from COLMAP, which contains 100 images of a building taken with the same camera. They look like this:

Feature Extraction & Matching

To align images, we first extract keypoints (distinctive features in each photo) using algorithms like SIFT. Then, we match these features between image pairs using a matcher such as FLANN. To reduce false matches, we apply Lowe’s ratio test, which keeps only the matches where the closest descriptor is significantly better than the second-best (typically with a ratio threshold of 0.75). Next, we apply RANSAC to estimate a homography and identify consistent matches, called inliers, based on geometric alignment. This combined filtering ensures that only accurate, reliable matches are used for 3D reconstruction.

Sparse Reconstruction

Once features are matched, we estimate the camera poses using Structure-from-Motion (SfM). This process incrementally recovers the position and orientation of each camera. With these poses and the matched features, we triangulate 3D points to build a sparse point cloud, which is a rough representation of the scene’s geometry.

Dense Reconstruction

Sparse reconstruction gives us a structural backbone, but to capture finer detail, we need dense data. Using techniques like multi-view stereo (MVS) or depth map fusion, we generate a dense point cloud by estimating depth for many or all image pixels. This dense representation reveals intricate surface features and prepares the data for meshing.

Normals Estimation

With a dense point cloud available, we can estimate surface normals, which are vectors that indicate the direction each surface is facing. Normals are typically computed by analysing a point’s local neighborhood using methods like PCA, or derived directly from mesh geometry. They are essential for realistic rendering, lighting, and further geometric processing.

Mesh Generation

From the dense point cloud and normals, we create a continuous 3D surface. Algorithms such as Poisson surface reconstruction, Delaunay triangulation, or ball-pivoting connect nearby points into a mesh of triangles. The resulting 3D mesh accurately models the object’s shape and can be used in a variety of applications including visualisation, simulation, 3D printing, and CAD.

At this point, you have a full 3D mesh reconstructed from your photos. You can now refine it further by cleaning up noise, filling holes, simplifying the mesh, or even texturing it using the original images. The final model can be exported for rendering, editing, or integration into other workflows.

And that’s it! Starting from a simple photo set, you can produce a complete and accurate 3D model of a real-world object or scene.

Posted in Computer Vision, Photography.

Tagged with , , , , , , , , , , , , .


From CPU to NPU: The Secret to ~15x Faster AI on Intel’s Latest Chips

Intel’s newest chips now come with a Neural Processing Unit (NPU), built to handle AI and machine learning tasks more efficiently than a regular CPU. Instead of struggling with AI workloads on the CPU, the NPU is designed to run them faster and with less power. This is great because you can free up the CPU to do other general tasks, but I wanted to know how much better the NPU can run a model, compared to the CPU. Based on my test, it’s roughly a 15x performance boost, which is great.

If you’re looking to buy an edge device with an NPU, I can recommend the Khadas Mind 2 Mini PC, as it’s really small and packs a lot of power, plus it has a small battery that serves as a UPS, so you can just move it around from one USB power supply to another without it losing power. It’s quite nice. OK, now let’s see how I got to that number from the title.

In real-time computer vision, throughput and latency are two fundamental performance metrics that impact the efficiency and responsiveness of a system. Throughput refers to the number of frames processed per second (FPS), determining how much data the system can handle over time. This is basically what you’re referring to when you ask “how long it takes to process this video”. Latency, on the other hand, is the time it takes to process a single frame from input to output, affecting how quickly the system responds to new data. Low latency is crucial for real-time applications like augmented reality and autonomous driving. When you play around with a system and it feels “laggy”, it’s because it has high latency. You want to keep your latency low and your throughput high.

I’m going to assume you already installed OpenVINO on your system, and that you have an Intel chip with an NPU in it. You can quickly check if those two things are true by running this command:

import openvino as ov

core = ov.Core()
core.available_devices

You should see something like [‘CPU’, ‘GPU’, ‘NPU’] as the reply to that. Those are the available devices in OpenVINO. If you don’t see your device, make sure you installed the drivers correctly and troubleshoot it before continuing.

Next, we need a model. I’ll be using ResNet-50, one of the most well-known Convolutional Neural Network architectures, introduced in Microsoft’s 2015 paper “Deep Residual Learning for Image Recognition” It was trained on ImageNet-1K at a resolution of 224×224, meaning you can feed it an image of that size, and it will predict the probabilities for 1,000 different object categories. You can get the object names here.

Luckily, ResNet-50, optimised for OpenVINO, is available for download here. Just grab those two files, resnet50_fp16.xml, and resnet50_fp16.bin and place them in your working folder. If you want to try with another model, you can also do that. Make sure to run the OpenVINO optimiser on your model to get the best performance. I’m also going to use OpenCV for image loading and resizing, so let’s install it first, and make sure numpy is there as well:

pip install opencv-python numpy

Now let’s classify an image with this model. Write this into a file and save it as classify.py:

import openvino as ov
import numpy as np
import cv2

def classify_image():
    # Step 1: Load OpenVINO model
    core = ov.Core()
    model = core.read_model("resnet50_fp16.xml")
    compiled_model = core.compile_model(model, "CPU")  # Use "NPU" if available

    # Step 2: Get input tensor details
    input_layer = compiled_model.input(0)
    input_shape = input_layer.shape  # Should be (1, 3, 224, 224)

    # Step 3: Load and preprocess image
    image = cv2.imread("input.jpg")
    image = cv2.resize(image, (224, 224))  # Resize to match model input
    image = image[:, :, ::-1]  # Convert BGR to RGB (OpenCV loads as BGR)
    image = image.astype(np.float32) / 255.0  # Normalise to [0,1]
    image = np.transpose(image, (2, 0, 1))  # HWC to CHW
    image = np.expand_dims(image, axis=0)  # Add batch dimension

    # Step 4: Run the inference
    output = compiled_model(image)[compiled_model.output(0)]

    # Step 5: Process the results
    top_class = np.argmax(output)  # Get class index

    # Load ImageNet labels (remember to download the file)
    imagenet_labels = np.array([line.strip() for line in open("imagenet_classes.txt").readlines()])

    # Display result
    print(f"Predicted Class: {imagenet_labels[top_class]}")

if __name__ == "__main__":
    classify_image()

Make sure that in the same folder you have these files: classify.py, imagenet_classes.txt, resnet50_fp16.xml, and resnet50_fp16.bin. Now add any image in there and rename it to input.jpg. Now simply call the script:

python classify.py

You should get the correct predicted class, like this:

Now that we know that the model actually works correctly with OpenVINO, we can benchmark it with a convenient tool that comes with it. It’s called benchmark_app, and it allows you to quickly check the performance of your devices with different models. You can call it like this:

benchmark_app -m MODEL -d DEVICE -hint HINT

For this benchmark, I ran these four commands:

benchmark_app -m "resnet50_fp16.xml" -d CPU -hint latency
benchmark_app -m "resnet50_fp16.xml" -d CPU -hint throughput
benchmark_app -m "resnet50_fp16.xml" -d NPU -hint latency
benchmark_app -m "resnet50_fp16.xml" -d NPU -hint throughput

These are the results:

DeviceHintMedian Latency (ms)Average Latency (ms)Min Latency (ms)Max Latency (ms)Throughput (FPS)
CPULatency25.3124.7318.3847.1640.32
CPUThroughput47.3863.6538.52135.3562.69
NPULatency1.681.71.528.3569.71
NPUThroughput9.158.43.2186.6936.05

Key Takeaways:

NPU (Latency mode) achieves 1.70ms average latency compared to 24.73ms on CPU (~15x improvement)
NPU (Throughput mode) reaches 936.05 FPS, which is ~15x higher than CPU throughput mode (62.69 FPS)

This confirms that Intel’s NPU significantly outperforms the CPU in both latency and throughput, with a roughly 15x performance boost for this particular model.

Posted in AI, Computer Vision, Open Source, OpenVINO.

Tagged with , , , , .