Easy Landmark Image Recognition with TensorFlow Hub DELF Module



Have you ever wonder how Google image search works behind the scene? I will show you how to build a mini version of a landmark image recognition pipeline that leverages TensorFlow Hub's DELF(DEep Local Feature) module with minimal configuration.

Feel free to explore the Colab notebook while reading.

Intro to Image Recognition/Retrieval

Image retrieval is the task of searching for digital images in large databases. It can be classified into two types: text-based image retrieval and content-based image retrieval. In text-based image retrieval method, we put just a query (or relevant word(s)) on search field to get the result as images. In content-based image retrieval, we provide a sample (but relevant) image on search field to get the result as similar images.


This post focus on content-based image retrieval where images are automatically annotated with their visual content by feature extraction process. The visual content includes colors, shapes, textures or any other information that can be derived from the image itself.  Extracted features representing visual content are indexed by high multi-dimensional indexing techniques to realize large-scale image retrieval.

If you have read my previous post on "building a travel recommendation engine", that is essentially an image retrieval model depends on extracted global image features and therefore have difficulty in dealing with partial visibility and extraneous image features. Alternatively, a more robust image retrieval system is local feature-based. It's able to handle background clutter, partial occlusion, multiple landmarks, objects in variable scales, etc.

Take this two image of the horseshoe in the Grand Canyon as an example, which are in different lighting and scales.


What is the DELF(DEep Local Feature) module?

The pre-trained DELF(DEep Local Feature) module, available on TensorFlow Hub can be used for image retrieval as a drop-in replacement for other keypoint detectors and descriptors. It describes each noteworthy point in a given image with 40-dimensional vectors known as feature descriptor.

The image below shows the DELF correspondences of two images.


DELF was trained with Google-Landmarks dataset which contains 1,060,709 images from 12,894 landmarks and 111,036 additional query images optimized for landmark recognition.

The DELF Image retrieval system can be decomposed into four main blocks:

  1. Dense localized feature extraction,
  2. Keypoint selection,
  3. Dimensionality reduction,
  4. Indexing and retrieval.

The first 3 blocks are wrapped into the TensorFlow Hub DELF module. Even so, it's still interesting to crack open the black box and look inside.

The dense localized feature extraction block is formed with a ResNet50 CNN feature extracting layers trained with a classification loss. The obtained feature maps are regarded as a dense grid of local descriptors.

Features are localized based on their receptive fields, which can be computed by considering the configuration of convolutional and pooling layers of the fully convolutional network(FCN). They use the pixel coordinates of the center of the receptive field as the feature location. The image below shows aggregated unique locations for 256x256 resolution input images.


They employed a two-step training strategy by first fine-tuning the original ResNet50 layers to enhance the discrimination of local descriptors, followed by training the attention score function to assess the relevance of features extracted by the model.


In the dimensionality reduction block, the feature dimension is reduced to 40 by PCA, a trade-off between compactness and discriminative.
As for indexing and retrieval, we will build it in the next section.

Building the image recognition pipeline

For demonstration purpose, we will create such a landmark image recognition system which takes one image as input and tell if it matches to one of the 50 indexed world famous buildings.

Indexing database images

We start by extracting feature descriptors from the 50 database landmark building images and aggregate their descriptors and locations. In our case, there are totally 9953 aggregated descriptor-location pairs.


Our image retrieval system is based on nearest neighbor search, to facilitate this, we built a KD-tree with all aggregated descriptors.

The indexing is carried out offline and build only once unless we want to index more database images in the future. Realize that in the code snippet below, we also created an indexes boundaries lookup array to reversely locate a database image index given an aggregated descriptor index.


m = hub.Module('https://tfhub.dev/google/delf/1')

# The module operates on a single image at a time, so define a placeholder to
# feed an arbitrary image in.
image_placeholder = tf.placeholder(
    tf.float32, shape=(None, None, 3), name='input_image')

module_inputs = {
    'image': image_placeholder,
    'score_threshold': 100.0,
    'image_scales': [0.25, 0.3536, 0.5, 0.7071, 1.0, 1.4142, 2.0],
    'max_feature_num': 1000,

module_outputs = m(module_inputs, as_dict=True)

image_tf = image_input_fn(db_images)

with tf.train.MonitoredSession() as sess:
    results_dict = {}  # Stores the locations and their descriptors for each image
    for image_path in db_images:
        image = sess.run(image_tf)
        print('Extracting locations and descriptors from %s' % image_path)
        results_dict[image_path] = sess.run(
            [module_outputs['locations'], module_outputs['descriptors']],
            feed_dict={image_placeholder: image})

locations_agg = np.concatenate([results_dict[img][0] for img in db_images])
descriptors_agg = np.concatenate([results_dict[img][1] for img in db_images])
accumulated_indexes_boundaries = list(accumulate([results_dict[img][0].shape[0] for img in db_images]))

d_tree = cKDTree(descriptors_agg) # build the KD tree

Query image at runtime

At runtime, the query image was first resized and cropped to 256x256 resolution followed by the DELF module computing its descriptors and locations. Then we query the KD-tree to find K nearest neighbors for each descriptor of the query image. Next, aggregate all the matches per database image. Finally, we perform geometric verification using RANSAC and employ the number of inliers as the score for retrieved images.

The following graph illustrates the querying pipeline.


One thing worth mentioning about applying RANSAC for geometric verification. We want to make sure all matches are consistent with a global geometric transformation; however, there are many incorrect matches. Take the following graph for example, without the geometric verification there are many inconsistent matches while after applying RANSAC, we can estimate the geometric transformation and the set of consistent matches simultaneously.


In our query demo, after the K nearest neighbor search we aggregated a total 23 tentative database images, while after apply RANSAC for each the tentative against the query image, only 13 candidates are left.

The following code snippet performs the geometric verification using RANSAC as well as visualization.

# Array to keep track of all candidates in database.
inliers_counts = []
# Read the resized query image for plotting.
img_1 = mpimg.imread(resized_image)
for index in unique_image_indexes:
    locations_2_use_query, locations_2_use_db = get_locations_2_use(index, indices, accumulated_indexes_boundaries)
    # Perform geometric verification using RANSAC.
    _, inliers = ransac(
        (locations_2_use_db, locations_2_use_query), # source and destination coordinates
    # If no inlier is found for a database candidate image, we continue on to the next one.
    if inliers is None or len(inliers) == 0:
    # the number of inliers as the score for retrieved images.
    inliers_counts.append({"index": index, "inliers": sum(inliers)})
    print('Found inliers for image {} -> {}'.format(index, sum(inliers)))
    # Visualize correspondences.
    _, ax = plt.subplots()
    img_2 = mpimg.imread(db_images[index])
    inlier_idxs = np.nonzero(inliers)[0]
        np.column_stack((inlier_idxs, inlier_idxs)),
    ax.set_title('DELF correspondences')

The pipeline finally generates the inlier count score for each database candidate by its index.

[{'index': 17, 'inliers': 29},
 {'index': 4, 'inliers': 11},
 {'index': 10, 'inliers': 10},
 {'index': 36, 'inliers': 10},
 {'index': 22, 'inliers': 8},
 {'index': 12, 'inliers': 7},
 {'index': 21, 'inliers': 7},
 {'index': 34, 'inliers': 6},
 {'index': 7, 'inliers': 5},
 {'index': 45, 'inliers': 5},
 {'index': 40, 'inliers': 4},
 {'index': 2, 'inliers': 3},
 {'index': 23, 'inliers': 3}]

It is then trivial to print out the top matched image's description with its index

print('Best guess for this image:', building_descs[17])

Which outputs:

Best guess for this image: 18. The Colosseum — Rome, Italy

Summary and further reading

We start with a brief introduction to Image Recognition/Retrieval task and TensorFlow Hub's DELF module followed by constructing a demo image recognition pipeline to retrieve 50 world famous buildings.

The original DELF paper has largest inspired me to write this post.

While there are some related resources, you might find helpful.

Kaggle - Google Landmark Recognition Challenge

Kaggle - Google Landmark Retrieval Challenge

Module DELF on TensorFlow Hub

Coursera - RANSAC: Random Sample Consensus

Lastly, don't forget to check out the source code for the post on my GitHub and try the runnable Colab notebook.

Current rating: 4.9