Using ML to "Understand" Images

 We've become familiar with using Large Language Models (LLMs) to help us "understand" the contents of text documents or to search for documents or pages that may contain text that is relevant to a question that we ask. This has gone beyond "keyword search" to more of a "semantic search" - searching for content that has the same meaning as what we are asking about, not just content that has the same words as what we ask.

What if we could do the same for images? What if we could take a picture and find other pictures that were "similar" to ours, the same way Google Lens does? Or what if we could type in a description and find images that match, the same way Google Photos does?

Is there a model that works with images in this way? Yes. Recently, Google introduced a multimodal embedding model as part of the Google Cloud Vertex AI suite of tools, which allows developers to take advantage of some of the same powerful algorithms that are used in tools such as Google Photos and Google Lens without having to learn advanced Machine Learning (ML) techniques.

While the model is straightforward to use by developers through a network API, even if you're not an ML expert, there are some concepts that you will need to understand.

We will review those concepts and see how you apply them with this model. Then, we will conduct a few experiments to see how well the model works. This includes using the popular LangChain JS library to help us access the model, and see what we can learn about how to apply the model to our own code.

What do we mean by "multimodal"?

The term "multimodal" gets tossed around a lot, often without being clear about what is meant by it. Generally, multimodal is a term used to describe something that uses multiple "modes" or methods to communicate, such as written text, speech, still images, video, or combinations of these.

In our case, multimodal refers to the ability of the model to process both text and images. This allows the model to be able to represent the "meaning" of both words and images. Because of this, it can help us find images and text that are "similar" to each other. This representation is done through the use of something called an "embedding".

What is an embedding?

In a nutshell, embeddings are a list of numbers between -1 and 1. Sometimes we'll refer to this as a vector.

Each item in the list represents a category of some sort, with the number representing how much the image or text "fits" that category. When we create an embedding, we (or the model) evaluates how well the image or text "fits" each category in the list.

If the image or text fits the category well, then it is valued closer to 1. If it is the exact opposite of that category, it is valued closer to -1. For example, a picture of an egg might be valued at something like 0.90 if the category was "breakfast", but valued at -0.50 if the category was "flower". For a good embedding, there are many such categories.

For ML-based embeddings, we don't necessarily know (or need to know) what category each item represents. But when two different embeddings have numbers in the same spot in the list that are close to each other, they are "similar" in some way to each other.

Mathematicians may think of these vectors as representing a point in n-dimensional space, but most people don't think well in n-dimensions. Instead, we'll use an algorithm (that has already been written for us and is contained in a function) to determine how close two different sets of numbers are to each other. Those that are closer, by some measurement, are said to be "similar".

Why do we want embeddings? What do we do with them?

Embeddings are particularly useful when we want to be able to compare two things to see how similar they are. As an extension of this, it is useful if we want to compare one thing to a large number of other things, such as when we're searching for something.

In the tests we'll be doing to try out the multimodal embedding model, we're going to store our embedding vectors on disk using JSON to represent an object. The key of this object is a name we're going to use for reference, and the value is the vector. We have methods loadVector() and saveVector() that load and save this object respectively.

There are a variety of algorithms to compare two embeddings to see how similar they are. We're going to use a formula called "cosine similarity" to do so, although others are available. We're not going to go into the math, but this measures how similar two embeddings are by looking at the angle between these two points in n-dimensional space. A similarity of 1 means they are identical, while -1 indicates they are exactly opposite. The LangChain JS library we're using has a function cleverly called cosineSimilarity() that takes two arrays of vectors and returns a two dimensional array of similarity values.

When actually using embeddings, we will usually store them in something more reliable, such as a database. Unlike traditional databases, however, we aren't usually searching for a specific value, but rather for the "nearest neighbors" of an embedding that we provide. Vector databases are well tuned to handle this kind of request.

While vector databases such as the Vertex AI Matching Engine are out of the scope of this article, you might use them to do something like store the embeddings for a large number of pictures. Then, when you are searching for one, you might enter in a text query, get the embeddings for that query, and ask the vector database to search for the images that are most similar to it.

Well then, how do we access the model to create an embedding?

The multimodal embedding model gives us a REST based API that will take either an image or some text and return an embedding that represents it. What is important is that the embeddings that both return are compatible with each other. So we can get an embedding for both an image and some text and see how closely they compare.

We'll use the LangChain JS library to create either a text or image embedding. In both cases, with the multimodal embeddings model, we'll get back an array of 1024 numbers. To do this, we'll create a new GoogleVertexAIMultimodalEmbeddings object. This object has two methods that we'll be using: embedImageQuery() takes an image and returns the embeddings for it as an array of numbers. Similarly, embedQuery()takes text and returns the embedding for it as an array of numbers.

We use LangChain JS to do the work instead of calling the model's REST endpoint directly since it will handle some of the work of formatting the request correctly and retrying in case the call initially fails.

Yeah yeah, show me the code!

Our goal will be to create three different functions that do the bulk of the work. There are other assorted functions that help with these tasks, but they're not relevant to the embeddings, and you can see them in the full source code in the github repository.

For starters, we'll need to import the classes and functions that we'll be using from LangChain JS. We've tested this with LangChain JS version 0.0.117, but the library is moving so fast you may have a more recent version, which should also work fine.

import fs from "fs";

import {GoogleVertexAIMultimodalEmbeddings} from


import {cosineSimilarity} from "langchain/util/math_utils";

To get the embedding of an image, we'll be passing in an object where the key is a reference name and the value is the path to the image on disk. (We also have a convenience function where we can just pass the name and path.) It will create the embedding class and then, for each of the image files, load it in, get the embedding, and then add this to our vector object.

export async function embedImages( imgPath ){

  const vector = loadVector();

  const e = new GoogleVertexAIMultimodalEmbeddings();

  for( const key of Object.keys(imgPath) ){

    const path = imgPath[key];

    const vkey = `img_${key}`;

    const imgFile = fs.readFileSync( path );

    const v = await e.embedImageQuery( imgFile );

    vector[vkey] = v;


  saveVector( vector );


export async function embedImage( name, path ){

  await embedImages({[name]: path});


For text, we have very similar code, although the exact call is slightly different. In this case, we'll take an array of strings (or make sure we have them) and get the text embedding for each one. The key we'll save it against is the text itself.

export async function embedText( text ){

  if( !Array.isArray( text ) ){

    text = [text];


  const vector = loadVector();

  const e = new GoogleVertexAIMultimodalEmbeddings();

  for( const key of text ){

    const vkey = `txt_${key}`;

    const v = await e.embedQuery( key );

    vector[vkey] = v;


  saveVector( vector );


Once we have all the embeddings we want, we can compute the similarity between each of them. We'll load in the vectors we have, compare them against themselves, and then write the similarities out as a table with tabs separating each column.

export function buildSimilarity(){

  const data = loadVector();

  const keys = Object.keys(data).sort();

  const vectors = keys.map( key => data[key] );

  const similarity = cosineSimilarity( vectors, vectors );



    `\t${keys.join('\t')}\n`, {flag:"w"} 


  for( let co=0; co<keys.length; co++ ){








To help understand what we're looking at, we'll load these results into a Google Sheet spreadsheet (which reads tab separated value files, or tsv files, natively), and then apply some formatting to help us identify how similar some of these pictures are to each other and to the text.

So?!? What did we try to embed and how well did it work?

For my initial testing, I used seven pictures that I took myself. Five of them are relatively similar - mostly variants of pictures of the shelves in my office. None of the images had any specific tags associated with them, so it was just creating the embeddings based on the visual content itself.

I also tried 12 text embeddings, some of which tried to reference the images, while others were  a little more abstract.

Similarity Values

If we look at the portion of the grid where we are seeing how similar images are to each other, representing something we might do with Google Lens, the results appear very useful.
  • Comparing the picture against itself gives a similarity value of 1 - which is expected.

  • Comparing two images of the same shelf (the shelf1 and shelf2 images), one with a box on it and one without, give a similarity of 0.96. Very very close, so we can interpret this as "nearly the same".

  • Comparing the images of two different shelves of books (shelf1 or shelf2 vs shelf3) are still pretty good. With similarity values of 0.77 and 0.79, we might choose to interpret these as "similar things, but not the same". Perhaps this is close enough in some cases, but perhaps not.

  • A picture of a person (shelfie) in front of the same shelves gives us an even lower score (between 0.60 and 0.71), indicating there are elements that are similar, but overall it is not very similar. Again, this may be a "good enough" match in some cases.

  • Other images (sunset and of an old computer) don't match the shelves very well at all - all their scores are below 0.50, tending to hover between 0.30 and 0.40. Perhaps this indicates that it can recognize that these are photographs of "things", but that there is little else similar about them.

  • What is most surprising is that the picture of just the yellow box on the shelf did not match the picture with the box on the shelf - the score was just 0.37. Although it was the best score against all the shelves. It did match better against the sunset photo, with a score of 0.43. I'm not entirely sure what this suggests, except, perhaps, that how the model judges "similarity" may take into account things like different colors, not just different recognizable objects.

When we compare the text prompts to the images similar to doing a search in Google Photos, a few things stand out immediately:

  • The similarity scores are significantly lower.

  • Despite this, there is enough variation for us to detect which images are most similar, even if it doesn't seem that close. Scores lower than 0.05 seem to indicate not being a match.

  • A more descriptive phrase seems to yield a better number than just a word, at least in most cases.

    • The phrase "books and mugs on a bookshelf" got better scores (0.08 to 0.19) than just using "shelf" (0.05 to 0.10) or "books" (0.04 to 0.08).

    • Similarly, while "computer" matched with a score of 0.10, "an old computer" matched with 0.16.

  • Although the picture is of a sunset, the score for "sunrise" (0.09) isn't terribly bad, but not as good as the score for "sunset" itself (0.12). Interestingly, the more descriptive phrase "clouds in a sky at sunset" didn't do as well with just a 0.09 score.

Initial conclusions

I've only just begun to explore the multimodal embedding model, but I feel I can already draw a few good conclusions:

  • The model does work for both image and text similarity matching.

  • Generating embeddings are simple, straightforward, and usable for production.

  • We need to do extensive testing and tweaking to determine optimal values where we should yield results and how we should evaluate those results.

I hope this has illustrated how embeddings can work with media such as images, and I hope it has inspired you to think of ways you can use this model for your own projects. While the features offered by it are similar to Google Lens and Google Photos, it is not a complete replacement for either of them, but does offer features that can be used in a variety of ways.

For example, you may wish to test how well it works for things such as plant or animal identification. Or you may wish to see how it can be used on private caches of images. Or you may have other ideas that this now unlocks.

Where do we go from here?

These tests were fairly limited and of a very narrow scope. So there are many other questions that we
should explore:

  • How does it handle comparisons of specific text that may be in an image (such as on a sign or billboard) when searching for that text?

  • How would other images be handled? For example, line drawings or cartoons instead of photographs?

  • Is there a way to write the text to get better results? Do they need to be more specific, for example?

  • There are other similarity scoring methods available, such as the inner product or euclidean distance. Do these yield better results in some cases?

  • What is the performance of generating these embeddings? Are there more efficient or faster ways to do the same sorts of things?

We'll explore some of these in later posts. In the meantime, the code is available in GitHub, and I encourage others to use it to generate your own tests and evaluate how well it might work to address problems you may have.









Programmers' Day!

 It seems that every four years, I have a need to find a place to talk about programming.

That time has come, again, and this time I'll be sharing thoughts about AI, Machine Learning, and many of the Generative AI tools that are currently so popular. I'll be particularly focusing on how every developer can take advantage of ML - primarily as a new component in the set of tools we can take advantage of to get the job done. We don't need to be ML "experts" to use these tools any more than we need to be DB "experts" to put a database to good use. There are things we need to learn, to be sure, including the risks and tradeoffs of these tools - but they are now well within the reach of nearly every developer.

More on that soon. But first...