2024-02-08

Gemini Versus Gemini: Understanding Google's Latest... Thing

An artistic rendition of the constellation Gemini on the ceiling of Grand Central Terminal, New York city. It shows the major stars of the constellation along with an illustration of twin boys superimposed over the stars. The stars and illustration are golden on a dark turquoise background.
Gemini, the Constellation
Grand Central Terminal Ceiling
Photograph by Allen Firstenberg


In 2023, Google made a number of announcements around new Generative AI features, but there were two that were most notable:

  • In February, they announced a conversational AI system called Bard would be available to the public to answer questions and help with creative tasks.
  • In May, they announced that an upcoming model known as Gemini would be powering many Google services in the future, and it would be available for outside developers to use.


During 2023, both these products had several updates, culminating in a recent announcement that the names of these two products were being merged and both would be known as Gemini.

Which now raises the question: What, exactly, do we mean when we talk about Gemini?
Let's try to untangle all the terminology.


Gemini, the Model

Julius: "My name is Julius Benedict and I'm your twin brother."
Vincent: "Oh, obviously!"
-- Arnold Schwarzenegger and Danny DeVito,Twins (1988)

At the heart of all this discussion is a multimodal machine learning model family known as Gemini.

Although it was announced in May of 2023 at Google I/O, details and public announcements of what it could do weren't released until December 2023. At this time, we learned that it was a machine learning model that was specifically trained on multimodal content. This means that it was trained to handle words, pictures, videos, and other media "modes" natively.

Gemini was divided into three sizes, with the understanding that the larger versions were more capable or could handle more complex tasks. All three, however, were multimodal. From smallest to largest, the three sizes were:

  • Nano
  • Pro
  • Ultra

When released in December 2023, there were also several announcements about how the Gemini model would be used:
  • Google was switching all of its products that used the previous generations of models (in the LaMDA or PaLM families) to use Gemini.
  • The first product to make this switch would be the Google Bard chatbot, where Gemini Pro would be the underlying model for most regions in the world.
  • Developers would have access to the Gemini Pro model through a cloud-based API.
  • Some early testers would have access to Gemini Nano on select Android devices through a library.

 

Gemini, the API

Well... the APIs.
"The best part of working with your twin? You always have someone to blame if things go wrong"
-- Unknown
Shortly after the Gemini announcement in December 2023, the model was made available to developers through an Application Programming Interface (API) and a set of libraries for a variety of different programming languages. The API provided access to two different variants of the model:
  • gemini-pro
  • gemini-pro-vision

Both are similar, but the gemini-pro-vision version was trained to take images (and sometimes videos) along with text for the input, while the gemini-pro version was better trained to be more conversational. Both could only return text.

Both of these models were available using two different developer platforms:
  • The Google Generative AI platform, sometimes known as the MakerSuite platform or the Google AI Studio platform
  • The Google Cloud Vertex AI platform

The two platforms were substantially the same, but there were slight differences between the two:
  • The MakerSuite platform was simpler to get up and running since developers could use a simple authentication scheme known as an API Key.
  • The VertexAI platform had a few more features, including video support, since it built on other Google Cloud features including authentication.


Importantly, however, the underlying model used by both is the same: Gemini Pro.


Gemini, the application

What's in a name? That which we call a rose,
By any other word would smell as sweet.
-- Romeo and Juliet, Act II, Scene 2, by William "The Bard" Shakespeare
In February 2024, Google announced several major developments with the Bard chatbot, the most surprising of which was that it was being renamed to Gemini. It also indicated that the entire suite of professional assistance tools, formerly known as Duet AI, would also come under the Gemini brand.

Other changes and updates included:
  • A split in features:
    • The basic Gemini chat would be using the Gemini Pro model for text-based work for all countries that can access the chatbot
    • The introduction of a premium level called Gemini Advanced which uses the Gemini Ultra model.
  • New features, including the ability to generate images using the Imagen 2 model.
  • The initial launch of an app for Android and iOS

So the natural question is how does Gemini, the chat application or assistant, differ from Gemini, the API, or Gemini, the model?

Gemini chat is a consumer-level application that provides a way for people to ask conversational questions that are handled by the Gemini model. It also has features that go beyond what the Gemini model or API handle, including:
  • Generating images using the Imagen 2 model
  • Accessing a user's personal email or files in Google Drive
  • Having access to up-to-date information from the internet
While the Gemini tools in Workspace provide specialized assistance about Google Cloud and Google Workspace, such as code assistance.

With this change, it is important to understand two things about the Gemini API:
  • It does not provide the same features that Gemini chat does.
  • It does not let you access Gemini chat through an API.

While developers can do things like write programs that use the Gemini API and have similar features to Gemini chat or the other Gemini assistants - developers must write the code to implement those features..

The Gemini model is used by all of of these products, along with several other products from Google. It may have more features and capabilities than either use or make available at this time.


Gemini, the Conclusion

“When I use a word,” Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.”
“The question is,” said Alice, “whether you can make words mean so many different things.”
“The question is,” said Humpty Dumpty, “which is to be master—that’s all.”
-- Through the Looking Glass, Lewis Carroll
While sometimes it is fine to use the term "Gemini" generically, we should make sure that it is clear what we're talking about.

If we're talking about the model, we should specify "the Gemini model" or a particular size such as "Gemini Pro".

If we're talking about the chat application, we should say "Gemini chat" or "the Gemini app" or talk about "Gemini Advanced chat". While if we're talking about other Google products under the Gemini name, we should be clear which we're talking about (such as "Gemini for Code Assistance").

If we're talking about developing, we'll probably talk about the "Gemini API" and possibly say which platform (Google AI Studio or Vertex AI) we're on. We may even talk about a particular model such as "gemini-ultra" or "gemini-pro-vision".

By following this guidance, we should make sure we are clearly understood. By human and AI model alike.



2024-01-04

Gemini, MakerSuite, API Keys, and "The caller does not have permission"

 On the Google Developer Community #gemini-api Discord channel, there have been a rise in the number of problems of people generating keys. Folks would say something like:

"I'm using MakerSuite with Gemini and I deleted an API Key. I went to create a new one, but I'm getting an error saying the caller does not have permission. What does that mean and how can I get a key?"


It took a few days to figure out what was going on, but we think we have a good solution. Let's take a look at what is going on, why, and what you can do about it.

API Keys and What Is Causing This

API Keys are a basic authorization system that lets Google authorize developers to access the Generative AI platform API, including API access to the Gemini model. Getting a key should be fairly easy - you select "Get API Key" at the Google AI Studio website and can then copy the key. You can then use this key in your code when you try to access the API.

Google uses this key to get a good idea how many different projects are accessing Gemini and makes sure it isn't being abused. Abuse prevention is important because the free tier for Gemini is limited to 60 queries per minute. It also leads to the problem people are now encountering.

Previously, you were able to create an unlimited number of keys. However, Google has apparently limited this to one key per project, probably as part of a plan to make sure people don't use multiple keys to get around the rate limit.

There appears to be a bug, however, where you can delete an API Key... but you won't be able to create a new one because Google thinks the key still exists.

In a way - it does.

The API Keys are actually associated with a Google Cloud project that gets created when you create a new key (the "Create API key in new project" button in the screen shot above). The MakerSuite console, however, hides this information behind the scenes to make it easier for developers to get started. When you delete a key, Google Cloud makes it so it can't be used - but also allows you to "undelete" it within 30 days in case there was a mistake. We'll use this fact to get your key back so you can use it.

Getting the Key Back

To get the key back, we'll go into the Google Cloud Console credentials page and restore the deleted key.

Go directly to the credentials page in Google Cloud Console at https://console.cloud.google.com/apis/credentials

  • Make sure the account is the same as the one you're using for MakerSuite. You should be able to see the account in the upper right hand corner.
  • If you have more than one project, make sure you're using the right one. The default keys are created in a project named "Generative Language Client", but you may have done it in a different project.

Select the "Restore Deleted Credentials" link.

For the key with the name "Generative Language API Key", select the "RESTORE" link.

On the pop-up, click the "RESTORE" button again.

Then click the back arrow to leave the "Deleted credentials" page.


On the Credentials page, you'll see that the credential has been restored.

And if you go back to the MakerSuite Google AI Studio, you'll see that the key now shows up there as well.

Creating a new API Key

In some cases, however, you actually want to keep the key deleted and need a new one. For example, you may have accidentally included the API Key in code that you posted on GitHub, and now need to invalidate it so nobody else can use it. But that means you'll need a new one.

You won't be able to use the Google AI Studio page to do this, but you will be able to do it through the Google Cloud Console page.

As above, you would go directly to the credentials page in Google Cloud Console at https://console.cloud.google.com/apis/credentials

    This time, however, you would select the "Create Credentials" link along the top and in the drop-down menu select "API key"

    The system will create the key and pop up a message saying it has done so, and let you copy the key at this time. More importantly, however, you'll see a warning that the key is unrestricted. This is a bad idea from a security perspective, so you should click on "Edit API key" to restrict how the key will be used.

    We want to restrict this so it can only use the Generative Language API, so we'll select the radio button to "Restrict key" and then make sure we locate and check the box next to "Generative Language API" and click on OK.


    We can also do other things from this page, such as change the name so it will be more obvious what it is used for, but that isn't necessary.

    Instead, we'll just save these settings.


    While the key will show up in the Google Cloud Console, it won't appear in the MakerSuite Google AI Studio key page. Instead, if you want to manage this key in the future, you'll need to do it from this page.

    Conclusion

    As I hope you've seen, while the MakerSuite Google AI Studio page simplifies managing your API Key for Gemini, you may sometimes need to use the Google Cloud Console Credentials page to address some issues you may encounter. Hopefully, this guide has made it relatively easy to navigate these tasks.

    If you have found this useful, please let me know. You can find my contact information on my website, prisoner.com. Or feel free to join the #gemini-api channel on the Google Developer Community Discord server.

    My thanks to the Googlers who have assisted in helping diagnose the problem and all the members of the #gemini-api channel who reported the problem and helped test this solution.











    2023-12-14

    Gemini Gives Us Retrieval Augmented Generation with Confidence

    Image by Imagen2

    Digging into the recently released API for Gemini in Google AI Studio, I discovered what I think might be a hidden gem. They have tools they group under "Semantic retrieval" for building Retrieval Augmented Generation (RAG) systems. This includes a tailored model and API that assist in response generation. But there are some distinctly different elements to it that piqued my interest. This is just a preliminary exploration, and I look forward to digging into these features further.


    What is Retrieval Augmented Generation?


    This will merit another article to go into depth, but RAG is a way of allowing users to ask questions through an AI model and ground the answer in an external authoritative resource. It usually involves three phases:

    • Turning the user question into a concrete, and conventional, digital representation. This could be something like an SQL query, a data structure containing matching keywords, or an embedding.

    • Using conventional (non-AI methods) to find results, such as documents, that match the request. This could be done by running the SQL query, for example, or doing a nearest neighbor search with the embedding vector. The results are a set of data that, somehow, answer the question asked.

    • Prompting another language model with the original question and these sample answers to produce a result that gets sent to the user.


    Our current tools can do this quite adequately, and there are several products (such as Google's Vertex AI Search and Conversation) that can do this out of the box and still provide API access. However, Gemini with the Google AI Studio are offering two interesting elements that make this offering different. The first is the Attributed Question and Answering API (the AQA API), which offers help on the "Generation" part, and the second is the Semantic Retriever API, which helps in the "Retrieval' part (as the name suggests).


    The Attributed Question and Answering API


    In short, the AQA API allows access to a generative AI model based on Gemini Pro. This model is similar to other Gemini models since it takes a prompt, temperature and other model parameters, safety restrictions, etc. It differs, however, by also allowing you to provide a set of passages which you think adequately answer the question.


    The results are what make this different from using typical LLMs to generate results with RAG. While it does give an answer, as you would expect, it also gives two additional pieces of information:

    • Which passages, provided through an attribute called inlinePassages, were used to generate the result (the "attribution" part of the name). This allows you and the user to verify the information and where it came from.

    • A confidence score called answerableProbability, giving an indication about how well the answer was actually grounded in the question and provided passages.


    As developers, we can use this confidence score to determine how to present the results. For example, with a confidence score of 90% we would present the results clearly and confidently. However, with a score under 50%, we might include a warning that the results are inaccurate, or we may not present the results at all and instead say that we didn't know.


    The Semantic Retriever API


    But the AQA API also goes a step further. In addition to providing a way to provide the passages inline, you can also create a corpus of documents and point to this corpus when asking it to generate a response to the question. These documents must be broken up into chunks of about 2k and each document and chunk can contain additional metadata.


    Then, when you ask the model to answer a question, you can point it at this corpus. It will identify several of the "best" answers for the question, and provide them to the model as if they were inline passages.


    In this case, the attributed results include any identifiers you may have included, so you can fetch the metadata and include them to the user. And, as always, it includes the answer probability so you can determine how you wish to handle replying with these results to the user.


    But why?


    If you're familiar with other RAG methods, you're probably asking yourself:  "Self, why would I use this method over other solutions - either pre-made products or putting it together yourself via something like LangChain?" The answer depends on your goals and how you use the API.


    Using it just as a generator gives you the confidence score, which allows you to avoid passing off replies that aren't actually backed by the grounding information.


    Using it as a full RAG gives you a bit more control over the LLM parameters than many pre-built systems. When compared to putting all the pieces together yourself, it should also be faster, since you're only making one API call rather than three or more.


    As with most solutions, there are tradeoffs involved in every approach. Still, I think this solution fits a really nice middle ground that people should explore.


    What's next?


    There is a lot more to test here - from code that actually uploads and manages data in the corpus to effectively using the model. While all of this looks straightforward, I'm anticipating some "gotchas" that will make this trickier. (For example, how can we express other contextual information? Can we do that in the prompt? And many more questions!)


    And then, of course, all the other new features that Gemini brings!


    More of all of these in a later post.



    2023-09-15

    Using ML to "Understand" Images

     We've become familiar with using Large Language Models (LLMs) to help us "understand" the contents of text documents or to search for documents or pages that may contain text that is relevant to a question that we ask. This has gone beyond "keyword search" to more of a "semantic search" - searching for content that has the same meaning as what we are asking about, not just content that has the same words as what we ask.


    What if we could do the same for images? What if we could take a picture and find other pictures that were "similar" to ours, the same way Google Lens does? Or what if we could type in a description and find images that match, the same way Google Photos does?


    Is there a model that works with images in this way? Yes. Recently, Google introduced a multimodal embedding model as part of the Google Cloud Vertex AI suite of tools, which allows developers to take advantage of some of the same powerful algorithms that are used in tools such as Google Photos and Google Lens without having to learn advanced Machine Learning (ML) techniques.


    While the model is straightforward to use by developers through a network API, even if you're not an ML expert, there are some concepts that you will need to understand.


    We will review those concepts and see how you apply them with this model. Then, we will conduct a few experiments to see how well the model works. This includes using the popular LangChain JS library to help us access the model, and see what we can learn about how to apply the model to our own code.


    What do we mean by "multimodal"?


    The term "multimodal" gets tossed around a lot, often without being clear about what is meant by it. Generally, multimodal is a term used to describe something that uses multiple "modes" or methods to communicate, such as written text, speech, still images, video, or combinations of these.


    In our case, multimodal refers to the ability of the model to process both text and images. This allows the model to be able to represent the "meaning" of both words and images. Because of this, it can help us find images and text that are "similar" to each other. This representation is done through the use of something called an "embedding".


    What is an embedding?


    In a nutshell, embeddings are a list of numbers between -1 and 1. Sometimes we'll refer to this as a vector.


    Each item in the list represents a category of some sort, with the number representing how much the image or text "fits" that category. When we create an embedding, we (or the model) evaluates how well the image or text "fits" each category in the list.


    If the image or text fits the category well, then it is valued closer to 1. If it is the exact opposite of that category, it is valued closer to -1. For example, a picture of an egg might be valued at something like 0.90 if the category was "breakfast", but valued at -0.50 if the category was "flower". For a good embedding, there are many such categories.


    For ML-based embeddings, we don't necessarily know (or need to know) what category each item represents. But when two different embeddings have numbers in the same spot in the list that are close to each other, they are "similar" in some way to each other.


    Mathematicians may think of these vectors as representing a point in n-dimensional space, but most people don't think well in n-dimensions. Instead, we'll use an algorithm (that has already been written for us and is contained in a function) to determine how close two different sets of numbers are to each other. Those that are closer, by some measurement, are said to be "similar".


    Why do we want embeddings? What do we do with them?


    Embeddings are particularly useful when we want to be able to compare two things to see how similar they are. As an extension of this, it is useful if we want to compare one thing to a large number of other things, such as when we're searching for something.


    In the tests we'll be doing to try out the multimodal embedding model, we're going to store our embedding vectors on disk using JSON to represent an object. The key of this object is a name we're going to use for reference, and the value is the vector. We have methods loadVector() and saveVector() that load and save this object respectively.


    There are a variety of algorithms to compare two embeddings to see how similar they are. We're going to use a formula called "cosine similarity" to do so, although others are available. We're not going to go into the math, but this measures how similar two embeddings are by looking at the angle between these two points in n-dimensional space. A similarity of 1 means they are identical, while -1 indicates they are exactly opposite. The LangChain JS library we're using has a function cleverly called cosineSimilarity() that takes two arrays of vectors and returns a two dimensional array of similarity values.


    When actually using embeddings, we will usually store them in something more reliable, such as a database. Unlike traditional databases, however, we aren't usually searching for a specific value, but rather for the "nearest neighbors" of an embedding that we provide. Vector databases are well tuned to handle this kind of request.


    While vector databases such as the Vertex AI Matching Engine are out of the scope of this article, you might use them to do something like store the embeddings for a large number of pictures. Then, when you are searching for one, you might enter in a text query, get the embeddings for that query, and ask the vector database to search for the images that are most similar to it.


    Well then, how do we access the model to create an embedding?


    The multimodal embedding model gives us a REST based API that will take either an image or some text and return an embedding that represents it. What is important is that the embeddings that both return are compatible with each other. So we can get an embedding for both an image and some text and see how closely they compare.


    We'll use the LangChain JS library to create either a text or image embedding. In both cases, with the multimodal embeddings model, we'll get back an array of 1024 numbers. To do this, we'll create a new GoogleVertexAIMultimodalEmbeddings object. This object has two methods that we'll be using: embedImageQuery() takes an image and returns the embeddings for it as an array of numbers. Similarly, embedQuery()takes text and returns the embedding for it as an array of numbers.


    We use LangChain JS to do the work instead of calling the model's REST endpoint directly since it will handle some of the work of formatting the request correctly and retrying in case the call initially fails.


    Yeah yeah, show me the code!


    Our goal will be to create three different functions that do the bulk of the work. There are other assorted functions that help with these tasks, but they're not relevant to the embeddings, and you can see them in the full source code in the github repository.


    For starters, we'll need to import the classes and functions that we'll be using from LangChain JS. We've tested this with LangChain JS version 0.0.117, but the library is moving so fast you may have a more recent version, which should also work fine.


    import fs from "fs";

    import {GoogleVertexAIMultimodalEmbeddings} from

        "langchain/experimental/multimodal_embeddings/googlevertexai";

    import {cosineSimilarity} from "langchain/util/math_utils";


    To get the embedding of an image, we'll be passing in an object where the key is a reference name and the value is the path to the image on disk. (We also have a convenience function where we can just pass the name and path.) It will create the embedding class and then, for each of the image files, load it in, get the embedding, and then add this to our vector object.


    export async function embedImages( imgPath ){

      const vector = loadVector();

      const e = new GoogleVertexAIMultimodalEmbeddings();


      for( const key of Object.keys(imgPath) ){

        const path = imgPath[key];

        const vkey = `img_${key}`;

        const imgFile = fs.readFileSync( path );

        const v = await e.embedImageQuery( imgFile );

        vector[vkey] = v;

      }


      saveVector( vector );

    }


    export async function embedImage( name, path ){

      await embedImages({[name]: path});

    }


    For text, we have very similar code, although the exact call is slightly different. In this case, we'll take an array of strings (or make sure we have them) and get the text embedding for each one. The key we'll save it against is the text itself.


    export async function embedText( text ){


      if( !Array.isArray( text ) ){

        text = [text];

      }


      const vector = loadVector();

      const e = new GoogleVertexAIMultimodalEmbeddings();


      for( const key of text ){

        const vkey = `txt_${key}`;

        const v = await e.embedQuery( key );

        vector[vkey] = v;

      }


      saveVector( vector );

    }


    Once we have all the embeddings we want, we can compute the similarity between each of them. We'll load in the vectors we have, compare them against themselves, and then write the similarities out as a table with tabs separating each column.


    export function buildSimilarity(){

      const data = loadVector();

      const keys = Object.keys(data).sort();

      const vectors = keys.map( key => data[key] );


      const similarity = cosineSimilarity( vectors, vectors );


      fs.writeFileSync( 

        similarityFile, 

        `\t${keys.join('\t')}\n`, {flag:"w"} 

      );

      for( let co=0; co<keys.length; co++ ){

        fs.writeFileSync(

          similarityFile,

          `${keys[co]}\t${similarity[co].join('\t')}\n`, 

          {flag:"a"}

        );

      }

    }


    To help understand what we're looking at, we'll load these results into a Google Sheet spreadsheet (which reads tab separated value files, or tsv files, natively), and then apply some formatting to help us identify how similar some of these pictures are to each other and to the text.


    So?!? What did we try to embed and how well did it work?


    For my initial testing, I used seven pictures that I took myself. Five of them are relatively similar - mostly variants of pictures of the shelves in my office. None of the images had any specific tags associated with them, so it was just creating the embeddings based on the visual content itself.


    I also tried 12 text embeddings, some of which tried to reference the images, while others were  a little more abstract.


    Similarity Values

    If we look at the portion of the grid where we are seeing how similar images are to each other, representing something we might do with Google Lens, the results appear very useful.
    • Comparing the picture against itself gives a similarity value of 1 - which is expected.

    • Comparing two images of the same shelf (the shelf1 and shelf2 images), one with a box on it and one without, give a similarity of 0.96. Very very close, so we can interpret this as "nearly the same".

    • Comparing the images of two different shelves of books (shelf1 or shelf2 vs shelf3) are still pretty good. With similarity values of 0.77 and 0.79, we might choose to interpret these as "similar things, but not the same". Perhaps this is close enough in some cases, but perhaps not.

    • A picture of a person (shelfie) in front of the same shelves gives us an even lower score (between 0.60 and 0.71), indicating there are elements that are similar, but overall it is not very similar. Again, this may be a "good enough" match in some cases.

    • Other images (sunset and of an old computer) don't match the shelves very well at all - all their scores are below 0.50, tending to hover between 0.30 and 0.40. Perhaps this indicates that it can recognize that these are photographs of "things", but that there is little else similar about them.

    • What is most surprising is that the picture of just the yellow box on the shelf did not match the picture with the box on the shelf - the score was just 0.37. Although it was the best score against all the shelves. It did match better against the sunset photo, with a score of 0.43. I'm not entirely sure what this suggests, except, perhaps, that how the model judges "similarity" may take into account things like different colors, not just different recognizable objects.


    When we compare the text prompts to the images similar to doing a search in Google Photos, a few things stand out immediately:

    • The similarity scores are significantly lower.

    • Despite this, there is enough variation for us to detect which images are most similar, even if it doesn't seem that close. Scores lower than 0.05 seem to indicate not being a match.

    • A more descriptive phrase seems to yield a better number than just a word, at least in most cases.

      • The phrase "books and mugs on a bookshelf" got better scores (0.08 to 0.19) than just using "shelf" (0.05 to 0.10) or "books" (0.04 to 0.08).

      • Similarly, while "computer" matched with a score of 0.10, "an old computer" matched with 0.16.

    • Although the picture is of a sunset, the score for "sunrise" (0.09) isn't terribly bad, but not as good as the score for "sunset" itself (0.12). Interestingly, the more descriptive phrase "clouds in a sky at sunset" didn't do as well with just a 0.09 score.


    Initial conclusions


    I've only just begun to explore the multimodal embedding model, but I feel I can already draw a few good conclusions:

    • The model does work for both image and text similarity matching.

    • Generating embeddings are simple, straightforward, and usable for production.

    • We need to do extensive testing and tweaking to determine optimal values where we should yield results and how we should evaluate those results.


    I hope this has illustrated how embeddings can work with media such as images, and I hope it has inspired you to think of ways you can use this model for your own projects. While the features offered by it are similar to Google Lens and Google Photos, it is not a complete replacement for either of them, but does offer features that can be used in a variety of ways.


    For example, you may wish to test how well it works for things such as plant or animal identification. Or you may wish to see how it can be used on private caches of images. Or you may have other ideas that this now unlocks.


    Where do we go from here?


    These tests were fairly limited and of a very narrow scope. So there are many other questions that we
    should explore:

    • How does it handle comparisons of specific text that may be in an image (such as on a sign or billboard) when searching for that text?

    • How would other images be handled? For example, line drawings or cartoons instead of photographs?

    • Is there a way to write the text to get better results? Do they need to be more specific, for example?

    • There are other similarity scoring methods available, such as the inner product or euclidean distance. Do these yield better results in some cases?

    • What is the performance of generating these embeddings? Are there more efficient or faster ways to do the same sorts of things?


    We'll explore some of these in later posts. In the meantime, the code is available in GitHub, and I encourage others to use it to generate your own tests and evaluate how well it might work to address problems you may have.


    img_box

    img_pet

    img_shelf1

    img_shelf2

    img_shelf3

    img_shelfie


    img_sunset