2025-12-21

What’s Coming with LangChainJS and Gemini?

 

Image by Nano Banana Pro
Image by Gemini 3 Pro Image Preview (Nano Banana Pro)
The past few months have been huge for both Gemini and LangChainJS! I’ve been busy trying to keep up with this (and a lot more), but as the year comes to a close, I wanted to take a moment and let folks know about the exciting developments going on at the junction of the two and let you know what’s coming “real soon now!”

LangChainJS finally hit its 1.0 milestone, and with it came a host of new features. At the same time, the API has stabilized, so we know what we’re working with and what to expect going forward.


Gemini also hit a milestone with Gemini 3 coming out and Gemini 2 being shut down in a few months. There have also been some fantastic multimodal models in the past few months (can we say Nano Banana enough?), and the LangChainJS Gemini libraries have barely been able to keep up with some of these developments.

I want to take a quick look at how we got here with the LangChainJS Gemini libraries to understand where we’re going. But if you’re impatient and just want to see what’s coming, skip ahead a section. I don’t think you’ll, be disapointed.

How we got here

Currently, there are a dizzying number of packages available to use Gemini with LangChainJS:

  • @langchain/google-genai was based on the previous version of Google’s Generative AI package. It was designed to work just with the AI Studio API (often confusingly called the Gemini API) and not with Vertex AI’s Gemini API. It has not been maintained in roughly a year and the library it uses is not designed to work with modern versions of the Gemini model.
  • @langchain/google-gauth was a REST-based library and is used if you’re running in a Google hosted environment or another node-like system to access either the AI Studio API or the Vertex AI API.
  • @langchain/google-webauth was similar to @langchain/google-gauth, but was designed to work in environments where there was no access to a file system.
  • @langchain/google-vertexai and @langchain/google-vertexai-web were similar to the above, but defaulted to using Vertex AI, although they could also use the AI Studio API.

There was also another package, @langchain/google-common which was the package that all the REST-based versions relied on to do the actual work.

As the person maintaining the REST-based packages, I always saw this as somewhat frustrating. The original goal was to have just one package. It was meant to use REST, since the libraries at the time (late 2023 and early 2024) only supported either the Vertex AI or AI Studio APIs. I wanted one library to make it easier. Well… best laid plans…

That one library became four. First, because we couldn’t find an easy solution to support both node-based and web-based platforms and still support Google Cloud’s Application Default Credentials (ADC). And then because we wanted a clear “Vertex” labeled package to match what was on the Python side (which were both written by Google).

By the time I had that working in January of 2024, someone at Google had already written a version that just worked with the AI Studio API side. And thus began the confusion.

I’ve been proud of the google-common based libraries. We tried many things that the community wanted — we had cross-platform compatibility for over a year before Google offered a library that did the same thing, and when Gemini 2.0 launched, we had compatibility within days, while Google took over a month to get its new JavaScript library out. We experimented with features such as a Security Manager and a Media Manager. We supported other models besides Gemini, with the first two being Gemma on the AI Studio platform and those from Anthropic on Vertex AI.

But the packages were confusing, and one was outdated. So it was time to find a better solution.

Simplifying the packaging

I’m thrilled that, going forward, we’ll be supporting one package: @langchain/google

pause for cheers and applause

As always, you’ll be able to use the package manager of your choice and install it as always:

yum add @langchain/google

Using the library should feel familiar. If you’re currently using the ChatGoogle class, you'll continue to do so, just from a new library:

import { ChatGoogle } from "@langchain/google";

const llm = new ChatGoogle({
model: "gemini-3-pro-preview",
})

You’ll also be able to use the new LangChainJS 1 way of creating agents with the createAgent() function by just specifying the model, and it will use this new library.

const agent = createAgent({
model: "gemini-3-flash-preview",
tools: [],
})

(Until these changes are in place, this may not do what you expect. So beware.)

Just like the old libraries, the new library will continue to support API Keys from both AI Studio and Vertex AI Express mode, as well as Google Cloud credentials for service accounts and individuals. Credentials can be provided explicitly in the code, loaded through environment variables where available, or relying on ADC.

This new library uses REST behind the scenes, so it doesn’t depend on any Google library to communicate with Gemini. I learned a lot while building the original REST version, and worked closely with LangChainJS engineers to try and avoid some of the worst mistakes we made back then. Our hope is that this new library becomes a model for how REST-based libraries can look and work for other integrations.

Despite this, our goal with this was, largely, to keep things that used to work continuing to work, so you wouldn’t need to make big code changes.

But LangChainJS 1 brings with it a lot of new features. And this new library is ready to use them.

Improved (and standard!) text and multimodal support

While there are many great features with both LangChainJS 1 and Gemini 3, I want to highlight one of the biggest new features that this library will be supporting.

LangChainJS 0 was mostly oriented around text — which all models supported when it was created. As models began to support multimodal input, and eventually output, the implementation was a bit haphazard and different for each model. LangChainJS 1 sought to standardize that.

Better ways to handle replies — text and multimodal

Previously, the response.content field would be either a string or an array of MessageContentComplex objects. Most tasks assumed it was a string, but if you needed multimodal support, this started getting messy.

LangChainJS 1 keeps response.content for backwards compatibility, and we've tried to respect that. So if you want text, you can still look here.

But the better way to get the text parts from the response is to use response.text, which now guarantees you will get a string. Something like this:

const llm = new ChatGoogle({
model: "gemini-3-flash-preview",
});
const result: AIMessage = await llm.invoke("Why is the sky blue?");
const answer: String = result.text;

If you need to differentiate between the “thinking” or “reasoning” parts of the response and the final response, or if you get multi-modal responses back, you can use the new response.contentBlocks field. This field is guaranteed to be an array of the new, consistent, ContentBlock.Standard objects.

For example:

const llm = new ChatGoogle({
model: "gemini-3-pro-image-preview",
});
const prompt = "Draw a parrot sitting on a chain-link fence.";
const result: AIMessage = await llm.invoke(prompt);
result.contentBlocks.forEach((block: ContentBlock.Standard) => {
if (!("text" in block)) {
saveToFile(block);
}
})

Sending multimodal input to Gemini

This ContentBlock.Standard also works for sending data to Gemini. For example:

const llm = new ChatGoogle({
model: "gemini-3-flash-preview",
});
const dataPath = "src/chat_models/tests/data/blue-square.png";
const dataType = "image/png";
const data = await fs.readFile(dataPath);
const data64 = data.toString("base64");

const content: ContentBlock.Standard[] = [
{
type: "text",
text: "What is in this image?",
},
{
type: "image",
data: data64,
mimeType: dataType,
},
];
const message = new HumanMessage({
contentBlocks: content
});
const result: AIMessage = await llm.invoke([message]);
console.log(result.text);

Similar tasks work for audio and video input as well.

What’s missing, what’s next, and what do you want to see?

We plan to release an alpha version of this in early January 2026, with a final version within a month after.

There is still a lot of discussion around what will happen with the old versions of the library, and the LangChain team and I welcome your thoughts. My current thinking is:

  • They will receive a version bump when the new @langchain/google is released.
  • Older versions and this new version will be marked as deprecated, with the target package to be @langchain/google.
  • This final release will actually delegate all functionality to @langchain/google - the old libraries will just be a thin veneer.
  • This will give you a little more time to migrate to the newer features without having to do extensive code changes.
  • I can’t guarantee full backwards compatibility, but the hope is that such issues will be minimal.

This first release of @langchain/google is also sure to be missing some features. We'd like to hear your feedback about what is most important to you. For example, here are some features that may not be available on day 1:

  • Embedding support
  • Batch support
  • Media manager
  • Security manager
  • Support for non-Gemini models (which are most important to you?)
  • Support for Veo and Imagen (how would you like to see these?)
  • Google’s Gemini Deep Thinking model and the Interactions API

You may have other features that you think are important — if so, we’d love to hear which ones. (And if you are willing to help integrate them — let’s talk.)

I, personally, want many of these features. But I want to get your feedback about what my priorities should be.

Some Personal Thanks

The past few months have been hectic for me, which is part of why this update has been delayed. I appreciate the support from the team at LangChain, from the community, and from my fellow GDEs. It means a lot to me when people tell me they’re using Gemini with LangChainJS.

Thanks to my employer, for encouraging open source work, to LangChain, for providing staff to assist in technical questions, and to Google for providing cloud credits to help make testing these updates possible and for sponsoring the #AISprintH2.

Very special thanks to Denis, Linda, Steven, Noble, and Mark who have always been there with technical and editorial advice, as well as a friendly voice when times got rough.

Very very special thanks to my family, who have always been there for me.

As many of you know, although I am both a Google Developer Expert and a LangChain Champion, I work for neither company. My work for the past two years on this project has been a labor of love because I appreciate the products that both Google and LangChain have delivered, and I want to make both better. I plan to continue that work — and I hope you are also out there, trying to make the world a better place in your own way.

2025-06-29

Gemini’s URL Context Tool and LangChainJS

 

Image by Imagen 4

Content and Context are the King and Queen of LLM applications. While LLM models, themselves, can “understand” what people are asking, it is only by providing additional content and context that we can build systems that avoid problems, such as hallucinations, and provide accurate and useful responses. Strategies such as Retrieval Augmented Generation (RAG) and using tools to include additional content are common ways of addressing this issue.

Google provides several specific tools that can provide relevant content directly through the Gemini API calls. This includes the Grounding with Google Search tool, which has extensive support in LangChainJS.

Google has now built on this by providing a URL Context tool which allows you to develop prompts that access specific information from and about publicly available web pages. Why is this important?

Can you see the problem?

As you may remember, Gemini can be accessed on two different platforms, AI Studio and Vertex AI, with several different auth methods involved. If you’re not familiar with developing for Gemini with LangChainJS, you should check out LangChain.js and Gemini: Getting Started.

For simplicity, we’ll be using the ChatGoogle class from the google-gauth package, but you can use one that meets your need.

There are some cases where we might ask an LLM specific questions about a web page, or ask for a summary, or to compare two different web pages. So we might address this with something like this code:

const url = "https://js.langchain.com/";
const question = `Describe the contents of this web page: ${url}`;
const modelName = "gemini-2.0-flash-001";

const model = new ChatGoogle({
modelName,
temperature: 0,
});

const result = await model.invoke(question);
console.log(result.content);

The exact answer we get, of course, will vary from run to run, but here is an example of a pretty typical result:

I have analyzed the content of the webpage you provided: https://js.langchain.com/. Here's a breakdown of what it contains:

**Overall Purpose:**
The webpage serves as the **official documentation and entry point for Langchain.js**, the JavaScript/TypeScript version of the popular LangChain framework. LangChain is a framework designed to simplify the development of applications powered by large language models (LLMs).

**Key Content Areas:**

* **Introduction and Overview:**
* Explains what LangChain is and its purpose: building applications with LLMs.
* Highlights the core components and capabilities of LangChain.js.
* Provides a high-level overview of the framework's architecture.

* **Getting Started:**
* Guides users through the initial setup and installation process.
* Includes instructions on installing the necessary packages (likely via npm or yarn).
* Provides basic code examples to demonstrate how to use LangChain.js for simple tasks.

* **Modules/Components:**
* **Models:** Documentation on different LLMs that can be used with LangChain.js (e.g., OpenAI, Cohere, Hugging Face). Includes how to connect to these models and configure them.
* **Prompts:** Explains how to create and manage prompts for LLMs. Covers prompt templates, prompt engineering techniques, and how to optimize prompts for specific tasks.
* **Chains:** Details how to create chains of operations that link together different LLM calls and other components. This is a core concept in LangChain for building more complex workflows.
* **Indexes:** Covers how to index and retrieve data for use with LLMs. This is important for tasks like question answering over documents.
* **Memory:** Explains how to add memory to LLM applications, allowing them to remember previous interactions and maintain context.
* **Agents:** Describes how to create agents that can use LLMs to make decisions and take actions in the real world. This is a more advanced topic that involves planning and tool use.
* **Callbacks:** Explains how to use callbacks to monitor and control the execution of LangChain components.

* **Use Cases:**
* Provides examples of how LangChain.js can be used to build various types of applications, such as:
* Question answering
* Chatbots
* Text summarization
* Code generation
* Data extraction

* **Guides and Tutorials:**
* Offers more in-depth guides and tutorials on specific topics and use cases.
* Provides step-by-step instructions and code examples to help users learn how to use LangChain.js effectively.

* **API Reference:**
* Detailed documentation of all the classes, functions, and methods available in the LangChain.js library.
* Provides information on the parameters, return values, and usage of each API element.

* **Community and Support:**
* Links to the LangChain community forums, Discord server, and other resources for getting help and connecting with other users.
* Information on how to contribute to the LangChain.js project.

* **Blog:**
* Articles and updates on the latest developments in LangChain.js, including new features, bug fixes, and use cases.

**In summary, the Langchain.js website is a comprehensive resource for developers who want to use JavaScript or TypeScript to build applications powered by large language models. It provides everything from basic tutorials to advanced API documentation, making it easy to get started and build sophisticated LLM-powered applications.**

Reading through this, it looks pretty good, right? Did you notice any problems?

Did you notice, perhaps, that it hallucinated nearly all of it?

Finding the solution

While we can address this manually (such as by loading the URL ourselves and including it in the context), Gemini also provides the URL Context Tool that will handle this for us and use Google’s high speed network and cache to provide the context.

We can change the model configuration in LangChainJS to bind to the URL Context Tool with something like this:

const urlContextTool: GeminiTool = {
urlContext: {},
};
const tools = [urlContextTool];
const model = new ChatGoogle({
modelName,
temperature: 0,
}).bindTools(tools);

If we run this, we’ll get a different answer:

The webpage is the documentation site for LangChain, a framework for developing applications powered by large language models (LLMs). It highlights the JavaScript version of LangChain.

Here's a breakdown of the content:
* **Introduction:** LangChain simplifies the LLM application lifecycle, covering development, productionization, and deployment. It mentions LangGraph.js for building stateful agents.
* **Core Components:** It lists the open-source libraries that make up the framework: `@langchain/core`, `@langchain/community`, partner packages (like `@langchain/openai`), `langchain`, and LangGraph.js. It also mentions LangSmith for debugging, testing, and monitoring LLM applications.
* **Tutorials:** Offers tutorials for building various applications like simple LLM apps, chatbots, and agents.
* **How-To Guides:** Provides short answers to common questions and tasks.
* **Conceptual Guide:** Introduces key LangChain concepts.
* **API Reference:** Links to the full documentation of classes and methods.
* **Ecosystem:** Mentions LangSmith and LangGraph.
* **Additional Resources:** Includes information on security, integrations, and contributing.
* **Versions:** Allows you to select documentation for different versions of the library.
* **Migration Guides:** Provides guides for migrating between versions.

On the surface, you might think this is pretty similar to what we saw above.

The biggest difference, however, is that the page contains all of the elements that are addressed in the reply. Clearly, Gemini actually used the page content when computing an answer.

But how do we know that it has actually accessed the page to do so?

The Hidden Links

As part of the response, Gemini will tell us what URLs it tried to read and if there were any problems doing so. LangChainJS stores this as part of the response_metadata attribute in the result.

We can examine the results with something like this:

const result = await model.invoke(question);
const urlContextMetadata = result?.response_metadata?.url_context_metadata;
console.log( urlContextMetadata );

For our prompt above, we should get something like this. If we had asked about several pages, each would be listed here.

{
urlMetadata: [
{
retrievedUrl: 'https://js.langchain.com/',
urlRetrievalStatus: 'URL_RETRIEVAL_STATUS_SUCCESS'
}
]
}

Conclusion

As we’ve seen, the Gemini URL Context tool provides a powerful and straightforward way to ground LLM responses with the context of specific web pages. By simply binding the tool to our LangChainJS model, we can ensure that Gemini has direct access to the content, hopefully preventing hallucinations and providing more accurate responses.

The ability to verify which URLs were accessed through the response_metadata gives us an extra layer of confidence that the model is using the information we provided.

You can even combine this with other tools, such as the Grounding with Google Search tool, to further enhance the quality of results. These tools, when combined with our own knowledge and business logic systems, let us move from simply prompting a model to building robust, context-aware AI systems.

Acknowledgements

The development of LangChainJS support for Gemini’s URL Context Tool and this documentation were all supported by Google Cloud Platform Credits provided by Google. My thanks to the teams at Google for their support.

Special thanks to Linda Lawton, Denis V., Steven Gray, and Noble Ackerson for their support and friendship.

2025-06-28

Text-to-Speech with Gemini and LangChainJS

Image by Imagen 4

One of the hottest features of Google's NotebookLM tool is the ability to turn a notes summary into a podcast, with two speakers chatting back and forth about the subject. What is particularly notable about these conversations is how natural they sound - not just because of the voices, but also some of the other disfluences and noises that the two "hosts" make.

The same technology that backs NotebookLM's audio feature is now available through Gemini. Currently in preview, this lets you use current LLM tools, such as LangChainJS, to get Gemini to create high-quality audio output.

We'll take a look at how LangChainJS can be used to generate audio with a single voice, a conversation between two voices, and what additional tools are needed to use this audio.

Talking to myself

Gemini can be accessed on two different platforms, AI Studio and Vertex AI, with several different auth methods involved. If you're not familiar with developing for Gemini with LangChainJS, you should check out LangChain.js and Gemini: Getting Started.

For simplicity, we'll be using the ChatGoogle class from the google-gauth package, but you can use one that meets your need. We do need to make sure we're configured to use the AI Studio API, since the preview models aren't available on Vertex AI yet, but this just involves us setting the "GOOGLE_API_KEY" environment variable before we test our code.

We'll get to the code shortly, but first there are a few things about configuration with the speech models that are different from how we'll use traditional models that we need to keep in mind.

Configuring for speech

First is that, at least currently, Gemini only supports two preview models:

  • gemini-2.5-flash-preview-tts
  • gemini-2.5-pro-preview-tts 

These models don't have all the features that are in the generally available Gemini 2.5 models, such as tools and function calling or the large context window, and they can only accept text input and generate audio output. While these features may eventually be folded into the general model, they currently are stand-alone only. We'll explore some ways to handle this later.

Next is that, during the model setup, we will need to specify both that we expect audio output (despite it being the only allowed output) and specify the name of the voice that we will use. Although Google has a large data structure that is necessary to specify the voice, and we can pass this in the configuration for the model, LangChainJS also allows just the name of the voice to be passed as a string.

With this, we can create and configure our model with something like this:

const modelName = "gemini-2.5-flash-preview-tts"; const responseModalities = ["AUDIO"]; const speechConfig = "Zubenelgenubi"; // The name of the voice to use const model = new ChatGoogle({ modelName, responseModalities, speechConfig, })

Once we've configured the model - we need to know a little bit about how to prompt the model to create our audio.

A simple prompt

Im some ways, this part is the easiest. The prompt consists of directions to the model about what to do. This should include what we want the model to say, and possibly how we want it to say it.

Although the speech models are built on top of an LLM, so they do have some foundational knowledge, we shouldn't expect it to do this terribly well. The model has been trained to do audio output, so it may not act well for other instructions.

Tip: If you're familiar with older text-to-speech models that used the Speech Synthesis Markup Language (SSML), this isn't available with the Gemini TTS models. Instead, you can just use a more human language to describe what you want to hear.

So creating the prompt and having the model run it can look something like this:

const prompt = "Say cheerfully: Have a wonderful day!"; const result = await model.invoke( prompt );

Simple, right?

Getting the audio out of the result, however, does take a bit more work.

Listening to the results

When we have the model evaluate a prompt typically in LangChainJS, we get an AIMessage object back that has the result as a string in the "content" field. Because we are getting back audio, and not text, however, we will get a more complicated array of generated output, including a Record with "media" information.

This media Record contains a "data" attribute, whose value is the audio in PCM format that has been base64 encoded.

The question now becomes - how can you actually hear this audio?

If you're writing a node server, you might wrap it in a WAV file container and send this as a "data:" URI for the browser to play in an <audio> tag.

We're going to use the "audio-buffer-from" and "audio-play" packages to create an audio buffer, with the correct formats, and then play it in a (relatively) device independent way. The audio buffer needs to be built using 16-bit big endian format with a sample rate of 24000.

With this, we can use code like this to get the base64 data from the content, create the audio buffer with this data and the format, and then play the audio buffer:

const audioContent = result?.content?.[0] as Record<string, any>; const audioData64 = audioContent.data; const audioDataBuffer = Buffer.from( audioData64, "base64" ); const audioFormat = { format: { endianness: "be", // Big-Endian / network format type: "int16", // 16 bit sampleRate: 24000, }, }; const audioBuffer = createBuffer( audioDataBuffer, audioFormat ); await play( audioBuffer );

A Gemini duet

Having Gemini generate audio is pretty neat, but it isn't quite the studio host conversation experience that NotebookLM offers. While we could certainly setup two models, each with a different voice, and pass them the text to generate the audio - fortunately, Gemini offers a simpler solution that just requires some changes to the audio configuration and our prompt.

Configuring the cast of characters

As with the simpler audio, we need to configure the voices that will be used when Gemini generates the output. In this case, however, we need to specify not just the voices available, but also the name of the speaker that is attached to that voice. When we create our prompt later, we will use these speaker names to indicate which voice is used for each utterance.

Again, although you can use Google's speech configuration object definition, LangChainJS provides a simplified version. So we might define our two speakers, "Brian" and "Sarah", with something like this:

const speechConfig = [ { speaker: "Brian", name: "Puck", }, { speaker: "Sarah", name: "Kore", }, ]; 


Scripting the audio

As usual, we'll invoke our model with a prompt. What is slight different in this case is that the prompt should be formated as a script, including the lines each role ("Sarah" and "Brian" in our example) will read, along with any instructions about how to read them, and some overall guidance about what we expect.

For example, we might have a prompt with this conversation:

const prompt = ` TTS the following conversation between Brian and Sarah. Pay attention to instructions about how each each person speaks, and other sounds they may make. Brian: Hows it going today, Sarah? Sarah Not too bad, how about you? Brian: [Sighs and sounds tired] It has been a rough day. Brian: [Perks up] But the week should improve! `;

Roll the audio

The other parts of the example, creating the model, running it, and getting the audio from it, are the same as with our single-speaker example.

For completeness, here is the full example:

import { ChatGoogle } from "@langchain/google-gauth"; import createBuffer from "audio-buffer-from"; import play from "audio-play"; const modelName = "gemini-2.5-flash-preview-tts"; const responseModalities = ["AUDIO"]; const speechConfig = [ { speaker: "Brian", name: "Kore", }, { speaker: "Sarah", name: "Puck", }, ]; const model = new ChatGoogle({ modelName, responseModalities, speechConfig, }) const prompt = ` TTS the following conversation between Brian and Sarah. Pay attention to instructions about how each each person speaks, and other sounds they may make. Brian: Hows it going today, Sarah? Sarah Not too bad, how about you? Brian: [Sighs and sounds tired] It has been a rough day. Brian: [Perks up] But the week should improve! `; const result = await model.invoke( prompt ); const audioContent = result?.content?.[0] as Record<string, any>; const audioData64 = audioContent.data; const audioDataBuffer = Buffer.from( audioData64, "base64" ); const audioFormat = { format: { endianness: "be", // Big-Endian / network format type: "int16", // 16 bit sampleRate: 24000, }, }; const audioBuffer = createBuffer( audioDataBuffer, audioFormat ); await play( audioBuffer ); 

Combining with other LangChainJS components

One thing to keep in mind is that the TTS model, at least in this preview version, isn't a fully capable model. It does not provide, for example, access to function calls, Grounding with Google Search, or other useful tools.

How can we use components like these to create a nice audio segment, such as what NotebookLM provides?

The broad solution would be to use more traditional models, such as Gemini 2.0 Pro or Gemini 2.5 Flash, to research the topic and create the script, and then pass this script over to the TTS model to create the audio output. Remember - there is no need to use a single model or a single model call to do all the work.

One straightforward way to do this might be with a couple of functions - one that creates the script and another that turns that script into the audio.

The first of these might be written with something like this:

async function makeScript( topic: string, name1: string = "Brian", name2: string = "Sarah" ): Promise<string> { const modelName = "gemini-2.5-flash"; const model = new ChatGoogle({ modelName, }); const history = []; const prompt1 = ` Provide me a short, one paragraph, summary on the following topic: ${topic} `; history.push( new HumanMessage( prompt1 ) ); const result1 = await model.invoke( history ); history.push( result1 ); const prompt2 = ` Now turn this paragraph into a conversation between two people named ${name1} and ${name2}. It should be written as a script with the two people as the speakers and any optional notes about how they are responding (for example, their tone of voice) in square brackets at the beginning of the line. Each line in the script should be short and brief. Example script: ${name1}: Hello ${name2}. [Excited] It is good to see you! ${name2}: [Surprised] Oh! Hi there. Good to see you too. Script: `; history.push( new HumanMessage( prompt2 ) ); const result2 = await model.invoke( history ); return result2.content as string; }

This is a fairly straightforward LangChainJS function, although it has prompts and models hard-coded in, so it wouldn't be very good in production. We have two prompts that we use in our conversation, the first asks about information on a topic, and the second prompts the model to turn it into the script. This is a good division of tasks - using one prompt to collect information, and then another to format it the way we need it.

Once we have the script, we'll pass it to another function:

async function readScript( script: string, name1: string = "Brian", name2: string = "Sarah" ): Promise<void> { const modelName = "gemini-2.5-pro-preview-tts"; const responseModalities = ["AUDIO"]; const speechConfig = [ { speaker: name1, name: "Puck", }, { speaker: name2, name: "Kore", }, ]; const model = new ChatGoogle({ modelName, responseModalities, speechConfig, }) const prompt = ` TTS the following conversation between Brian and Sarah. Pay attention to instructions about how each each person speaks, and other sounds they may make. ${script} `; const result = await model.invoke( prompt ); const audioContent = result?.content?.[0] as Record<string, any>; const audioData64 = audioContent.data; const audioDataBuffer = Buffer.from( audioData64, "base64" ); const audioFormat = { format: { endianness: "be", // Big-Endian / network format type: "int16", // 16 bit sampleRate: 24000, }, }; const audioBuffer = createBuffer( audioDataBuffer, audioFormat ); await play( audioBuffer ); }

This function should look fairly familiar by now. The biggest difference is that the script comes in as a raw script and this function adds some additional instructions before sending it to the model to read the prompt.

We combine them together and call them with a topic:

async function talkAbout( topic: string): Promise<void> { const script = await makeScript( topic ); console.log( script ); await readScript( script ); } await talkAbout("Sharks");

This is a fairly simple example. It won't work for very long topics or conversations, for example, but is meant to illustrate the concept.

A more robust implementation would make use of something like LangGraph to handle the various phases of this task (generating the conversation vs playing the audio) and include portions that validate the output at each step. Done correctly, it could also break up each part to allow for a larger generated script and still play back each audio segment more quickly. If there is interest, I may write about this in the future.

Conclusion

The new text-to-speech models in Gemini, accessible through LangChainJS, open up a world of possibilities for creating rich, audio-based applications. As we've seen, generating both single-voice and multi-voice audio requires just a few configuration changes.

But the real power comes from combining the specialized TTS models with the broader capabilities of other Gemini models, allowing you to build complex workflows that can research a topic, generate a script, and then perform it as a natural-sounding conversation.

While this technology is still in its preview phase, it already provides a powerful and easy-to-use toolset. The examples in this repository provide a starting point for your own explorations.

Go ahead and experiment with different voices, prompts, and chains to see what you can create. I can't wait to see... or hear... what you create.

Acknowledgements

The development of LangChainJS support for Gemini 2.5 TTS models and this documentation were all supported by Google Cloud Platform Credits provided by Google. My thanks to the teams at Google for their support.

Special thanks to Linda Lawton, Denis V., Steven Gray, and Noble Ackerson for their support and friendship.