Future Unicorn #233: LangChain (open source project)

Augmenting language models

Feb 21, 2023

The Quild Future Unicorn is a weekly product-focused note highlighting early-stage startups with statistically significant signals of becoming unicorns.

LangChain is an open-source project that helps developers create applications on top of large language models (LLMs). LLMs require a unique set of tooling to maximize its usefulness and to put guardrails around it. I used LangChain to build GPT-VC 3.0 (Streamlit cloud is having stability issues so it may not load).

LangChain is the open source project. The company is still in stealth.

Founder: Harrison Chase (and maybe others, to be confirmed)

Signals:

Venture-backed experience
- Harrison was a machine learning engineer at Robust Intelligence (3+ yrs)
Traction
- Grew to 7.2K Github stars in 4 months (started with a Tweet in October)
- Packed community events (see picture below)
Top investor
- Benchmark invested (to be confirmed)
Top university alumni
- Harrison graduated from Harvard University

Packed (like sardines) event hosted by LangChain and Benchmark in San Francisco.

The Future Unicorn series is powered by Specter, a data intelligence provider for the world's leading investors like Accel and Bessemer. I have been working with data-driven tools for venture capital for a long time, and Specter's is the best one.

Project Notes

Since the company is in stealth, there is no commercial product yet. This section will be about the LangChain project.

Pain point

A key challenge of working with large language models is coaxing it to produce the desired outputs given a limited context window - or commonly referred to as a prompt. OpenAI’s latest GPT-3 model (text-davinci-003) can process up to ~3,000 words (4,000 tokens) for each prompt.

For everyday use cases, like writing a blog, a story, a tabloid, etc., <100 words is enough. But for more complex use cases, like a customer service chatbot, ~3,000 words is not enough to fit a company’s customer service policies and processes (and probably not a good idea to try to). LLMs don’t have stock knowledge of any company’s policies and databases. Developers have to figure out how to bring external knowledge into the model by chaining together different tools or software modules. For example above, a document fetching tool that extracts the most relevant customer support policies for each interaction would help LLMs understand the context without going the context window limit.

Product

LangChain helps developers chain together different primitives. Primitives can be a prompt template, a tool, a LLM, or even other chains. Its like LEGO. We’ll discuss the main primitives.

Prompt template

An LLM takes a prompt as input to produce an output. A prompt template is a reproducible way to creating prompts with input variables that can be determined by the user or other tools. Here’s a prompt template that generates a company name based on a product, which is the input variable.

from langchain import PromptTemplate


template = """
I want you to act as a naming consultant for new companies.

Here are some examples of good company names:

- search engine, Google
- social media, Facebook
- video sharing, YouTube

The name should be short, catchy and easy to remember.

What is a good name for a company that makes {product}?
"""

prompt = PromptTemplate(
    input_variables=["product"],
    template=template,
)

Large language model
LangChain provides a common API/syntax to use the LLMs from different providers like OpenAI, Cohere, Hugging Face, and AI21 Labs. This makes it easy for developers to use and test different LLMs.
Document loaders
LangChain has a lot of pre-built integrations to load documents of different types (e.g. PDF, HTML) from different sources (e.g. Notion, Google Drive, s3, websites). Loading documents make it easily accessible either for accessing directly or by processing it to store inside a vector database.
Utilities
Utilities is a catch all set of integrations to access 3rd party services, aside from document leaders. One of the most common utility is accessing the Google Search API so developers can fetch Google search results to feed into the prompt.

From these primitives, developers can create a chain that takes an end user input to search for relevant documents, feed those into the prompt template, and generate an output from the selected LLM. LangChain has some complex utility chains available off the shelf like a Moderation chain that detects inappropriate text and a Math chain that translates a query into a Python calculator (LLMs are not good at math).

Future Unicorn #233: LangChain (open source project)

Augmenting language models

Project Notes

Discussion about this post