The intelligent modern data stack
Briefings highlight generational AI scaleups, startups, and projects that are defining new categories and changing how we live & work. For this issue, I am fortunate to have chatted with two of the four co-founders of Numbers Station, Christopher Aberger and Ines Chami. Read more to learn about their story and where they are taking Numbers Station next.
Numbers Station is building an intelligent data stack tool powered by foundation models so data workers spend less time on mundane data tasks, and more time generating insights. They are taking a layer approach to tackling the problem space by focusing first on the data transformation stage.
Why Numbers Station is a generational startup:
Founded by top-tier researchers and experienced operators
Chris Aberger was one of the first engineers at SambaNova Systems, an integrated hardware & software unicorn, where he led the machine learning team
Ines Chami co-authored the seminal paper on using foundation models on structured data
Sen Wu co-authored some of the most widely cited AI papers, including the research that led to now-AI unicorn Snorkel
Chris Re leads the Hazy Research group at Stanford which is, in my opinion, producing some of the most productizable AI research. Re is also a serial venture entrepreneur having started two companies sold to Apple (Lattice and Inductiv), two AI unicorns (SambaNova, Snorkel), and advising many more
Differentiated approach to the data lifecycle journey
They are starting with the first step of any data process: data preparation. A large portion of generative-AI-for-data startups focus on text-to-SQL using off-the-shelf models
They build customized models for customers that are as accurate yet are 700x smaller and costs >2000x cheaper than OpenAI’s models
Large growing market
Estimates show that current spend on data preparation & data analytics tools is $25-30 billion, growing 10%. If the trend continues, the incremental spend of $2-3 billion a year is a large opportunity on its own.
Telling the CEO of your first employer that you’d consider yourself a failure if you didn’t resign & start your own company in four years is very unusual. But that is what Chris Aberger told Rodrigo Liang, co-founder of SambaNova Systems, when he joined the company as one of the first engineers. Four years later, Chris co-founded Numbers Station along with Ines Chami, Sen Wu, and Chris Re.
For Ines, she decided to be a founder to learn. She could have joined the AI teams of big tech companies to continue researching. But she thought that nothing beats learning from building a company. “Zero regrets. I’m learning more than I expected” as she reflects on her founder journey so far. “My biggest learning is learning how to execute a research vision from ideation to production. In research, we have to prove concepts and develop prototypes. But in startups, it is solving last mile problems and delivering a working product.”
Their vision is that foundation models can automate complex data-intensive workflows. In the seminal paper on applying foundation models to data tasks co-authored by Ines, Chris Re, & others found that large foundation models generalize and achieve state-of-the-art performance on data cleaning and integration tasks, even though they are not trained for these data tasks. While the paper was first published in May 2022, the team has already been building foundation models for years. They all saw the potential of the new technology well in advance of the ChatGPT-induced generative AI hype.
If you tried to explaining foundation models to business executives last year, they’d be annoyed at you for wasting their time. Today, they’d be annoyed that you’re assuming they don’t know what foundation models are. Which is great according to Ines. “The tune has changed so much since we started Numbers Station. Now we can focus on talking about the value of using these models to their data.”
The top three use cases of foundation models were helping writers write, coders code, and data analysts analyze. Writing assistants were quickly commoditized. Github Copilot seems to have won the mindshare for coding assistants. But the story for the data space is still being written. The data stack is complicated, with multiple vendors for each layer. With a large market, this space is hot with startups finding a wedge and incumbents defending their positions. Dozens of “generative-AI-for-data” startups were launched in the past six months. Incumbents Microsoft, Google, Tableau, and Thoughtspot quickly launched their own products as well.
Amidst this frenzied pace of new product launches, Numbers Station is instead focusing on how to execute better internally. Ines’ view is that “The external competition is there but our pace remains [very fast]. If anything, the market is being educated faster so we can now tell users how we’re differentiated.” Aberger adds that they’re not dismissing competition. “There are a lot of really smart people and companies looking at related problems and we definitely study them to see what is going on, if there are things to learn, and how to get better ourselves. In general, I like to worry about things we can control, so [that’s] focusing internally on our execution and how to build a generational product.”
What differentiates Numbers Station
What makes the team standout is their complementary experiences at the forefront of large data systems & cutting-edge AI. Aberger led machine learning at SambaNova, which has been training LLMs for enterprises since 2020. While the other co-founders have been pioneering AI research at Stanford. Their experience is reflected in how they’re uniquely approaching the problem and their product’s technical edge.
The process to go from raw data to BI dashboards is complex. Enterprises hire a dedicated team and purchase several specialized tools to manage the process. A large proportion of generative-AI-for-data startups use foundation models to generate SQL statements to run on data warehouses. Numbers Station instead started from first principles: where does the problem start? It starts with data preparation. So their first product is focused on data preparation. In this step, SQL is just one of many tools. Cleaning typos, matching records, and un-SQL-able data transformations require different tools. So they’ve built a suite of tools for data preparation. One of which is AI transformation, a single tool to freely transform data. This can be used to judge the sentiment of a text, summarize other data entries, and correct typos. Users don’t have to train separate machine learning models, learn regex, or drudge manually transforming data.
That may sound like a simple application of foundation models. Anyone can sign up for ChatGPT and instruct it to transform each row in a data table. This is where the team’s technical prowess shines. Applying off-the-shelf general foundation models to large data sets in an enterprise setting would be prohibitively expensive and slow. In an experiment evaluating Numbers Station’s customized models, their team showed that they can build a model 700x smaller than OpenAI’s models but with similar accuracy. A smaller model runs much faster and cheaper. To illustrate the affordability and scalability of their customized models, they compared the cost of running sentiment analysis on 1M rows: Numbers Station’s inference costs $1.7 compared to OpenAI’s $3.7K for GPT-4, a 2275x cost reduction.
What’s next for Numbers Station
Data prep is a large market on its own, ~$5 billion by some estimates. But Numbers Station’s goal is much bigger: automating the entire data stack. Data prep is step one. Automating the semantic layer is next, according to Aberger. Data prep creates clean datasets. But understanding clean data is also a problem.
In a large organization, different teams will have different definitions of what an “active user” is and even how to calculate “revenue”. Should active user be one that has logged in the past 7 days or one that has done a set of activities in the past 30 days? Should the foreign exchange rate at the end of the month or on the date of billing or on the date of wire transfer be used to aggregate global revenue into a single currency? These may seem trivial, but teams do spend days & weeks to resolve inconsistencies. The semantic layer provides a common understanding of organization's data, ensures that the data is consistent and trustworthy, and helps avoid duplicative work.
The product is still in private beta but they told me to expect generally availability in a couple of months. This is exciting news if you’re someone who had to spend days correcting typos and standardizing zip codes (like me).
Product Problem Space Notes
Their product is still in private preview so instead of writing about it, this section will instead describe the problem space based on the primer Generative AI for Modern Business Intelligence.
Modern BI problems & generative AI opportunities
The modularity of the modern data stack (MDS) and modern business intelligence (MBI), while beneficial, has introduced new problems: disconnected tools and unmanageable data swamps. MDS often consist of disconnected tools that need specialized knowledge to integrate. The scalability of storage & compute also leads to a store-everything mentality creating unmanageable data swamps in SQL-centric data stores. With data being ingested from different sources, understanding the context becomes difficult. Tracing back entities and tables become increasingly perplexing with each step losing context. Sometimes even requiring another tool to decipher. Consequently, teams struggle to identify the source of truth, leading to ad-hoc, bespoke tables for answering specific questions. This creates "data debt" as these one-off solutions accumulate over time.
These problems present opportunities for generative AI to unlock the value of MBI.
Data munging / transformation - where Numbers Station is currently today
Pain point: Data preparation tasks such as classification, transformation, and cleaning are time-consuming and tedious.
Status quo: Data analysts and engineers spend a significant amount of time on manual data wrangling, which slows down the analysis process.
Use case solution: Generative AI automates data preparation tasks. For example, suppose an organization has inconsistent date formats across multiple data sources. The AI identifies the discrepancies, standardizes the date formats, and cleans the data.
Data documentation (table-to-text and SQL-to-text) - where Numbers Station is going next
Pain point: Understanding and navigating complex data structures is challenging, especially when documentation is lacking or outdated.
Status quo: Data documentation is often created manually, which is time-consuming, error-prone, and difficult to maintain.
Use case solution: Generative AI generates dataset documentation, including descriptions of fields, data types, and relationships between tables. For example, when given a database schema for an e-commerce platform, the AI can create a document explaining each table (e.g., orders, customers, products) and their relationships.
Natural language querying (usually but not limited to text-to-SQL)
Pain point: Non-technical users struggle to extract insights from data stored in databases due to the learning curve associated with SQL or other query languages.
Status quo: Business users often rely on data analysts or engineers to write SQL queries, which is time-consuming and create bottlenecks in decision-making.
Use case solution: Generative AI converts natural language queries into SQL code. For example, a product manager asks, "What is the average revenue per user for the past three months?" The AI generates the SQL query, retrieves the information, and presents the result to the user.