8.6 C
Casper
Tuesday, October 7, 2025

DataChat Delivers Data Exploration with a Dose of GenAI

Must read

DataChat turns data into insights with natural language! No SQL/Python is needed. Business users and analysts are empowered!

What if you tell the computer how you want to explore a data set, and the computer automatically executes the analysis and delivers the results? That’s the idea behind DataChat, a generative AI-based data exploration and analytics tool that spun out of a University of Wisconsin-Madison research project and is now a commercial product.

Jignesh Patel, who is currently a computer science professor at Carnegie Mellon University and a co-founder ofĀ DataChat, recently sat down virtually withĀ DatanamiĀ to chat about the nature of data exploration in the generative AI era and the new DataChat offering, which formally launched earlier this month at theĀ GartnerĀ Data & Analytics Summit.

The impetus for creating DataChat started in 2016 when Patel was working as a computer science professor at the University of Wisconsin-Madison and the CTO of Pivotal (now a part of VMware TanzuĀ and parent companyĀ Broadcom). The big data explosion was in full swing, Hadoop was the rallying point for new distributed frameworks, and data scientists were in big demand.

While the technology was evolving quickly, too many companies were spinning their tires regarding data analytics and exploration, and Patel sensed that something was missing from the equation.

ā€œEvery CTO their first objective was to hire an army of data scientists. They couldn’t get enough of data scientists,ā€ Patel said. ā€œAnd I started to observe how data scientists work in the very early days. It’s all ad-hoc analytics. It’s unscripted, unlike the BI world, and you’re trying to get something from data in a non-linear path.ā€

Much of this data exploration work was done manually, using tools like Jupyter’s data science notebooks. Data scientists would explore a particular data set until something interesting popped out, then figure out a way to extract that particular piece of data, transform it into a more useful form, and then pipe it into a machine learning algorithm, which could be used in an application.

Patel recognized the pattern lent itself to some form of automation, one that was preferably more approachable by non-experts.

ā€œThey were doing this by breaking the problem down, step by step, then trying to find code somewhere on the Web and retrofit it inside. And that’s how many cells get constructed in notebooks,ā€ he said. ā€œSo we wrote a paper in 2017 to say, what if we could have this data science cell be filled up by the user just expressing that in natural language?ā€

This was pre-ChatGPT days, of course, and the state-of-the-art in natural language processing (NLP) was nowhere near what it is today. While the NLP tech would improve, Patel and his University of Wisconsin PhD graduate student, Rogers Jeffrey Leo John, did the hard work of constructing a compact control language that could sit between the user and the underlying SQL and Python code that would query data and call machine learning algorithms, respectively.

ā€œThe intermediate [language]… was great because now we could take any arbitrary language, convert that into that intermediate language, and now convert that into SQL and Python,ā€ Patel said. ā€œBecause that’s what you must do if you’re talking to a SQL database, doing ETL. If you want to build machine learning models, you have to cross the two main languages of data science, which are SQL and Python.ā€

A Natural Language for Data Science

The goal of DataChat was to create a data analytics and exploration tool that could follow simple English instructions, reducing the need for users to know SQL or Python to be productive with data. Users can type in simple commands such as ā€œcreate a visualization for customer churn,ā€ the product will automatically produce a visualization based on the data.

Patel said the idea is for DataChat to be interactive and have a natural flow. Sitting behind a spreadsheet-like interface, users can fire off questions about the data. Not every question posed to DataChat will generate a reliable answer immediately. However, the give and take allows the product and the user to move forward predictably.

Also Read: Nvidia Unveils ā€˜Blackwell’ Chip, AI Robots at GTC 24

ā€œYou ask, and you get,ā€ Patel said. ā€œAnd we also tell you the steps when you get something back. There’s a give and take. I’m going to ask you something, it didn’t make sense, and you ask in a slightly different way, but I’m making progress at every step.ā€

Business users, data analysts, and data scientists are the targeted users for DataChat. For business users and data analysts, the goal is to elevate their skills in the data science realm without a lot of training. Data scientists often use DataChat to give them an idea of what’s in a new data set.

ā€œThey might just be poking at it DataChat and saying, ā€˜Hey, how many null values do I have in three of my critical columns?ā€™ā€ Patel said. ā€œInstead of writing a SQL query, they just point, click, or ask and get that answer, and it’s just much faster. They could write it, but they’re getting time benefit from using this.ā€

A DataChat workflow can generate three artifacts from data sitting in anything from an Excel workbook to a data warehouse inĀ DatabricksĀ orĀ Snowflake: a report, a chart, or a machine learning model, including regression, classification, and time series. Each workflow will be accompanied by an explanation of how and why it generated the answer that it did, which is an important feature of the product, Patel said.

For a model on churn, DataChat won’t generate ā€œsome crazy technical answer,ā€ he said. ā€œBut it’s going to say, ā€˜Okay, these three things–the age of the person, the contract type, and whether they have bought insurance or not. And this is 60% of the influence or 20% and 10%, and here are the things that it’s not influencing based on the data.ā€™ā€

That level of transparency is critical in data science, Patel said. ā€œFrom day one, we’ve been thinking about solving data science, and science requires transparency, so that’s built into the philosophy of the product,ā€ he said.

The Shifting Grounds of NLP

DataChat was first registered as a company in 2017 and raised $4 million in a seed round in 2020 (it has since raised another $25 million). In 2017, Patel and John slogged their way forward with the NLP technology of the day, which wasn’t nearly as powerful nor easy to use as today’s large language models (LLMs).

DataChat
The DataChat interface lets users explore data using natural language (Image courtesy DataChat)

They built language parsers and delved into semantic understanding, ā€œall of that crazy stuff,ā€ Patel said. ā€œBut as part of doing that, we built the rest of the bottom of the stack,ā€ he continued. ā€œSo important layers were all ready. They were scalable, they were cost-optimized, especially for cloud databases.ā€

When the LLM revolution exploded onto the scene a few years later, Patel and John quickly realized the superiority of the new approach. They jettisoned the top of the stack built on now outdated NLP techniques. They replaced it with OpenAI’s Codex. When OpenAI killed Codex a year ago, they pivoted again to make the LLM component swappable in their stack.

ā€œSo obviously, that was hell for us, but as part of doing that, we redid our engineering framework in the LLM piece to make sure that next time that happens to us, we can plug and play LLMs out and make it as painless as possible,ā€ Patel said.

Today the company relies primarily on OpenAI’s GPT-4, which is generally considered the most powerful and well-read LLM on the market today. DataChat employs GPT-4 to learn and generate DataChat’s intermediate language. GPT-4 is told about the type of data the user wants to analyze in general terms, but customers’ actual data never touches GPT-4, Patel said.

ā€œWe will construct summaries of the structure of the schema, so we say, ā€˜Here are the elements,ā€™ā€ Patel said. ā€œI don’t need to give [GPT-4] the actual data values.ā€

LLMs are non-deterministic machines that can’t be fully trusted, Patel said, which is why DataChat uses LLMs only as ā€œguides.ā€ ā€œThey hallucinate; they do wrong stuff,ā€ he said. ā€œSo they just give us stuff, we will convert that query to an intermediate language…and what we will generate for you is completely deterministic.ā€

A user can take a workflow generated by DataChat from one piece of data and run it on another piece of data, and it would run in the same way, he said. ā€œSo there’s no ambiguity.ā€

It’s been a long road for Patel and John, but the Madison, Wisconsin-based company is finally accepting orders for DataChat. After being formally launched at the Gartner show, Patel is ready to see what the next chapter in his fourth startup will bring.

ā€œWhen we started and wrote that initial paper, everyone thought it was crazy in the database world,ā€ Patel said. ā€œBut we got, in some sense, lucky that the GenAI piece landed where it was now a lot more usable. But that’s the fun thing about technology: It moves around, and if you’re willing to move around, good things can happen.ā€

More articles

Latest posts