DataChat turns data into insights with natural language! No SQL/Python is needed. Business users and analysts are empowered!
What if you tell the computer how you want to explore a data set, and the computer automatically executes the analysis and delivers the results? Thatās the idea behind DataChat, a generative AI-based data exploration and analytics tool that spun out of a University of Wisconsin-Madison research project and is now a commercial product.
Jignesh Patel, who is currently a computer science professor at Carnegie Mellon University and a co-founder ofĀ DataChat, recently sat down virtually withĀ DatanamiĀ to chat about the nature of data exploration in the generative AI era and the new DataChat offering, which formally launched earlier this month at theĀ GartnerĀ Data & Analytics Summit.
The impetus for creating DataChat started in 2016 when Patel was working as a computer science professor at the University of Wisconsin-Madison and the CTO of Pivotal (now a part of VMware TanzuĀ and parent companyĀ Broadcom). The big data explosion was in full swing, Hadoop was the rallying point for new distributed frameworks, and data scientists were in big demand.
While the technology was evolving quickly, too many companies were spinning their tires regarding data analytics and exploration, and Patel sensed that something was missing from the equation.
āEvery CTO their first objective was to hire an army of data scientists. They couldnāt get enough of data scientists,ā Patel said. āAnd I started to observe how data scientists work in the very early days. Itās all ad-hoc analytics. Itās unscripted, unlike the BI world, and youāre trying to get something from data in a non-linear path.ā
Much of this data exploration work was done manually, using tools like Jupyter’s data science notebooks. Data scientists would explore a particular data set until something interesting popped out, then figure out a way to extract that particular piece of data, transform it into a more useful form, and then pipe it into a machine learning algorithm, which could be used in an application.
Patel recognized the pattern lent itself to some form of automation, one that was preferably more approachable by non-experts.
āThey were doing this by breaking the problem down, step by step, then trying to find code somewhere on the Web and retrofit it inside. And thatās how many cells get constructed in notebooks,ā he said. āSo we wrote a paper in 2017 to say, what if we could have this data science cell be filled up by the user just expressing that in natural language?ā
This was pre-ChatGPT days, of course, and the state-of-the-art in natural language processing (NLP) was nowhere near what it is today. While the NLP tech would improve, Patel and his University of Wisconsin PhD graduate student, Rogers Jeffrey Leo John, did the hard work of constructing a compact control language that could sit between the user and the underlying SQL and Python code that would query data and call machine learning algorithms, respectively.
āThe intermediate [language]⦠was great because now we could take any arbitrary language, convert that into that intermediate language, and now convert that into SQL and Python,ā Patel said. āBecause thatās what you must do if youāre talking to a SQL database, doing ETL. If you want to build machine learning models, you have to cross the two main languages of data science, which are SQL and Python.ā
A Natural Language for Data Science
The goal of DataChat was to create a data analytics and exploration tool that could follow simple English instructions, reducing the need for users to know SQL or Python to be productive with data. Users can type in simple commands such as ācreate a visualization for customer churn,ā the product will automatically produce a visualization based on the data.
Patel said the idea is for DataChat to be interactive and have a natural flow. Sitting behind a spreadsheet-like interface, users can fire off questions about the data. Not every question posed to DataChat will generate a reliable answer immediately. However, the give and take allows the product and the user to move forward predictably.
Also Read: Nvidia Unveils āBlackwellā Chip, AI Robots at GTC 24
āYou ask, and you get,ā Patel said. āAnd we also tell you the steps when you get something back. Thereās a give and take. Iām going to ask you something, it didnāt make sense, and you ask in a slightly different way, but Iām making progress at every step.ā
Business users, data analysts, and data scientists are the targeted users for DataChat. For business users and data analysts, the goal is to elevate their skills in the data science realm without a lot of training. Data scientists often use DataChat to give them an idea of whatās in a new data set.
āThey might just be poking at it DataChat and saying, āHey, how many null values do I have in three of my critical columns?āā Patel said. āInstead of writing a SQL query, they just point, click, or ask and get that answer, and itās just much faster. They could write it, but theyāre getting time benefit from using this.ā
A DataChat workflow can generate three artifacts from data sitting in anything from an Excel workbook to a data warehouse inĀ DatabricksĀ orĀ Snowflake: a report, a chart, or a machine learning model, including regression, classification, and time series. Each workflow will be accompanied by an explanation of how and why it generated the answer that it did, which is an important feature of the product, Patel said.
For a model on churn, DataChat wonāt generate āsome crazy technical answer,ā he said. āBut itās going to say, āOkay, these three thingsāthe age of the person, the contract type, and whether they have bought insurance or not. And this is 60% of the influence or 20% and 10%, and here are the things that itās not influencing based on the data.āā
That level of transparency is critical in data science, Patel said. āFrom day one, weāve been thinking about solving data science, and science requires transparency, so thatās built into the philosophy of the product,ā he said.
The Shifting Grounds of NLP
DataChat was first registered as a company in 2017 and raised $4 million in a seed round in 2020 (it has since raised another $25 million). In 2017, Patel and John slogged their way forward with the NLP technology of the day, which wasnāt nearly as powerful nor easy to use as todayās large language models (LLMs).

They built language parsers and delved into semantic understanding, āall of that crazy stuff,ā Patel said. āBut as part of doing that, we built the rest of the bottom of the stack,ā he continued. āSo important layers were all ready. They were scalable, they were cost-optimized, especially for cloud databases.ā
When the LLM revolution exploded onto the scene a few years later, Patel and John quickly realized the superiority of the new approach. They jettisoned the top of the stack built on now outdated NLP techniques. They replaced it with OpenAIās Codex. When OpenAI killed Codex a year ago, they pivoted again to make the LLM component swappable in their stack.
āSo obviously, that was hell for us, but as part of doing that, we redid our engineering framework in the LLM piece to make sure that next time that happens to us, we can plug and play LLMs out and make it as painless as possible,ā Patel said.
Today the company relies primarily on OpenAIās GPT-4, which is generally considered the most powerful and well-read LLM on the market today. DataChat employs GPT-4 to learn and generate DataChatās intermediate language. GPT-4 is told about the type of data the user wants to analyze in general terms, but customersā actual data never touches GPT-4, Patel said.
āWe will construct summaries of the structure of the schema, so we say, āHere are the elements,āā Patel said. āI donāt need to give [GPT-4] the actual data values.ā
LLMs are non-deterministic machines that canāt be fully trusted, Patel said, which is why DataChat uses LLMs only as āguides.ā āThey hallucinate; they do wrong stuff,ā he said. āSo they just give us stuff, we will convert that query to an intermediate languageā¦and what we will generate for you is completely deterministic.ā
A user can take a workflow generated by DataChat from one piece of data and run it on another piece of data, and it would run in the same way, he said. āSo thereās no ambiguity.ā
Itās been a long road for Patel and John, but the Madison, Wisconsin-based company is finally accepting orders for DataChat. After being formally launched at the Gartner show, Patel is ready to see what the next chapter in his fourth startup will bring.
āWhen we started and wrote that initial paper, everyone thought it was crazy in the database world,ā Patel said. āBut we got, in some sense, lucky that the GenAI piece landed where it was now a lot more usable. But thatās the fun thing about technology: It moves around, and if youāre willing to move around, good things can happen.ā