23.9 C
Tuesday, May 28, 2024

DataChat Delivers Data Exploration with a Dose of GenAI

Must read

DataChat turns data into insights with natural language! No SQL/Python is needed. Business users and analysts are empowered!

What if you tell the computer how you want to explore a data set, and the computer automatically executes the analysis and delivers the results? That’s the idea behind DataChat, a generative AI-based data exploration and analytics tool that spun out of a University of Wisconsin-Madison research project and is now a commercial product.

Jignesh Patel, who is currently a computer science professor at Carnegie Mellon University and a co-founder of DataChat, recently sat down virtually with Datanami to chat about the nature of data exploration in the generative AI era and the new DataChat offering, which formally launched earlier this month at the Gartner Data & Analytics Summit.

The impetus for creating DataChat started in 2016 when Patel was working as a computer science professor at the University of Wisconsin-Madison and the CTO of Pivotal (now a part of VMware Tanzu and parent company Broadcom). The big data explosion was in full swing, Hadoop was the rallying point for new distributed frameworks, and data scientists were in big demand.

While the technology was evolving quickly, too many companies were spinning their tires regarding data analytics and exploration, and Patel sensed that something was missing from the equation.

“Every CTO their first objective was to hire an army of data scientists. They couldn’t get enough of data scientists,” Patel said. “And I started to observe how data scientists work in the very early days. It’s all ad-hoc analytics. It’s unscripted, unlike the BI world, and you’re trying to get something from data in a non-linear path.”

Much of this data exploration work was done manually, using tools like Jupyter’s data science notebooks. Data scientists would explore a particular data set until something interesting popped out, then figure out a way to extract that particular piece of data, transform it into a more useful form, and then pipe it into a machine learning algorithm, which could be used in an application.

Patel recognized the pattern lent itself to some form of automation, one that was preferably more approachable by non-experts.

“They were doing this by breaking the problem down, step by step, then trying to find code somewhere on the Web and retrofit it inside. And that’s how many cells get constructed in notebooks,” he said. “So we wrote a paper in 2017 to say, what if we could have this data science cell be filled up by the user just expressing that in natural language?”

This was pre-ChatGPT days, of course, and the state-of-the-art in natural language processing (NLP) was nowhere near what it is today. While the NLP tech would improve, Patel and his University of Wisconsin PhD graduate student, Rogers Jeffrey Leo John, did the hard work of constructing a compact control language that could sit between the user and the underlying SQL and Python code that would query data and call machine learning algorithms, respectively.

“The intermediate [language]… was great because now we could take any arbitrary language, convert that into that intermediate language, and now convert that into SQL and Python,” Patel said. “Because that’s what you must do if you’re talking to a SQL database, doing ETL. If you want to build machine learning models, you have to cross the two main languages of data science, which are SQL and Python.”

A Natural Language for Data Science

The goal of DataChat was to create a data analytics and exploration tool that could follow simple English instructions, reducing the need for users to know SQL or Python to be productive with data. Users can type in simple commands such as “create a visualization for customer churn,” the product will automatically produce a visualization based on the data.

Patel said the idea is for DataChat to be interactive and have a natural flow. Sitting behind a spreadsheet-like interface, users can fire off questions about the data. Not every question posed to DataChat will generate a reliable answer immediately. However, the give and take allows the product and the user to move forward predictably.

Also Read: Nvidia Unveils ‘Blackwell’ Chip, AI Robots at GTC 24

“You ask, and you get,” Patel said. “And we also tell you the steps when you get something back. There’s a give and take. I’m going to ask you something, it didn’t make sense, and you ask in a slightly different way, but I’m making progress at every step.”

Business users, data analysts, and data scientists are the targeted users for DataChat. For business users and data analysts, the goal is to elevate their skills in the data science realm without a lot of training. Data scientists often use DataChat to give them an idea of what’s in a new data set.

“They might just be poking at it DataChat and saying, ‘Hey, how many null values do I have in three of my critical columns?’” Patel said. “Instead of writing a SQL query, they just point, click, or ask and get that answer, and it’s just much faster. They could write it, but they’re getting time benefit from using this.”

A DataChat workflow can generate three artifacts from data sitting in anything from an Excel workbook to a data warehouse in Databricks or Snowflake: a report, a chart, or a machine learning model, including regression, classification, and time series. Each workflow will be accompanied by an explanation of how and why it generated the answer that it did, which is an important feature of the product, Patel said.

For a model on churn, DataChat won’t generate “some crazy technical answer,” he said. “But it’s going to say, ‘Okay, these three things–the age of the person, the contract type, and whether they have bought insurance or not. And this is 60% of the influence or 20% and 10%, and here are the things that it’s not influencing based on the data.’”

That level of transparency is critical in data science, Patel said. “From day one, we’ve been thinking about solving data science, and science requires transparency, so that’s built into the philosophy of the product,” he said.

The Shifting Grounds of NLP

DataChat was first registered as a company in 2017 and raised $4 million in a seed round in 2020 (it has since raised another $25 million). In 2017, Patel and John slogged their way forward with the NLP technology of the day, which wasn’t nearly as powerful nor easy to use as today’s large language models (LLMs).

The DataChat interface lets users explore data using natural language (Image courtesy DataChat)

They built language parsers and delved into semantic understanding, “all of that crazy stuff,” Patel said. “But as part of doing that, we built the rest of the bottom of the stack,” he continued. “So important layers were all ready. They were scalable, they were cost-optimized, especially for cloud databases.”

When the LLM revolution exploded onto the scene a few years later, Patel and John quickly realized the superiority of the new approach. They jettisoned the top of the stack built on now outdated NLP techniques. They replaced it with OpenAI’s Codex. When OpenAI killed Codex a year ago, they pivoted again to make the LLM component swappable in their stack.

“So obviously, that was hell for us, but as part of doing that, we redid our engineering framework in the LLM piece to make sure that next time that happens to us, we can plug and play LLMs out and make it as painless as possible,” Patel said.

Today the company relies primarily on OpenAI’s GPT-4, which is generally considered the most powerful and well-read LLM on the market today. DataChat employs GPT-4 to learn and generate DataChat’s intermediate language. GPT-4 is told about the type of data the user wants to analyze in general terms, but customers’ actual data never touches GPT-4, Patel said.

“We will construct summaries of the structure of the schema, so we say, ‘Here are the elements,’” Patel said. “I don’t need to give [GPT-4] the actual data values.”

LLMs are non-deterministic machines that can’t be fully trusted, Patel said, which is why DataChat uses LLMs only as “guides.” “They hallucinate; they do wrong stuff,” he said. “So they just give us stuff, we will convert that query to an intermediate language…and what we will generate for you is completely deterministic.”

A user can take a workflow generated by DataChat from one piece of data and run it on another piece of data, and it would run in the same way, he said. “So there’s no ambiguity.”

It’s been a long road for Patel and John, but the Madison, Wisconsin-based company is finally accepting orders for DataChat. After being formally launched at the Gartner show, Patel is ready to see what the next chapter in his fourth startup will bring.

When we started and wrote that initial paper, everyone thought it was crazy in the database world,” Patel said. “But we got, in some sense, lucky that the GenAI piece landed where it was now a lot more usable. But that’s the fun thing about technology: It moves around, and if you’re willing to move around, good things can happen.”

More articles

Latest news