Article

Data Harmonization in the era of AI and ChatGPT-4

By Soren Altmann, Partner at Redslim and Ralf Scharnetzki, Product Management Consultant at Redslim

4 minute read | June 2023

 

Over the past few years, we have seen huge progress in the field of artificial intelligence (AI). From image recognition and natural language processing to robotics and self-driving cars, AI has made significant advancements in almost every industry. The progress in AI has been nothing short of remarkable, and we are only just scratching the surface of what is possible with this technology.

But there’s one development in particular that has captured the attention of almost everyone over the past six months: Large Language Models (LLMs) like ChatGPT-4 and Bard. ChatGPT is an LLM trained by Microsoft OpenAI that can generate human-like responses to a wide range of prompts. In the small span of time, we started discussing this article, ChatGPT has advanced considerably and Microsoft Fabric has been announced to the market as a statement of Microsoft fully embracing the AI journey in the data analytics space.

As technology experts and data enthusiasts we regularly invest in developing algorithms to help us harmonize and enrich disconnected data sets: what we call SMART Coding at Redslim. And not surprisingly, we now get a lot of questions about how we’re looking at LLMs. Across the board for all players in the CPG arena, there’s more data, more details and more sources of information to analyze than ever before. And everyone, including us at Redslim, is on the lookout for optimal ways to manage the plethora of data they’re collecting, managing, or buying.

Here are some initial observations we’ve made when looking at the most recent developments of ChatGPT and AI, and how we feel these developments can impact the data management and harmonization industry as we know it.

Exciting Opportunities of Using LLMs within Data Harmonization

Although our tech teams have been following these new developments closely, testing functionalities in our development strategies, even we have been surprised by the progress we saw in Q4 2022 since the launch of ChatGPT. And the speed of the developments we’ve seen over the past months is unlike anything we’ve seen before.

As the capabilities of LLMs continue to grow, there is an enormous potential for our industry to leverage this technology in various ways. Today, we see the main benefits of LLMs to our industry as aiding in:

1. Coding products – Helping identify product attributes and break these out

As LLMs continue to improve, they have the potential to make mass customization to specific enrichment needs more feasible than ever before. Some reference data enrichment, with large bespoke granularity, takes weeks or months to develop, is painful to maintain, and is thus only available to large enterprises with substantial budgets. However, with LLMs, organizations with smaller budgets may become able to benefit from customizations that were previously out of reach.

2. Developing business applications- Helping write better software, faster

One of the most exciting applications of LLMs is as virtual assistants to industry experts in coding and development of business applications. With LLMs, developers can ask for a sequence of code from the tool, then subsequently quickly search through vast amounts of code to double check and correct any errors that have been returned. This can save developers a significant amount of time and increase their productivity, resulting in faster development cycles.

3. Getting faster to insights - Helping end-users analyze data by identifying patterns and calling these out

Similarly, LLMs can be a valuable tool for data analysis. By using LLMs to analyze data, experts can quickly and accurately identify patterns and trends that might otherwise be missed. This can be particularly useful in industries such as ours, where even small changes in data can have significant implications.

Simply Integrated data

Unlock the full potential of all your data sources, in one place with Redslim LIVE

Data Privacy as an Area of Concern

As we explore the potential usage of LLMs in our daily operations, we must state that the risks of inputting proprietary and confidential data into AI and LLMs are significant and should not be overlooked. With LLMs, there is always a risk that sensitive data may be leaked, either intentionally or accidentally. As more businesses begin to use LLMs, there is an increasing need to ensure that user data is protected and kept confidential.

Real-world examples of LLM data privacy issues have already come to light, such as the recent Samsung ChatGPT leak, where trade secrets were accidentally leaked to the AI chatbot. Similarly, Amazon has reported instances where text generated by ChatGPT closely resembled internal company data.

Fortunately, there are efforts underway to address these concerns. Microsoft is currently working on privacy-enhanced usage of its OpenAI powered AI for businesses, which could cost up to ten times more than shared AI. The new chatbot will be designed with a "privacy-first" approach and will be targeted at businesses that need to keep their secrets safe.

In addition, OpenAI is already now offering new settings to ensure that user data is not used for training. This option has become available in Italy, and it is hoped that it will soon be available in other parts of the world. This is a welcome development, as it gives users greater control over their data and ensures that it is not being used in ways they are not comfortable with.

While it is reassuring to see that steps are being taken to address data privacy concerns with LLMs, it is crucial to remain vigilant, especially when dealing with third-party or retailer data. It is vital to emphasize the importance of data privacy within organizations and to prioritize it in any LLM implementation. That’s a major driver in our caution at Redslim. Our tech team and data engineers, while a huge fan of cutting-edge developments, have to be incredibly diligent in protecting what’s most important: the privacy and quality of partners’ and customers’ data.

Quality Remains Central

As impressive as the developments in AI are, one principle remains central in our industry: we can’t, and we will never compromise on quality of the data management services we deliver.

There has been quite a lot of press recently about the potential for LLMs to return false information when queried by a human. LLMs such as Bard and ChatGPT-4 have a significant problem in that they assert inaccurate information as truth with confidence. This occurs because, instead of searching a verified database to provide responses to queries, they are trained on extensive text data to anticipate the subsequent word -- somewhat akin to an autocomplete feature.

When testing the usability of LLMs to our operations as a virtual assistant, we’ve encountered some notable examples where it has returned false information, which we, as experts, could quickly see in the data.

One of our tech developers, for example, was using an LLM to explore its capabilities in writing code. While assistance in coding is an incredibly useful way to use LLMs, our team of tech experts were able to determine the errors in the coding the LLM returned only because of their extensive IT knowledge. We saw a similar example when testing LLMs’ potential with data harmonization. We recently tested ChatGPT with matching descriptions of make-up UPCs with the corresponding color shade. While the result was good, it wasn’t good enough. We determined that the LLM returned the correct data in 80% of the cases: the only problem is, as LLMs state their outputs as fact, you don’t know which 20% are wrong, and which ones are correct. And that simply isn’t good enough for data quality.

These issues do not speak to the potential of these tools, which remains huge. But it speaks about the limitations of relying on these tools extensively in an industry where data teams pursue 100% accuracy and full traceability of the manipulations of their data.

The Way Forward: Humans + Technology

The development of AI has simply been breath-taking over the past few years. It's truly remarkable how far we've come in the field of AI, and LLMs like ChatGPT-4 are a prime example of how cutting-edge technology can change the game. We have no doubt that in the future, technology will continue to advance to a point where not a ton of human intervention is needed. But we’re not there yet.

Pure AI approaches today reach well below quality and traceability service levels clients are demanding in our industry. Market data mobilizes millions in innovation investments, M&As, budget allocations.... Who would rely on data without traceable by definition accuracy for these types of pivotal decisions?

To yield to 100% data quality, the hybrid approach of AI with human intervention remains necessary. That’s because AI is currently only useful to the extent that the data it’s based on is of quality and comparable – and that’s where human intervention is important. We still need humans to verify, modify, and recheck data sources to ensure Quality-in-Quality-Out, to ensure that the information you render is trusted, accurate, and can help you make smarter decisions.

In conclusion, the potential benefits of using LLMs are enormous. With immediate value in development, reference data management, as well as in data analysis, LLMs are transforming the way we work. But if operating in a context where reliability of a service is the first element customers are asking for, one still has to stick to the best tested method: human experts aided by machines as virtual assistants.