Business

Empowering AI advancements through data-centric strategies: A triumph of large language models and tools like Cleanlab

August 16, 2023

The AI revolution was decades in the making. It was a field filled with excitement, yet often punctuated by disappointments and “AI winters.” But recently, something shifted. Large Language Models (LLMs) like ChatGPT, Claude, and Bard catapulted AI from laboratory curiosity to the mainstream.

This shift wasn’t solely a triumph of AI but also a victory over the intricacies of large and messy data. As the saying goes, “garbage in, garbage out.” New tools are emerging that focus on improving the underlying data, therefore improving LLMs.

The Double Challenge of LLMs

The term “Large Language Models” holds within it two great challenges. First, the sheer volume of data. We’re talking upwards of a petabyte (a million gigabytes) of data for GPT-4, encompassing millions of books, blogs, social media posts, video transcripts, and more. This colossal scale offers vast potential but also poses significant logistical considerations.

Second, the complexity of natural language. Context-dependent, ambiguous, and diverse, language data is a wild beast that even the best algorithms struggle to tame. It’s impossible to accurately label all this data, which inevitably means that even state-of-the-art LLMs are trained on tons of incorrectly-labeled data.

In facing these challenges, new data-centric tools and methodologies emerged, enabling a true leap in what AI is capable of. Solutions like Cleanlab and others began to offer ways to collect diverse data, automate quality control, and process language into a form suitable for AI models.

These tools did not merely offer incremental improvements; they fundamentally reshaped the approach to AI data handling. They transformed the task of handling large-scale language data from a manual, error-prone process into an automated, precise one, democratizing the field and enabling advancements at an unprecedented pace.

Why Data-Centric AI is Needed (With a Python Demo)

In AI, real-world datasets contain annotation errors ranging from 7-50%. These imperfections significantly hamper training and evaluation. Data-centric AI emphasizes improving the quality of the dataset itself.

OpenAI’s strategy, for instance, illustrates this emphasis: “We prioritized filtering out all of the bad data over leaving in all of the good data. This is because we can always fine-tune our model with more data later to teach it new things, but it’s much harder to make the model forget something that it has already learned.”

An approach of manually filtering data, however, is time-consuming and expensive. The Cleanlab package is an open-source framework popular for practicing data-centric AI today. It allows you to run data quality algorithms on your trained ML model’s outputs to detect common dataset issues like label errors, outliers, drift, and more.

With just a few lines of code, you can automatically find and identify problems in various types of data, such as image, text, tabular, and audio. By using the Cleanlab package, you can decide how to improve your dataset and model, re-train your ML model, and see its performance improve without any changes to your existing code.

Cleanlab Studio, on the other hand, is more than just an extension of the Cleanlab package; it’s a no-code platform designed to find and fix problems in real-world datasets. It doesn’t just stop at detecting issues but goes further in handling data curation and correction, and even automates almost all the hard parts of turning raw data into reliable ML or Analytics.

Let’s use the Cleanlab package to demonstrate the power of data-centric AI.

1. Preparing data and fine-tuning

We start with the Stanford Politeness Dataset. Ensure you have the train and test sets loaded. In this demo, we’ll fine-tune the Davinci LLM for 3-class classification, first without Cleanlab, and then see how we can improve accuracy with data-centricity. We can run a simple bash command to train a model.

!openai api fine_tunes.create -t "train_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "baseline"

When that’s done, we can query a fine_tunes.results endpoint to see the test accuracy.

!openai api fine_tunes.results -i ft-9800F2gcVNzyMdTLKcMqAtJ5 > baseline.csv

`df = pd.read_csv(‘baseline.csv’)

baseline_acc = df.iloc[-1][‘classification/accuracy’]`

We get a result of 63% accuracy. Let’s see if we can improve this.

2. Obtain Predicted Class Probabilities

Now, let’s use OpenAI’s API to compute embeddings and fit a logistic regression model to obtain out-of-sample predicted class probabilities.

# Get embeddings from OpenAI. from openai.embeddings_utils import get_embedding

embedding_model = "text-similarity-davinci-001" train["embedding"] = train.prompt.apply(lambda x: get_embedding(x, engine=embedding_model)) embeddings = train["embedding"].values

# Get out-of-sample predicted class probabilities via cross-validation.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression() labels = train["completion"].values pred_probs = cross_val_predict(estimator=model, X=embeddings, y=labels, cv=10, method="predict_proba")

With just one line of code, Cleanlab estimates which examples have label issues in our training dataset.

from cleanlab.filter import find_label_issues

Now we can get indices of examples estimated to have label issues:

issue_idx = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence') # sort indices by likelihood of label error

Now, we’ve automatically extracted the indices of potentially mislabeled examples, so we can remove them and train a new classifier.

# Remove the label errors

train_cl = train.drop(issue_idx).reset_index(drop=True) format_data(train_cl, "train_cl.jsonl")

Now let’s train a more robust classifier with better data.

!openai api fine_tunes.create -t "train_cl_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "dropped"

# Evaluate model on test data

!openai api fine_tunes.results -i ft-InhTRQGu11gIDlVJUt0LYbEx > cleanlab.csv df = pd.read_csv('cleanlab.csv') dropped_acc = df.iloc[-1]['classification/accuracy']

We get an accuracy of over 66%, improving a state-of-the-art fine-tunable model (GPT-3, as you can’t fine-tune GPT-4), merely by automatically improving the dataset, without any change to the model.

With Cleanlab Studio, it’s also possible to automatically fix the incorrect labels instead of just removing them outright, improving accuracy even further. A guide by Cleanlab shows that this takes accuracy up to 77%.

Takeaways

Using data-centric tools like Cleanlab, you can efficiently find and fix data and label issues, leading to significant improvements in the performance of LLMs like Davinci. This approach does not alter the model architecture or hyperparameters and focuses only on enhancing the quality of the training data.

The approach outlined in this guide could be the key to unlocking even greater accuracy and robustness in AI models, even with future advanced LLMs like GPT-5.

This article was originally published by Frederik Bussler on Hackernoon.

HackerNoon

VIEW ALL POSTS

< Next Post

‘We own the science’ remix: YouTube bans health guidance not approved by WHO, local authorities

Previous Post >

Thailand’s Pheu Thai Party pledges to give citizens digital currency handout with 6-month expiry date, must be spent within 4km of home

Business

Horasis addresses global turbulence with assembly of influential leaders in São Paulo, Brazil this October

After many years of relative stability, it seems clear we are now in a period of unprecedented...

July 2, 2025 Katie Konyn

Business

Off the grid: the rise of mobile charging amidst the electric vehicle revolution

In its century-long history, the automotive industry has not undergone a revolution quite as...

July 2, 2025 Salome Beyer Velez

Business

As demand for energy booms, Think Power Solutions appoints Daniel Helman as Chief Executive Officer

Across the U.S., the power industry has seen a massive spike in capital investments. In fact, this...

July 2, 2025 Sociable Team

Sociable's Podcast

Brains Byte Back

Brains Byte Back interviews startups, entrepreneurs, and industry leaders that tap into how our brains work. We explore how knowledge & technology intersect to build a better, more sustainable future for humanity. If you’re interested in ideas that push the needle, and future-proofing yourself for the new information age, join us every Friday. Brains Byte Back guests include founders, CEOs, and other influential individuals making a big difference in society, with past guest speakers such as New York Times journalists, MIT Professors, and C-suite executives of Fortune 500 companies.

It’s predicted that AI could replace half of all entry-level white-collar jobs in the next five years, especially in the U.S., where fewer regulations and bigger investments are speeding things up. Routine tasks like document review and data entry could is already be being picked up by AI, so what does that mean for the future of entry-level work? Redefine it or eliminate it?

Leslie Thomas, Chief Psychometric Officer at Kryterion, breaks down what this means for your career and how certification is evolving to keep up, including avoiding cheating. She explains how her team works with companies to define what people actually need to know in an AI-powered workplace. She offers a valuable method in terms of defining where your job will fall in line in the world of AI.

You'll learn how Kryterion is using AI to build smarter assessments, why soft skills like creativity and adaptability matter more than ever, and how to figure out which parts of your job are at risk.

If you're asking what to learn next or how to stay relevant, this episode gives you a great place to start.Find out more about Leslie Thomas here.

Access her ebook here.

Learn more about Kryterion here.

Reach out to today's host, Erick Espinosa – [email protected]

Get the latest on tech news – https://sociable.co/

Leave an iTunes review – https://rb.gy/ampk26