Business

Empowering AI advancements through data-centric strategies: A triumph of large language models and tools like Cleanlab

August 16, 2023

The AI revolution was decades in the making. It was a field filled with excitement, yet often punctuated by disappointments and “AI winters.” But recently, something shifted. Large Language Models (LLMs) like ChatGPT, Claude, and Bard catapulted AI from laboratory curiosity to the mainstream.

This shift wasn’t solely a triumph of AI but also a victory over the intricacies of large and messy data. As the saying goes, “garbage in, garbage out.” New tools are emerging that focus on improving the underlying data, therefore improving LLMs.

The Double Challenge of LLMs

The term “Large Language Models” holds within it two great challenges. First, the sheer volume of data. We’re talking upwards of a petabyte (a million gigabytes) of data for GPT-4, encompassing millions of books, blogs, social media posts, video transcripts, and more. This colossal scale offers vast potential but also poses significant logistical considerations.

Second, the complexity of natural language. Context-dependent, ambiguous, and diverse, language data is a wild beast that even the best algorithms struggle to tame. It’s impossible to accurately label all this data, which inevitably means that even state-of-the-art LLMs are trained on tons of incorrectly-labeled data.

In facing these challenges, new data-centric tools and methodologies emerged, enabling a true leap in what AI is capable of. Solutions like Cleanlab and others began to offer ways to collect diverse data, automate quality control, and process language into a form suitable for AI models.

These tools did not merely offer incremental improvements; they fundamentally reshaped the approach to AI data handling. They transformed the task of handling large-scale language data from a manual, error-prone process into an automated, precise one, democratizing the field and enabling advancements at an unprecedented pace.

Why Data-Centric AI is Needed (With a Python Demo)

In AI, real-world datasets contain annotation errors ranging from 7-50%. These imperfections significantly hamper training and evaluation. Data-centric AI emphasizes improving the quality of the dataset itself.

OpenAI’s strategy, for instance, illustrates this emphasis: “We prioritized filtering out all of the bad data over leaving in all of the good data. This is because we can always fine-tune our model with more data later to teach it new things, but it’s much harder to make the model forget something that it has already learned.”

An approach of manually filtering data, however, is time-consuming and expensive. The Cleanlab package is an open-source framework popular for practicing data-centric AI today. It allows you to run data quality algorithms on your trained ML model’s outputs to detect common dataset issues like label errors, outliers, drift, and more.

With just a few lines of code, you can automatically find and identify problems in various types of data, such as image, text, tabular, and audio. By using the Cleanlab package, you can decide how to improve your dataset and model, re-train your ML model, and see its performance improve without any changes to your existing code.

Cleanlab Studio, on the other hand, is more than just an extension of the Cleanlab package; it’s a no-code platform designed to find and fix problems in real-world datasets. It doesn’t just stop at detecting issues but goes further in handling data curation and correction, and even automates almost all the hard parts of turning raw data into reliable ML or Analytics.

Let’s use the Cleanlab package to demonstrate the power of data-centric AI.

1. Preparing data and fine-tuning

We start with the Stanford Politeness Dataset. Ensure you have the train and test sets loaded. In this demo, we’ll fine-tune the Davinci LLM for 3-class classification, first without Cleanlab, and then see how we can improve accuracy with data-centricity. We can run a simple bash command to train a model.

!openai api fine_tunes.create -t "train_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "baseline"

When that’s done, we can query a fine_tunes.results endpoint to see the test accuracy.

!openai api fine_tunes.results -i ft-9800F2gcVNzyMdTLKcMqAtJ5 > baseline.csv

`df = pd.read_csv(‘baseline.csv’)

baseline_acc = df.iloc[-1][‘classification/accuracy’]`

We get a result of 63% accuracy. Let’s see if we can improve this.

2. Obtain Predicted Class Probabilities

Now, let’s use OpenAI’s API to compute embeddings and fit a logistic regression model to obtain out-of-sample predicted class probabilities.

# Get embeddings from OpenAI. from openai.embeddings_utils import get_embedding

embedding_model = "text-similarity-davinci-001" train["embedding"] = train.prompt.apply(lambda x: get_embedding(x, engine=embedding_model)) embeddings = train["embedding"].values

# Get out-of-sample predicted class probabilities via cross-validation.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression() labels = train["completion"].values pred_probs = cross_val_predict(estimator=model, X=embeddings, y=labels, cv=10, method="predict_proba")

With just one line of code, Cleanlab estimates which examples have label issues in our training dataset.

from cleanlab.filter import find_label_issues

Now we can get indices of examples estimated to have label issues:

issue_idx = find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence') # sort indices by likelihood of label error

Now, we’ve automatically extracted the indices of potentially mislabeled examples, so we can remove them and train a new classifier.

# Remove the label errors

train_cl = train.drop(issue_idx).reset_index(drop=True) format_data(train_cl, "train_cl.jsonl")

Now let’s train a more robust classifier with better data.

!openai api fine_tunes.create -t "train_cl_prepared.jsonl" -v "test_prepared.jsonl" --compute_classification_metrics --classification_n_classes 3 -m davinci --suffix "dropped"

# Evaluate model on test data

!openai api fine_tunes.results -i ft-InhTRQGu11gIDlVJUt0LYbEx > cleanlab.csv df = pd.read_csv('cleanlab.csv') dropped_acc = df.iloc[-1]['classification/accuracy']

We get an accuracy of over 66%, improving a state-of-the-art fine-tunable model (GPT-3, as you can’t fine-tune GPT-4), merely by automatically improving the dataset, without any change to the model.

With Cleanlab Studio, it’s also possible to automatically fix the incorrect labels instead of just removing them outright, improving accuracy even further. A guide by Cleanlab shows that this takes accuracy up to 77%.

Takeaways

Using data-centric tools like Cleanlab, you can efficiently find and fix data and label issues, leading to significant improvements in the performance of LLMs like Davinci. This approach does not alter the model architecture or hyperparameters and focuses only on enhancing the quality of the training data.

The approach outlined in this guide could be the key to unlocking even greater accuracy and robustness in AI models, even with future advanced LLMs like GPT-5.

This article was originally published by Frederik Bussler on Hackernoon.

HackerNoon

VIEW ALL POSTS

< Next Post

‘We own the science’ remix: YouTube bans health guidance not approved by WHO, local authorities

Previous Post >

Thailand’s Pheu Thai Party pledges to give citizens digital currency handout with 6-month expiry date, must be spent within 4km of home

Business

Africa’s Digital Assets Push Gets an Upgrade as ADAS Teams-Up With CEO’s Forum

As Africa’s digital economy accelerates, a new partnership between the Africa Digital Assets...

March 24, 2026 Elena Rodríguez

Business

Why companies can’t afford black-box AI anymore

The State of Generative AI in the Enterprise report from Menlo Ventures found that companies are...

March 24, 2026 Ray Fernandez

Business Technology

From one donor, thousands of doses: Meet the startup making cell therapy accessible

Living therapies, made of engineered immune cells – and capable of hunting down cancer, reversing...

March 24, 2026 Salome Beyer Velez

Sociable's Podcast

Brains Byte Back

Brains Byte Back interviews startups, entrepreneurs, and industry leaders that tap into how our brains work. We explore how knowledge & technology intersect to build a better, more sustainable future for humanity. If you’re interested in ideas that push the needle, and future-proofing yourself for the new information age, join us every Friday. Brains Byte Back guests include founders, CEOs, and other influential individuals making a big difference in society, with past guest speakers such as New York Times journalists, MIT Professors, and C-suite executives of Fortune 500 companies.

Millions of people across the globe spend an average of 8–10 hours a day sitting at a desk, and research increasingly shows that long periods of idling are linked to higher risks of diabetes, heart disease, and other chronic health problems.

And going to the gym for one hour does little to cancel the effects of sitting for too long. We’re learning that our well-being is shaped just as much by what happens the rest of the day.

In this episode of Brains Byte Back, host Erick Espinosa speaks with Dr. Milad Geravand, co-founder and CEO of Deep Care, about why the modern desk job has quietly become a health risk we need recognize and how new AI-powered tools could help change that.

Dr. Geravand created Isa, an AI-powered desk device designed to monitor posture, movement, and workplace conditions in real time. Unlike traditional fitness trackers that focus on workouts, Isa focuses on the 8–10 hours people spend sitting during the workday.

In this conversation, they explore:

• Why sitting all day affects metabolism and long-term health

• Why exercise alone may not undo sedentary habits

• How AI can support healthier behavior without invading privacy

• What companies can do to improve employee health and productivity

Find out more about Dr. Milad Geravand, here.

Learn more about Deep CareI, here.

Reach out to today's host, Erick Espinosa – [email protected]

Get the latest on tech news – https://sociable.co/

Leave an iTunes review – https://rb.gy/ampk26