AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (5.8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access | Just Accepted

A Multi-Objective Optimization Framework for Data Cleaning Using Large Language Models

Tianze Hu^¹, Jiacheng Wang^¹, Wenqi Pu^¹, Jiajun Li^¹, Ruixin Gu^², Xin Bi^³, Haijun Yin^⁴, Yu-Ping Wang^¹(

)

¹ School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

² College of Software Engineering, Beijing University of Technology, Beijing 100124, China

³ School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China

⁴ Engineering Department, Shenyang Aircraft Industry (Group) Co., LTD., Shenyang 110000, China

Show Author Information

Abstract

The use of Large Language Models (LLMs) in data cleaning tasks has demonstrated impressive capabilities. However, the high inference costs associated with LLMs pose significant challenges, particularly when managing large-scale datasets within constrained budgets. While many studies focus on direct methods to reduce inference costs, we propose a novel framework to alleviate the high inference costs of LLMs by transforming the task into a multi-objective optimization problem. This framework begins by decomposing the complex data cleaning task into smaller, well-defined sub-tasks. For each sub-task, the most appropriate method is selected from a range of options, such as rule-based tools, code generation methods, smaller pretrained language models, or LLMs, depending on the trade-off between cost and effectiveness. This allows for a systematic balance between cost and quality, enabling the completion of high-quality data cleaning tasks within budget constraints. Experimental results validate the effectiveness of this approach. The framework significantly reduces inference costs while maintaining high-quality data processing. This framework offers a practical pathway to optimizing LLM-based data cleaning methods, balancing computational efficiency and data processing quality. Future work could explore the dynamic adaptations for evolving sub-tasks or deeper integrations with explainable AI and human-in-the-loop approaches to enhance trust and interpretability in data cleaning pipelines.

Keywords

Large Language Models (LLMs)Data Cleaning Cost Optimization Quality Modeling Expectation Maximization (EM) algorithm Multi-Objective Optimization (MOO)

References

【1】

Crossref Google Scholar

Big Data Mining and Analytics

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , {{reviewData.reportCite.doi}}

Cite this article:

Hu T, Wang J, Pu W, et al. A Multi-Objective Optimization Framework for Data Cleaning Using Large Language Models. Big Data Mining and Analytics, 2025, https://doi.org/10.26599/BDMA.2025.9020074

623

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 31 December 2024

Revised: 15 April 2025

Accepted: 13 June 2025

Available online: 10 October 2025

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).