Abstract
The use of Large Language Models (LLMs) in data cleaning tasks has demonstrated impressive capabilities. However, the high inference costs associated with LLMs pose significant challenges, particularly when managing large-scale datasets within constrained budgets. While many studies focus on direct methods to reduce inference costs, we propose a novel framework to alleviate the high inference costs of LLMs by transforming the task into a multi-objective optimization problem. This framework begins by decomposing the complex data cleaning task into smaller, well-defined sub-tasks. For each sub-task, the most appropriate method is selected from a range of options, such as rule-based tools, code generation methods, smaller pretrained language models, or LLMs, depending on the trade-off between cost and effectiveness. This allows for a systematic balance between cost and quality, enabling the completion of high-quality data cleaning tasks within budget constraints. Experimental results validate the effectiveness of this approach. The framework significantly reduces inference costs while maintaining high-quality data processing. This framework offers a practical pathway to optimizing LLM-based data cleaning methods, balancing computational efficiency and data processing quality. Future work could explore the dynamic adaptations for evolving sub-tasks or deeper integrations with explainable AI and human-in-the-loop approaches to enhance trust and interpretability in data cleaning pipelines.
京公网安备11010802044758号
Comments on this article