AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1.3 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access | Just Accepted

Optimizing Multimodal Data Queries in Data Lakes

Runqun Xiong1( )Shiyuan Zhao2Ciyuan Chen1Zhuqing Xu3

1 School of Computer Science and Engineering, Southeast University, Nanjing 211189, China.

2 School of Computer Software Engineering,Southeast University, Nanjing 211189, China.

3 College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106, China.

 

Show Author Information

Abstract

This paper addresses the challenge of efficiently querying multimodal related data in data lakes, a largescale storage and management system that supports heterogeneous data formats, including structured, semistructured, and unstructured data. Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities, such as tables, images, and text, which has applications in fields like e-commerce, healthcare, and education. However, existing methods primarily focus on single-modality queries, such as joinable or unionable table discovery, and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency. To tackle these challenges, we propose MQDL, a Multimodal data Query mechanism for Data Lakes, which employs a modality-adaptive indexing mechanism and contrastive learning-based embeddings to unify representations across modalities. Additionally, we introduce product quantization to optimize candidate verification during queries, reducing computational overhead while maintaining precision. We evaluate MQDL using a table-image dataset across multiple business scenarios, measuring metrics such as precision, recall, and F1-score. Results show that MQDL achieves an accuracy rate of approximately 90% while demonstrating strong scalability and reduced query response times compared to traditional methods. These findings highlight MQDL’s potential to enhance multimodal data retrieval in complex data lake environments.

Tsinghua Science and Technology
Cite this article:
Xiong R, Zhao S, Chen C, et al. Optimizing Multimodal Data Queries in Data Lakes. Tsinghua Science and Technology, 2025, https://doi.org/10.26599/TST.2025.9010022

173

Views

23

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 05 December 2024
Revised: 27 January 2025
Accepted: 19 February 2025
Available online: 27 February 2025

© The author(s) 2025

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return