Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge, leading to significant improvements in both factual accuracy and task performance. However, existing dense retrievers face considerable challenges when handling numerical constraints, particularly in queries requiring precise filtering conditions. To systematically explore these issues, we introduce Numerical Constraint Question (NumConQ), a comprehensive multi-domain benchmark dataset that contains more than 6500 queries covering healthcare, finance, education, sports, and movies. Empirical analysis reveals that state-of-the-art dense retrievers achieve only 16.3% accuracy in numerical constraint satisfaction, significantly underperforming relative to their semantic matching capabilities. To address these limitations, we propose Numerical Constraint-aware Retriever (NC-Retriever), which features: (1) a two-phase contrastive learning framework that combines in-batch negative samplings with progressively introduced hard negatives, and (2) a hybrid numerical representation scheme for consistent tokenization. Extensive experiments show that NC-Retriever achieves a relative improvement of 65.84% in recall@10 and a 78.28% increase in precision@10 compared to current state-of-the-art methods. The code and benchmark dataset are available at https://github.com/Tongji-KGLLM/NumConQ.
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
Comments on this article