This study presents a novel multimodal medical image zero-shot segmentation algorithm named the text-visual-prompt segment anything model (TV-SAM) without any manual annotations. The TV-SAM incorporates and integrates the large language model GPT-4, the vision language model GLIP, and the SAM to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing the SAM’s capability for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training. TV-SAM significantly outperforms SAM AUTO (p < 0.01) and GSAM (p < 0.05), closely matching the performance of SAM BBOX with gold standard bounding box prompts (p = 0.07), and surpasses the state-of-the-art methods on specific datasets such as ISIC (0.853 versus 0.802) and WBC (0.968 versus 0.883). The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, the ability to address complex problems in specialized domains can be enhanced.
- Article type
- Year
- Co-author

Ultrasound (US) imaging is clinically used to guide needle insertions because it is safe, real-time, and low cost. The localization of the needle in the ultrasound image, however, remains a challenging problem due to specular reflection off the smooth surface of the needle, speckle noise, and similar line-like anatomical features. This study presents a novel robust needle localization and enhancement algorithm based on deep learning and beam steering methods with three key innovations. First, we employ beam steering to maximize the reflection intensity of the needle, which can help us to detect and locate the needle precisely. Second, we modify the U-Net which is an end-to-end network commonly used in biomedical segmentation by using two branches instead of one in the last up-sampling layer and adding three layers after the last down-sample layer. Thus, the modified U-Net can real-time segment the needle shaft region, detect the needle tip landmark location and determine whether an image frame contains the needle by one shot. Third, we develop a needle fusion framework that employs the outputs of the multi-task deep learning (MTL) framework to precisely locate the needle tip and enhance needle shaft visualization. Thus, the proposed algorithm can not only greatly reduce the processing time, but also significantly increase the needle localization accuracy and enhance the needle visualization for real-time clinical intervention applications.