This study presents a novel multimodal medical image zero-shot segmentation algorithm named the text-visual-prompt segment anything model (TV-SAM) without any manual annotations. The TV-SAM incorporates and integrates the large language model GPT-4, the vision language model GLIP, and the SAM to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing the SAM’s capability for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training. TV-SAM significantly outperforms SAM AUTO (p < 0.01) and GSAM (p < 0.05), closely matching the performance of SAM BBOX with gold standard bounding box prompts (p = 0.07), and surpasses the state-of-the-art methods on specific datasets such as ISIC (0.853 versus 0.802) and WBC (0.968 versus 0.883). The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, the ability to address complex problems in specialized domains can be enhanced.
- Article type
- Year
- Co-author


With the development of computers, artificial intelligence, and cognitive science, engagement in deep communication between humans and computers has become increasingly important. Therefore, affective computing is a current hot research topic. Thus, this study constructs a Physiological signal-based, Mean-threshold, and Decision-level fusion algorithm (PMD) to identify human emotional states. First, we select key features from electroencephalogram and peripheral physiological signals, and use the mean-value method to obtain the classification threshold of each participant and distinguish individual differences. Then, we employ Gaussian Naive Bayes (GNB), Linear Regression (LR), Support Vector Machine (SVM), and other classification methods to perform emotion recognition. Finally, we improve the classification accuracy by developing an ensemble model. The experimental results reveal that physiological signals are more suitable for emotion recognition than classical facial and speech signals. Our proposed mean-threshold method can solve the problem of individual differences to a certain extent, and the ensemble learning model we developed significantly outperforms other classification models, such as GNB and LR.
Ultrasound (US) imaging is clinically used to guide needle insertions because it is safe, real-time, and low cost. The localization of the needle in the ultrasound image, however, remains a challenging problem due to specular reflection off the smooth surface of the needle, speckle noise, and similar line-like anatomical features. This study presents a novel robust needle localization and enhancement algorithm based on deep learning and beam steering methods with three key innovations. First, we employ beam steering to maximize the reflection intensity of the needle, which can help us to detect and locate the needle precisely. Second, we modify the U-Net which is an end-to-end network commonly used in biomedical segmentation by using two branches instead of one in the last up-sampling layer and adding three layers after the last down-sample layer. Thus, the modified U-Net can real-time segment the needle shaft region, detect the needle tip landmark location and determine whether an image frame contains the needle by one shot. Third, we develop a needle fusion framework that employs the outputs of the multi-task deep learning (MTL) framework to precisely locate the needle tip and enhance needle shaft visualization. Thus, the proposed algorithm can not only greatly reduce the processing time, but also significantly increase the needle localization accuracy and enhance the needle visualization for real-time clinical intervention applications.