AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (3 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

JudPriNet: Video transition detection based on semantic relationship and Monte Carlo sampling

School of Engineering, Computer and Mathematical Sciences, Auckland University of Technology, Auckland 1024, New Zealand
School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin 540004, China, and also with the Department of Electrical Engineering, University of Chile, Santiago 8010037, Chile
Show Author Information

Abstract

Video understanding and content boundary detection are vital stages in video recommendation. However, previous content boundary detection methods require collecting information, including location, cast, action, and audio, and if any of these elements are missing, the results may be adversely affected. To address this issue and effectively detect transitions in video content, in this paper, we introduce a video classification and boundary detection method named JudPriNet. The focus of this paper is on objects in videos along with their labels, enabling automatic scene detection in video clips and establishing semantic connections among local objects in the images. As a significant contribution, JudPriNet presents a framework that maps labels to “Continuous Bag of Visual Words Model” to cluster labels and generates new standardized labels as video-type tags. This facilitates automatic classification of video clips. Furthermore, JudPriNet employs Monte Carlo sampling method to classify video clips, the features of video clips as elements within the framework. This proposed method seamlessly integrates video and textual components without compromising training and inference speed. Through experimentation, we have demonstrated that JudPriNet, with its semantic connections, is able to effectively classify videos alongside textual content. Our results indicate that, compared with several other detection approaches, JudPriNet excels in high-level content detection without disrupting the integrity of the video content, outperforming existing methods.

References

[1]
C. P. Papageorgiou, M. Oren, and T. Poggio, A general framework for object detection, in Proc. 6th Int. Conf. Computer Vision, Bombay, India, 1998, pp. 555–562.
[2]
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5987–5995.
[3]
A. Rao, L. Xu, Y. Xiong, G. Xu, Q. Huang, B. Zhou, and D. Lin, A local-to-global approach to multi-modal movie scene segmentation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 10143–10152.
[4]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781, 2013.
[5]
G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, presented at the ECCV: Workshop on Statistical Learning in Computer Vision, Prague, Czech Republic, 2004.
[6]
J. Y. H. Ng, F. Yang, and L. S. Davis, Exploiting local features from deep networks for image retrieval, in Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, 2015, pp. 53–61.
[7]
R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, NetVLAD: CNN architecture for weakly supervised place recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 5297–5307.
[8]
J. Liu, W. C. Chang, Y. Wu, and Y. Yang, Deep learning for extreme multi-label text classification, in Proc. 40th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, Tokyo, Japan, 2017, pp. 115–124.
[9]
J. Nam, E. L. Mencía, H. J. Kim, and J. Fürnkranz, Maximizing subset accuracy with recurrent neural networks in multi-label classification, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 5419–5429.
[10]
G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, and L. Carin, Joint embedding of words and labels for text classification, arXiv preprint arXiv: 1805.04174, 2018.
[11]
W. Zhang, J. Yan, X. Wang, and H. Zha, Deep extreme multi-label learning, in Proc. 2018 ACM on Int. Conf. Multimedia Retrieval, Yokohama, Japan, 2018, pp. 100–107.
[12]

L. Yao, C. Mao, and Y. Luo, Graph convolutional networks for text classification, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 7370–7377, 2019.

[13]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 779–788.
[14]

J. H. Kim, N. Kim, Y. W. Park, and C. S. Won, Object detection and classification based on YOLO-V5 with improved maritime dataset, J. Mar. Sci. Eng., vol. 10, no. 3, p. 377, 2022.

[15]
A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler, MOT16: A benchmark for multi-object tracking, arXiv preprint arXiv: 1603.00831, 2016.
[16]
T. Pedersen, S. Patwardhan, and J. Michelizzi, WordNet: Similarity: Measuring the relatedness of concepts, in Proc. HLT-NAACL—Demonstrations '04 : Demonstration Papers at HLT-NAACL 2004, Boston, MA, USA, 2004, pp. 38–41.
[17]
Q. Le and T. Mikolov, Distributed representations of sentences and documents, in Proc. 31st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 1188–1196.
[18]

K. W. Church, Word2Vec, Nat. Lang. Eng., vol. 23, no. 1, pp. 155–162, 2017.

[19]

S. Hiranandani, K. Kennedy, and C. W. Tseng, Compiling Fortran D for MIMD distributed-memory machines, Commun. ACM, vol. 35, no. 8, pp. 66–80, 1992.

[20]
S. Goldwater, T. L. Griffiths, and M. Johnson, Contextual dependencies in unsupervised word segmentation, in Proc. 21st Int. Conf. Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 2006, pp. 673–680.
[21]

F. Sun, J. Tang, H. Li, G. J. Qi, and T. S. Huang, Multi-label image categorization with sparse factor representation, IEEE Trans. Image Process., vol. 23, no. 3, pp. 1028–1037, 2014.

[22]
W. Li, J. Han, and J. Pei, CMAR: Accurate and efficient classification based on multiple class-association rules, in Proc. 2001 IEEE Int. Conf. Data Mining, San Jose, CA, USA, 2001, pp. 369–376.
[23]
Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, PREDRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs, in Proc. 31st Int. Conf. Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017, pp. 879–888.
[24]
H. Guo, K. Zheng, X. Fan, H. Yu, and S. Wang, Visual attention consistency under image transforms for multi-label image classification, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 729–739.
[25]
R. Azad, M. Asadi-Aghbolaghi, M. Fathy, and S. Escalera, Bi-directional ConvLSTM U-net with densley connected convolutions, in Proc. IEEE/CVF Int. Conf. Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 2019, pp. 406–415.
[26]
S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord, Learning representations by predicting bags of visual words, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 6926–6936.
[27]
N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang, YouTube-VOS: A large-scale video object segmentation benchmark, arXiv preprint arXiv: 1809.03327, 2018.
[28]
E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke, YouTube-BoundingBoxes: A large high-precision human-annotated data set for object detection in video, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 7464–7473.
[29]
S. C. Yurtkulu, Y. H. Şahin, and G. Unal, Semantic segmentation with extended DeepLabv3 architecture, in Proc. 27th Signal Processing and Communications Applications Conf. (SIU), Sivas, Turkey, 2019, pp. 1–4.
[30]

Z. H. Hoo, J. Candlish, and D. Teare, What is an ROC curve, Emerg. Med. J., vol. 34, pp. 357–359, 2017.

[31]
J. Xu, T. Mei, T. Yao, and Y. Rui, MSR-VTT: A large video description dataset for bridging video and language, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 5288–5296.
[32]
X. Wang, J. Wu, J. Chen, L. Li, Y. F. Wang, and W. Y. Wang, VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research, in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 4580–4590.
Intelligent and Converged Networks
Pages 134-146
Cite this article:
Ma B, Wu J, Yan WQ. JudPriNet: Video transition detection based on semantic relationship and Monte Carlo sampling. Intelligent and Converged Networks, 2024, 5(2): 134-146. https://doi.org/10.23919/ICN.2024.0010

300

Views

13

Downloads

0

Crossref

0

Scopus

Altmetrics

Received: 09 October 2023
Revised: 26 October 2023
Accepted: 09 February 2024
Published: 30 June 2024
© All articles included in the journal are copyrighted to the ITU and TUP.

This work is available under the CC BY-NC-ND 3.0 IGO license:https://creativecommons.org/licenses/by-nc-nd/3.0/igo/

Return