LSTM-in-LSTM for generating long descriptions of images

Jun Song; Siliang Tang; Jun Xiao; Fei Wu; Zhongfei (Mark) Zhang

doi:10.1007/s41095-016-0059-z

Computational Visual Media 2016, 2(4): 379-388 https://doi.org/10.1007/s41095-016-0059-z

Research Article |

Open Access | Issue | Published: 15 November 2016

LSTM-in-LSTM for generating long descriptions of images

Show Author's Information Hide Author's Information Jun Song^¹, Siliang Tang^¹, Jun Xiao^¹, Fei Wu^¹(

), Zhongfei (Mark) Zhang^²

1 College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China.

2 Department of Computer Science, Watson School of Engineering and Applied Sciences, Binghamton University, Binghamton, NY, USA.

Keywords:

computer vision, long short-term memory (LSTM), neural network, image description generation

Cite this article:

Song J, Tang S, Xiao J, et al. LSTM-in-LSTM for generating long descriptions of images. Computational Visual Media, 2016, 2(4): 379-388. https://doi.org/10.1007/s41095-016-0059-z

Download citation

EndNote(RIS)

BibTeX

584

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM (long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual interaction between visual cues (i.e., the spatially-concurrent visual objects), while the outer LSTM generally captures the explicit multi-modal relationship between sentences and images (i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector (via the outer LSTM), and a context vector of fine-grained visual cues (via the inner LSTM). Our model outperforms state-of-the-art methods on several benchmark datasets (Flickr8k, Flickr30k, MSCOCO) when used to generate long rich fine-grained descriptions of given images in terms of four different metrics (BLEU, CIDEr, ROUGE-L, and METEOR).

Full text

Abstract

Full text

Outline

About this article

LSTM-in-LSTM for generating long descriptions of images

Show Author's information Hide Author's Information Jun Song^¹, Siliang Tang^¹, Jun Xiao^¹, Fei Wu^¹(

), Zhongfei (Mark) Zhang^²

1 College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China.

2 Department of Computer Science, Watson School of Engineering and Applied Sciences, Binghamton University, Binghamton, NY, USA.

Abstract

Keywords: computer vision, long short-term memory (LSTM), neural network, image description generation

References(33)

[1]

Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every picture tells a story: Generating sentences from images. In: Computer Vision—ECCV 2010. Daniilidis, K.; Maragos, P.; Paragios, N. Eds. Springer Berlin Heidelberg, 15-29, 2010.

DOI

[2]

Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A. C.; Berg, T. L. BabyTalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 35, No. 12, 2891-2903, 2013.

DOI Google Scholar

[3]

Li, S.; Kulkarni, G.; Berg, T. L.; Berg, A. C.; Choi, Y. Composing simple image descriptions using web-scale n-grams. In: Proceedings of the 15th Conference on Computational Natural Language Learning, 220-228, 2011.

[4]

Gong, Y.; Wang, L.; Hodosh, M.; Hockenmaier, J.; Lazebnik, S. Improving image-sentence embeddings using large weakly annotated photo collections. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 529-545, 2014.

DOI

[5]

Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research Vol. 47, 853-899, 2013.

DOI Google Scholar

[6]

Ordonez, V.; Kulkarni, G.; Berg, T. L. Im2text: Describing images using 1 million captioned photographs. In: Proceedings of Advances in Neural Information Processing Systems, 1143-1151, 2011.

[7]

Socher, R.; Karpathy, A.; Le, Q. V.; Manning, C. D.; Ng, A. Y. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics Vol. 2, 207-218, 2014.

DOI Google Scholar

[8]

Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 3128-3137, 2015.

DOI

[9]

Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huang, Z.; Yuille, A. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632, 2014.

Google Scholar

[10]

Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3156-3164, 2015.

DOI

[11]

Jin, J.; Fu, K.; Cui, R.; Sha, F.; Zhang, C. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272, 2015.

Google Scholar

[12]

Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R. S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, 2048-2057, 2015.

[13]

Bengio, Y.; Schwenk, H.; Senécal, J.-S.; Morin, F.; Gauvain, J.-L. Neural probabilistic language models. In: Innovations in Machine Learning. Holmes, D. E.; Jain, L. C. Eds. Springer Berlin Heidelberg, 137-186, 2006.

[14]

Palangi, H.; Deng, L.; Shen, Y.; Gao, J.; He, X.; Chen, J.; Song, X.; Ward, R. Deep sentence embedding using the long short term memory network: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing Vol. 24, No. 4, 694-707, 2016.

DOI Google Scholar

[15]

Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

Google Scholar

[16]

Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 1097-1105, 2012.

[17]

Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Google Scholar

[18]

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 580-587, 2014.

DOI

[19]

He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 346-361, 2014.

[20]

Girshick, R. Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, 1440-1448, 2015.

DOI

[21]

Karpathy, A.; Joulin, A.; Li, F. F. F. Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of Advances in Neural Information Processing Systems, 1889-1897, 2014.

[22]

Elliott, D.; Keller, F. Image description using visual dependency representations. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1292-1302, 2013.

[23]

Sutton, R. S.; Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, 1998.

DOI

[24]

Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 3104-3112, 2014.

[25]

Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.

Google Scholar

[26]

Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics Vol. 2, 67-78, 2014.

DOI Google Scholar

[27]

Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 740-755, 2014.

DOI

[28]

Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311-318, 2002.

DOI

[29]

Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8, 2004.

[30]

Kuznetsova, P.; Ordonez, V.; Berg, A. C.; Berg, T. L.; Choi, Y. Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, Vol. 1, 359-368, 2012.

[31]

Vedantam, R.; Zitnick, C. L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4566-4575, 2015.

DOI

[32]

Denkowski, M.; Lavie, A. Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation, 2014.

DOI

[33]

De Marneffe, M.-C.; Manning, C. D. The Stanford typed dependencies representation. In: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, 1-8, 2008.

DOI

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Revised: 25 July 2016

Accepted: 19 August 2016

Published: 15 November 2016

Issue date: December 2016

Copyright

Acknowledgements

This work was supported in part by the National Basic Research Program of China (No. 2012CB316400), National Natural Science Foundation of China (Nos. 61472353 and 61572431), China Knowledge Centre for Engineering Sciences and Technology, the Fundamental Research Funds for the Central Universities and 2015 Qianjiang Talents Program of Zhejiang Province. Z. Zhang was supported in part by the US NSF (No. CCF-1017828) and Zhejiang Provincial Engineering Center on Media Data Cloud Processing and Analysis.

Rights and permissions

This article is published with open access at Springerlink.com

The articles published in this journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.