References(33)
[1]
Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every picture tells a story: Generating sentences from images. In: Computer Vision—ECCV 2010. Daniilidis, K.; Maragos, P.; Paragios, N. Eds. Springer Berlin Heidelberg, 15-29, 2010.
[2]
Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A. C.; Berg, T. L. BabyTalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 35, No. 12, 2891-2903, 2013.
[3]
Li, S.; Kulkarni, G.; Berg, T. L.; Berg, A. C.; Choi, Y. Composing simple image descriptions using web-scale n-grams. In: Proceedings of the 15th Conference on Computational Natural Language Learning, 220-228, 2011.
[4]
Gong, Y.; Wang, L.; Hodosh, M.; Hockenmaier, J.; Lazebnik, S. Improving image-sentence embeddings using large weakly annotated photo collections. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 529-545, 2014.
[5]
Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research Vol. 47, 853-899, 2013.
[6]
Ordonez, V.; Kulkarni, G.; Berg, T. L. Im2text: Describing images using 1 million captioned photographs. In: Proceedings of Advances in Neural Information Processing Systems, 1143-1151, 2011.
[7]
Socher, R.; Karpathy, A.; Le, Q. V.; Manning, C. D.; Ng, A. Y. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics Vol. 2, 207-218, 2014.
[8]
Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 3128-3137, 2015.
[9]
Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huang, Z.; Yuille, A. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632, 2014.
[10]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3156-3164, 2015.
[11]
Jin, J.; Fu, K.; Cui, R.; Sha, F.; Zhang, C. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272, 2015.
[12]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R. S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, 2048-2057, 2015.
[13]
Bengio, Y.; Schwenk, H.; Senécal, J.-S.; Morin, F.; Gauvain, J.-L. Neural probabilistic language models. In: Innovations in Machine Learning. Holmes, D. E.; Jain, L. C. Eds. Springer Berlin Heidelberg, 137-186, 2006.
[14]
Palangi, H.; Deng, L.; Shen, Y.; Gao, J.; He, X.; Chen, J.; Song, X.; Ward, R. Deep sentence embedding using the long short term memory network: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing Vol. 24, No. 4, 694-707, 2016.
[15]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[16]
Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 1097-1105, 2012.
[17]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[18]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 580-587, 2014.
[19]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 346-361, 2014.
[20]
Girshick, R. Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, 1440-1448, 2015.
[21]
Karpathy, A.; Joulin, A.; Li, F. F. F. Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of Advances in Neural Information Processing Systems, 1889-1897, 2014.
[22]
Elliott, D.; Keller, F. Image description using visual dependency representations. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1292-1302, 2013.
[23]
Sutton, R. S.; Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, 1998.
[24]
Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 3104-3112, 2014.
[25]
Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
[26]
Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics Vol. 2, 67-78, 2014.
[27]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 740-755, 2014.
[28]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311-318, 2002.
[29]
Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8, 2004.
[30]
Kuznetsova, P.; Ordonez, V.; Berg, A. C.; Berg, T. L.; Choi, Y. Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, Vol. 1, 359-368, 2012.
[31]
Vedantam, R.; Zitnick, C. L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4566-4575, 2015.
[32]
Denkowski, M.; Lavie, A. Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation, 2014.
[33]
De Marneffe, M.-C.; Manning, C. D. The Stanford typed dependencies representation. In: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, 1-8, 2008.