Journal Home > Volume 4 , Issue 1

Modern computational models have leveraged biological advances in human brain research. This study addresses the problem of multimodal learning with the help of brain-inspired models. Specifically, a unified multimodal learning architecture is proposed based on deep neural networks, which are inspired by the biology of the visual cortex of the human brain. This unified framework is validated by two practical multimodal learning tasks: image captioning, involving visual and natural language signals, and visual-haptic fusion, involving haptic and visual signals. Extensive experiments are conducted under the framework, and competitive results are achieved.


menu
Abstract
Full text
Outline
About this article

Brain-inspired multimodal learning based on neural networks

Show Author's information Chang LiuFuchun Sun( )Bo Zhang
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Abstract

Modern computational models have leveraged biological advances in human brain research. This study addresses the problem of multimodal learning with the help of brain-inspired models. Specifically, a unified multimodal learning architecture is proposed based on deep neural networks, which are inspired by the biology of the visual cortex of the human brain. This unified framework is validated by two practical multimodal learning tasks: image captioning, involving visual and natural language signals, and visual-haptic fusion, involving haptic and visual signals. Extensive experiments are conducted under the framework, and competitive results are achieved.

Keywords: deep learning, multimodal learning, brain-inspired learning, neural networks

References(28)

[1]
Riesenhuber M, Poggio T. Hierarchical models of object recognition in cortex. Nat Neurosci 1999, 2(11): 1019-1025.
[2]
Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 1962, 160(1): 106-154.
[3]
LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE 1998, 86(11): 2278-2324.
[4]
Goodale MA, Milner AD. Separate visual pathways for perception and action. Trends Neurosci 1992, 15(1): 20-25.
[5]
[6]
Hu XL, Zhang JW, Li JM, Zhang B. Sparsity-regularized HMAX for visual recognition. PLOS One 2014, 9(1): e81813.
[7]
Dura-Bernal S, Wennekers T, Denham SL. The role of feedback in a hierarchical model of object perception. In From Brains to Systems. Hernández C, Sanz R, Gómez-Ramirez J, Smith LS, Hussain A, Chella A, Aleksander I, Eds. New York, NY: Springer, 2011, pp 165-179.
[8]
Casagrande VA. A third parallel visual pathway to primate area V1. Trends Neurosci 1994, 17(7): 305- 310.
[9]
Markov NT, Vezoli J, Chameau P, Falchier A, Quilodran R, Huissoud C, Lamy C, Misery P, Giroud P, Ullman S, Barone P, Dehay C, Knoblauch K, Kennedy H. Anatomy of hierarchy: Feedforward and feedback pathways in macaque visual cortex. J Comp Neurol 2014, 522(1): 225-259.
[10]
Murphy PC, Sillito AM. Corticofugal feedback influences the generation of length tuning in the visual pathway. Nature 1987, 329(6141): 727-729.
[11]
Casagrande VA. A third parallel visual pathway to primate area V1. Trends Neurosci 1994, 17(7): 305-310.
[12]
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A. Real-time human pose recognition in parts from single depth images. In Proceedings of CVPR 2011, Colorado Springs, CO, USA, 2011, pp 1297-1304.
[13]
McMahan HB, Holt G, Sculley D, Young M, Ebner D, Grady J, Nie L, Phillips T, Davydov E, Golovin D, Chikkerur S, Liu D, Wattenberg M, Hrafnkelsson AM, Boulos T, Kubica J. Ad click prediction: A view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 2013, pp 1222-1230.
[14]
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015, 521(7553): 436-444.
[15]
Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge, MA: MIT Press, 2016.
[16]
Lahat D, Adali T, Jutten C. Multimodal data fusion: An overview of methods, challenges, and prospects. Proc IEEE 2015, 103(9): 1449-1477.
[17]
Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS. Multimodal fusion for multimedia analysis: A survey. Multimed Syst 2010, 16(6): 345-379.
[18]
McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 1943, 5(4): 115-133.
[19]
Dai JF, Li Y, He KM, Sun J. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 2016.
[20]
Sutskever I, Vinyals O, Le OV. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, 2014.
[21]
Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 2013, 47: 853- 899.
[22]
Jia X, Gavves E, Fernando B, Tuytelaars T. Guiding the long-short term memory model for image caption generation. In Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
[23]
Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015.
[24]
Karpathy A, Li FF. Deep visual-semantic alignments for generating image descriptions. In Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015.
[25]
Mao JH, Xu W, Yang Y, Wang J, Yuille AL. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014.
[25]
Mao JH, Xu W, Yang Y, Wang J, Huang ZH, Yuille A. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632, 2015.
[26]
Chu V, McMahon I, Riano L, McDonald CG, He Q, Perez-Tejada JM, Arrigo M, Darrell T, Kuchenbecker KJ. Robotic learning of haptic adjectives through physical interaction. Rob Auton Syst 2015, 63: 279- 292.
[27]
Gao Y, Hendricks LA, Kuchenbecker KJ, Darrell T. Deep learning for tactile understanding from visual and haptic data. In Proceedings of 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 2016, pp 536-543.
Publication history
Copyright
Rights and permissions

Publication history

Received: 15 July 2018
Revised: 06 August 2018
Accepted: 10 August 2018
Published: 25 November 2018
Issue date: September 2018

Copyright

© The authors 2018

Rights and permissions

This article is published with open access at journals.sagepub.com/home/BSA

Creative Commons Non Commercial CC BY-NC: This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (http://www.creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/ en-us/nam/open-access-at-sage).

Return