IJRST

Details

An In-Depth Analysis of the Multimodal Representation Learning with Respect to the Applications and Linked Challenges in Multiple Sectors

Arnav Goenka

Vellore Institute of Technology, Vellore, Tamil Nadu, India

50-57

Vol: 12, Issue: 3, 2022

Receiving Date: 2022-08-31 Acceptance Date:

2022-09-27

Publication Date:

2022-10-03

Download PDF

http://doi.org/10.37648/ijrst.v12i03.009

Abstract

Representation learning is a machine learning type wherein a system automatically uses deep models to extract features from raw data. It is essential for tasks like classifications, regression, and identification. Multimodal representation learning is a subset of representation learning that focuses on feature extraction from several heterogeneous, interconnected modalities. Although these modalities are frequently heterogeneous, they show correlations and relationships. These modalities include text, images, audio, or videos. Several difficulties arise from this intrinsic complexity, including combining multimodal data from various sources by precisely characterizing the relationships and correlations between modalities and jointly deriving features from multimodal data. Researchers are becoming increasingly interested in these problems, particularly as deep learning gains momentum. In recent years, many deep multimodal learning techniques have been developed. We present an overview of deep multimodal learning in this study, focusing on techniques that have been proposed in the past decade. We aim to provide readers with valuable insights for researchers, especially those working on multimodal deep machine learning, by educating them on the latest developments, trends, and difficulties in this field.

Keywords: machine learning; Multimodal representation learning; Multimodality Robust Line Segment (MRLS)

References

Y. Bengio, A. Courville, P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2013, 35(8):1798-1828.
D. Lahat, T. Adali, C. Jutten. Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects. Proceedings of the IEEE, 2015, 103(9):1449-1477.
Y. Zheng. Methodologies for Cross-Domain Data Fusion: An Overview. IEEE Transactions on Big Data, 2015, 1(1):1-14.
D. Ramachandram, G. W. Taylo. Deep Multimodal learning: a survey on recent advances and trends. IEEE Signal Processing Magazine, 2017, 34(6):96-108.
C. Zhao, H. Zhao, J. Lv, et al. Multimodal image matching based on Multimodality Robust Line Segment Descriptor. Neurocomputing, 2016, 177:290-303.
Y. Zhang, Y. Gu, X. Gu. Two-Stream Convolutional Neural Network for Multimodal Matching. International Conference on Artificial Neural Networks (ICANN18), 2018, Pages:14-21.
B. Pitts, S. L. Riggs, N. Sarter. Crossmodal Matching: A Critical but Neglected Step in Multimodal Research, IEEE Transactions on Human-Machine Systems, 2016, 46(3):445-450.
J. H. Choi, J. S. Lee. EmbraceNet: A robust deep learning architecture for multimodal classification. Information Fusion, 2019, 51:259-270.
S. Bahrampour, N. M. Nasrabadi, A. Ray, et al. Multimodal Task-Driven Dictionary Learning for Image Classification. IEEE Transactions on Image Processing, 2015, 25(1):24-38.
L. Gomez-Chova, D. Tuia, G. Moser, et al. Multimodal Classification of Remote Sensing Images: A Review and Future Directions. Proceedings of the IEEE, 2015, 103(9):1-25.
M. Turk. Multimodal interaction: A review. Pattern Recognition Letters, 2014, 36:189-195.
N. Vidakis, K. Konstantinos, G. Triantafyllidis. A Multimodal Interaction Framework for Blended Learning. International Conference on Interactivity, Game Creation, Design, Learning, and Innovation, Pages:2016, 205-211.
J. Mi, S. Tang, Z Deng, et al. Object affordance based multimodal fusion for natural Human-Robot interaction. Cognitive Systems Research, 2019, 54:128-137.
P. Hu, D Peng, X. Wang, et al. Multimodal adversarial network for cross-modal retrieval. Knowledge-Based Systems, online first, 2019, https://doi.org/10.1016/j.knosys.2019.05.017.
F. Shang, H. Zhang, L. Zhu, et al. Adversarial cross-modal retrieval based on dictionary learning. Neurocomputing, 2019, online first, https://doi.org/10.1016/j.neucom.2019.04.041.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al. Generative adversarial nets. In: Proceedings of the 2014 Conference on Advances in Neural Information Processing Systems 27. Montreal, Canada: Curran Associates, Inc., 2014. 2672-2680.
W. Cao, Q. Lin, Z. He, et al. Hybrid representation learning for cross-modal retrieval. Neurocomputing, 2019, 345:45-57.
D. Rafailidis, S. Manolopoulou, P. Daras. A unified framework for multimodal retrieval. Pattern Recognition, 2013, 46:3358-3370.
C. C. Park, B. Kim, G. Kim. Towards Personalized Image Captioning via Multimodal Memory Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(4):999-1012.
Y. Niu, Z. Lu, J. R. Wen, et al. Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation. IEEE Transactions on Image Processing, 2019, 28(4):1720-1731.
D. Zhao, Z. Chang, S. Guo. A multimodal fusion approach for image captioning. Neurocomputing, 2019, 329:476-485.
C. Wu, Y. Wei, X. Chu, et al. Hierarchical attention-based multimodal fusion for video captioning. Neurocomputing, 2018, 315:362-370.
C. L. Chou, H. T. Chen, S. Y. Lee. Multimodal Video-to-Near-Scene Annotation. IEEE Transactions on Multimedia, 2017, 19(2):354-366.
C. L. Chou, H. T. Chen, S. Y. Lee. Multimodal Video-to-Near-Scene Annotation. IEEE Transactions on Multimedia, 2017, 19(2):354-366.
S. Shekhar, V. M. Patel, N. M. Nasrabadi, et al. Joint Sparse Representation for Robust Multimodal Biometrics Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(1):113-126.
Z. Gu, B. Lang, T. yue, et al. Learning Joint Multimodal Representation Based on Multi-fusion Deep Neural Networks. International Conference on Neural Information Processing, 2017, pp.276-285.
N. Srivastava, R. R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. in Proc. Advances in Neural Inform. Processing Syst., 2012, pp. 2222-2230.
K. Sohn, W. Shang, H. Lee. Improved multimodal deep learning with variation of information. in Proc. Advances in Neural Information Processing Systems., 2014, pp. 2141-2149.
M. R. Amer, T. Shields, B. Siddiquie, et al. Deep Multimodal Fusion: A Hybrid Approach. International Journal of Computer Vision, 2018, 126(2-4):440-456.
S. S. Rajagopalan, L. P. Morency, T. Baltrusaitis, et al. Extending long short-term memory for multi-view structured learning. in Proc. Eur. Conf. Comput. Vis., 2016, pp. 338-353.
W. Feng, N. Guan, Y. Li, et al. Audio visual speech recognition with multimodal recurrent neural networks. 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14-19 May 2017, 681-688.
A. H. Abdulnabi, B. Shuai, Z. Zuo, et al. Multimodal Recurrent Neural Networks with Information Transfer Layers for Indoor Scene Labeling. IEEE Transactions on Multimedia, 2018, 20(7):1656-1671.
J. Weston, S. Bengio, N. Usunier. WSABIE: Scaling up to large vocabulary image annotation. In Proc. 7th Int. Joint Conf. Artif. Intell., 2011, pp. 2764-2770.
A. Frome, G. Corrado, J. Shlens. DeViSE: A deep visualsemantic embedding model. in Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2013, pp. 2121-2129.
R. Kiros, R. Salakhutdinov, R. S. Zemel. Unifying visualsemantic embeddings with multimodal neural language models. Trans. Assoc. Comput. Linguistics, 2015, pp. 1-13.
Y. Pan, T. Mei, T. Yao, et al. Jointly modeling embedding and translation to bridge video and language. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4594-4602.
M. M. Bronstein, A. M. Bronstein, F. Michel, et al. Data fusion through cross-modality metric learning using similarity-sensitive hashing. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3594-3601.
S. Kumar, R. Udupa. Learning hash functions for cross-view similarity search. in Proc. 7th Int. Joint Conf. Artif. Intell., 2011, pp. 1360-1365.
Q. Y. Jiang, W. J. Li. Deep cross-modal hashing. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3270-3278.
A. Mandal, P. Maji. FaRoC: Fast and Robust Supervised Canonical Correlation Analysis for Multimodal Omics Data. IEEE Transactions on Cybernetics, 2018, 48(4):1229-1241.
Y. Yu, S. Tang, K. Aizawa, et al. Category-Based Deep CCA for Fine-Grained Venue Discovery from Multimodal Data. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(4):1250-1258.
N. E. D. Elmadany, Y. He, L. Guan. Multimodal Learning for Human Action Recognition Via Bimodal/Multimodal Hybrid Centroid Canonical Correlation Analysis. IEEE Transactions on Multimedia, 2019, 21(5):1317-1331.

Back

info@ijrst.com

+919555269393

Track Article

Upload Article

Details

An In-Depth Analysis of the Multimodal Representation Learning with Respect to the Applications and Linked Challenges in Multiple Sectors

Abstract

References

Our Head Office

Quick Links

info@ijrst.com

+919555269393

Track Article

Upload Article

Details

An In-Depth Analysis of the Multimodal Representation Learning with Respect to the Applications and Linked Challenges in Multiple Sectors

Abstract

References

Our Head Office

Quick Links

Indexing