Multimodal
Multimodal
NLP involves the combination of different types of information, such as text, speech, images, and videos, to enhance natural language processing tasks. This allows machines to better comprehend human communication by taking into account additional contextual information beyond just text. For instance, multimodal NLP can be used to enhance machine translation by integrating visual data from images or videos to provide better translations. It can also be used to improve sentiment analysis by incorporating non-textual data such as facial expressions or tone of voice. Multimodal NLP is a growing field of study and is expected to become increasingly significant as more data becomes available across multiple modalities.
Multimodal Emotion Recognition
IEMOCAP
The IEMOCAP (Busso et al., 2008) contains the acts of 10 speakers in a two-way conversation segmented into utterances. The medium of the conversations in all the videos is English. The database contains the following categorical labels: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise, and other.
Monologue:
Model | Accuracy | Paper / Source |
---|---|---|
CHFusion (Poria et al., 2017) | 76.5% | Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling |
bc-LSTM (Poria et al., 2017) | 74.10% | Context-Dependent Sentiment Analysis in User-Generated Videos |
Conversational: Conversational setting enables the models to capture emotions expressed by the speakers in a conversation. Inter speaker dependencies are considered in this setting.
Model | Weighted Accuracy (WAA) | Paper / Source |
---|---|---|
CMN (Hazarika et al., 2018) | 77.62% | Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos |
Memn2n | 75.08 | Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos |
Multimodal Metaphor Recognition
Mohammad et. al, 2016 created a dataset of verb-noun pairs from WordNet that had multiple senses. They annoted these pairs for metaphoricity (metaphor or not a metaphor). Dataset is in English.
Model | F1 Score | Paper / Source | Code |
---|---|---|---|
5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec | 0.75 | Shutova et. al, 2016 | Unavailable |
Tsvetkov et. al, 2014 created a dataset of adjective-noun pairs that they then annotated for metaphoricity. Dataset is in English.
Model | F1 Score | Paper / Source | Code |
---|---|---|---|
5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec | 0.79 | Shutova et. al, 2016 | Unavailable |
Multimodal Sentiment Analysis
MOSI
The MOSI dataset (Zadeh et al., 2016) is a dataset rich in sentimental expressions where 93 people review topics in English. The videos are segmented with each segments sentiment label scored between +3 (strong positive) to -3 (strong negative) by 5 annotators.
Model | Accuracy | Paper / Source |
---|---|---|
bc-LSTM (Poria et al., 2017) | 80.3% | Context-Dependent Sentiment Analysis in User-Generated Videos |
MARN (Zadeh et al., 2018) | 77.1% | Multi-attention Recurrent Network for Human Communication Comprehension |
Visual Question Answering
VQAv2
Given an image and a natural language question about the image, the task is to provide an accurate natural language answer
Model | Accuracy | Paper / Source | Code |
---|---|---|---|
UNITER (Chen et al., 2019) | 73.4 | UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS | Link |
LXMERT (Tan et al., 2019) | 72.54 | LXMERT: Learning Cross-Modality Encoder Representations from Transformers | Link |
GQA - Visual Reasoning in the Real World
GQA focuses on real-world compositional reasoning.
Model | Accuracy | Paper / Source | Code |
---|---|---|---|
KaKao Brain | 73.24 | GQA Challenge | Unavailable |
LXMERT (Tan et al., 2019) | 60.3 | LXMERT: Learning Cross-Modality Encoder Representations from Transformers | Link |
TextVQA
TextVQA requires models to read and reason about text in an image to answer questions based on them.
Model | Accuracy | Paper / Source | Code |
---|---|---|---|
M4C (Hu et al., 2020) | 40.46 | Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA | Link |
VizWiz dataset
This task focuses on answering visual questions that originate from a real use case where blind people were submitting images with recorded spoken questions in order to learn about their physical surroundings.
Model | Accuracy | Paper / Source | Code |
---|---|---|---|
Pythia | 54.22 | FB’s Pythia repository | Link |
BUTD Vizwiz (Gurari et al., 2018) | 46.9 | VizWiz Grand Challenge: Answering Visual Questions from Blind People | Unavailable |