INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Special Issue | Volume XIV, Issue XIII, October 2025
www.ijltemas.in Page 7
Detecting Misinformation Using Multimodal AI Models on Social
Media Platforms
Ashwini Sonawane*, Sayali Shinde
Department of Computer Science, Dr. D. Y. Patil Arts, Commerce and Science College, Pimpri, Pune, Maharashtra, India
DOI: https://doi.org/10.51583/IJLTEMAS.2025.1413SP002
Received: 26 June 2025; Accepted: 30 June 2025; Published: 22 October 2025
Abstract: Misinformation on social media has become a critical challenge, impacting public opinion, health, and democracy.
Traditional text-based methods for misinformation detection often fall short because social media content is increasingly
multimodal, containing images, videos, and text. This paper explores the use of multimodal AI models that integrate visual,
textual, and contextual features to improve the accuracy of misinformation detection on social media platforms. We present an
overview of recent advancements, propose a multimodal framework, and discuss experimental results, challenges, and future
research directions.
Keywords— Multimodal Fusion, Natural Language Processing, Multimodal AI, Social Network Analysis, Deepfake Detection
I Introduction
Nowadays, billions of multi-modal posts containing texts, images, videos, sound tracks, etc., are shared throughout the web,
mainly via social media platforms such as Facebook, Twitter, Snapchat, Reddit, Instagram, YouTube, and so on. While the
combination of modalities allows for more expressive, detailed, and user-friendly content, it brings about new challenges, as it is
harder to accommodate uni-modal solutions to multi-modal environments with the rapid development of social networks; the way
people obtain information is also changing. Twitter, Facebook, Sina Weibo and other emerging social media platforms have
become the main channels for the public to obtain news. Due to the strong openness of emerging media platforms, users can post
or repost news articles at will. Therefore, tens of thousands of news articles are widely disseminated on social media platforms
every day. However, due to the randomness of news posting and the lack of verification and inspection of these news articles by
various institutions, all kinds of fake news emerge endlessly on social media, which will bring tremendous political, economic
and social public opinion influence. This study focuses on detecting misinformation on social media platforms by leveraging
multimodal AI models that analyze both textual and visual content. The methodology uses a real-world incident—the
misinformation spread during the COVID-19 pandemic, specifically the false claims around 5G technologies causing the spread
of the virus on Twitter and Facebook in early 2020. This incident was widely studied and serves as a valid benchmark.
Misinformation detection due to its multimodal nature (textual posts accompanied by images and videos). This article aims to
detect fake news that contains both text and images. Text and images provide rich information for detecting fake news, leading
scholars to focus on the automatic detection of multimodal fake news. Currently, multimodal fake news detection methods mainly
rely on the complementarily of text features and image features. For example, attempted to learn a shared representation of text
and images using an auto encoder to detect fake news. Utilized visual, textual, and social contextual information of news, and
fused multimodal information based on an attention mechanism to detect fake news.
II Literature Review
Text-Based Detection
BERT, RoBERTa, and LSTM models have achieved success in detecting text-based misinformation.
However, their accuracy decreases when misinformation is accompanied by misleading images or videos.
Visual and Multimodal Detection
Visual BERT and CLIP (Contrastive Language-Image Pre-training) have introduced joint embedding spaces for images and text.
Multimodal Transformer models like VL-BERT, UNITER, and MMBT allow simultaneous processing of text and visual data,
showing promising results.
Limitations of Existing Models
Lack of contextual reasoning.
Difficulty in understanding sarcasm, memes, or manipulated images.
III Methodology
Data Collection
Data is collected from multiple platforms (Twitter, Facebook, Instagram, TikTok) using: Public APIs Crowd sourced
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Special Issue | Volume XIV, Issue XIII, October 2025
www.ijltemas.in Page 8
datasets (e.g., We Verify, Twitter PHEME, FakeNewsNet) Manual annotation Each post includes:
Text content
Image or video (if available)
User metadata (verified status, follower count)
Temporal data (time of post)
Preprocessing
Text: Tokenization, lemmatization, stopword removal.
Image: Resizing, normalization, visual feature extraction via ResNet or ViT.
Metadata: Normalized and embedded.
Model Architecture
We propose a Multimodal Transformer Framework integrating:
Text Encoder: Pretrained BERT
Image Encoder: CLIP-based Vision Transformer
Metadata Encoder: Shallow neural net
Fusion Module: Cross-modal attention layers
Classification Head: Fully connected layers + softmax
IV Experimental Results
Evaluation Metrics
Accuracy
Precision, Recall, F1-Score
AUC-ROC
Dataset
Dataset Modalities Size Platform
Twitter PHEME Text + Metadata 70,000 Twitter
FakeNewsNet Text + Image 100,000 Facebook/Twitter
WeVerify Text + Video 30,000 YouTube/Facebook
Table I: Table of Dataset
Model Comparison
Model Precision Recall F1-Score Accuracy
Text-only BERT 0.81 0.74 0.77 78.5%
Visual BERT 0.84 0.79 0.81 82.3%
Ours (Fusion Net) 0.89 0.87 0.88 88.9%
Table II: Table of Model Comparison
V. Case Studies
COVID-19 Misinformation
Example: A Facebook post claimed garlic could cure COVID-19 with a misleading image. Text alone seemed harmless, but the
image falsely depicted a WHO certificate. Our model flagged it as false due to visual-textual contradiction.
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Special Issue | Volume XIV, Issue XIII, October 2025
www.ijltemas.in Page 9
Political Fake News
A meme on Twitter misrepresented a politician's quote. The image was doctored. The fusion model caught it, but text-only
systems failed.
Figure 1. Distribution of dataset sizes by platform
Figure 2. Bar chart comparing model performance
VI. Conclusions
This study demonstrates the effectiveness of multimodal AI models in detecting misinformation on social media platforms. By
combining textual, visual, and contextual data through our proposed FusionNet framework, we achieved higher performance in
terms of precision, recall, F1-score, and accuracy compared to traditional text-only and visual-text models. Our case studies
further validate the real-world applicability of our approach, especially in complex misinformation scenarios such as those
involving manipulated media or misleading visual-textual pairings. Future research may explore expanding to audio and deeper
contextual reasoning for more robust detection capabilities.
Acknowledgment
We would like to express our sincere gratitude to all those who contributed to the completion of this research paper. Special
thanks to the participants who shared their experiences and insights, which enriched the research findings. Finally, we extend our
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Special Issue | Volume XIV, Issue XIII, October 2025
www.ijltemas.in Page 10
appreciation to our families and friends for their unwavering support during the research process.
References
1. Aronoff, S. (1989). Geographic Information Systems: A Management Perspective. Ottawa: WDL Publications.
2. Jin, Z., Cao, J., Guo, H., Zhang, Y., & Luo, J. (2017). Multimodal fusion with recurrent neural networks for rumor
detection on microblogs. ACM Multimedia.
3. Kiela, D., Bulat, L., & Clark, S. (2019). Learning Multimodal Representations with Sparse Attention. arXiv preprint
arXiv:1902.00751.
4. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. NAACL.
5. Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). VilBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
Vision-and-Language Tasks. NeurIPS.
6. Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake News Detection on Social Media: A Data Mining
Perspective. ACM SIGKDD Explorations.
7. Wang, Y., Ma, F., Jin, Z., Yuan, Y., Xun, G., Jha, K., & Gao, J. (2018). EANN: Event Adversarial Neural Networks for
Multi-Modal Fake News Detection. KDD.