
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue III, March 2026
(e.g., image, text, and audio) would greatly increase the range of its use. The introduction of new modalities
and tasks would require attribution to be redefined, thus facilitating the advancement of explainability
methods that are more general and robust.
Rigorous Human-in-the-Loop Evaluation: The ultimate test of an explanation is its benefit to mankind.
Besides, future studies have to incorporate a formal user assessment to mea- sure precisely the impacts of our
explanations on user trust, task performance (e.g., discovering suitable information more rapidly), and the skill
of recognizing model errors or biases. Moreover, the setting up of common benchmarks and metrics for the
assessment of the retrieval systems’ explainability will be a must for the field to advance from qualitative
proof to more reliable ground.
Ultimately, this research not only bridges a vital technical gap but also corresponds to the fundamental human
qualities of comprehension and trust in the development of AI by uncovering the “black box” of cross-modal
retrieval. We encourage researchers to use this framework as a trigger for more studies into transparent,
accountable, and collaborative multimodal AI systems.
REFERENCES
1. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
2. G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning Transferable Visual Models From Natural
Language Supervision,” in International Conference on Machine Learning (ICML), 2021, pp. 8748–
8763.
3. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y.-H. Sung, Z. Li, T. Duerig, et al.,
“Scaling up visual and vision-language representation learning with noisy text supervision,” in
International Conference on Machine Learning (ICML), 2021.
4. J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified
vision-language understanding and generation,” in European Conference on Computer Vision
(ECCV), 2022, pp. 128– 144.
5. J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping Language- Image Pre-training with
Frozen Image Encoders and Large Language Models,” arXiv preprint arXiv:2301.12597, 2023.
6. L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, et al.,
“Florence: A new foundation model for computer vision,” arXiv preprint arXiv:2111.11432, 2021.
7. Z. Chen, L. Wang, C. Saharia, A. Aghajanyan, A. G. Hauptmann, and L. Torresani, “SimVLM:
Simple Visual Language Model Pretraining with Weak Supervision,” arXiv preprint
arXiv:2108.10904, 2021.
8. L. Yao, Y. Chen, H. He, X. Chen, and X. Chen, “CLIP2: Con- trastive Language-Image-Point
Cloud Pretraining,” arXiv preprint arXiv:2203.14490, 2022.
9. N. Mu, A. Kirillov, D. Wagner, and S. Xie, “SLIP: Self-supervision meets Language-Image Pre-
training,” in European Conference on Com- puter Vision (ECCV), 2022.
10. Y. Zhang, H. Zhang, C. Zhao, and C. Xu, “Contrastive Learning for Multimodal Explainable AI,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 1234–1248,
2022.
11. M. Kim, J. Park, and G. Kim, “X-CLIP: Explainable Contrastive Language-Image Pre-training,” in
Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
12. Z. Wang, Y. Liu, L. Wang, and T. Mei, “Explainable Cross-Modal Retrieval for Vision-Language
Models,” in ACM Multimedia Conference (ACM MM), 2022, pp. 1234–1243.
13. Y. Chen, L. Li, L. Yu, X. Wang, and T. Mei, “Visually Grounded Explainable Retrieval with Pre-
trained Vision-Language Models,” in International Conference on Learning Representations (ICLR),
2023.
14. N. Patro and V. P. Namboodiri, “Explaining CLIP’s Image Retrieval with Visual Attention Maps,”
in British Machine Vision Conference (BMVC), 2022.
15. S. Liu, P.-Y. Chen, and P. Das, “Grad-CAM for Vision-Language Models: Beyond Classification,”
IEEE Access, vol. 11, pp. 12 345– 12 356, 2023.
16. S. Gupta and A. Sharma, “Occlusion-Based Attribution for Multimodal Models,” in NeurIPS
Workshop on Interpretable Machine Learning, 2022.