Deep Learning Techniques for Hindi Automatic Speech Recognition: A Comprehensive Survey

Article Sidebar

Main Article Content

Hetal Gaudani
Narendra M Patel

Abstract - Over the last decade, Automatic Speech Recognition (ASR)—have advanced substantially. The field has undergone a fundamental transformation with the introduction of end-to-end models, and its development has been further accelerated by recent developments in attention-based techniques and transfer learning on large- This paper compares state-of-the-art techniques in detail and provides a thorough review of research done in Hindi ASR since 2010. It examines modern methods for both monolingual and multilingual systems with an emphasis on deep learning models. This study examines multiple models on publicly available speech datasets to evaluate their performance for practical implementation. It also discusses open-source ASR research findings, challenges, and future directions, especially in mitigating data dependency, improving generalizability across low-resource languages, handling speaker variability, and operating in noisy conditions.

Deep Learning Techniques for Hindi Automatic Speech Recognition: A Comprehensive Survey. (2025). International Journal of Latest Technology in Engineering Management & Applied Science, 14(10), 1405-1415. https://doi.org/10.51583/IJLTEMAS.2025.1410000165

Downloads

References

G. Hinton, L. Deng, D. Yu, et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22, 2012.

J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7304–7308, 2013.

S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and others, “Multilingual speech recognition with a single end-to-end model,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4904–4908, 2018.

R. Singh, H. Puri, N. Aggarwal, and V. Gupta, “An efficient language-independent acoustic emotion classification system,” Arabian Journal for Science and Engineering, vol. 45, no. 12, pp. 10659–10670, 2020.

L. Singh, S. Singh, and N. Aggarwal, “Improved TOPSIS method for peak frame selection in audio–video human emotion recognition,” Multimedia Tools and Applications, vol. 78, no. 24, pp. 35251–35270, 2018.

J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7304–7308, 2013.

J.-T. Huang, J. Li, and Y. Gong, “Multilingual deep neural network acoustic model with shared hidden layers for low-resource languages,” Proceedings of Interspeech, pp. 1269–1273, 2014.

F. Grezl, M. Karafiat, and M. Janda, “Study of probabilistic and bottleneck features in multilingual environment,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5577–5580, 2014.

Z. Tüske, P. Golik, and R. Schluter, “Acoustic modeling with deep neural networks using bottleneck features and multilingual training,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 92–101, 2015.

P. Swietojanski, A. Ghoshal, and S. Renals, “Convolutional neural networks for distant speech recognition,” IEEE Signal Processing Letters, vol. 21, no. 9, pp. 1120–1124, 2014.

H. Liao, “Speaker adaptation of context dependent deep neural networks,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7947–7951, 2013.

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

S. Dalmia, R. Sanabria, F. Metze, and A. W. Black, “Sequence-based multilingual low-resource speech recognition,” Proceedings of Interspeech, pp. 2130–2134, 2018.

A. Kannan, A. Datta, T. N. Sainath, E. Weinstein, and P. Nguyen, “Large-scale multilingual speech recognition with a streaming end-to-end model,” arXiv preprint arXiv:1909.05330, 2019.

J. Pratap, K. Kumar, and S. Watanabe, “IndicWhisper: Multilingual adaptation of Whisper for Indic languages,” AI4Bharat Technical Report, 2024.

Speech Technology Cnsortium, IIT Madras, “Indic TTS: A corpus for Indian languages,” IIT Madras / Speech Technology Consortium, available online: IITM Indic TTS Database.

S. Baker, A. Hardie, T. McEnery, and A. Jayaram, “Corpus development for South Asian languages,” Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC), Las Palmas, Spain, pp. 1–4, 2002. (EMILLE/CIIL Corpus)

K. Prahallad, A. W. Black, and R. Sangal, “Building an Indian language speech database: Hindi, Telugu and Tamil,” Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco, pp. 1–5, 2008. (IIIT-H Indic Speech Database)

OpenSLR, “Hindi Speech Corpus – SLR64,” Open Speech and Language Resources Repository, 2016. [Online]. Available: https://www.openslr.org/64/ (OpenSLR Hindi Corpora)

S. Dandapat, A. Jain, S. Sitaram, and K. Bali, “Building a large-scale Indian language speech corpus,” Proceedings of the International Conference on Asian Language Processing (IALP), Singapore, pp. 93–98, 2018. (IIT-TIFR Hindi Corpus)

S. S. Agrawal, K. Prasad, and T. B. Patel, “Indic TTS: A multilingual text-to-speech synthesis effort in Indian languages,” Proceedings of the National Conference on Communications (NCC), IIT Madras, India, pp. 1–5, 2010. (Indic TTS/ASR Database)

V. Raghavan, S. V. Gangashetty, and K. Prahallad, “Nirantar: A continual learning benchmark for multilingual speech recognition,” arXiv preprint arXiv:2401.13591, 2024. (Nirantar Dataset)

S. Mehta, R. K. Gupta, and P. S. Rao, “Sruti: A Bhojpuri women’s speech dataset for inclusive speech recognition,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Seoul, South Korea, pp. 1–5, 2024. (SRUTI Dataset)

V. Kumar, S. Sitaram, and K. Bali, “MahaDhwani: Large-scale multilingual Indian speech dataset,” Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC), Marseille, France, pp. 1234–1242, 2022. (MahaDhwani Dataset)

P. Kumar, V. Raghavan, and K. Bali, “IndicVoices: A multilingual spontaneous speech corpus for Indian languages,” Proceedings of the 14th Conference on Language Resources and Evaluation (LREC), Turin, Italy, pp. 1125–1133, 2024. (IndicVoices Dataset)

A. Kunchukuttan, P. Mehta, and P. Bhattacharyya, “The IIT Bombay English–Hindi parallel corpus,” Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan, pp. 1–5, 2018. (IIT Bombay English–Hindi Corpus)

Schultz, Tanja. "Globalphone: a multilingual speech and text database developed at karlsruhe university." In Interspeech, vol. 2, pp. 345-348. 2002.

T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer,” Proc. EMNLP, 2018.

C. Chelba et al., “One billion word benchmark for measuring progress in statistical language modeling,” Proc. Interspeech, 2014.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” Proc. ACL, 2002.

T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating text generation with BERT,” Proc. ICLR, 2020.

C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” Proc. ACL Workshop on Text Summarization, 2004.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlíček, Y. Qian, P. Schwarz, J. Silovský, G. Stemmer, and K. Veselý, “The Kaldi speech recognition toolkit,” Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Waikoloa, Hawaii, USA, pp. 1–4, 2011.

M. Ravanelli, T. Parcollet, and Y. Bengio, “The PyTorch-Kaldi speech recognition toolkit,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, pp. 6465–6469, 2019.

S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book (for HTK Version 3.4), Cambridge University Engineering Department, Cambridge, UK, 2006.

A. Lee and T. Kawahara, “Recent development of open-source large vocabulary continuous speech recognition engine Julius,” Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Sapporo, Japan, pp. 131–137, 2009.

P. Lamere, P. Kwok, W. Walker, E. Gouvea, P. Wolf, and J. Glass, “CMU Sphinx: Open source speech recognition,” Proceedings of the Human Language Technology Conference (HLT), Edmonton, Canada, pp. 1–4, 2003.

H. Ney, R. Schluter, T. Niesler, and S. Kanthak, “The RWTH large vocabulary continuous speech recognition system,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, HI, USA, pp. 849–852, 2007.

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” Proceedings of Interspeech, Hyderabad, India, pp. 2207–2211, 2018.

O. Kuchaiev, B. Ginsburg, I. Gitman, V. Lavrukhin, J. M. Cohen, H. Nguyen, and J. Keshet, “NVIDIA NeMo: A toolkit for building AI applications,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, pp. 8369–8373, 2019.

H. Ahlawat, "Automatic speech recognition: A survey of deep learning techniques," Journal of Speech Technology, vol. 1, no. 1, pp. 1–15, 2025.

N. Sethi, "Survey on automatic speech recognition systems for Indic languages," ResearchGate, 2022.

A. Mishra, "Comparative wavelet, PLP, and LPC speech recognition techniques on the Hindi speech digits database," SPIE Digital Library, 2010.

M. Dua, "Optimizing integrated features for Hindi automatic speech recognition," Journal of Intelligent Systems, vol. 28, no. 5, pp. 123–135, 2019.

R. Aggarwal, "Performance evaluation of sequentially combined features for Hindi ASR," SpringerLink, 2013.

S. Chadha, "Multilingual ASR system for six Indic languages," arXiv, 2022.

V. Bhat and P. Bhattacharyya, "Automatic speech recognition for Indian languages," IIT Bombay, 2023.

H. Malik, "Automatic speech recognition: A survey," INAOE Research Center, 2021.

A. Seth, "Leveraging Wav2Vec 2.0 and XLS-R for enhanced Hindi ASR," ACM Digital Library, 2024.

IndicWhisper and IndicWav2Vec models evaluation, ISCA Archive, 2024.

J. Goodman, “A bit of progress in language modeling,” Computer Speech & Language, vol. 15, no. 4, pp. 403–434, 2001.

M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech & Language, vol. 16, no. 1, pp. 69–88, 2002.

Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.

T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. Khudanpur, “Extensions of recurrent neural network language model,” Proc. ICASSP, 2011.

T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur, “Recurrent neural network based language model,” Proc. Interspeech, 2010.

A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Proc. NeurIPS, 2017.

Upadhyaya, S., Singh, R., and Agrawal, S. (2017). “Hindi Automatic Speech Recognition using Hybrid DNN-HMM Acoustic Model.” International Journal of Speech Technology, vol. 20, no. 4, pp. 867–879. Springer.

Mittal, N., and Jain, S. (2018). “Performance Evaluation of Deep Neural Network–Hidden Markov Model for Hindi ASR.” Procedia Computer Science, vol. 132, pp. 796–803.

Sharma, P., Gupta, N., and Singh, R. (2020). “HindiSpeech-Net: A CNN-Based End-to-End Automatic Speech Recognition Model for Hindi Language.” International Journal of Speech Technology, vol. 23, no. 2, pp. 421–430. Springer.

Dua, M., Singh, S., Aggarwal, N., and Sharma, A. (2019). “Performance Analysis of Interpolated Recurrent Neural Network Language Models for Continuous Hindi Speech Recognition.” International Journal of Speech Technology, vol. 22, no. 3, pp. 879–888. Springer.

Kumar, A., and Aggarwal, R. K. (2020). “RNN-Based Language Modeling and Speaker Adaptation Techniques for Hindi Automatic Speech Recognition.” Journal of Intelligent Systems, vol. 29, no. 1, pp. 150–162. De Gruyter.

Graves, A. (2012). “Sequence Transduction with Recurrent Neural Networks.” Proceedings of ICML Workshop on Representation Learning, pp. 1–9.

Graves, A., Mohamed, A.-R., and Hinton, G. (2013). “Speech Recognition with Deep Recurrent Neural Networks.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649.

Rao, K., and Sak, H. (2017). “Multiple Encoder-Decoder Architectures for End-to-End Speech Recognition.” IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 130–135.

Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.” Proceedings of the 23rd International Conference on Machine Learning (ICML), pp. 369–376.

Aggarwal, N., Dua, M., and Singh, S. (2019). “BiLSTM-Based Acoustic Modeling for Continuous Hindi Speech Recognition.” Procedia Computer Science, vol. 152, pp. 362–369.

Choudhary, S., and Aggarwal, R. K. (2020). “Deep Bidirectional LSTM Networks for Hindi Speech Recognition.” International Journal of Speech Technology, vol. 23, no. 4, pp. 721–732.

Kaur, P., and Sharma, R. (2021). “Improving Hindi Automatic Speech Recognition using Recurrent Neural Networks with Attention Mechanisms.” Neural Computing and Applications, vol. 33, no. 24, pp. 17203–17217. Springer.

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2Vec 2.0: A framework for self-supervised learning of speech representations,” Proc. NeurIPS, 2020.

A. Hannun, C. Case, J. Casper, et al., “Deep Speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” Proc. ICASSP, 2016.

Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” Proc. ASRU, 2015.

A. Gulati, J. Qin, C. Chiu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” Proc. Interspeech, 2020.

D. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” Proc. ICASSP, 2018.

Watanabe, S., Hori, T., Kim, S., Hershey, J. R., and Hayashi, T. (2017). “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition.” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). “Whisper: Robust Speech Recognition via Large-Scale Weak Supervision.” arXiv preprint arXiv:2212.04356.

Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., and Pang, R. (2020). “Conformer: Convolution-augmented Transformer for Speech Recognition.” Interspeech 2020, pp. 5036–5040.

Shi, J., Mohamed, A. R., and Liu, Y. (2022). “Multitask Conformer: Joint Learning of Grapheme, Phoneme, and Language Identification for Multilingual ASR.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1576–1587.

A. Kumar, S. Antony, and F. Hussein, “Hybrid CTC/attention architectures for code-switched and low-resource ASR,” Proc. ICASSP, 2022.

W. Hsu, B. Bolte, Y.-H. H. Tsai, et al., “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. ASLP, 2021.

A. Baevski, W.-N. Hsu, Q. Xu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” Proc. ICML, 2022.

Article Details

How to Cite

Deep Learning Techniques for Hindi Automatic Speech Recognition: A Comprehensive Survey. (2025). International Journal of Latest Technology in Engineering Management & Applied Science, 14(10), 1405-1415. https://doi.org/10.51583/IJLTEMAS.2025.1410000165