A Hidden Markov Model Framework for POS Tagging of English–Punjabi Code-Mixed Social Media Text
Article Sidebar
Main Article Content
Part-of-Speech (POS) tagging for code-mixed text is notably challenging due to frequent language switching, non-standard orthography, transliteration issues, and the prevalence of informal syntactic structures in user-generated content. This study presents a Hidden Markov Model (HMM)-based approach tailored to English–Punjabi code-mixed text, specifically addressing bilingual interactions in which Punjabi is written in Romanized script. A code-mixed corpus was strategically compiled from platforms such as YouTube, Facebook, and WhatsApp, and meticulously annotated at the token level. The resulting dataset comprises 900 sentences totalling 10,117 words, showcasing diverse mixing patterns and typical social media artefacts, including abbreviations, emojis, and irregular punctuation. The proposed framework conceptualizes POS tagging as a sequence labelling problem. It estimates emission and transition probabilities through the annotated corpus and employs the Viterbi algorithm to decode the most probable tag sequences. Experimental evaluations yield an overall tagging accuracy of 71.52%, establishing a probabilistic baseline for English–Punjabi code-mixed POS tagging. This work lays the groundwork for future research to integrate richer feature sets and leverage neural architectures to enhance performance.
Downloads
References
AlGhamdi, F., Molina, G., Diab, M., Solorio, T., Hawwari, A., Soto, V., & Hirschberg, J. (2016, November). Part of speech tagging for code switched data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching (pp. 98-107).
Baig, A., Rahman, M. U., Kazi, H., & Baloch, A. (2020). Developing a pos tagged corpus of urdu tweets. Computers, 9(4), 90. https://doi.org/10.3390/computers9040090
Bansal, N., Goyal, V., & Rani, S. (2020). Experimenting language identification for sentiment analysis of english punjabi code mixed social media text. International Journal of E-Adoption (IJEA), 12(1), 52-62. https://doi.org/10.4018/IJEA.2020010105
Gill, M. S., Lehal, G. S., & Joshi, S. S. (2009). Part of speech tagging for grammar checking of punjabi. The Linguistic Journal, 4(1), 6-21.
Jamatia, A., Gambäck, B., & Das, A. (2015). Part-of-speech tagging for code-mixed english-hindi twitter and facebook chat messages. Association for Computational Linguistics.
Jamatia, A., Das, A., & Gamback, B. (2020). Deep learning-based language identification in English-Hindi-Bengali code-mixed social media corpora. Journal of Intelligent Systems, 28(3), 399-408. https://doi.org/10.1515/jisys-2017-0440
Kumar, S., Kumar, M. A., & Soman, K. P. (2019). Deep learning based part-of-speech tagging for Malayalam Twitter data (Special issue: deep learning techniques for natural language processing). Journal of Intelligent Systems, 28(3), 423-435. https://doi.org/10.1515/jisys-2017-0520
Nikiforos, M. N., & Kermanidis, K. L. (2020, May). A supervised part-of-speech tagger for the Greek language of the social web. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3861-3867). https://aclanthology.org/2020.lrec-1.476/
Pakray, P., Majumder, G., & Pathak, A. (2018, January). An hmm based pos tagger for pos tagging of code-mixed indian social media text. In Annual Convention of the Computer Society of India (pp. 495-504). Singapore: Springer Singapore. https://doi.org/10.1007/978-981-13-1343-1_41
Pathak, D., Nandi, S., & Sarmah, P. (2023). Part-of-speech tagger for assamese using ensembling approach. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(10), 1-22. https://doi.org/10.1145/3617653
Paul, A., Purkayastha, B. S., & Sarkar, S. (2015, September). Hidden Markov model based part of speech tagging for Nepali language. In 2015 international symposium on advanced computing and communication (ISACC) (pp. 149-156). IEEE. DOI: https://doi.org/10.1109/ISACC.2015.7377332
Pimpale, P. B., & Patel, R. N. (2016). Experiments with POS tagging code-mixed Indian social media text. arXiv preprint arXiv:1610.09799. https://doi.org/10.48550/arXiv.1610.09799
Raha, T., Mahata, S., Das, D., & Bandyopadhyay, S. (2019, December). Development of pos tagger for english-bengali code-mixed data. In Proceedings of the 16th International Conference on Natural Language Processing (pp. 143-149). https://aclanthology.org/2019.icon-1.17/
Saharia, N., Das, D., Sharma, U., & Kalita, J. (2009, August). Part of speech tagger for Assamese text. In Proceedings of the ACL-IJCNLP 2009 conference short papers (pp. 33-36). DOI 10.4018/IJSE.2018010102
Santiago-Benito, H., Cordova-Esparza, D. M., Castro-Sanchez, N. A., Terven, J., Romero-González, J. A., & Garcia-Ramirez, T. (2025). Automatic grammatical tagger for a Spanish–Mixtec parallel corpus. SoftwareX, 29, 101985. https://doi.org/10.1016/j.softx.2024.101985
Sarkar, K., & Gayen, V. (2013). A trigram HMM-based POS tagger for Indian languages. In Proceedings of the international conference on frontiers of intelligent computing: theory and applications (FICTA) (pp. 205-212). Berlin, Heidelberg: Springer Berlin Heidelberg.
Sharma, S. K., & Lehal, G. S. (2011, June). Using Hidden Markov Model to improve the accuracy of Punjabi POS tagger. In 2011 IEEE international conference on computer science and automation engineering (Vol. 2, pp. 697-701). IEEE. DOI: https://doi.org/10.1109/CSAE.2011.5952600
Silva, E. H. D., Pardo, T. A. S., Roman, N. T., & Felippo, A. D. (2021). Universal dependencies for tweets in brazilian portuguese: Tokenization and part of speech tagging. Anais.
Tiun, S., Ariffin, S. N. A. N., & Chew, Y. D. (2022, June). POS Tagging Model for Malay Tweets Using New POS Tagset and BiLTSM-CRF Approach. In ALTNLP (pp. 160-165).
Withanage, S. G., & Silva, T. (2020, November). A stochastic part of speech tagger for the sinhala language based on social media data mining. In 2020 20th International Conference on Advances in ICT for Emerging Regions (ICTer) (pp. 137-142). IEEE. DOI: https://doi.org/10.1109/ICTer51097.2020.9325456
Vyas, Y., Gella, S., Sharma, J., Bali, K., & Choudhury, M. (2014, October). Pos tagging of english-hindi code-mixed social media content. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 974-979).

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in our journal are licensed under CC-BY 4.0, which permits authors to retain copyright of their work. This license allows for unrestricted use, sharing, and reproduction of the articles, provided that proper credit is given to the original authors and the source.