Digital Diglossia: A Study of the Possibility of Reproducing Linguistic Hegemony in Hindi LLM Development

डिजिटल डायग्लॉसिया : हिन्दी एलएलएम(LLM) निर्माण में भाषिक हेजेमनी प्रजनन संभावना का अध्ययन

Authors

DOI:

https://doi.org/10.31305/rrjss.2025.v05.n01.009

Keywords:

Diglossia, Linguistic Capital, Artificial Intelligence, Algorithmic Bias, Hindi LLM

Abstract

This study critically examines the interplay between the diglossic structure of the Hindi-speaking society and the development of AI-based language models — particularly Large Language Models (LLMs). It investigates the pre-existing linguistic imbalance between ‘high’ and ‘low’ forms of Hindi (standard vs. colloquial/local variants), as well as between English and indigenous languages. The research argues that the processes involved in LLM development — such as corpus selection, tokenization, and training — do not merely reflect these disparities but also reinforce them technically. The main research questions explored include:(1) At which stages do the construction of Hindi LLMs reproduce linguistic hegemony? (2) How does the quality of LLM outputs vary across English, standard Hindi, and colloquial Hindi? (3) How do these disparities contribute to the marginalization of Dalit, feminist, and regional knowledge systems? The theoretical framework is grounded in critical applied linguistics, Bourdieu’s theory of linguistic capital, and recent debates around algorithmic bias. Methodologically, the study employs comparative corpus analysis, evaluation of LLM responses, and expert interviews. Ultimately, this research calls for a reimagination of ‘linguistic justice’ within AI.

 Abstract in Hindi Language: यह अध्ययन हिन्दी समाज की द्वैत्तभाषिक (डाएग्लॉसिक) संरचना और कृत्रिम बुद्धिमत्ता (AI) आधारित भाषा मॉडल — विशेषकर LLM (Large Language Models) — के आपसी संबंधों की आलोचनात्मक पड़ताल करता है। हिन्दी में 'उच्च' बनाम 'निम्न' भाषा रूप (शुद्ध हिन्दी बनाम बोलचाल की लोक हिन्दी), और अंग्रेज़ी बनाम देसी भाषाओं के मध्य जो भाषिक असंतुलन पहले से मौजूद है, यह शोध यह स्थापित करता है कि LLM विकास की प्रक्रिया — जैसे कोर्पस चयन, टोकनाइज़ेशन व प्रशिक्षण — इन असमानताओं को मात्र प्रतिबिंबित नहीं करती, बल्कि तकनीकी रूप में और अधिक सुदृढ़ करती है।मुख्य अनुसंधान प्रश्नों में यह विश्लेषण शामिल है कि: (1) हिन्दी LLM निर्माण किन स्तरों पर भाषिक वर्चस्व को पुनःस्थापित करता है? (2) अंग्रेज़ी, मानक हिन्दी और लोक हिन्दी में LLM प्रतिक्रियाओं की गुणवत्ता में क्या भिन्नता है? (3) इन अंतरविरोधों के कारण दलित-स्त्री-क्षेत्रीय ज्ञान परंपराओं का क्या ह्रास हो रहा है? अध्ययन का सैद्धांतिक आधार आलोचनात्मक अनुप्रयुक्त भाषाविज्ञान, बोरदियू का 'भाषाई पूंजी' सिद्धांत और एल्गोरिद्मिक पूर्वग्रह संबंधी नवीन विमर्श है। पद्धति में तुलनात्मक कोर्पस विश्लेषण, LLM प्रतिक्रियाओं का मूल्यांकन, तथा विशेषज्ञ साक्षात्कार सम्मिलित हैं। निष्कर्षतः, यह शोध AI में 'भाषिक न्याय' की पुनःकल्पना का आह्वान करता है।

Keywords: डाएग्लॉसिया, भाषिक पूंजी, कृत्रिम बुद्धिमत्ता, एल्गोरिद्मिक पूर्वग्रह, हिंदी LLM

Author Biography

  • Vijendra Singh Chauhan, Associate Professor, Department of Hindi, Zakir Husain Delhi College, University of Delhi

    Dr. Vijender Singh Chauhan is an Associate Professor at Zakir Husain Delhi College, University of Delhi, and a prominent voice in Hindi literary studies and interdisciplinary research. With over two decades of teaching and research experience, he has presented more than 100 invited lectures and keynote addresses at national and international platforms including IITs, NITs, and global academic summits. His scholarly interests span feminist literary theory, digital humanities, socio-linguistics, and policy discourse on education and inclusion. Dr. Chauhan has authored numerous peer-reviewed articles, and his work increasingly engages with issues at the intersection of caste, gender, and governance. Known for integrating academic insight with public intellectual engagement, he is also a TEDx speaker, social media influencer, and mentor to thousands of civil service aspirants. His recent works critically explore the informalization of labour, youth aspirations, and the linguistic politics of AI systems.

References

Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258

OpenAI. (2023). GPT-4 Technical Report. OpenAI Documentation

Google DeepMind. (2024). Gemini: The next step in AI language understanding. Google AI Blog

Floridi, L., &Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4), 681–694

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems

Malik, A., Gupta, S., & Rani, P. (2022). Socially aware bias measurements for Hindi language representations. Proceedings of NAACL 2022

Joshi, P., Kumar, S., & Srivastava, R. (2024). Since lawyers are males: Examining implicit gender bias in Hindi language generation by LLMs. ACL Anthology.

Khandelwal, S., Sharma, M., & Rani, K. (2024). Indian-BhED: A dataset for measuring India-centric biases in LLMs. EMNLP 2024

Helm, L., Salminen, J., & Jung, S. G. (2024). Diversity and language technology: How language modeling bias causes epistemic injustice. CHI 2024.

Bhatt, A., Dev, S., & Tripathi, A. (2023). Cultural re-contextualization of fairness research in language technologies in India. Findings of ACL 2023

Goldshtein, A., Pechenizkiy, M., & Markov, I. (2024). The social consequences of language technologies and their underlying language ideologies. HCII 2024

Ferguson, C. A. (1959). Diglossia. Word, 15(2), 325-340

Bourdieu, P. (1991). Language and symbolic power (J. B. Thompson, Ed.; G. Raymond & M. Adamson, Trans.). Harvard University Press

Kachru, B. B. (1986). The alchemy of English: The spread, functions, and models of non-native Englishes. University of Illinois Press

Andersen, R. (2024). The AI revolution is crushing thousands of languages. The Atlantic

Chandran, R. (2023). India's scaling up of AI could reproduce casteist bias, discrimination against women and minorities. Reuters Foundation/Scroll.in

Bourdieu, P. (1991). Language and symbolic power (J. B. Thompson, Ed.; G. Raymond & M. Adamson, Trans.). Harvard University Press.

Ferguson, C. A. (1959). Diglossia. Word, 15(2), 325–340

Pennycook, A. (2001). Critical applied linguistics: A critical introduction. Routledge DOI: https://doi.org/10.4324/9781410600790

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. NYU Press.

Benjamin, R. (2019). Race after technology: Abolitionist tools for the new Jim code. Polity Press.

Malik, A., Gupta, S., & Rani, P. (2022). Socially aware bias measurements for Hindi language representations. Proceedings of NAACL 2022 DOI: https://doi.org/10.18653/v1/2022.naacl-main.76

Joshi, P., Kumar, S., & Srivastava, R. (2024). Since lawyers are males: Examining implicit gender bias in Hindi language generation by LLMs. ACL Anthology.

Khandelwal, S., Sharma, M., & Rani, K. (2024). Indian-BhED: A dataset for measuring India-centric biases in LLMs. EMNLP 2024 DOI: https://doi.org/10.1145/3677525.3678666

Helm, L., Salminen, J., & Jung, S. G. (2024). Diversity and language technology: How language modeling bias causes epistemic injustice. CHI 2024.

Andersen, R. (2024). The AI revolution is crushing thousands of languages. The Atlantic.

Bourdieu, P. (1991). Language and symbolic power (J. B. Thompson, Ed.; G. Raymond & M. Adamson, Trans.). Harvard University Press.

Ferguson, C. A. (1959). Diglossia. Word, 15(2), 325–340 DOI: https://doi.org/10.1080/00437956.1959.11659702

Joshi, P., Kumar, S., & Srivastava, R. (2024). Since lawyers are males: Examining implicit gender bias in Hindi language generation by LLMs. ACL Anthology

Helm, L., Salminen, J., & Jung, S. G. (2024). Diversity and language technology: How language modeling bias causes epistemic injustice. CHI 2024 DOI: https://doi.org/10.1007/s10676-023-09742-6

Goldshtein, A., Pechenizkiy, M., & Markov, I. (2024). The social consequences of language technologies and their underlying language ideologies. HCII 2024 DOI: https://doi.org/10.1007/978-3-031-60875-9_18

Andersen, R. (2024). The AI revolution is crushing thousands of languages. The Atlantic

Bhatt, A., Dev, S., & Tripathi, A. (2023). Cultural re-contextualization of fairness research in language technologies in India. Findings of ACL 2023.

Floridi, L., &Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4), 681–694 DOI: https://doi.org/10.1007/s11023-020-09548-1

Downloads

Published

2025-06-30

How to Cite

Chauhan, V. S. (2025). Digital Diglossia: A Study of the Possibility of Reproducing Linguistic Hegemony in Hindi LLM Development: डिजिटल डायग्लॉसिया : हिन्दी एलएलएम(LLM) निर्माण में भाषिक हेजेमनी प्रजनन संभावना का अध्ययन. Research Review Journal of Social Science , 5(1), 75-83. https://doi.org/10.31305/rrjss.2025.v05.n01.009