Digital Diglossia: A Study of the Possibility of Reproducing Linguistic Hegemony in Hindi LLM Development
डिजिटल डायग्लॉसिया : हिन्दी एलएलएम(LLM) निर्माण में भाषिक हेजेमनी प्रजनन संभावना का अध्ययन
DOI:
https://doi.org/10.31305/rrjss.2025.v05.n01.009Keywords:
Diglossia, Linguistic Capital, Artificial Intelligence, Algorithmic Bias, Hindi LLMAbstract
This study critically examines the interplay between the diglossic structure of the Hindi-speaking society and the development of AI-based language models — particularly Large Language Models (LLMs). It investigates the pre-existing linguistic imbalance between ‘high’ and ‘low’ forms of Hindi (standard vs. colloquial/local variants), as well as between English and indigenous languages. The research argues that the processes involved in LLM development — such as corpus selection, tokenization, and training — do not merely reflect these disparities but also reinforce them technically. The main research questions explored include:(1) At which stages do the construction of Hindi LLMs reproduce linguistic hegemony? (2) How does the quality of LLM outputs vary across English, standard Hindi, and colloquial Hindi? (3) How do these disparities contribute to the marginalization of Dalit, feminist, and regional knowledge systems? The theoretical framework is grounded in critical applied linguistics, Bourdieu’s theory of linguistic capital, and recent debates around algorithmic bias. Methodologically, the study employs comparative corpus analysis, evaluation of LLM responses, and expert interviews. Ultimately, this research calls for a reimagination of ‘linguistic justice’ within AI.
Abstract in Hindi Language: यह अध्ययन हिन्दी समाज की द्वैत्तभाषिक (डाएग्लॉसिक) संरचना और कृत्रिम बुद्धिमत्ता (AI) आधारित भाषा मॉडल — विशेषकर LLM (Large Language Models) — के आपसी संबंधों की आलोचनात्मक पड़ताल करता है। हिन्दी में 'उच्च' बनाम 'निम्न' भाषा रूप (शुद्ध हिन्दी बनाम बोलचाल की लोक हिन्दी), और अंग्रेज़ी बनाम देसी भाषाओं के मध्य जो भाषिक असंतुलन पहले से मौजूद है, यह शोध यह स्थापित करता है कि LLM विकास की प्रक्रिया — जैसे कोर्पस चयन, टोकनाइज़ेशन व प्रशिक्षण — इन असमानताओं को मात्र प्रतिबिंबित नहीं करती, बल्कि तकनीकी रूप में और अधिक सुदृढ़ करती है।मुख्य अनुसंधान प्रश्नों में यह विश्लेषण शामिल है कि: (1) हिन्दी LLM निर्माण किन स्तरों पर भाषिक वर्चस्व को पुनःस्थापित करता है? (2) अंग्रेज़ी, मानक हिन्दी और लोक हिन्दी में LLM प्रतिक्रियाओं की गुणवत्ता में क्या भिन्नता है? (3) इन अंतरविरोधों के कारण दलित-स्त्री-क्षेत्रीय ज्ञान परंपराओं का क्या ह्रास हो रहा है? अध्ययन का सैद्धांतिक आधार आलोचनात्मक अनुप्रयुक्त भाषाविज्ञान, बोरदियू का 'भाषाई पूंजी' सिद्धांत और एल्गोरिद्मिक पूर्वग्रह संबंधी नवीन विमर्श है। पद्धति में तुलनात्मक कोर्पस विश्लेषण, LLM प्रतिक्रियाओं का मूल्यांकन, तथा विशेषज्ञ साक्षात्कार सम्मिलित हैं। निष्कर्षतः, यह शोध AI में 'भाषिक न्याय' की पुनःकल्पना का आह्वान करता है।
Keywords: डाएग्लॉसिया, भाषिक पूंजी, कृत्रिम बुद्धिमत्ता, एल्गोरिद्मिक पूर्वग्रह, हिंदी LLM
References
Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258
OpenAI. (2023). GPT-4 Technical Report. OpenAI Documentation
Google DeepMind. (2024). Gemini: The next step in AI language understanding. Google AI Blog
Floridi, L., &Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4), 681–694
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems
Malik, A., Gupta, S., & Rani, P. (2022). Socially aware bias measurements for Hindi language representations. Proceedings of NAACL 2022
Joshi, P., Kumar, S., & Srivastava, R. (2024). Since lawyers are males: Examining implicit gender bias in Hindi language generation by LLMs. ACL Anthology.
Khandelwal, S., Sharma, M., & Rani, K. (2024). Indian-BhED: A dataset for measuring India-centric biases in LLMs. EMNLP 2024
Helm, L., Salminen, J., & Jung, S. G. (2024). Diversity and language technology: How language modeling bias causes epistemic injustice. CHI 2024.
Bhatt, A., Dev, S., & Tripathi, A. (2023). Cultural re-contextualization of fairness research in language technologies in India. Findings of ACL 2023
Goldshtein, A., Pechenizkiy, M., & Markov, I. (2024). The social consequences of language technologies and their underlying language ideologies. HCII 2024
Ferguson, C. A. (1959). Diglossia. Word, 15(2), 325-340
Bourdieu, P. (1991). Language and symbolic power (J. B. Thompson, Ed.; G. Raymond & M. Adamson, Trans.). Harvard University Press
Kachru, B. B. (1986). The alchemy of English: The spread, functions, and models of non-native Englishes. University of Illinois Press
Andersen, R. (2024). The AI revolution is crushing thousands of languages. The Atlantic
Chandran, R. (2023). India's scaling up of AI could reproduce casteist bias, discrimination against women and minorities. Reuters Foundation/Scroll.in
Bourdieu, P. (1991). Language and symbolic power (J. B. Thompson, Ed.; G. Raymond & M. Adamson, Trans.). Harvard University Press.
Ferguson, C. A. (1959). Diglossia. Word, 15(2), 325–340
Pennycook, A. (2001). Critical applied linguistics: A critical introduction. Routledge DOI: https://doi.org/10.4324/9781410600790
Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. NYU Press.
Benjamin, R. (2019). Race after technology: Abolitionist tools for the new Jim code. Polity Press.
Malik, A., Gupta, S., & Rani, P. (2022). Socially aware bias measurements for Hindi language representations. Proceedings of NAACL 2022 DOI: https://doi.org/10.18653/v1/2022.naacl-main.76
Joshi, P., Kumar, S., & Srivastava, R. (2024). Since lawyers are males: Examining implicit gender bias in Hindi language generation by LLMs. ACL Anthology.
Khandelwal, S., Sharma, M., & Rani, K. (2024). Indian-BhED: A dataset for measuring India-centric biases in LLMs. EMNLP 2024 DOI: https://doi.org/10.1145/3677525.3678666
Helm, L., Salminen, J., & Jung, S. G. (2024). Diversity and language technology: How language modeling bias causes epistemic injustice. CHI 2024.
Andersen, R. (2024). The AI revolution is crushing thousands of languages. The Atlantic.
Bourdieu, P. (1991). Language and symbolic power (J. B. Thompson, Ed.; G. Raymond & M. Adamson, Trans.). Harvard University Press.
Ferguson, C. A. (1959). Diglossia. Word, 15(2), 325–340 DOI: https://doi.org/10.1080/00437956.1959.11659702
Joshi, P., Kumar, S., & Srivastava, R. (2024). Since lawyers are males: Examining implicit gender bias in Hindi language generation by LLMs. ACL Anthology
Helm, L., Salminen, J., & Jung, S. G. (2024). Diversity and language technology: How language modeling bias causes epistemic injustice. CHI 2024 DOI: https://doi.org/10.1007/s10676-023-09742-6
Goldshtein, A., Pechenizkiy, M., & Markov, I. (2024). The social consequences of language technologies and their underlying language ideologies. HCII 2024 DOI: https://doi.org/10.1007/978-3-031-60875-9_18
Andersen, R. (2024). The AI revolution is crushing thousands of languages. The Atlantic
Bhatt, A., Dev, S., & Tripathi, A. (2023). Cultural re-contextualization of fairness research in language technologies in India. Findings of ACL 2023.
Floridi, L., &Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4), 681–694 DOI: https://doi.org/10.1007/s11023-020-09548-1