Aakash Kharb
Maharshi Dayanand University, Rohtak
Download PDF
http://doi.org/10.37648/ijrst.v15i01.007
The convergence of big data, language technologies and computing science has reshaped how we model human languages and software systems. Massive text corpora, multimodal datasets and “big code” repositories enable data-driven accounts of language formation and change, while also powering advances in software analytics, code intelligence and large language models (LLMs) for natural and programming languages. This paper surveys and synthesizes applications of big data across two tightly coupled domains: (1) language formation in the sense of acquisition, usage and diachronic change, and (2) computing science, particularly natural language processing (NLP), LLMs and code-centric analytics. We review foundational big-data NLP and corpus-linguistic work, recent LLM surveys, and research on mining software repositories and big code. We then propose a conceptual framework connecting human language corpora and software repositories as parallel manifestations of “languages in use,” both amenable to large-scale statistical modelling. Comparative analysis contrasts traditional small-data, rule-based methodologies with big-data, representation-learning approaches across linguistic and software-engineering tasks. Case studies include social-media NLP, corpus-based language change modelling, AI-assisted programming and software analytics pipelines. Finally, we discuss open challenges, including data bias, interpretability, governance, privacy and the risk of over-fitting linguistic and programming norms to dominant platforms.
Keywords: Big data; language formation; corpus linguistics; large language models; big code; mining software repositories; software analytics; natural language processing
Disclaimer: Indexing of published papers is subject to the evaluation and acceptance criteria of the respective indexing agencies. While we strive to maintain high academic and editorial standards, International Journal of Research in Science and Technology does not guarantee the indexing of any published paper. Acceptance and inclusion in indexing databases are determined by the quality, originality, and relevance of the paper, and are at the sole discretion of the indexing bodies.