Details

A Survey of Big Data Approaches in Natural Language Formation and Computing Science

Aakash Kharb

Maharshi Dayanand University, Rohtak

85-93

Vol: 15, Issue: 1, 2025

Receiving Date: 2025-01-29 Acceptance Date:

2025-02-25

Publication Date:

2025-03-10

Download PDF

http://doi.org/10.37648/ijrst.v15i01.007

Abstract

The convergence of big data, language technologies and computing science has reshaped how we model human languages and software systems. Massive text corpora, multimodal datasets and “big code” repositories enable data-driven accounts of language formation and change, while also powering advances in software analytics, code intelligence and large language models (LLMs) for natural and programming languages. This paper surveys and synthesizes applications of big data across two tightly coupled domains: (1) language formation in the sense of acquisition, usage and diachronic change, and (2) computing science, particularly natural language processing (NLP), LLMs and code-centric analytics. We review foundational big-data NLP and corpus-linguistic work, recent LLM surveys, and research on mining software repositories and big code. We then propose a conceptual framework connecting human language corpora and software repositories as parallel manifestations of “languages in use,” both amenable to large-scale statistical modelling. Comparative analysis contrasts traditional small-data, rule-based methodologies with big-data, representation-learning approaches across linguistic and software-engineering tasks. Case studies include social-media NLP, corpus-based language change modelling, AI-assisted programming and software analytics pipelines. Finally, we discuss open challenges, including data bias, interpretability, governance, privacy and the risk of over-fitting linguistic and programming norms to dominant platforms.

Keywords: Big data; language formation; corpus linguistics; large language models; big code; mining software repositories; software analytics; natural language processing

References

  1. 1. Allamanis, M., Barr, E. T., Devanbu, P., & Sutton, C. (2018). A survey of machine learning for big code and naturalness. ACM Computing Surveys, 51(4), Article 81. https://doi.org/10.1145/3212695
  2. 2.Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., … Xie, X. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3), Article 39. https://doi.org/10.1145/3641289 3. Gousios, G. (2018). Big data software analytics with Apache Spark. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceedings (pp. 542–543). ACM. https://doi.org/10.1145/3183440.3183458 4. Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. Science, 349(6245), 261–266. https://doi.org/10.1126/science.aaa8685 5. Kumar, P. (2024). Large language models (LLMs): Survey, technical frameworks, and future challenges. Artificial Intelligence Review, 57(10), 260. https://doi.org/10.1007/s10462-024-10888-y 6. Nevalainen, T. (2020). Using large recent corpora to study language change. In R. D. Janda, B. D. Joseph, & B. Vance (Eds.), Handbook of Historical Linguistics (Vol. 2, pp. 272–290). Wiley. https://doi.org/10.1002/9781118732168.ch13 7. Theodorakopoulos, L., Antonopoulou, H., Halkiopoulos, C., & Mamalougkou, V. (2023). Synergizing big data analytics and natural language processing: A comprehensive review of techniques and emerging trends. International Journal of Multidisciplinary and Current Educational Research, 5(6), 111–118. (DOI as reported in the article: e.g., 10.28991/esj-2023-07-03-04 for related works)
  3. 8. Vidoni, M. (2022). A systematic process for mining software repositories: Results from a systematic literature review. Information and Software Technology, 144, 106791. https://doi.org/10.1016/j.infsof.2021.106791 9. Wang, L., Wang, G., & Alexander, C. A. (2015). Natural language processing systems and big data analytics. International Journal of Computational Systems Engineering, 2(2), 76–84. https://doi.org/10.1504/IJCSYSE.2015.077052 10. Wan, Y., He, Y., Bi, Z., Zhang, J., Zhang, H., Sui, Y., … Yu, P. S. (2024). Deep learning for code intelligence: Survey, benchmark and toolkit. ACM Computing Surveys, 56(12), 1–41. https://doi.org/10.1145/3664597 11. Wong, M.-F., Guo, S., Hang, C.-N., Ho, S.-W., & Tan, C.-W. (2023). Natural language generation and understanding of big code for AI-assisted programming: A review. Entropy, 25(6), 888. https://doi.org/10.3390/e25060888 12. Yayah, F. C., Ghauth, K. I., & Ting, C.-Y. (2018). Application of NLP on big data using Hadoop: Case study using trouble tickets. Advanced Science Letters, 24(10), 7696–7702. https://doi.org/10.1166/asl.2018.13002 13. Zeroual, I., Lakhouaja, A., & Belkredim, F. Z. (2018). Data science in light of natural language processing. Procedia Computer Science, 127, 58–67. https://doi.org/10.1016/j.procs.2018.01.099
Back

Disclaimer: Indexing of published papers is subject to the evaluation and acceptance criteria of the respective indexing agencies. While we strive to maintain high academic and editorial standards, International Journal of Research in Science and Technology does not guarantee the indexing of any published paper. Acceptance and inclusion in indexing databases are determined by the quality, originality, and relevance of the paper, and are at the sole discretion of the indexing bodies.