Benchmarking Apache Spark vs. Hadoop: Evaluating In-Memory and Disk-Based Processing Models for Big Data Analytics
Ahmed Elgalb
Independent Researchers, Iowa, United States.
George Samaan
Independent Researchers, Iowa, United States.
43-52
Vol: 12, Issue: 4, 2022
Receiving Date:
2022-09-12
Acceptance Date:
2022-11-13
Publication Date:
2022-12-13
Download PDF
http://doi.org/10.37648/ijrst.v12i04.008
Abstract
Apache Spark and Hadoop MapReduce are two of the most popular data processing paradigms for large-scale computing and each has its own model and philosophy of execution. Spark’s in-memory model promises better execution for iterative, interactive, and streaming workloads, and Hadoop MapReduce’s disk-based solution remains a staple of massive one-pass jobs. This paper presents an in-depth discussion of both frameworks based on studies and benchmarks published prior to 2022. By exploring their architectures, performance, fault tolerance, and compatibility with larger analytics stacks, it illustrates where each one is superior and where they are able to work together. It explains the cost of large-scale in-memory caching, why iterative machine learning algorithms get the most out of Spark’s DAG architecture, and why some tasks still get better off using Hadoop’s stable batch structure. Across many tables and examples, the paper illustrates a subtle point: Spark and Hadoop are not competing monoliths but complementary tools for distinct workload profiles that each are relevant to an array of real-world data engineering and analytics situations.
Keywords:
Apache Spark; Hadoop MapReduce; data processing paradigm
References
- J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012, pp. 15–28.
- M. Zaharia, P. Wendell, T. Das, and A. Dave, “Spark: A unified analytics engine for large-scale data processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, 2016.
- D. Borthakur, J. S. Sarma, O. O’Malley, S. Radia, B. Reed, and K. Shah, “Apache Hadoop YARN: Yet another resource negotiator,” in Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, 2013, p. 5.
- P. Faris, N. Yusof, and M. Othman, “Performance analysis of big data in Hadoop,” in Proceedings of the 2018 Third International Conference on Informatics and Computing (ICIC), IEEE, 2018, pp. 1–5.
- A. K. Tiwari, “Comparative study of big data computing and storage tools: A review,” in 2016 IEEE International Conference on Emerging Technologies and Innovative Business Practices for the Transformation of Societies (EmergiTech), IEEE, 2016, pp. 106–110.
- M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, “Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study,” Journal of Information Security and Applications, vol. 50, p. 102419, 2020.
- V. Behzadan and W. Hsu, “Adversarial exploitation of policy learning in autonomous systems,” in Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2017, pp. 281–288.
- S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl, “Spinning fast iterative data flows,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1268–1279, 2012.
- L. A. Maglaras, J. Jiang, M. A. Ferrag, Z. Xia, and H. Janicke, “Cybersecurity solutions for the Internet of Things: A survey,” Transfers in Internet and Information Systems, vol. 13, no. 1, pp. 1–20, 2019.
Back