Leveraging In-Memory Computing for Speeding up Apache Spark and Hadoop Distributed Data Processing

Sarbaree Mishra; Vineela Komandla; Srikanth Bandi

doi:10.63282/3050-922X.IJERET-V3I3P108

Authors

Sarbaree Mishra Program Manager at Molina Healthcare Inc., USA. Author
Vineela Komandla Vice President Product Manager, JP Morgan, USA. Author
Srikanth Bandi Software Engineer, JP Morgan Chase, USA. Author

DOI:

https://doi.org/10.63282/3050-922X.IJERET-V3I3P108

Keywords:

Real-time analytics, data caching, distributed systems, cluster computing, data parallelism, computational efficiency, fault tolerance, data pipelines, iterative processing, RDD (Resilient Distributed Datasets), DAG (Directed Acyclic Graph), machine learning integration, big data analytics, performance tuning, scalability, high-speed processing, low-latency systems, memory optimization

Abstract

In-memory computing has been a leading approach to distributed data processing, which in turn has positively affected frameworks like Apache Spark and Hadoop by implementing new features that can overcome limitations of earlier disk-based methods. Most of the traditional disk-based methods, although reliable, have some issues, such as long delays caused by disk I/O bottlenecks, especially when it comes to increasingly large and complex information that needs to be processed. In-memory computing eliminates the inefficiencies by utilizing the computer's random access memory (RAM) for data storage and processing, which results in much lower latency & faster computations. Apache Spark utilizes this idea via its Resilient Distributed Dataset (RDD) model, which stores data temporarily in memory to facilitate repeated tasks and reduce the number of disk operations needed. Likewise, to boost the performance, Hadoop has changed by adding in-memory features like YARN’s memory-based caching. Such an approach is vital in tasks that need input of continuous and quick data, performing analytics in real-time or carrying out repetitive machine learning procedures frequently. Besides quicker execution time, in-memory computing also increases scalability and improves resource utilization by providing more efficient partitioning, caching, and task execution. Furthermore, this also goes hand in hand with the progress of the technology in the field of hardware, like fast memory (RAM) and solid-state drives, which enables even better performance results. Along with optimized data partitioning, compression & fast memory management strategies are the means to alleviate the pressure on resources, allowing systems to operate with low latency/fast response time/high throughput even on bigger datasets. This integration eliminates the overhead involved in the processing, and hence, the organizations become more agile in decision-making because their insights are current and they can respond more quickly

References

[1] Huang, W., Meng, L., Zhang, D., & Zhang, W. (2016). In-memory parallel processing of massive remotely sensed data using an apache spark on hadoop yarn model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(1), 3-19.

[2] Hong, S., Choi, W., & Jeong, W. K. (2017, May). GPU in-memory processing using spark for iterative computation. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (pp. 31-41). IEEE.

[3] Datla, Lalith Sriram. “Infrastructure That Scales Itself: How We Used DevOps to Support Rapid Growth in Insurance Products for Schools and Hospitals”. International Journal of AI, BigData, Computational and Management Studies, vol. 3, no. 1, Mar. 2022, pp. 56-65

[4] Zhang, X., Khanal, U., Zhao, X., & Ficklin, S. (2018). Making sense of performance in in-memory computing frameworks for scientific data analysis: A case study of the spark system. Journal of Parallel and Distributed Computing, 120, 369-382.

[5] Arugula, Balkishan. “Change Management in IT: Navigating Organizational Transformation across Continents”. International Journal of AI, BigData, Computational and Management Studies, vol. 2, no. 1, Mar. 2021, pp. 47-56

[6] Manda, J. K. "Blockchain Applications in Telecom Supply Chain Management: Utilizing Blockchain Technology to Enhance Transparency and Security in Telecom Supply Chain Operations." MZ Computing Journal 2.2 (2021).

[7] Shaikh, E., Mohiuddin, I., Alufaisan, Y., & Nahvi, I. (2019, November). Apache spark: A big data processing engine. In 2019 2nd IEEE Middle East and North Africa COMMunications Conference (MENACOMM) (pp. 1-6). IEEE.

[8] Immaneni, J. (2020). Building MLOps Pipelines in Fintech: Keeping Up with Continuous Machine Learning. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 1(2), 22-32.

[9] Aziz, K., Zaidouni, D., & Bellafkih, M. (2019). Leveraging resource management for efficient performance of Apache Spark. Journal of Big Data, 6(1), 78.

[10] Veluru, Sai Prasad. "Leveraging AI and ML for Automated Incident Resolution in Cloud Infrastructure." International Journal of Artificial Intelligence, Data Science, and Machine Learning 2.2 (2021): 51-61.

[11] Allam, Hitesh. "Bridging the Gap: Integrating DevOps Culture into Traditional IT Structures." International Journal of Emerging Trends in Computer Science and Information Technology 3.1 (2022): 75-85.

Tang, S., He, B., Yu, C., Li, Y., & Li, K. (2020). A survey on spark ecosystem: Big data processing infrastructure, machine learning, and applications. IEEE Transactions on Knowledge and Data Engineering, 34(1), 71-91.

[12] Arugula, Balkishan, and Pavan Perala. “Building High-Performance Teams in Cross-Cultural Environments”. International Journal of Emerging Research in Engineering and Technology, vol. 3, no. 4, Dec. 2022, pp. 23-31.

[13] Grossman, M., & Sarkar, V. (2016, May). SWAT: A programmable, in-memory, distributed, high-performance computing platform. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (pp. 81-92).

[14] Shaik, Babulal. "Developing Predictive Autoscaling Algorithms for Variable Traffic Patterns." Journal of Bioinformatics and Artificial Intelligence 1.2 (2021): 71-90.

[15] Manda, J. K. "IoT Security Frameworks for Telecom Operators: Designing Robust Security Frameworks to Protect IoT Devices and Networks in Telecom Environments." Innovative Computer Sciences Journal 7.1 (2021).

[16] Patel, Piyushkumar. "Remote Auditing During the Pandemic: The Challenges of Conducting Effective Assurance Practices." Distributed Learning and Broad Applications in Scientific Research 6 (2020): 806-23.

[17] Islam, N. S., Wasi-ur-Rahman, M., Lu, X., Shankar, D., & Panda, D. K. (2015, October). Performance characterization and acceleration of in-memory file systems for Hadoop and Spark applications on HPC clusters. In 2015 IEEE International Conference on Big Data (Big Data) (pp. 243-252). IEEE.

[18] Nookala, Guruprasad. "End-to-End Encryption in Data Lakes: Ensuring Security and Compliance." Journal of Computing and Information Technology 1.1 (2021).

[19] Huang, Y., Yesha, Y., Halem, M., Yesha, Y., & Zhou, S. (2016, December). YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics. In 2016 IEEE international conference on big data (big data) (pp. 214-222). IEEE.

[20] Allam, Hitesh. "Security-Driven Pipelines: Embedding DevSecOps into CI/CD Workflows." International Journal of Emerging Trends in Computer Science and Information Technology 3.1 (2022): 86-97.

[21] Talakola, Swetha. “The Importance of Mobile Apps in Scan and Go Point of Sale (POS) Solutions”. American Journal of Data Science and Artificial Intelligence Innovations, vol. 1, Sept. 2021, pp. 464-8

[22] Jani, Parth. “Embedding NLP into Member Portals to Improve Plan Selection and CHIP Re-Enrollment”. Newark Journal of Human-Centric AI and Robotics Interaction, vol. 1, Nov. 2021, pp. 175-92.

[23] Datla, Lalith Sriram, and Rishi Krishna Thodupunuri. “Designing for Defense: How We Embedded Security Principles into Cloud-Native Web Application Architectures”. International Journal of Emerging Research in Engineering and Technology, vol. 2, no. 4, Dec. 2021, pp. 30-38.

[24] Zhang, H., Chen, G., Ooi, B. C., Tan, K. L., & Zhang, M. (2015). In-memory big data management and processing: A survey. IEEE Transactions on Knowledge and Data Engineering, 27(7), 1920-1948.

[25] Nookala, G. (2021). Automated Data Warehouse Optimization Using Machine Learning Algorithms. Journal of Computational Innovation, 1(1).

[26] Jani, Parth. “Azure Synapse + Databricks for Unified Healthcare Data Engineering in Government Contracts”. Los Angeles Journal of Intelligent Systems and Pattern Recognition, vol. 2, Jan. 2022, pp. 273-92

[27] Saxena, S., & Gupta, S. (2017). Practical real-time data processing and analytics: distributed computing and event processing using Apache Spark, Flink, Storm, and Kafka. Packt Publishing Ltd.

[28] Mohammad, Abdul Jabbar. “AI-Augmented Time Theft Detection System”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 2, no. 3, Oct. 2021, pp. 30-38

[29] Immaneni, J. (2021). Using swarm intelligence and graph databases for real-time fraud detection. Journal of Computational Innovation, 1(1).

[30] Hu, F., Yang, C., Schnase, J. L., Duffy, D. Q., Xu, M., Bowen, M. K., ... & Song, W. (2018). ClimateSpark: An in-memory distributed computing framework for big climate data analytics. Computers & geosciences, 115, 154-166.

[31] Manda, Jeevan Kumar. "Cloud Security Best Practices for Telecom Providers: Developing comprehensive cloud security frameworks and best practices for telecom service delivery and operations, drawing on your cloud security expertise." Available at SSRN 5003526 (2020).

[32] Abdul Jabbar Mohammad. “Cross-Platform Timekeeping Systems for a Multi-Generational Workforce”. American Journal of Cognitive Computing and AI Systems, vol. 5, Dec. 2021, pp. 1-22.

[33] Patel, Piyushkumar, and Hetal Patel. "Lease Modifications and Rent Concessions under ASC 842: COVID-19’s Lasting Impact on Lease Accounting." Distributed Learning and Broad Applications in Scientific Research 6 (2020): 824-41.

[34] Veiga, J., Expósito, R. R., Taboada, G. L., & Tourino, J. (2018). Enhancing in-memory efficiency for MapReduce-based data processing. Journal of Parallel and Distributed Computing, 120, 323-338.

[35] Shaik, Babulal. "Automating Compliance in Amazon EKS Clusters With Custom Policies." Journal of Artificial Intelligence Research and Applications 1.1 (2021): 587-10.

[36] Vasanta Kumar Tarra. “Policyholder Retention and Churn Prediction”. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING ( JRTCSE), vol. 10, no. 1, May 2022, pp. 89-103.

[37] Yan, D., Yin, X. S., Lian, C., Zhong, X., Zhou, X., & Wu, G. S. (2015). Using memory in the right way to accelerate Big Data processing. Journal of Computer Science and Technology, 30, 30-41.

[38] Kim, M., Li, J., Volos, H., Marwah, M., Ulanov, A., Keeton, K., ... & Fernando, P. (2017). Sparkle: Optimizing spark for large memory machines and analytics. arXiv preprint arXiv:1708.05746.

[39] Sreekandan Nair, S., & Lakshmikanthan, G. (2021). Open Source Security: Managing Risk in the Wake of Log4j Vulnerability. International Journal of Emerging Trends in Computer Science and Information Technology, 2(4), 33-45. https://doi.org/10.63282/d0n0bc24

Leveraging In-Memory Computing for Speeding up Apache Spark and Hadoop Distributed Data Processing

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Callpaper

Menu

Information

Keywords

Latest publications