Efficient Data Partitioning Algorithms for Distributed Storage Systems

Authors

  • Arjun Patel Machine Learning Specialist, Healthcare AI, Philips Healthcare, Netherlands Author

DOI:

https://doi.org/10.63282/3050-922X.IJERET-V2I2P101

Keywords:

Distributed Storage Systems, Data Partitioning, Adaptive Load Balancing, Hash-Based Partitioning, Range-Based Partitioning, Consistent Hashing, Query Performance, Data Movement Optimization, Scalability, Computational Complexity

Abstract

Distributed storage systems are essential for managing large-scale data in modern computing environments. These systems rely on efficient data partitioning algorithms to ensure data is evenly distributed, minimize data movement, and optimize query performance. This paper explores various data partitioning algorithms, their strengths, and limitations. We present a comprehensive review of existing techniques and propose a novel algorithm that improves partitioning efficiency and load balancing. The proposed algorithm is evaluated through extensive simulations and real-world experiments, demonstrating significant improvements in performance metrics such as query latency and data locality. The paper also discusses the implications of these findings for the design and implementation of distributed storage systems

References

[1] Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107-113.

[2] Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google File System. Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), 29-43.

[3] Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop Distributed File System. Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 1-10.

[4] Lakshman, A., & Malik, P. (2010). Cassandra: A Decentralized Structured Storage System. Proceedings of the 2010 ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP), 351-364.

[5] Abadi, D. J., Madden, S. R., & Hachem, N. (2005). Aurora: A New Model and Architecture for Data Stream Management. The VLDB Journal, 12(3-4), 120-139.

[6] Agrawal, D., El Abbadi, A., & Kamath, C. (2001). Data Partitioning in Distributed Databases. ACM Computing Surveys (CSUR), 33(2), 169-209.

[7] Balazinska, M., Balakrishnan, H., & S Madden. (2007). Fault Tolerance in Hadoop MapReduce. Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD), 111-122.

[8] Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. Proceedings of the 2001 ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), 149-160.

[9] Karger, D. R., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., & Lewin, D. (1997). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC), 654-663.

[10] Dean, J., & Ghemawat, S. (2008). Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems (TOCS), 26(2), 4-43.

Downloads

Published

2021-04-15

Issue

Section

Articles

How to Cite

1.
Patel A. Efficient Data Partitioning Algorithms for Distributed Storage Systems. IJERET [Internet]. 2021 Apr. 15 [cited 2025 Sep. 12];2(2):1-10. Available from: https://ijeret.org/index.php/ijeret/article/view/29