Scalable Data Pipelines for Real-Time Analytics: Innovations in Streaming Data Architectures

Nissi Joy

doi:10.63282/3050-922X.IJERET-V5I1P102

Authors

Nissi Joy Data Analyst, Conflowence, USA Author

DOI:

https://doi.org/10.63282/3050-922X.IJERET-V5I1P102

Keywords:

Real-Time Analytics, Streaming Data, Apache Flink, Spark Streaming, Apache Kafka, Data Pipelines, Low Latency, Big Data Processing, State Management, Data Ingestion

Abstract

Real-time analytics has become a critical component in various industries, from finance to healthcare, enabling organizations to make data-driven decisions with minimal latency. However, the rapid growth in data volume and velocity poses significant challenges for traditional data processing systems. This paper explores the latest innovations in streaming data architectures designed to address these challenges. We discuss the evolution of data pipelines, the key components of scalable real-time data processing systems, and the algorithms that enable efficient data streaming. We also present case studies and empirical evaluations to demonstrate the effectiveness of these architectures in real-world scenarios

References

[1] Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C., Hueske, F., Heise, A., ... & Warneke, D. (2014). The Stratosphere platform for big data analytics. The VLDB Journal, 23(6), 939–964. https://doi.org/10.1007/s00778-014-0357-y

[2] Barr, J. (2013, November 14). Amazon Kinesis - Real-Time Stream Processing. AWS News Blog. https://aws.amazon.com/blogs/aws/amazon-kinesis-real-time-stream-processing/

[3] Carbone, P., Fóra, G., Ewen, S., Haridi, S., & Tzoumas, K. (2015). Lightweight asynchronous snapshots for distributed dataflows. arXiv preprint arXiv:1506.08603. https://arxiv.org/abs/1506.08603

[4] Chen, Y., Hosseini, M., & Golmohammadi, A. (2023). SustainGym: Reinforcement learning environments for sustainable energy systems. Advances in Neural Information Processing Systems Datasets and Benchmarks Track. https://proceedings.neurips.cc/paper_files/paper/2023/file/ba74855789913e5ed36f87288af79e5b-Paper-Datasets_and_Benchmarks.pdf

[5] Ewen, S., Tzoumas, K., Kaufmann, M., & Markl, V. (2012). Spinning fast iterative data flows. Proceedings of the VLDB Endowment, 5(11), 1268–1279. https://doi.org/10.14778/2350229.2350261

[6] Gates, A., & Nadeau, J. (2014). Programming Pig. O'Reilly Media.

[7] Grolinger, K., Higashino, W. A., Tiwari, A., & Capretz, M. A. M. (2014). Data management in cloud environments: NoSQL and NewSQL data stores. Journal of Cloud Computing: Advances, Systems and Applications, 3(1), 1–24. https://doi.org/10.1186/s13677-014-0021-6

[8] Hueske, F., Peters, M., Sax, M. J., Rheinländer, A., Bergmann, R., Krettek, A., & Tzoumas, K. (2012). Opening the black boxes in data flow optimization. Proceedings of the VLDB Endowment, 5(11), 1256–1267. https://doi.org/10.14778/2350229.2350260

[9] Kreps, J., Narkhede, N., & Rao, J. (2011). Kafka: A distributed messaging system for log processing. Proceedings of the NetDB (Vol. 11, pp. 1–7).

[10] Marz, N., & Warren, J. (2013). Big Data: Principles and best practices of scalable realtime data systems. Manning Publications.

[11] Pointer, I. (2015, May 7). Apache Flink: New Hadoop contender squares off against Spark. InfoWorld. https://www.infoworld.com/article/2922401/apache-flink-new-hadoop-contender-squares-off-against-spark.html

[12] Rao, S., & Gupta, S. (2014, June 17). Interactive analytics in human time. Yahoo Engineering Blog. https://yahooeng.tumblr.com/post/89073085149/interactive-analytics-in-human-time

[13] Schuster, W. (2014, April 6). Nathan Marz on Storm, immutability in the Lambda architecture, Clojure. InfoQ. https://www.infoq.com/interviews/marz-storm-lambda-clojure/

[14] Srivastava, M., & Yadav, P. (2021, October 22). Scalable data streaming with Amazon Kinesis: Design and secure highly available, cost-effective data streaming applications with Amazon Kinesis. 2021 5th International Conference on Information Systems and Computer Networks (ISCON). https://doi.org/10.1109/ISCON52037.2021.9702380

[15] Vargas-Solar, G., & Espinosa-Oviedo, J. A. (2021). Building analytics pipelines for querying big streams and data histories with H-STREAM. arXiv preprint arXiv:2108.03485. https://arxiv.org/abs/2108.03485

[16] Warneke, D., & Kao, O. (2009). Nephele: Efficient parallel data processing in the cloud. Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS '09), Article 8, 10 pages. https://doi.org/10.1145/1646468.1646476

[17] Yang, F., & Merlino, G. (2014, July 30). Real-time analytics with open source technologies. Druid Blog. https://druid.apache.org/blog/2014/07/30/real-time-analytics-with-open-source-technologies.html

[18] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud '10), 10–10.

[19] Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., & Chandra, V. (2018). Federated learning with non-IID data. arXiv preprint arXiv:1806.00582. https://arxiv.org/abs/1806.00582

Scalable Data Pipelines for Real-Time Analytics: Innovations in Streaming Data Architectures

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Callpaper

Menu

Information

Keywords

Latest publications