High-Throughput Data Processing with Spark and Ray
Was this section helpful?
Apache Spark Documentation, The Apache Software Foundation, 2024 - Provides comprehensive guidance on Spark's architecture, DataFrame API, PySpark, and distributed processing capabilities.
Ray Documentation, Anyscale, 2024 - Offers detailed information on Ray Core, Ray Data, task and actor primitives, and building distributed Python applications.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Blake Goldin, Joseph Hellerstein, Mohammad Isard, S. Krishnan, Haoyuan Li, Scott McVeety, Andy Konwinski, Patrick Wendell, Adam Wilde, Michael Franklin, Ion Stoica, 2012Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI '12) (USENIX)DOI: 10.1145/2228298.2228301 - Introduces the RDD abstraction, which is fundamental to Spark's design for fault-tolerant and efficient distributed data processing.
Ray: A Distributed System for AI Applications, Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Gungor, Eric Choi, Joseph E. Gonzalez, Ion Stoica, 2018Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18) - Describes Ray's flexible architecture for general-purpose distributed Python computation, designed to support diverse AI workloads.