The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, 2003Proceedings of the nineteenth ACM symposium on Operating systems principlesDOI: 10.1145/945445.945451 - Describes the design and implementation of the Google File System, a foundational distributed file system architecture that influenced many modern petabyte-scale storage solutions like HDFS.
WebDataset: A High-Performance I/O Format for Large-scale Deep Learning, Kyle K. Kayastha, Brant C. Faircloth, Andreas K. Foerster, Benjamin S. Glick, Brian K. Stewart, Jan Schlüter, 2021arXiv preprint arXiv:2106.01429 - Details WebDataset, an efficient data format and loading library designed for streaming large-scale deep learning datasets directly from object storage, addressing optimized file formats and data streaming.
Data Management for Machine Learning: A Survey, Jens Dittrich, Stefanie Scherzinger, Felix Naumann, Kai-Uwe Sattler, Volker Markl, 2022ACM Computing Surveys, Vol. 55 (ACM)DOI: 10.1145/3547192 - Provides a comprehensive overview of data management challenges and solutions in machine learning, covering aspects from data ingestion and storage to feature engineering and metadata, highly relevant to managing large datasets for MLOps.