Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications, Chip Huyen, 2022 (O'Reilly Media) - This book provides a practical approach to building ML systems, with specific content on feature engineering, data management, and the function of feature stores in supporting data quality, lineage, and operational efficiency for production ML applications.
Managing Data Schema Changes in Production Machine Learning Pipelines, Jonathan R. S. Johnson, Emily M. Reif, Christopher J. Olaes, Jason D. Hibbs, Douglas E. Zongker, 2021Proceedings of the Workshop on Data Management for End-to-End Machine Learning (DEEM '21) at KDD '21 (ACM)DOI: 10.1145/3447953.3460831 - This article addresses the challenges of schema evolution in ML data pipelines, offering strategies for managing changes to feature definitions and their impact on downstream models, a main function of the Feature Registry.
Delta Lake: High-Performance ACID Table Storage for Spark and Beyond, Michael Armbrust, Sameer Agarwal, Xiangrui Meng, Timothy Hunter, Joseph K. Bradley, Ali Ghodsi, Andrea J. Hu, Tathagata Das, Databricks Team, 2020Proceedings of the VLDB Endowment, Vol. 13 (VLDB Endowment)DOI: 10.14778/3407790.3407823 - This article introduces Delta Lake, an open-source storage layer that brings ACID transactions and time-travel capabilities to data lakes, supporting the 'point-in-time correctness' needed by the offline feature store for reproducible training data.