Apache Parquet, The Apache Software Foundation, 2024 - Explains the columnar storage format, its benefits, and how it enables efficient reads for analytical queries through features like column projection and vectorization.
Spark SQL Performance Tuning, The Apache Software Foundation, 2024 - Official guide for optimizing Spark SQL queries, including generating and interpreting execution plans, using partitioning for improved performance, and managing shuffle operations.
Building a Data Lakehouse with Delta Lake, Vini J. Varghese, Andy Feng, Deniz Ilkbasaran, 2023 (O'Reilly Media) - Discusses modern data lake architectures and optimization strategies for large-scale data, including effective partitioning and file management techniques that enhance query performance.