Greedy Function Approximation: A Gradient Boosting Machine, Jerome H. Friedman, 2001The Annals of Statistics, Vol. 29DOI: 10.1214/aos/1013203451 - Introduces the Gradient Boosting Machine algorithm, detailing how the negative gradient of various loss functions (including deviance for classification) forms the pseudo-residuals for subsequent tree fitting.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Provides a comprehensive treatment of machine learning fundamentals, with a clear explanation of cross-entropy loss (Log Loss) and its connection to maximum likelihood estimation, relevant for classification.