Axiomatic Attribution for Deep Networks, Mukund Sundararajan, Ankur Taly, Qing Yan, 2017Proceedings of the 34th International Conference on Machine Learning (ICML), Vol. 70DOI: 10.5555/3305890.3305942 - Introduces Integrated Gradients, an axiomatic attribution method for deep networks, for understanding feature importance by accumulating gradients.
A Unified Approach to Interpreting Model Predictions, Scott M. Lundberg, Su-In Lee, 2017Advances in Neural Information Processing Systems (NeurIPS), Vol. 30 (Curran Associates, Inc.)DOI: 10.5555/3295222.3295230 - Presents SHAP, a unified framework for interpreting model predictions based on Shapley values, providing theoretically sound feature attributions.
"Why Should I Trust You?": Explaining the Predictions of Any Classifier, Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin, 2016Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Association for Computing Machinery)DOI: 10.1145/2939672.2939778 - Introduces LIME, a model-agnostic local interpretability technique that explains predictions of any classifier by fitting simple, interpretable models to perturbed data.
Transformer Interpretability Beyond Attention Visualization, Hila Chefer, Shir Gur, Lior Wolf, 2021Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)DOI: 10.1109/CVPR46437.2021.00062 - Discusses methods for attributing feature importance in Transformer models, going beyond direct attention weights to provide more faithful explanations.