Retrieval-Augmented Generation for Large Language Models: A Survey, Yuan-Fang Li, Genggeng Hao, Chunyang Li, Jingyang Ding, Yutong Zhou, Yanmin An, Gang Chen, Jianxin Li, Jun Liu, Xiang Li, Huaijun Li, Yu Han, Haoran Chen, Weizhao Li, Guodong Long, Ruoyu Chen, Cheng Chen, Jie Xu, Chunjing Gan, Quan Z. Sheng, Lei Pan, Kun Xu, Chen Wang, Wei Luo, Shirui Pan, Lei Wang, Xiaohui Tao, Minjuan Zhu, Jie Hu, Faliang Huang, Yonghong Kang, Yi Hu, Jingjing Xu, Tongtong Li, Yuxin Li, Zaiyu Li, Jiawen Lin, Wei Chen, Xifeng Yan, Xiangliang Zhang, Hongzhi Yin, Kai Chen, Bo Li, Guanghua Wang, Quan Li, Zhicheng Dou, Yanyan Shen, Yiming Li, Feifei Li, Chuan Zhou, Pengfei Wang, Peng Zhang, Jinyang Li, Xiangyu Fan, Ruimao Zhang, Dong Guo, Wei Xu, Linzhang Wang, Zhenyu Wang, Yi Wu, Jiajin Li, Qiang Wei, Yang Yang, Xindong Wu, Jianshe Zhou, Zhaoyu Wang, Hao Wang, Xinzhi Gao, Yanchun Zhang, 2024ACM Computing Surveys, Vol. 56 (ACM)DOI: 10.1145/3639089 - A comprehensive survey of Retrieval-Augmented Generation (RAG) in LLMs, covering architectures, challenges, and optimization strategies, relevant for understanding the performance considerations of RAG applications.
Best practices for API usage, OpenAI, 2023 (OpenAI) - Official guide from OpenAI providing strategies to reduce latency and cost when interacting with their APIs, including techniques like caching and batching that address identified bottlenecks.