TensorFlow PluggableDevice, TensorFlow Authors, 2024 - Official documentation describing TensorFlow's interface for integrating custom hardware devices and alternative computation backends, a key mechanism for deeper runtime interoperability.
Extending PyTorch with C++ and CUDA, PyTorch Authors, 2018 (PyTorch Foundation) - PyTorch's official guide on creating custom C++ and CUDA operations, detailing how to extend the framework with specialized, high-performance kernels or subgraphs handled by external runtimes.
DLPack: A Standard for Tensor Exchange, DMLC Community, 2024 - The official GitHub repository and specification for DLPack, a cross-framework standard for zero-copy tensor data exchange, which is fundamental for efficient data exchange between ML frameworks and specialized runtimes.
TVM: An End-to-End Optimizing Compiler for Deep Learning, Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, 201813th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (USENIX Association)DOI: 10.5555/3295222.3295267 - A foundational paper on TVM, an open-source deep learning compiler and runtime that often integrates with high-level frameworks as a specialized backend, illustrating the principles of JIT compiler integration and optimized execution.