Performance Profiling and Debugging in Distributed Environments
New · Open Source
Kerb - LLM Development Toolkit
Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.
Was this section helpful?
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag, 2010Google Technical Report (Google, Inc.) - Foundational paper introducing the concepts of distributed tracing, essential for understanding modern observability tools.
Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy, 2017 (O'Reilly Media) - Comprehensive guide to building and operating reliable distributed systems, including chapters on monitoring, debugging, and incident response.
OpenTelemetry Documentation, OpenTelemetry Authors, 2024 - Official guide for implementing and utilizing OpenTelemetry for distributed tracing, metrics, and log collection across various services.
NVIDIA Nsight Systems Documentation, NVIDIA Corporation, 2024 (NVIDIA Corporation) - Provides detailed instructions and best practices for profiling GPU-accelerated applications, crucial for optimizing LLM and retriever inference.