Alternatives: Direct Preference Optimization (DPO)
New · Open Source
Kerb - LLM Development Toolkit
Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.
Was this section helpful?
Deep Reinforcement Learning from Human Preferences, Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, 2017Advances in Neural Information Processing Systems 30 (NeurIPS 2017)DOI: 10.48550/arXiv.1706.03741 - A foundational paper that proposes learning reward functions from human preferences to train reinforcement learning agents.
DPO (Direct Preference Optimization), Hugging Face, 2024 (Hugging Face) - Provides practical information and code examples for implementing DPO using the Hugging Face TRL library.