趋近智
本次实践练习将指导您专门针对RAG相关任务微调 (fine-tuning)一个较小的LLM(大语言模型 (LLM))。微调较小的模型在性能、成本效益和速度之间提供了令人满意的平衡,使其成为生产环境中有吸引力的选择。通过根据RAG管道中您特定的生成需求调整模型,通常可以比使用更大的通用模型获得更好的结果,尤其是在与检索到的上下文 (context)保持事实一致性方面。
我们的目标是获取一个预训练 (pre-training)的较小LLM,并对其进行调整,使其能够根据提供的上下文和用户查询生成答案,这是RAG系统中的一个常见任务。我们将侧重使用参数 (parameter)高效微调(PEFT)技术,例如低秩适应(LoRA),以使此过程在计算上可行且高效。
开始之前,请确保您拥有以下内容:
transformers:用于访问预训练 (pre-training)模型和分词 (tokenization)器 (tokenizer)。datasets:用于处理和准备数据。peft:用于应用LoRA或其他PEFT方法。accelerate:简化PyTorch训练在各种硬件上的运行。torch:PyTorch库。evaluate 和 rouge_score:用于评估指标。
您可以使用pip安装这些库:
pip install transformers datasets peft accelerate torch evaluate rouge_score
```
3. 基础模型:我们将使用Hugging Face上一个相对较小但功能齐全的模型,例如t5-small。它非常适用于条件生成任务。
[
{
"context": "Kuala Lumpur is the capital city of Malaysia, known for its iconic Petronas Twin Towers. It is also a major economic and cultural hub in Southeast Asia.",
"question": "What is the capital of Malaysia?",
"answer": "Kuala Lumpur is the capital city of Malaysia."
},
{
"context": "The Petronas Twin Towers were once the tallest buildings and remain an architectural marvel in Kuala Lumpur, Malaysia. They were completed in 1998.",
"question": "When were the Petronas Twin Towers completed?",
"answer": "The Petronas Twin Towers were completed in 1998."
},
// ... more examples
]
我们已选择t5-small作为基础模型。T5是一个编码器-解码器模型,在无监督和有监督任务的多任务混合上进行了预训练 (pre-training),使其能够灵活应对各种生成任务。任务是:给定一个context和一个question,生成一个在context中有事实依据的answer。我们需要适当地格式化T5模型的输入,通常是通过在输入字符串前添加任务特定指令,例如“answer the question based on the context:”。
数据集准备是一个重要步骤。您的微调数据质量直接影响专用LLM的性能。
加载您的数据集:如果您有如上所述的JSON文件,可以使用datasets库加载它。
from datasets import load_dataset
# Assuming your data is in 'rag_finetuning_data.jsonl'
# For demonstration, let's create a dummy dataset
data_files = {"train": "path_to_your_train_data.jsonl", "validation": "path_to_your_validation_data.jsonl"}
# raw_datasets = load_dataset("json", data_files=data_files)
# For this example, let's use a small dummy dataset
# In a real scenario, you'd load your actual dataset
from datasets import Dataset
dummy_data = {
"train": Dataset.from_list([
{"id": "1", "context": "Kuala Lumpur is the capital city of Malaysia, known for its iconic Petronas Twin Towers.", "question": "What is the capital of Malaysia?", "answer": "Kuala Lumpur is the capital city of Malaysia."},
{"id": "2", "context": "The Petronas Twin Towers were once the tallest buildings and remain an architectural marvel in Kuala Lumpur, Malaysia. They were completed in 1998.", "question": "When were the Petronas Twin Towers completed?", "answer": "The Petronas Twin Towers were completed in 1998."}
]),
"validation": Dataset.from_list([
{"id": "3", "context": "Langkawi is an archipelago of 99 islands in the Andaman Sea, off the west coast of Malaysia.", "question": "Where is Langkawi located?", "answer": "Langkawi is located off the west coast of Malaysia."}
])
}
raw_datasets = dummy_data
分词 (tokenization)和格式化:我们需要对输入(上下文 (context)和问题)和目标(答案)进行分词。对于T5,常见做法是拼接上下文和问题,通常会带有一个前缀。
from transformers import AutoTokenizer
model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
max_input_length = 512 # Adjust based on your context lengths
max_target_length = 64 # Adjust based on your answer lengths
prefix = "answer the question based on the context: "
def preprocess_function(examples):
inputs = [prefix + "question: " + q + " context: " + c for q, c in zip(examples["question"], examples["context"])]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length")
# Tokenize targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["answer"], max_length=max_target_length, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
# Ensure -100 is used for padding tokens in labels so they are ignored in the loss function
for i, label_ids in enumerate(model_inputs["labels"]):
model_inputs["labels"][i] = [lid if lid != tokenizer.pad_token_id else -100 for lid in label_ids]
return model_inputs
tokenized_datasets = {
"train": raw_datasets["train"].map(preprocess_function, batched=True, remove_columns=raw_datasets["train"].column_names),
"validation": raw_datasets["validation"].map(preprocess_function, batched=True, remove_columns=raw_datasets["validation"].column_names)
}
prefix帮助模型理解任务。拼接“question: ”和“context: ”明确地为模型标记 (token)了这些部分。对标签进行分词时,tokenizer.as_target_tokenizer()用于序列到序列模型。填充的标签标记被设为-100,以便损失函数 (loss function)忽略它们。
即使是“小型”LLM的完全微调也可能资源密集。参数 (parameter)高效微调(PEFT)方法,如LoRA,允许我们只微调模型参数的一小部分,从而大幅减少计算需求和存储。
加载基础模型:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
配置LoRA:
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
r=16, # Rank of the update matrices. Higher rank means more parameters.
lora_alpha=32, # Alpha scaling factor.
target_modules=["q", "v"], # Apply LoRA to query and value weights in attention
lora_dropout=0.05,
bias="none",
task_type=TaskType.SEQ_2_SEQ_LM # For sequence-to-sequence models like T5
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# Example output: trainable params: XXXX || all params: YYYYYY || trainable%: Z.ZZZZ
此配置指定LoRA将应用于注意力层中的查询(q)和值(v)投影矩阵。print_trainable_parameters方法将展示LoRA如何大幅减少需要更新的参数数量。
设置训练参数和训练器:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
data_collator = DataCollatorForSeq2Seq(
tokenizer,
model=peft_model,
label_pad_token_id=-100, # Important for ignoring padding in loss calculation
pad_to_multiple_of=8 # Optional: for TPU efficiency
)
output_dir = "t5-small-rag-finetuned-lora"
training_args = Seq2SeqTrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=4, # Adjust based on your GPU memory
per_device_eval_batch_size=4, # Adjust based on your GPU memory
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
learning_rate=1e-4, # Common learning rate for LoRA
num_train_epochs=3, # Adjust as needed
logging_dir=f"{output_dir}/logs",
logging_steps=10,
evaluation_strategy="epoch", # Evaluate at the end of each epoch
save_strategy="epoch", # Save model at the end of each epoch
load_best_model_at_end=True,
metric_for_best_model="eval_loss", # Or a ROUGE score if you implement compute_metrics
predict_with_generate=True, # Necessary for generating text during evaluation
fp16=True, # Enable mixed-precision training if GPU supports it
)
trainer = Seq2SeqTrainer(
model=peft_model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
)
开始训练:
trainer.train() ``` 这将启动微调过程。请监控训练和验证损失。
peft_model.save_pretrained(f"{output_dir}/best_lora_adapter")
# You can also save the tokenizer for convenience
tokenizer.save_pretrained(f"{output_dir}/best_lora_adapter")
评估微调 (fine-tuning)后的模型非常重要。我们需要评估其根据所提供的上下文 (context)生成准确且相关联答案的能力。
为训练器定义compute_metrics函数(可选但建议):您可以使用ROUGE等指标,它衡量生成答案与参考答案之间的重叠度。
import numpy as np
import evaluate
rouge_metric = evaluate.load("rouge")
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
# Decode generated summaries, handling -100 pads
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# ROUGE expects a newline after each sentence
decoded_preds = ["\n".join(pred.strip().split()) for pred in decoded_preds]
decoded_labels = ["\n".join(label.strip().split()) for label in decoded_labels]
result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
# Extract specific ROUGE scores
result = {key: value * 100 for key, value in result.items()}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
return result
# Re-initialize trainer with compute_metrics if you define it
# trainer = Seq2SeqTrainer(
# ...
# compute_metrics=compute_metrics,
# ...
# )
# Then call trainer.evaluate() or it will be called during training if evaluation_strategy is set.
如果您将compute_metrics整合到Seq2SeqTrainer中,它将在评估阶段自动计算这些指标。对于本次演练,我们侧重于eval_loss来选择最佳模型。
定性评估:对生成结果进行定性检查。
加载微调后的适配器:
from peft import PeftModel
# Load the base model
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
# Load the LoRA adapter
fine_tuned_model = PeftModel.from_pretrained(base_model, f"{output_dir}/best_lora_adapter")
fine_tuned_model = fine_tuned_model.to("cuda" if torch.cuda.is_available() else "cpu") # Move to device
fine_tuned_model.eval() # Set to evaluation mode
# If you saved the tokenizer along with the adapter:
# fine_tuned_tokenizer = AutoTokenizer.from_pretrained(f"{output_dir}/best_lora_adapter")
# else, use the original tokenizer
fine_tuned_tokenizer = tokenizer
为一些测试示例生成答案:
test_context = "The Atacama Desert is a desert plateau in South America covering a 1,600 km strip of land on the Pacific coast, west of the Andes Mountains. It is the driest nonpolar desert."
test_question = "What is the Atacama Desert?"
input_text = f"{prefix}question: {test_question} context: {test_context}"
input_ids = fine_tuned_tokenizer(input_text, return_tensors="pt", max_length=max_input_length, truncation=True).input_ids.to(fine_tuned_model.device)
outputs = fine_tuned_model.generate(input_ids, max_length=max_target_length, num_beams=4, early_stopping=True)
generated_answer = fine_tuned_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Context: {test_context}")
print(f"Question: {test_question}")
print(f"Generated Answer: {generated_answer}")
检查答案是否:
比较:将这些输出与以下内容进行比较:
t5-small模型(未经微调)。如果您有标注测试集,简单的定量比较可能涉及ROUGE分数:
模拟的ROUGE-L分数显示了RAG专用微调后的潜在提升。实际结果将根据数据和训练情况有所不同。
对微调 (fine-tuning)后的较小LLM满意后:
r、lora_alpha)、学习率以及其他训练参数。本次实践练习演示了一种优化RAG系统生成组件的技术。通过使用PEFT方法微调像T5-small这样的较小LLM,您可以创建根据您的生产需求量身定制的专用、高效且有效的生成器,从而带来更高质量的输出和更好的资源使用。这种方法是构建可维护RAG解决方案的一个步骤。
这部分内容有帮助吗?
© 2026 ApX Machine Learning用心打造