Embedding Model ko RAG ke liye Fine-Tune Kaise Karein 🎯
Dosto, Retrieval Augmented Generation (RAG) systems me embedding models ka role bahut important hai.
Aksar hum general-purpose models jaise text-embedding-ada-002 ya bge-large-en use karte hain.
Lekin, agar aapko apne specific data par behtar performance chahiye, to embedding model ko fine-tune karna ek powerful technique hai.
Is tutorial me hum seekhenge ki Hugging Face ke sentence-transformers library ka use karke ek embedding model ko kaise fine-tune karein.
Step 1: Synthetic Dataset Banana 🧪
Fine-tuning ke liye humein ek dataset chahiye jisme questions aur unke relevant contexts (answers) hon. Agar aapke paas aisa dataset nahi hai, to aap ek LLM (jaise GPT-4) ka use karke apne documents se synthetic (artificial) dataset bana sakte hain.
Hum llama-index library ka use karke yeh kaam aasaani se kar sakte hain.
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.finetuning import generate_qa_embedding_pairs
# Apne documents load karein
documents = SimpleDirectoryReader("./documents").load_data()
# Nodes/chunks banayein
parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = parser.get_nodes_from_documents(documents)
# LLM ka use karke Question-Answer pairs generate karein
qa_dataset = generate_qa_embedding_pairs(nodes, llm="gpt-4-1106-preview")
# Dataset ko save karein
qa_dataset.save_json("train_dataset.json")
Explanation: Yeh code aapke document folder se files ko load karta hai, unhe छोटे-छोटे chunks me todta hai, aur phir GPT-4 ka use karke har chunk ke liye relevant questions generate karta hai. Isse humein training ke liye data mil jaata hai.
Step 2: Model ko Train Karna 🏋️♂️
Ab hum sentence-transformers library ka use karke apne base embedding model ko train karenge. Hum bge-large-en-v1.5 ko as a base model use karenge.
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.trainer import SentenceTransformerTrainer
# Dataset load karein
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
# Base model initialize karein
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Training loss define karein (MultipleNegativesRankingLoss RAG ke liye accha hai)
loss = losses.MultipleNegativesRankingLoss(model)
# Training arguments set karein
args = SentenceTransformerTrainingArguments(
output_dir="bge-large-en-v1.5-finetuned",
num_train_epochs=3,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
learning_rate=2e-5,
warmup_ratio=0.1,
)
# Trainer create karein aur train karein
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
# Model ko save karein
model.save_to_disk("./bge-finetuned-model")
Explanation: Is code me humne synthetic dataset load kiya, base model (bge-large-en-v1.5) chuna, training ke liye loss function aur parameters set kiye, aur phir model ko train kiya. Trained model local directory me save ho jayega.
Step 3: Model ko Evaluate Karna 📊
Fine-tuning ke baad, yeh check karna zaroori hai ki hamara model pehle se behtar hua ya nahi. Hum iske liye InformationRetrievalEvaluator ka use kar sakte hain, jo MRR (Mean Reciprocal Rank) jaise metrics calculate karta hai.
Iske liye aapko ek alag evaluation dataset (corpus, queries, relevant_docs) banana hoga.
from sentence_transformers.evaluation import InformationRetrievalEvaluator
# Evaluator create karein
evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)
# Base model evaluate karein
base_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
print("Base model performance:")
base_model.evaluate(evaluator)
# Fine-tuned model evaluate karein
finetuned_model = SentenceTransformer("./bge-finetuned-model")
print("Fine-tuned model performance:")
finetuned_model.evaluate(evaluator)
Aap dekhenge ki fine-tuned model ka MRR score base model se zyada hoga, jiska matlab hai ki woh aapke data ke liye zyada relevant results de raha hai.
💡 Pro Tips
- Dataset Quality: Aapke synthetic dataset ki quality bahut matter karti hai. Acche questions generate karne ke liye powerful LLM (like GPT-4) use karein.
- Base Model: Apne use-case ke hisaab se sahi base model chunein.
bgeseries ke models RAG ke liye kaafi popular hain. - Hyperparameters:
learning_rate,num_train_epochs, aurbatch_sizeko tune karke aap aur behtar results paa sakte hain.