PDF ke Saath Baat Karne Ka Jadui Tarika 📄✨
Kya aapko kabhi laga hai ki aapke PDF documents se interact karne ka ek aasan tarika ho? LangChain ke madad se hum ek powerful PDF chatbot bana sakte hain jo aapke documents se direct baat karta hai! Chaliye step-by-step samajhte hain kaise.
🔧 Prerequisite
Sabse pehle humein kuch libraries install karni hongi:
pip install langchain pypdf python-dotenv openai chromadb
Step 1: PDF Loader Setup 📚
PDF ko load karne ke liye hum PyPDFLoader ka use karenge:
from langchain_community.document_loaders import PyPDFLoader
# PDF file path
file_path = "your_document.pdf"
# PDF loader initialize karna
loader = PyPDFLoader(file_path)
# Documents load karna
documents = loader.load()
Explanation:
PyPDFLoaderPDF file ko load karta hai..load()method saare pages ko document objects me convert karta hai.- Har page ek separate document ban jata hai.
Step 2: Text Splitting ✂️
PDF ke text ko chunks me split karna important hai:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Text splitter create karna
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
# Documents ko split karna
texts = text_splitter.split_documents(documents)
Explanation:
chunk_size=1000→ har chunk 1000 characters ka hoga.chunk_overlap=200→ 200 characters overlap rahenge taaki context na toote.- Yeh approach semantic meaning ko preserve karta hai.
Step 3: Embeddings Generate Karna 🧠
Embeddings document ke meaning ko vector format me convert karte hain:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# OpenAI Embeddings
embeddings = OpenAIEmbeddings()
# Vector Store create karna
vectorstore = Chroma.from_documents(
documents=texts,
embedding=embeddings
)
Explanation:
OpenAIEmbeddingstext ko high-dimensional vector me convert karta hai.Chromavector store documents ko efficiently store & retrieve karne me help karta hai.
Step 4: Retrieval Chain Setup 🔗
Retrieval chain document se relevant information nikalne me help karta hai:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
# ChatGPT model initialize
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
# Retrieval QA Chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
Explanation:
ChatOpenAIlanguage model ko initialize karta hai.RetrievalQAchain documents se relevant context nikalta hai.chain_type="stuff"→ context directly model ko pass hota hai.
Step 5: Chatbot Interaction 💬
Ab aap apne PDF se interact kar sakte hain!
# Query karna
query = "Aapke document ka main point kya hai?"
result = qa_chain.invoke(query)
print(result['result'])
Yeh command aapke PDF ke content se relevant jawab nikal kar dikhata hai 🎯
Advanced Features 🚀
1. Conversation Memory
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
memory = ConversationBufferMemory()
Memory feature chatbot ko past questions ya context yaad rakhne me help karta hai — taaki conversation natural lage aur user experience better ho.
💡 Pro Tips
- Apne documents ka size manageable rakhein (avoid 100s of MBs at once).
- Chunking aur embedding parameters experiment karke best accuracy nikaalein.
- Confidential PDFs ke liye local vector store (Chroma) use karein cloud ke bajaye.