LangChain en la documentación PDF de InterSystems
Hoy os traigo otro ejemplo de aplicación de LangChain.
Inicialmente buscaba generar una "chain" o cadena para lograr hacer búsquedas dinámicas en la documentación en HTML, pero al final resultó más sencillo utilizar la versión en PDF de la documentación .
Crear un nuevo entorno virtual
mkdir chainpdf cd chainpdf python -m venv . scripts\activate pip install openai pip install langchain pip install wget pip install lancedb pip install tiktoken pip install pypdf set OPENAI_API_KEY=[ Your OpenAI Key ] python
Preparar los documentos
import glob
import wget;
url='https://docs.intersystems.com/irisforhealth20231/csp/docbook/pdfs.zip';
wget.download(url)
# extract docs
import zipfile
with zipfile.ZipFile('pdfs.zip','r') as zip_ref:
zip_ref.extractall('.')
# get a list of files
pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]Cargar los documentos en Vector Store
import lancedb
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import LanceDB
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.prompts.prompt import PromptTemplate
from langchain import OpenAI
from langchain.chains import LLMChain
embeddings = OpenAIEmbeddings()
db = lancedb.connect('lancedb')
table = db.create_table("my_table", data=[
{"vector": embeddings.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")
documentsAll=[]
pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]
for file_name in pdfFiles:
loader = PyPDFLoader(file_name)
pages = loader.load_and_split()
# Strip unwanted padding
for page in pages:
del page.lc_kwargs
page.page_content=("".join((page.page_content.split('\xa0'))))
documents = CharacterTextSplitter().split_documents(pages)
# Ignore the cover pages
for document in documents[2:]:
documentsAll.append(document)
# This will take couple of minutes to complete
docsearch = LanceDB.from_documents(documentsAll, embeddings, connection=table)Preparar la plantilla de búsqueda
_GetDocWords_TEMPLATE = """Answer the Question: {question}
By considering the following documents:
{docs}
"""
PROMPT = PromptTemplate(
input_variables=["docs","question"], template=_GetDocWords_TEMPLATE
)
llm = OpenAI(temperature=0, verbose=True)
chain = LLMChain(llm=llm, prompt=PROMPT)¿Estáis sentados?... Vamos a hablar con la documentación
"Qué es un adaptador de ficheros?"
# Ask the queston # First query the vector store for matching content query = "What is a File adapter" docs = docsearch.similarity_search(query) # Only using the first two documents to reduce token search size on openai chain.run(docs=docs[:2],question=query)
Respuesta:
'\nA file adapter is a type of software that enables the transfer of data between two different systems. It is typically used to move data from one system to another, such as from a database to a file system, or from a file system to a database. It can also be used to move data between different types of systems, such as from a web server to a database.
"¿Qué es una tabla de bloqueo?"
# Ask the queston # First query the vector store for matching content query = "What is a locak table" docs = docsearch.similarity_search(query) # Only using the first two documents to reduce token search size on openai chain.run(docs=docs[:2],question=query)
Respuesta:
'\nA lock table is a system-wide, in-memory table maintained by InterSystems IRIS that records all current locks and the processes that have owned them. It is accessible via the Management Portal, where you can view the locks and (in rare cases, if needed) remove them.'
Dejaré como ejercicio futuro formatear una interfaz de usuario sobre esta funcionalidad.
Comentarios (0)0