Ask an AI chat bot about local private files – How to implement that?

Christoph Dähne22.04.2024

I have been trying to solve the following tasks for some time now: I want a chat bot answering questions about our internal wiki. Of course, the implementation is based on LLMs or Large Language Models.

The task has two specialties that I kept failing at:

  1. Firstly, our internal wiki is rather large. I can't just put everything into one big prompt with the question underneath.
  2. The data is confidential. It must not leave our servers. Hence, we cannot use third-party APIs, such as OpenAI. Instead everything has to run locally.

Finally, I built a prototype yielding decent results. I present the code and the thoughts behind in this article. It is meant to be an introduction into the topic. So, I may oversimplify at some places.

Tiny android standing on a deskYour personal AI assistant you can feed your documents for analysis. (AI generated)

How does the chat bot work in principle?

First, let's see how the whole thing works. Training an own LLM with the data from our wiki is off limits. It is too complex and expensive.

Instead, we take already trained LLMs and somehow feed our wiki data to them. For ease of understanding, I start with the naive approach, and then refine it into a functional system. Here comes the naive approach:

We take the question from the user and prepend all our wiki content. Then we also add the entire chat history and send the resulting mega prompt to our LLM and get an answer. It doesn't work like that.

Firstly, the prompt is far too big. LLMs have an input limit depending on the model. But, probably, the answer would be bad anyway, because the prompt includes a lot of irrelevant information. This usually has a negative effect on the quality of the answer (so I heard, no link at hand).

Interestingly, shrinking the prompt solves the issues. Let's stick to the naive idea, but

  1. Only include a summary of the chat history.
  2. Only include parts of the wiki relevant to the current question.

Then, we end up with a prompt that contains a summary of the chat history, the relevant wiki parts that contain the information sought in the question, and the question itself. Any yes: this works!

However: The prompt to the LLM already includes the answer: the relevant parts of the wiki. Sounds like we are back to the start: answering the question to the wiki? Thankfully not.

Summarizing the chat history

The short answer first, we use our LLM to summarize the chat history. To be more precise: we generate a new question from the chat history and the original question.

Overview: From question to answer

How to find the relevant parts of the wiki, I explain below. Given such a system in place, we can answer chat question as follows:

  1. Submit question about information in the wiki
  2. Use LLM X (eg llava) to create a new question from the original question and the chat history
  3. Use LLM Y (eg instructor) to find related wiki parts (explained below)
  4. Create prompt including the new question and the related wiki parts
  5. Use LLM X to generate an answer
  6. Append original question to the history

This scheme is so common that LangChain provides a ready to use implementation: the ConversationalRetrievalChain. See their documentation or the code in this article for more details.

Finding relevant wiki parts

Now to the interesting part: embeddings. Embeddings are another type of results from LLMs. We use these embeddings to create a large search database where we store lots and lots of wiki parts as text blocks so that we can find the relevant content when a question is asked.

We don't train a model with the knowledge from our wiki, but we use a model to sort the text blocks by similarity, so to speak, such that similar content is usually "close together". If a question is asked, we also sort this question into this similarity space and then find relevant text blocks nearby. That is the basic idea behind anyway.

The embeddings are high dimensional vectors, aka list of floats. The dimension depends on the model. The training of the model determines the embeddings. Now, we have a lot of vectors for a lot of text blocks with the property that text blocks with similar content are usually close to each other in this vector space. Depending on which LLM you use for this, the results are of varying quality. Looking back, finding an open, local LLM providing embeddings fitting my use case has been the hardest part of this project.

Nonetheless, the embeddings are absolutely crucial: if I'm not able to find the relevant text blocks in my wiki, then I can't generate a useful answer, no matter how good my chat LLM is. The information just isn't available.

Overview: From wiki content to embeddings

  1. extract textual content from source data (wiki)
  2. split text into overlapping text blocks of suitable size (depends on your wiki)
  3. use LLM Y (eg instructor) to generate embeddings
  4. store those embeddings for later use

The Vector Store

We store the embeddings along with their text blocks and metadata, ie the source of the text block, in a specialized database called Vector Store. In the code example I use ChromaDB, but there are other implementations as well. In the code example below, I assume that we have a folder containing the wiki content. Your file types may differ. In our case it is a bunch of HTML files.

The first thing we're going to do is go over this directory and extract the textual content from the documents. The content is, so to speak, a text block without formatting. Now we have one long text block per file along with some metadata, ie the source file. The metadata becomes important if we want to include sources in the chat answers. However, the text blocks are usually much too long.

Hence, we split them up again. You can experiment a bit to see what block length and what overlap delivers the best results for your own use case.

Then we use the large language model of our choice. I use instructor which specializes in embeddings. At least among the open local models I tried, it gives the best results. We use this model to create the points in our embedding vector space. One point for each document. The points are represented by vectors. Once we have them, we store them in our vector store.

This is the full import of our wiki data. At this point, it is noticeable that we process each document individually. This means that if we are able to track changes, which is of course no problem in our wiki, it is conceivable to only carry out partial imports of updated or new wiki pages. Not only does this limit the usage of resources, but also, users no longer need to wait for nightly/hourly/… batch imports to see the latest data in the chat. But that is not implemented in the prototype – just a side note.

How to write it in Python?

So much for the ideas behind the implementation. Now of course the big question: how do you write it? I have a few code examples of a prototype below. They are of course not suitable for productive use, but you can clearly see how which components are called and how they interact. Feel free to get inspiration from it.

The implementation is based on LangChain, a Python framework. It helps to implement LLM specific features with support for various models. For example: the ConversationalRetrievalChain implements summarizing a chat history and question, finding relevant information in a vector store and generating an answer.

To import file content into a vector store, you can use the DirectoryLoader. Under the hood it uses unstructured.io.

I have added comments and links to the documentation of the individual components almost everywhere in the code. So you can take a closer look.

So finally, here comes the code. One file is for the data import into the vector store, the other one for querying it. I built the code such that it is easy to switch models for testing and comparing the results.

import argparse import os import langchain from langchain.cache import InMemoryCache from langchain.chains import ConversationalRetrievalChain from langchain_community.embeddings import OllamaEmbeddings from langchain_community.llms import Ollama from langchain_community.vectorstores import Chroma from langchain_core.vectorstores import VectorStoreRetriever from langchain_openai import OpenAI from langchain_openai import OpenAIEmbeddings from langchain_community.embeddings import HuggingFaceInstructEmbeddings def load_vector_store(model, persist_directory): """Loads the vector store from the given directory.""" if model == "instructor": # model is fetched automatically on first run embedding_function = HuggingFaceInstructEmbeddings( model_name="hkunlp/instructor-xl", # mps for Apple M devices, cuda for Nvidia model_kwargs={"device": "mps"}) elif model == 'openai': # https://python.langchain.com/docs/integrations/text_embedding/openai embedding_function = OpenAIEmbeddings( openai_api_key=os.getenv("OPENAI_API_KEY"), # https://platform.openai.com/docs/models/embeddings model='text-embedding-3-large') else: # Ollama needs to be installed separately, see https://ollama.com # https://python.langchain.com/docs/integrations/text_embedding/ollama embedding_function = OllamaEmbeddings( # https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html model=model) # https://python.langchain.com/docs/integrations/vectorstores/chroma return Chroma(persist_directory=persist_directory, embedding_function=embedding_function) def create_llm(model): """Creates a new LLM instance.""" langchain.llm_cache = InMemoryCache() if model == 'openai': # https://python.langchain.com/docs/integrations/llms/openai return OpenAI( openai_api_key=os.getenv("OPENAI_API_KEY"), model='gpt-3.5-turbo-instruct') else: # Ollama needs to be installed separately, see https://ollama.com # https://python.langchain.com/docs/integrations/llms/ollama return Ollama( # https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html model=model) def create_qa_chain(vector_store, llm, verbose): """Creates a new Question-Answer chain.""" # https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html return ConversationalRetrievalChain.from_llm( llm=llm, # https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStoreRetriever.html retriever=VectorStoreRetriever(vectorstore=vector_store), verbose=verbose) history = [] def answer_question(qa_chain, question, say): """Answers the given question and adds it to the history.""" answer = qa_chain.invoke({"question": question, "chat_history": history})['answer'] say(f"{answer}") history.append((question, answer)) def question_and_answer(qa_chain): print("\nPlease enter question:") question = input("") print("\nAnswer:") answer_question(qa_chain, question, print) def main(): parser = argparse.ArgumentParser( description='CLI tool for interactive or scripted LLM chat') # https://docs.python.org/3/library/argparse.html#quick-links-for-add-argument parser.add_argument('-m', '--model', required=True, help='Name of the model to use') parser.add_argument('-e', '--embeddings-model', help='Model to use for the embeddings, defaults to <model>') parser.add_argument('-d', '--database', help='Source for the database files, defaults to ../import/db/<model>') parser.add_argument('-v', '--verbose', action='store_true', help='Enable verbose mode') parser.add_argument('-p', '--prompt', action='append', help='One or more prompts to use instead of interactive mode') args = parser.parse_args() model = args.model embeddings_model = args.embeddings_model if args.embeddings_model else model persist_directory = args.database if args.database else os.path.join( os.getenv("DB_DIR", default="../import/db"), embeddings_model) assert os.path.exists(persist_directory) verbose = args.verbose prompts = args.prompt vector_store = load_vector_store(embeddings_model, persist_directory) llm = create_llm(model) qa_chain = create_qa_chain(vector_store, llm, verbose) if prompts: for prompt in prompts: answer_question(qa_chain, prompt, print) else: while (True): question_and_answer(qa_chain) if __name__ == "__main__": main()
import argparse import os from langchain_community.document_loaders import DirectoryLoader from langchain_community.embeddings import HuggingFaceInstructEmbeddings from langchain_community.embeddings import OllamaEmbeddings from langchain_community.vectorstores import Chroma from langchain_openai import OpenAIEmbeddings from langchain_text_splitters import RecursiveCharacterTextSplitter def load_documents(data_dir, verbose): """Loads all documents from the data directory.""" print(f"Loading documents from {data_dir}") if verbose else None loader = DirectoryLoader(data_dir) docs = loader.load() print(f"Loaded {len(docs)} documents, eg") if verbose else None print(docs[0]) if verbose else None # Split the documents into chunks with an overlap text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50) docs = text_splitter.split_documents(docs) print(f"Split into {len(docs)} chunks") if verbose else None # ChromaDB cannot handle lists in metadata, # but only str, int, float or bool for doc in docs: metadata = doc.metadata for key in metadata: if isinstance(metadata[key], list): if key == "source": metadata[key] = metadata[key][0] else: metadata[key] = str(metadata[key]) return docs def create_vector_store(docs, model, persist_directory): """Creates and persists a vector store.""" if model == "instructor": # model is fetched automatically on first run embedding_function = HuggingFaceInstructEmbeddings( model_name="hkunlp/instructor-xl", # mps for Apple M devices, cuda for Nvidia model_kwargs={"device": "mps"}) elif model == 'openai': # https://python.langchain.com/docs/integrations/text_embedding/openai embedding_function = OpenAIEmbeddings( openai_api_key=os.getenv("OPENAI_API_KEY"), # https://platform.openai.com/docs/models/embeddings model='text-embedding-3-large') else: # Ollama needs to be installed separately, see https://ollama.com # https://python.langchain.com/docs/integrations/text_embedding/ollama embedding_function = OllamaEmbeddings( # https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html model=model) # https://python.langchain.com/docs/integrations/vectorstores/chroma db = Chroma.from_documents(docs, embedding_function, persist_directory=persist_directory) db.persist() def dump_documents(documents): """Prints the documents to the console.""" for doc in documents: print(doc) print("---") def main(): parser = argparse.ArgumentParser( description='CLI tool to import documents into a vector store.') # https://docs.python.org/3/library/argparse.html#quick-links-for-add-argument parser.add_argument('-m', '--model', required=True, help='Name of the model to use') parser.add_argument('-s', '--source-data', help='Path to the folder containing the source data files, defaults to ./data') parser.add_argument('-d', '--database', help='Path to the database files, defaults to ./db/<model>') parser.add_argument('-v', '--verbose', action='store_true', help='Enable verbose mode') parser.add_argument('-dry_run', '--dry-run', action='store_true', help='Just create and print the document chunks instead of the embeddings. Implies -v.') args = parser.parse_args() model = args.model source_data = args.source_data if args.source_data else "./data" assert os.path.exists(source_data), "Source data directory does not exist" dry_run = args.dry_run verbose = args.verbose or dry_run db_dir = args.database if args.database else "./db" persist_directory = os.path.join(db_dir, model) print("Persisting to", persist_directory) if verbose else None documents = load_documents(source_data, verbose) if dry_run: dump_documents(documents) else: create_vector_store(documents, model, persist_directory) if __name__ == "__main__": main()
argparse chromadb InstructorEmbedding langchain langchain_community langchain_openai sentence_transformers==2.2.2 torch

Why do I use which LLM?

As you probably noticed, the implementation allows the use of OpenAI models. Note that those do not run locally. I use them for quality comparison only. During the last months I periodically tried out local models. So far, the limiting factor has been the model for the embeddings. Recently, I got quite good results with the following LLMs:

Both are free models. Instructor specializes on embeddings. There is a YouTube video using instructor and mistral which seems to work as well. However, no local model I tried could compete with instructor. Feel free to make your own measurements for your data and use case – and feel free to share it.

Alternative: PrivateGPT

I stumbled across a project called Private GPT. They seem to implement exactly this use-case as open source: Chatting with your own local documents. They use different models, I think. Take a look if you are interested and feel free to share your experience. I did not test it yet.

 

Thanks for reading this far. Have fun building your own chatbot.

Appendix

I sometimes have trouble figuring out the library versions when re-implementing examples locally. Here are all direct and indirect dependencies used including the version numbers.

argparse==1.4.0 chromadb==0.4.24 InstructorEmbedding==1.0.1 langchain==0.1.12 langchain-community==0.0.28 langchain-openai==0.1.0 sentence-transformers==2.2.2 torch==2.2.1 ## The following requirements were added by pip freeze: aiohttp==3.9.3 aiosignal==1.3.1 annotated-types==0.6.0 anyio==4.3.0 asgiref==3.7.2 attrs==23.2.0 backoff==2.2.1 bcrypt==4.1.2 build==1.1.1 cachetools==5.3.3 certifi==2024.2.2 charset-normalizer==3.3.2 chroma-hnswlib==0.7.3 click==8.1.7 coloredlogs==15.0.1 dataclasses-json==0.6.4 Deprecated==1.2.14 distro==1.9.0 fastapi==0.110.0 filelock==3.13.1 flatbuffers==24.3.7 frozenlist==1.4.1 fsspec==2024.3.0 google-auth==2.28.2 googleapis-common-protos==1.63.0 grpcio==1.62.1 h11==0.14.0 httpcore==1.0.4 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.21.4 humanfriendly==10.0 idna==3.6 importlib-metadata==6.11.0 importlib_resources==6.3.1 Jinja2==3.1.3 joblib==1.3.2 jsonpatch==1.33 jsonpointer==2.4 kubernetes==29.0.0 langchain-core==0.1.33 langchain-text-splitters==0.0.1 langsmith==0.1.27 MarkupSafe==2.1.5 marshmallow==3.21.1 mmh3==4.1.0 monotonic==1.6 mpmath==1.3.0 multidict==6.0.5 mypy-extensions==1.0.0 networkx==3.2.1 nltk==3.8.1 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.17.1 openai==1.14.2 opentelemetry-api==1.23.0 opentelemetry-exporter-otlp-proto-common==1.23.0 opentelemetry-exporter-otlp-proto-grpc==1.23.0 opentelemetry-instrumentation==0.44b0 opentelemetry-instrumentation-asgi==0.44b0 opentelemetry-instrumentation-fastapi==0.44b0 opentelemetry-proto==1.23.0 opentelemetry-sdk==1.23.0 opentelemetry-semantic-conventions==0.44b0 opentelemetry-util-http==0.44b0 orjson==3.9.15 overrides==7.7.0 packaging==23.2 pillow==10.2.0 posthog==3.5.0 protobuf==4.25.3 pulsar-client==3.4.0 pyasn1==0.5.1 pyasn1-modules==0.3.0 pydantic==2.6.4 pydantic_core==2.16.3 PyPika==0.48.9 pyproject_hooks==1.0.0 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.1 regex==2023.12.25 requests==2.31.0 requests-oauthlib==1.4.0 rsa==4.9 safetensors==0.4.2 scikit-learn==1.4.1.post1 scipy==1.12.0 sentencepiece==0.2.0 setuptools==69.2.0 six==1.16.0 sniffio==1.3.1 SQLAlchemy==2.0.28 starlette==0.36.3 sympy==1.12 tenacity==8.2.3 threadpoolctl==3.4.0 tiktoken==0.6.0 tokenizers==0.15.2 torchvision==0.17.1 tqdm==4.66.2 transformers==4.39.1 typer==0.9.0 typing-inspect==0.9.0 typing_extensions==4.10.0 urllib3==2.2.1 uvicorn==0.28.0 uvloop==0.19.0 watchfiles==0.21.0 websocket-client==1.7.0 websockets==12.0 wrapt==1.16.0 yarl==1.9.4 zipp==3.18.1

Dein Besuch auf unserer Website produziert laut der Messung auf websitecarbon.com nur 0,28 g CO₂.