Ask an AI chat bot about local private files – How to implement that?

Christoph Dähne22.04.2024

I have been trying to solve the following tasks for some time now: I want a chat bot answering questions about our internal wiki. Of course, the implementation is based on LLMs or Large Language Models.

The task has two specialties that I kept failing at:

  1. Firstly, our internal wiki is rather large. I can't just put everything into one big prompt with the question underneath.
  2. The data is confidential. It must not leave our servers. Hence, we cannot use third-party APIs, such as OpenAI. Instead everything has to run locally.

Finally, I built a prototype yielding decent results. I present the code and the thoughts behind in this article. It is meant to be an introduction into the topic. So, I may oversimplify at some places.

Tiny android standing on a deskYour personal AI assistant you can feed your documents for analysis. (AI generated)

How does the chat bot work in principle?

First, let's see how the whole thing works. Training an own LLM with the data from our wiki is off limits. It is too complex and expensive.

Instead, we take already trained LLMs and somehow feed our wiki data to them. For ease of understanding, I start with the naive approach, and then refine it into a functional system. Here comes the naive approach:

We take the question from the user and prepend all our wiki content. Then we also add the entire chat history and send the resulting mega prompt to our LLM and get an answer. It doesn't work like that.

Firstly, the prompt is far too big. LLMs have an input limit depending on the model. But, probably, the answer would be bad anyway, because the prompt includes a lot of irrelevant information. This usually has a negative effect on the quality of the answer (so I heard, no link at hand).

Interestingly, shrinking the prompt solves the issues. Let's stick to the naive idea, but

  1. Only include a summary of the chat history.
  2. Only include parts of the wiki relevant to the current question.

Then, we end up with a prompt that contains a summary of the chat history, the relevant wiki parts that contain the information sought in the question, and the question itself. Any yes: this works!

However: The prompt to the LLM already includes the answer: the relevant parts of the wiki. Sounds like we are back to the start: answering the question to the wiki? Thankfully not.

Summarizing the chat history

The short answer first, we use our LLM to summarize the chat history. To be more precise: we generate a new question from the chat history and the original question.

Overview: From question to answer

How to find the relevant parts of the wiki, I explain below. Given such a system in place, we can answer chat question as follows:

  1. Submit question about information in the wiki
  2. Use LLM X (eg llava) to create a new question from the original question and the chat history
  3. Use LLM Y (eg instructor) to find related wiki parts (explained below)
  4. Create prompt including the new question and the related wiki parts
  5. Use LLM X to generate an answer
  6. Append original question to the history

This scheme is so common that LangChain provides a ready to use implementation: the ConversationalRetrievalChain. See their documentation or the code in this article for more details.

Finding relevant wiki parts

Now to the interesting part: embeddings. Embeddings are another type of results from LLMs. We use these embeddings to create a large search database where we store lots and lots of wiki parts as text blocks so that we can find the relevant content when a question is asked.

We don't train a model with the knowledge from our wiki, but we use a model to sort the text blocks by similarity, so to speak, such that similar content is usually "close together". If a question is asked, we also sort this question into this similarity space and then find relevant text blocks nearby. That is the basic idea behind anyway.

The embeddings are high dimensional vectors, aka list of floats. The dimension depends on the model. The training of the model determines the embeddings. Now, we have a lot of vectors for a lot of text blocks with the property that text blocks with similar content are usually close to each other in this vector space. Depending on which LLM you use for this, the results are of varying quality. Looking back, finding an open, local LLM providing embeddings fitting my use case has been the hardest part of this project.

Nonetheless, the embeddings are absolutely crucial: if I'm not able to find the relevant text blocks in my wiki, then I can't generate a useful answer, no matter how good my chat LLM is. The information just isn't available.

Overview: From wiki content to embeddings

  1. extract textual content from source data (wiki)
  2. split text into overlapping text blocks of suitable size (depends on your wiki)
  3. use LLM Y (eg instructor) to generate embeddings
  4. store those embeddings for later use

The Vector Store

We store the embeddings along with their text blocks and metadata, ie the source of the text block, in a specialized database called Vector Store. In the code example I use ChromaDB, but there are other implementations as well. In the code example below, I assume that we have a folder containing the wiki content. Your file types may differ. In our case it is a bunch of HTML files.

The first thing we're going to do is go over this directory and extract the textual content from the documents. The content is, so to speak, a text block without formatting. Now we have one long text block per file along with some metadata, ie the source file. The metadata becomes important if we want to include sources in the chat answers. However, the text blocks are usually much too long.

Hence, we split them up again. You can experiment a bit to see what block length and what overlap delivers the best results for your own use case.

Then we use the large language model of our choice. I use instructor which specializes in embeddings. At least among the open local models I tried, it gives the best results. We use this model to create the points in our embedding vector space. One point for each document. The points are represented by vectors. Once we have them, we store them in our vector store.

This is the full import of our wiki data. At this point, it is noticeable that we process each document individually. This means that if we are able to track changes, which is of course no problem in our wiki, it is conceivable to only carry out partial imports of updated or new wiki pages. Not only does this limit the usage of resources, but also, users no longer need to wait for nightly/hourly/… batch imports to see the latest data in the chat. But that is not implemented in the prototype – just a side note.

How to write it in Python?

So much for the ideas behind the implementation. Now of course the big question: how do you write it? I have a few code examples of a prototype below. They are of course not suitable for productive use, but you can clearly see how which components are called and how they interact. Feel free to get inspiration from it.

The implementation is based on LangChain, a Python framework. It helps to implement LLM specific features with support for various models. For example: the ConversationalRetrievalChain implements summarizing a chat history and question, finding relevant information in a vector store and generating an answer.

To import file content into a vector store, you can use the DirectoryLoader. Under the hood it uses unstructured.io.

I have added comments and links to the documentation of the individual components almost everywhere in the code. So you can take a closer look.

So finally, here comes the code. One file is for the data import into the vector store, the other one for querying it. I built the code such that it is easy to switch models for testing and comparing the results.

chat.py
1 import argparse 2 import os 3 4 import langchain 5 from langchain.cache import InMemoryCache 6 from langchain.chains import ConversationalRetrievalChain 7 from langchain_community.embeddings import OllamaEmbeddings 8 from langchain_community.llms import Ollama 9 from langchain_community.vectorstores import Chroma 10 from langchain_core.vectorstores import VectorStoreRetriever 11 from langchain_openai import OpenAI 12 from langchain_openai import OpenAIEmbeddings 13 from langchain_community.embeddings import HuggingFaceInstructEmbeddings 14 15 16 def load_vector_store(model, persist_directory): 17 """Loads the vector store from the given directory.""" 18 if model == "instructor": 19 # model is fetched automatically on first run 20 embedding_function = HuggingFaceInstructEmbeddings( 21 model_name="hkunlp/instructor-xl", 22 # mps for Apple M devices, cuda for Nvidia 23 model_kwargs={"device": "mps"}) 24 elif model == 'openai': 25 # https://python.langchain.com/docs/integrations/text_embedding/openai 26 embedding_function = OpenAIEmbeddings( 27 openai_api_key=os.getenv("OPENAI_API_KEY"), 28 # https://platform.openai.com/docs/models/embeddings 29 model='text-embedding-3-large') 30 else: 31 # Ollama needs to be installed separately, see https://ollama.com 32 # https://python.langchain.com/docs/integrations/text_embedding/ollama 33 embedding_function = OllamaEmbeddings( 34 # https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html 35 model=model) 36 # https://python.langchain.com/docs/integrations/vectorstores/chroma 37 return Chroma(persist_directory=persist_directory, embedding_function=embedding_function) 38 39 40 def create_llm(model): 41 """Creates a new LLM instance.""" 42 langchain.llm_cache = InMemoryCache() 43 if model == 'openai': 44 # https://python.langchain.com/docs/integrations/llms/openai 45 return OpenAI( 46 openai_api_key=os.getenv("OPENAI_API_KEY"), 47 model='gpt-3.5-turbo-instruct') 48 else: 49 # Ollama needs to be installed separately, see https://ollama.com 50 # https://python.langchain.com/docs/integrations/llms/ollama 51 return Ollama( 52 # https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html 53 model=model) 54 55 56 def create_qa_chain(vector_store, llm, verbose): 57 """Creates a new Question-Answer chain.""" 58 # https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html 59 return ConversationalRetrievalChain.from_llm( 60 llm=llm, 61 # https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStoreRetriever.html 62 retriever=VectorStoreRetriever(vectorstore=vector_store), 63 verbose=verbose) 64 65 66 history = [] 67 68 69 def answer_question(qa_chain, question, say): 70 """Answers the given question and adds it to the history.""" 71 answer = qa_chain.invoke({"question": question, "chat_history": history})['answer'] 72 say(f"{answer}") 73 history.append((question, answer)) 74 75 76 def question_and_answer(qa_chain): 77 print("\nPlease enter question:") 78 question = input("") 79 print("\nAnswer:") 80 answer_question(qa_chain, question, print) 81 82 83 def main(): 84 parser = argparse.ArgumentParser( 85 description='CLI tool for interactive or scripted LLM chat') 86 # https://docs.python.org/3/library/argparse.html#quick-links-for-add-argument 87 parser.add_argument('-m', '--model', required=True, 88 help='Name of the model to use') 89 parser.add_argument('-e', '--embeddings-model', 90 help='Model to use for the embeddings, defaults to <model>') 91 parser.add_argument('-d', '--database', 92 help='Source for the database files, defaults to ../import/db/<model>') 93 parser.add_argument('-v', '--verbose', action='store_true', 94 help='Enable verbose mode') 95 parser.add_argument('-p', '--prompt', action='append', 96 help='One or more prompts to use instead of interactive mode') 97 args = parser.parse_args() 98 99 model = args.model 100 embeddings_model = args.embeddings_model if args.embeddings_model else model 101 persist_directory = args.database if args.database else os.path.join( 102 os.getenv("DB_DIR", default="../import/db"), 103 embeddings_model) 104 assert os.path.exists(persist_directory) 105 verbose = args.verbose 106 prompts = args.prompt 107 108 vector_store = load_vector_store(embeddings_model, persist_directory) 109 llm = create_llm(model) 110 qa_chain = create_qa_chain(vector_store, llm, verbose) 111 if prompts: 112 for prompt in prompts: 113 answer_question(qa_chain, prompt, print) 114 else: 115 while (True): 116 question_and_answer(qa_chain) 117 118 if __name__ == "__main__": 119 main()
import.py
1 import argparse 2 import os 3 4 from langchain_community.document_loaders import DirectoryLoader 5 from langchain_community.embeddings import HuggingFaceInstructEmbeddings 6 from langchain_community.embeddings import OllamaEmbeddings 7 from langchain_community.vectorstores import Chroma 8 from langchain_openai import OpenAIEmbeddings 9 from langchain_text_splitters import RecursiveCharacterTextSplitter 10 11 12 def load_documents(data_dir, verbose): 13 """Loads all documents from the data directory.""" 14 print(f"Loading documents from {data_dir}") if verbose else None 15 loader = DirectoryLoader(data_dir) 16 docs = loader.load() 17 print(f"Loaded {len(docs)} documents, eg") if verbose else None 18 print(docs[0]) if verbose else None 19 20 # Split the documents into chunks with an overlap 21 text_splitter = RecursiveCharacterTextSplitter( 22 chunk_size=500, 23 chunk_overlap=50) 24 docs = text_splitter.split_documents(docs) 25 print(f"Split into {len(docs)} chunks") if verbose else None 26 27 # ChromaDB cannot handle lists in metadata, 28 # but only str, int, float or bool 29 for doc in docs: 30 metadata = doc.metadata 31 for key in metadata: 32 if isinstance(metadata[key], list): 33 if key == "source": 34 metadata[key] = metadata[key][0] 35 else: 36 metadata[key] = str(metadata[key]) 37 return docs 38 39 40 def create_vector_store(docs, model, persist_directory): 41 """Creates and persists a vector store.""" 42 if model == "instructor": 43 # model is fetched automatically on first run 44 embedding_function = HuggingFaceInstructEmbeddings( 45 model_name="hkunlp/instructor-xl", 46 # mps for Apple M devices, cuda for Nvidia 47 model_kwargs={"device": "mps"}) 48 elif model == 'openai': 49 # https://python.langchain.com/docs/integrations/text_embedding/openai 50 embedding_function = OpenAIEmbeddings( 51 openai_api_key=os.getenv("OPENAI_API_KEY"), 52 # https://platform.openai.com/docs/models/embeddings 53 model='text-embedding-3-large') 54 else: 55 # Ollama needs to be installed separately, see https://ollama.com 56 # https://python.langchain.com/docs/integrations/text_embedding/ollama 57 embedding_function = OllamaEmbeddings( 58 # https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html 59 model=model) 60 # https://python.langchain.com/docs/integrations/vectorstores/chroma 61 db = Chroma.from_documents(docs, embedding_function, persist_directory=persist_directory) 62 db.persist() 63 64 65 def dump_documents(documents): 66 """Prints the documents to the console.""" 67 for doc in documents: 68 print(doc) 69 print("---") 70 71 72 def main(): 73 parser = argparse.ArgumentParser( 74 description='CLI tool to import documents into a vector store.') 75 # https://docs.python.org/3/library/argparse.html#quick-links-for-add-argument 76 parser.add_argument('-m', '--model', required=True, 77 help='Name of the model to use') 78 parser.add_argument('-s', '--source-data', 79 help='Path to the folder containing the source data files, defaults to ./data') 80 parser.add_argument('-d', '--database', 81 help='Path to the database files, defaults to ./db/<model>') 82 parser.add_argument('-v', '--verbose', action='store_true', 83 help='Enable verbose mode') 84 parser.add_argument('-dry_run', '--dry-run', action='store_true', 85 help='Just create and print the document chunks instead of the embeddings. Implies -v.') 86 args = parser.parse_args() 87 88 model = args.model 89 source_data = args.source_data if args.source_data else "./data" 90 assert os.path.exists(source_data), "Source data directory does not exist" 91 dry_run = args.dry_run 92 verbose = args.verbose or dry_run 93 db_dir = args.database if args.database else "./db" 94 persist_directory = os.path.join(db_dir, model) 95 print("Persisting to", persist_directory) if verbose else None 96 97 documents = load_documents(source_data, verbose) 98 if dry_run: 99 dump_documents(documents) 100 else: 101 create_vector_store(documents, model, persist_directory) 102 103 104 if __name__ == "__main__": 105 main()
requirements.txt
1 argparse 2 chromadb 3 InstructorEmbedding 4 langchain 5 langchain_community 6 langchain_openai 7 sentence_transformers==2.2.2 8 torch

Why do I use which LLM?

As you probably noticed, the implementation allows the use of OpenAI models. Note that those do not run locally. I use them for quality comparison only. During the last months I periodically tried out local models. So far, the limiting factor has been the model for the embeddings. Recently, I got quite good results with the following LLMs:

Both are free models. Instructor specializes on embeddings. There is a YouTube video using instructor and mistral which seems to work as well. However, no local model I tried could compete with instructor. Feel free to make your own measurements for your data and use case – and feel free to share it.

Alternative: PrivateGPT

I stumbled across a project called Private GPT. They seem to implement exactly this use-case as open source: Chatting with your own local documents. They use different models, I think. Take a look if you are interested and feel free to share your experience. I did not test it yet.

 

Thanks for reading this far. Have fun building your own chatbot.

Appendix

I sometimes have trouble figuring out the library versions when re-implementing examples locally. Here are all direct and indirect dependencies used including the version numbers.

pip freeze -r requirements.txt
1 argparse==1.4.0 2 chromadb==0.4.24 3 InstructorEmbedding==1.0.1 4 langchain==0.1.12 5 langchain-community==0.0.28 6 langchain-openai==0.1.0 7 sentence-transformers==2.2.2 8 torch==2.2.1 9 ## The following requirements were added by pip freeze: 10 aiohttp==3.9.3 11 aiosignal==1.3.1 12 annotated-types==0.6.0 13 anyio==4.3.0 14 asgiref==3.7.2 15 attrs==23.2.0 16 backoff==2.2.1 17 bcrypt==4.1.2 18 build==1.1.1 19 cachetools==5.3.3 20 certifi==2024.2.2 21 charset-normalizer==3.3.2 22 chroma-hnswlib==0.7.3 23 click==8.1.7 24 coloredlogs==15.0.1 25 dataclasses-json==0.6.4 26 Deprecated==1.2.14 27 distro==1.9.0 28 fastapi==0.110.0 29 filelock==3.13.1 30 flatbuffers==24.3.7 31 frozenlist==1.4.1 32 fsspec==2024.3.0 33 google-auth==2.28.2 34 googleapis-common-protos==1.63.0 35 grpcio==1.62.1 36 h11==0.14.0 37 httpcore==1.0.4 38 httptools==0.6.1 39 httpx==0.27.0 40 huggingface-hub==0.21.4 41 humanfriendly==10.0 42 idna==3.6 43 importlib-metadata==6.11.0 44 importlib_resources==6.3.1 45 Jinja2==3.1.3 46 joblib==1.3.2 47 jsonpatch==1.33 48 jsonpointer==2.4 49 kubernetes==29.0.0 50 langchain-core==0.1.33 51 langchain-text-splitters==0.0.1 52 langsmith==0.1.27 53 MarkupSafe==2.1.5 54 marshmallow==3.21.1 55 mmh3==4.1.0 56 monotonic==1.6 57 mpmath==1.3.0 58 multidict==6.0.5 59 mypy-extensions==1.0.0 60 networkx==3.2.1 61 nltk==3.8.1 62 numpy==1.26.4 63 oauthlib==3.2.2 64 onnxruntime==1.17.1 65 openai==1.14.2 66 opentelemetry-api==1.23.0 67 opentelemetry-exporter-otlp-proto-common==1.23.0 68 opentelemetry-exporter-otlp-proto-grpc==1.23.0 69 opentelemetry-instrumentation==0.44b0 70 opentelemetry-instrumentation-asgi==0.44b0 71 opentelemetry-instrumentation-fastapi==0.44b0 72 opentelemetry-proto==1.23.0 73 opentelemetry-sdk==1.23.0 74 opentelemetry-semantic-conventions==0.44b0 75 opentelemetry-util-http==0.44b0 76 orjson==3.9.15 77 overrides==7.7.0 78 packaging==23.2 79 pillow==10.2.0 80 posthog==3.5.0 81 protobuf==4.25.3 82 pulsar-client==3.4.0 83 pyasn1==0.5.1 84 pyasn1-modules==0.3.0 85 pydantic==2.6.4 86 pydantic_core==2.16.3 87 PyPika==0.48.9 88 pyproject_hooks==1.0.0 89 python-dateutil==2.9.0.post0 90 python-dotenv==1.0.1 91 PyYAML==6.0.1 92 regex==2023.12.25 93 requests==2.31.0 94 requests-oauthlib==1.4.0 95 rsa==4.9 96 safetensors==0.4.2 97 scikit-learn==1.4.1.post1 98 scipy==1.12.0 99 sentencepiece==0.2.0 100 setuptools==69.2.0 101 six==1.16.0 102 sniffio==1.3.1 103 SQLAlchemy==2.0.28 104 starlette==0.36.3 105 sympy==1.12 106 tenacity==8.2.3 107 threadpoolctl==3.4.0 108 tiktoken==0.6.0 109 tokenizers==0.15.2 110 torchvision==0.17.1 111 tqdm==4.66.2 112 transformers==4.39.1 113 typer==0.9.0 114 typing-inspect==0.9.0 115 typing_extensions==4.10.0 116 urllib3==2.2.1 117 uvicorn==0.28.0 118 uvloop==0.19.0 119 watchfiles==0.21.0 120 websocket-client==1.7.0 121 websockets==12.0 122 wrapt==1.16.0 123 yarl==1.9.4 124 zipp==3.18.1

Dein Besuch auf unserer Website produziert laut der Messung auf websitecarbon.com nur 0,28 g CO₂.