I have been trying to solve the following tasks for some time now: I want a chat bot answering questions about our internal wiki. Of course, the implementation is based on LLMs or Large Language Models.
The task has two specialties that I kept failing at:
Firstly, our internal wiki is rather large. I can't just put everything into one big prompt with the question underneath.
The data is confidential. It must not leave our servers. Hence, we cannot use third-party APIs, such as OpenAI. Instead everything has to run locally.
Finally, I built a prototype yielding decent results. I present the code and the thoughts behind in this article. It is meant to be an introduction into the topic. So, I may oversimplify at some places.
Your personal AI assistant you can feed your documents for analysis. (AI generated)
How does the chat bot work in principle?
First, let's see how the whole thing works. Training an own LLM with the data from our wiki is off limits. It is too complex and expensive.
Instead, we take already trained LLMs and somehow feed our wiki data to them. For ease of understanding, I start with the naive approach, and then refine it into a functional system. Here comes the naive approach:
We take the question from the user and prepend all our wiki content. Then we also add the entire chat history and send the resulting mega prompt to our LLM and get an answer. It doesn't work like that.
Firstly, the prompt is far too big. LLMs have an input limit depending on the model. But, probably, the answer would be bad anyway, because the prompt includes a lot of irrelevant information. This usually has a negative effect on the quality of the answer (so I heard, no link at hand).
Interestingly, shrinking the prompt solves the issues. Let's stick to the naive idea, but
Only include a summary of the chat history.
Only include parts of the wiki relevant to the current question.
Then, we end up with a prompt that contains a summary of the chat history, the relevant wiki parts that contain the information sought in the question, and the question itself. Any yes: this works!
However: The prompt to the LLM already includes the answer: the relevant parts of the wiki. Sounds like we are back to the start: answering the question to the wiki? Thankfully not.
Summarizing the chat history
The short answer first, we use our LLM to summarize the chat history. To be more precise: we generate a new question from the chat history and the original question.
Overview: From question to answer
How to find the relevant parts of the wiki, I explain below. Given such a system in place, we can answer chat question as follows:
Submit question about information in the wiki
Use LLM X (eg llava) to create a new question from the original question and the chat history
Use LLM Y (eg instructor) to find related wiki parts (explained below)
Create prompt including the new question and the related wiki parts
Use LLM X to generate an answer
Append original question to the history
This scheme is so common that LangChain provides a ready to use implementation: the ConversationalRetrievalChain. See their documentation or the code in this article for more details.
Finding relevant wiki parts
Now to the interesting part: embeddings. Embeddings are another type of results from LLMs. We use these embeddings to create a large search database where we store lots and lots of wiki parts as text blocks so that we can find the relevant content when a question is asked.
We don't train a model with the knowledge from our wiki, but we use a model to sort the text blocks by similarity, so to speak, such that similar content is usually "close together". If a question is asked, we also sort this question into this similarity space and then find relevant text blocks nearby. That is the basic idea behind anyway.
The embeddings are high dimensional vectors, aka list of floats. The dimension depends on the model. The training of the model determines the embeddings. Now, we have a lot of vectors for a lot of text blocks with the property that text blocks with similar content are usually close to each other in this vector space. Depending on which LLM you use for this, the results are of varying quality. Looking back, finding an open, local LLM providing embeddings fitting my use case has been the hardest part of this project.
Nonetheless, the embeddings are absolutely crucial: if I'm not able to find the relevant text blocks in my wiki, then I can't generate a useful answer, no matter how good my chat LLM is. The information just isn't available.
Overview: From wiki content to embeddings
extract textual content from source data (wiki)
split text into overlapping text blocks of suitable size (depends on your wiki)
use LLM Y (eg instructor) to generate embeddings
store those embeddings for later use
The Vector Store
We store the embeddings along with their text blocks and metadata, ie the source of the text block, in a specialized database called Vector Store. In the code example I use ChromaDB, but there are other implementations as well. In the code example below, I assume that we have a folder containing the wiki content. Your file types may differ. In our case it is a bunch of HTML files.
The first thing we're going to do is go over this directory and extract the textual content from the documents. The content is, so to speak, a text block without formatting. Now we have one long text block per file along with some metadata, ie the source file. The metadata becomes important if we want to include sources in the chat answers. However, the text blocks are usually much too long.
Hence, we split them up again. You can experiment a bit to see what block length and what overlap delivers the best results for your own use case.
Then we use the large language model of our choice. I use instructor which specializes in embeddings. At least among the open local models I tried, it gives the best results. We use this model to create the points in our embedding vector space. One point for each document. The points are represented by vectors. Once we have them, we store them in our vector store.
This is the full import of our wiki data. At this point, it is noticeable that we process each document individually. This means that if we are able to track changes, which is of course no problem in our wiki, it is conceivable to only carry out partial imports of updated or new wiki pages. Not only does this limit the usage of resources, but also, users no longer need to wait for nightly/hourly/… batch imports to see the latest data in the chat. But that is not implemented in the prototype – just a side note.
How to write it in Python?
So much for the ideas behind the implementation. Now of course the big question: how do you write it? I have a few code examples of a prototype below. They are of course not suitable for productive use, but you can clearly see how which components are called and how they interact. Feel free to get inspiration from it.
The implementation is based on LangChain, a Python framework. It helps to implement LLM specific features with support for various models. For example: the ConversationalRetrievalChain implements summarizing a chat history and question, finding relevant information in a vector store and generating an answer.
I have added comments and links to the documentation of the individual components almost everywhere in the code. So you can take a closer look.
So finally, here comes the code. One file is for the data import into the vector store, the other one for querying it. I built the code such that it is easy to switch models for testing and comparing the results.
chat.py
1 import argparse
2 import os
3 4 import langchain
5 from langchain.cache import InMemoryCache
6 from langchain.chains import ConversationalRetrievalChain
7 from langchain_community.embeddings import OllamaEmbeddings
8 from langchain_community.llms import Ollama
9 from langchain_community.vectorstores import Chroma
10 from langchain_core.vectorstores import VectorStoreRetriever
11 from langchain_openai import OpenAI
12 from langchain_openai import OpenAIEmbeddings
13 from langchain_community.embeddings import HuggingFaceInstructEmbeddings
14 15 16 defload_vector_store(model, persist_directory):
17 """Loads the vector store from the given directory."""18 if model == "instructor":
19 # model is fetched automatically on first run20 embedding_function = HuggingFaceInstructEmbeddings(
21 model_name="hkunlp/instructor-xl",
22 # mps for Apple M devices, cuda for Nvidia23 model_kwargs={"device": "mps"})
24 elif model == 'openai':
25 # https://python.langchain.com/docs/integrations/text_embedding/openai26 embedding_function = OpenAIEmbeddings(
27 openai_api_key=os.getenv("OPENAI_API_KEY"),
28 # https://platform.openai.com/docs/models/embeddings29 model='text-embedding-3-large')
30 else:
31 # Ollama needs to be installed separately, see https://ollama.com32 # https://python.langchain.com/docs/integrations/text_embedding/ollama33 embedding_function = OllamaEmbeddings(
34 # https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html35 model=model)
36 # https://python.langchain.com/docs/integrations/vectorstores/chroma37 return Chroma(persist_directory=persist_directory, embedding_function=embedding_function)
38 39 40 defcreate_llm(model):
41 """Creates a new LLM instance."""42 langchain.llm_cache = InMemoryCache()
43 if model == 'openai':
44 # https://python.langchain.com/docs/integrations/llms/openai45 return OpenAI(
46 openai_api_key=os.getenv("OPENAI_API_KEY"),
47 model='gpt-3.5-turbo-instruct')
48 else:
49 # Ollama needs to be installed separately, see https://ollama.com50 # https://python.langchain.com/docs/integrations/llms/ollama51 return Ollama(
52 # https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html53 model=model)
54 55 56 defcreate_qa_chain(vector_store, llm, verbose):
57 """Creates a new Question-Answer chain."""58 # https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html59 return ConversationalRetrievalChain.from_llm(
60 llm=llm,
61 # https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStoreRetriever.html62 retriever=VectorStoreRetriever(vectorstore=vector_store),
63 verbose=verbose)
64 65 66 history = []
67 68 69 defanswer_question(qa_chain, question, say):
70 """Answers the given question and adds it to the history."""71 answer = qa_chain.invoke({"question": question, "chat_history": history})['answer']
72 say(f"{answer}")
73 history.append((question, answer))
74 75 76 defquestion_and_answer(qa_chain):
77 print("\nPlease enter question:")
78 question = input("")
79 print("\nAnswer:")
80 answer_question(qa_chain, question, print)
81 82 83 defmain():
84 parser = argparse.ArgumentParser(
85 description='CLI tool for interactive or scripted LLM chat')
86 # https://docs.python.org/3/library/argparse.html#quick-links-for-add-argument87 parser.add_argument('-m', '--model', required=True,
88 help='Name of the model to use')
89 parser.add_argument('-e', '--embeddings-model',
90 help='Model to use for the embeddings, defaults to <model>')
91 parser.add_argument('-d', '--database',
92 help='Source for the database files, defaults to ../import/db/<model>')
93 parser.add_argument('-v', '--verbose', action='store_true',
94 help='Enable verbose mode')
95 parser.add_argument('-p', '--prompt', action='append',
96 help='One or more prompts to use instead of interactive mode')
97 args = parser.parse_args()
98 99 model = args.model
100 embeddings_model = args.embeddings_model if args.embeddings_model else model
101 persist_directory = args.database if args.database else os.path.join(
102 os.getenv("DB_DIR", default="../import/db"),
103 embeddings_model)
104 assert os.path.exists(persist_directory)
105 verbose = args.verbose
106 prompts = args.prompt
107 108 vector_store = load_vector_store(embeddings_model, persist_directory)
109 llm = create_llm(model)
110 qa_chain = create_qa_chain(vector_store, llm, verbose)
111 if prompts:
112 for prompt in prompts:
113 answer_question(qa_chain, prompt, print)
114 else:
115 while (True):
116 question_and_answer(qa_chain)
117 118 if __name__ == "__main__":
119 main()
import.py
1 import argparse
2 import os
3 4 from langchain_community.document_loaders import DirectoryLoader
5 from langchain_community.embeddings import HuggingFaceInstructEmbeddings
6 from langchain_community.embeddings import OllamaEmbeddings
7 from langchain_community.vectorstores import Chroma
8 from langchain_openai import OpenAIEmbeddings
9 from langchain_text_splitters import RecursiveCharacterTextSplitter
10 11 12 defload_documents(data_dir, verbose):
13 """Loads all documents from the data directory."""14 print(f"Loading documents from {data_dir}") if verbose elseNone15 loader = DirectoryLoader(data_dir)
16 docs = loader.load()
17 print(f"Loaded {len(docs)} documents, eg") if verbose elseNone18 print(docs[0]) if verbose elseNone19 20 # Split the documents into chunks with an overlap21 text_splitter = RecursiveCharacterTextSplitter(
22 chunk_size=500,
23 chunk_overlap=50)
24 docs = text_splitter.split_documents(docs)
25 print(f"Split into {len(docs)} chunks") if verbose elseNone26 27 # ChromaDB cannot handle lists in metadata,28 # but only str, int, float or bool29 for doc in docs:
30 metadata = doc.metadata
31 for key in metadata:
32 ifisinstance(metadata[key], list):
33 if key == "source":
34 metadata[key] = metadata[key][0]
35 else:
36 metadata[key] = str(metadata[key])
37 return docs
38 39 40 defcreate_vector_store(docs, model, persist_directory):
41 """Creates and persists a vector store."""42 if model == "instructor":
43 # model is fetched automatically on first run44 embedding_function = HuggingFaceInstructEmbeddings(
45 model_name="hkunlp/instructor-xl",
46 # mps for Apple M devices, cuda for Nvidia47 model_kwargs={"device": "mps"})
48 elif model == 'openai':
49 # https://python.langchain.com/docs/integrations/text_embedding/openai50 embedding_function = OpenAIEmbeddings(
51 openai_api_key=os.getenv("OPENAI_API_KEY"),
52 # https://platform.openai.com/docs/models/embeddings53 model='text-embedding-3-large')
54 else:
55 # Ollama needs to be installed separately, see https://ollama.com56 # https://python.langchain.com/docs/integrations/text_embedding/ollama57 embedding_function = OllamaEmbeddings(
58 # https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.ollama.OllamaEmbeddings.html59 model=model)
60 # https://python.langchain.com/docs/integrations/vectorstores/chroma61 db = Chroma.from_documents(docs, embedding_function, persist_directory=persist_directory)
62 db.persist()
63 64 65 defdump_documents(documents):
66 """Prints the documents to the console."""67 for doc in documents:
68 print(doc)
69 print("---")
70 71 72 defmain():
73 parser = argparse.ArgumentParser(
74 description='CLI tool to import documents into a vector store.')
75 # https://docs.python.org/3/library/argparse.html#quick-links-for-add-argument76 parser.add_argument('-m', '--model', required=True,
77 help='Name of the model to use')
78 parser.add_argument('-s', '--source-data',
79 help='Path to the folder containing the source data files, defaults to ./data')
80 parser.add_argument('-d', '--database',
81 help='Path to the database files, defaults to ./db/<model>')
82 parser.add_argument('-v', '--verbose', action='store_true',
83 help='Enable verbose mode')
84 parser.add_argument('-dry_run', '--dry-run', action='store_true',
85 help='Just create and print the document chunks instead of the embeddings. Implies -v.')
86 args = parser.parse_args()
87 88 model = args.model
89 source_data = args.source_data if args.source_data else"./data"90 assert os.path.exists(source_data), "Source data directory does not exist"91 dry_run = args.dry_run
92 verbose = args.verbose or dry_run
93 db_dir = args.database if args.database else"./db"94 persist_directory = os.path.join(db_dir, model)
95 print("Persisting to", persist_directory) if verbose elseNone96 97 documents = load_documents(source_data, verbose)
98 if dry_run:
99 dump_documents(documents)
100 else:
101 create_vector_store(documents, model, persist_directory)
102 103 104 if __name__ == "__main__":
105 main()
As you probably noticed, the implementation allows the use of OpenAI models. Note that those do not run locally. I use them for quality comparison only. During the last months I periodically tried out local models. So far, the limiting factor has been the model for the embeddings. Recently, I got quite good results with the following LLMs:
Both are free models. Instructor specializes on embeddings. There is a YouTube video using instructor and mistral which seems to work as well. However, no local model I tried could compete with instructor. Feel free to make your own measurements for your data and use case – and feel free to share it.
Alternative: PrivateGPT
I stumbled across a project called Private GPT. They seem to implement exactly this use-case as open source: Chatting with your own local documents. They use different models, I think. Take a look if you are interested and feel free to share your experience. I did not test it yet.
Resources
In my search for an implementation and fitting models, I naturally also looked for blog posts and watched or listened to videos. Here are some of them.
Thanks for reading this far. Have fun building your own chatbot.
Appendix
I sometimes have trouble figuring out the library versions when re-implementing examples locally. Here are all direct and indirect dependencies used including the version numbers.