RAG Evaluation Series: Validating the RAG Performance of OpenAI vs LlamaIndex
This is the second of a multi-part series I am doing to evaluate various RAG systems using Tonic Validate, a RAG evaluation and benchmarking platform, and the open source tool tvalmetrics. All the code and data used in this article is available here. I’ll be back shortly with another comparison of more RAG tools!
Introduction
Last week, we tested out OpenAI’s Assistants API and discovered there are some major problems with its ability to handle multiple documents. However, to better assess its performance, I am going to compare OpenAI’s Assistants RAG to another popular open source RAG library, LlamaIndex. Let’s get started!
Testing OpenAI’s Assistants RAG
In the previous article, we set up OpenAI’s Assistants RAG already. You can view the original setup here. Here’s a refresher:
Our testing set utilized 212 Paul Graham essays. Initially, we tried to upload all 212 essays to the RAG system, but discovered that OpenAI’s RAG caps uploads to only 20 documents at most. To get around this, we took the 212 essays, split them into five groups, and created a single file for each group, giving us a total of five files. We used five files instead of 20 because we ran into reliability issues at higher document counts that prevented us from running any tests on a 20-file set. With the five-file setup, we got the following results using our open source RAG benchmarking library, tvalmetrics:
In our package, a score of 0 indicates no similarity and 5 indicates perfect similarity. The results above are less than desirable given the mean similarity score of 2.41 and median of 2. This low score is caused by OpenAI’s RAG system’s inability to find the relevant text in the document, resulting in the system returning no answer.
However, we then tried combining all 212 essays into a single document and the performance dramatically improved.
Not only did the mean similarity score jump to 4.16, but the median went up to a perfect 5.0. The speed of response also dramatically improved. Instead of taking almost an hour or more, the test was able to run in minutes. This shows that OpenAI’s RAG system is capable of good scores under specific working conditions, such as working on single documents. While we did achieve a high multi-document score once, this was a fleeting success as we weren’t able to replicate the result again.
Testing LlamaIndex
Now that we’ve recapped the OpenAI results, let’s assess LlamaIndex.
Preparing the Experiment
To keep things fair for OpenAI, we will still use the Assistants API with GPT-4 Turbo in case there are any differences in performance in the Assistants API vs the regular API. However, I will turn off file retrieval, forcing the Assistants API to rely on LlamaIndex instead. I will also use the same experimental conditions I used for my OpenAI evaluation. First, I will test the tool using five combined documents to get a baseline for how LlamaIndex performs with multiple documents, then repeat the test using a single combined document. To set up LlamaIndex for this test, I ran the following code:
from typing import Any
from llama_index.llms import (
CustomLLM,
CompletionResponse,
CompletionResponseGen,
LLMMetadata,
)
from llama_index.llms.base import llm_completion_callback
from openai.types.beta.assistant import Assistant
class AssistantLLM(CustomLLM):
llama_assistant: Assistant
@property
def metadata(self) -> LLMMetadata:
"""Get LLM metadata."""
return LLMMetadata(
context_window=128000,
num_output=4096,
model_name="gpt-4-1106-preview",
)
@llm_completion_callback()
def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
response = get_response(prompt, self.llama_assistant)
return CompletionResponse(text=response)
@llm_completion_callback()
def stream_complete(
self, prompt: str, **kwargs: Any
) -> CompletionResponseGen:
raise NotImplementedError()
To ingest the documents, I ran this code:
from llama_index import ServiceContext, VectorStoreIndex, SimpleDirectoryReader
def setup_assistant(file_count):
combined_essays = combine_essays(file_count)
# Llama Index already injects into the prompt that the chatbot should not use external knowledge so we don't need to specify it here.
llama_assistant = client.beta.assistants.create(
name=f"LLamaIndex Rag Test {file_count} Files",
instructions=(
"You are a chatbot that answers questions about Paul Graham's essays. "
),
model="gpt-4-1106-preview",
)
service_context = ServiceContext.from_defaults(llm=AssistantLLM(llama_assistant=llama_assistant))
documents = SimpleDirectoryReader("combined_essays").load_data()
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)
query_engine = index.as_query_engine(service_context=service_context)
return query_engine
To see how I created the single- and multiple-document training sets and the question-answer pairs used as the test set, you can take a look at my write-up in the last blog post, here.
Spot Checking the Setup
Finally, we can do a spot check to see how LlamaIndex performs.
assistant = setup_assistant(1)
def get_llama_response(prompt):
response = assistant.query(prompt)
context = [x.text for x in response.source_nodes]
return (response.response, context)
llama_response = get_llama_response("What was Airbnb's monthly financial goal to achieve ramen profitability during their time at Y Combinator?")
print(llama_response[0])
print(llama_response[1])
For both the five-document setup and the single-document setup, the LLM returned the same correct response:
Airbnb's monthly financial goal to achieve ramen profitability during their time at Y Combinator was $4000 a month.
Unlike OpenAI’s system, LlamaIndex was able to respond correctly to my query while using multiple documents and the answer stayed consistent each time I asked the question. Notably, when using a single document, the performance was about the same between the two. OpenAI’s answer is a bit more descriptive, but it also still hallucinates the source count. So, I would lean more towards saying I prefer LlamaIndex myself.
Evaluating the RAG system
Now, let's run the scoring using tvalmetrics
, an open source library we created at Tonic.ai for measuring the quality of LLM responses. I used the following code to benchmark LlamaIndex’s responses:
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
# Using ThreadPoolExecutor to process questions in parallel
with ThreadPoolExecutor(max_workers=10) as executor:
# Map the process_question function to each question in the list
results = list(tqdm(executor.map(get_llama_response, question_list), total=len(question_list)))
# Unpack results into separate lists
llama_responses, llama_context = zip(*results)
openai_batch_scores = score_calculator.score_batch(
question_list=question_list,
reference_answer_list=answer_list,
llm_answer_list=llama_responses,
)
openai_scores_df = openai_batch_scores.to_dataframe()
# Remove overall_score column since we are only using one stat
openai_scores_df = openai_scores_df.drop(columns=['overall_score'])
openai_scores_df.describe()
Using the five-document setup, I got the following results:
Using multiple documents, LlamaIndex performed much better than OpenAI. Its mean similarity score was ~3.8, which is slightly above average, and its median was 5.0, which is excellent. The run time was only seven minutes for the five documents compared with almost an hour for OpenAI’s system using the same setup. I also noticed that the LlamaIndex system was dramatically less prone to crashing compared with OpenAI's system, which indicates that the RAG system itself was the problem for OpenAI's reliability as opposed to the Assistants API.
However, for the single-document setup, LlamaIndex performs slightly worse than OpenAI’s RAG system, with a mean of similarity score of 3.7 (vs OpenAI’s 4.16) and median of 4.0 (vs 5.0).
I should note that I got slightly better results by tweaking some of the LlamaIndex parameters, including changing the chunk size to 80 tokens, the chunk overlap to 60 tokens, and the number of provided chunks to 12, and by using the hybrid search option in LlamaIndex. Doing so yielded results that were closer to OpenAI, but not quite as good:
While the performance between the two systems is now closer, OpenAI still edges out ahead on single documents only. Keep in mind that the settings I used are tuned towards the type of questions I am asking (ie, short questions with an answer that’s very obvious in the text). These settings won’t work in all scenarios, whereas OpenAI managed to achieve good performance with settings that should, in theory, work in any situation. Although, because OpenAI doesn’t allow for a lot of customization of their settings, you could be leaving performance on the table. In the end, it’s up to you whether you want more customization with the ability to get better results in certain scenarios or a general out-of-the-box tool that can achieve decent performance (...on single documents only, that is).
Conclusion
OpenAI's RAG system does seem promising, but performance issues with multiple documents significantly decreases its usefulness. Their system can function well on single documents. However, most people will probably want to operate their RAG system on a corpus of different documents. While they can stuff all their documents into a single, monolithic document, that’s a hack and shouldn’t be required for a well-functioning, approachable RAG system. That, coupled with the 20-file limit, makes me hesitant to recommend anyone replace their existing RAG pipeline with OpenAI's RAG anytime soon. However, as I’ve said before, there is potential for improvement for OpenAI. While running some spot checks on their RAG system for GPTs, I noticed that the performance was much better on multiple documents. The bad performance is solely limited to the Assistants API itself. If OpenAI worked to bring the Assistants API quality up to that of the GPTs and removed the file limit, then I could see companies considering migrating from LlamaIndex if they are willing to give up some of the customizability that LlamaIndex provides. However, until that day comes, I recommend using LlamaIndex.
All the code and data used in this article is available here.