In this article, we will explore the SummaryIndex
class in LlamaIndex. The SummaryIndex
class is a data structure that allows you to perform RAG on small datasets. It is an alternative to other kinds of indexes, such as the VectorStoreIndex
, which is more suitable for larger knowledge bases. The advantage of the SummaryIndex
is that it does not require a vector database.
Consider that this is also a rather old implementation of an index. It was (unless I am mistaken) the first index implemented in LlamaIndex by Jerry. It is not as efficient as the VectorStoreIndex
for large datasets, but it is useful for small datasets. Plus, looking at this implementation will help you understand what an index is and that they are not all vector databases.
What is a SummaryIndex
?#
A good way to think of it is as an array of nodes
that represent the text in your document. When you query the SummaryIndex
, it will go over each node
in the array and return the ones that are most relevant to your query.
Here is what a SummaryIndex
looks like if it contained 4 nodes:
Creating a SummaryIndex
without embeddings#
Creating a SummaryIndex
without passing an embeddings transformation will just return all the nodes in the list. This is useful if all your data fits into the context window of the model you are using.
Let’s install LlamaIndex and Hugging Face’s embeddings:
pip install -Uq llama-index llama-index-embeddings-huggingface
To create a SummaryIndex
, you need to first load your data.
from llama_index.core import SimpleDirectoryReader
docs = SimpleDirectoryReader('./data').load_data()
len(docs) # one Document object per page
Now let’s create the SummaryIndex
:
from llama_index.core import SummaryIndex
from llama_index.core.retrievers import SummaryIndexRetriever
# Create the index
index = SummaryIndex.from_documents(docs)
# Create a retriever
retriever = SummaryIndexRetriever(index=index)
# Query the index
results = retriever.retrieve("whatever")
Regardless of your query, this will return all the nodes in the SummaryIndex
. This is probably not what you want, so let’s add embeddings.
Creating a SummaryIndex
with embeddings#
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
index_w_embeddings = SummaryIndex.from_documents(
docs,
transformations=[
HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
])
Now, each node contains an embedding alongside the text. Now we can initialize a SummaryIndexEmbeddingRetriever
:
from llama_index.core.indices.list.retrievers import SummaryIndexEmbeddingRetriever
retriever_w_embeddings = SummaryIndexEmbeddingRetriever(
index=index_w_embeddings,
embed_model=HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
similarity_top_k=3, # default is 1
)
This class requires that you pass the index
and the embed_model
. The embed_model
is the same model you used to create the SummaryIndex
. As you can see, we have more parameters to play with. The similarity_top_k
parameter will return the top k
most similar nodes to your query.
Why not use a Vector Database#
Notice that we are using embeddings, but we are not putting them into a vector database. This is because the SummaryIndex
is not designed to be used with a vector database. It is designed to be used with small documents.
The SummaryIndex
data structure is simply running a for loop over all the nodes in the list, comparing their embeddings to the embeddings of the query, and returning the K most similar nodes. This is not efficient for large datasets, but it is very efficient for small datasets.
Using a LLM as a retriever#
You can also use a LLM as a retriever. This was one of the early ideas on how to augment the knowledge of an LLM without having to retrain it or fine-tune it, even before the common RAG pipeline was introduced.
This system involves looping over all the nodes in the SummaryIndex
, concatenating the query with the node, and passing it to the LLM. The LLM will then return a score for each node.
Here is what the implementation looks like:
I’ll use Groq.
pip install -Uq llama-index-llms-groq
import os
import getpass
os.environ['GROQ_API_KEY'] = getpass.getpass('Groq API Key:')
I am assuming you already have the index_w_embeddings
variable from the previous example.
from llama_index.llms.groq import Groq
retriever_llm = index_w_embeddings.as_retriever(
retriever_mode="llm",
llm=Groq(model="deepseek-r1-distill-qwen-32b", api_key=os.environ["GROQ_API_KEY"]),
verbose=True
)
results = retriever_llm.retrieve("what are the punishments")
Let’s take a look at the results:
for r in results:
print("Node contents (excerpt):\n", r.node.text[:200].replace("\t", " "))
print("Score: ", r.score)
print("\n" + "-" * 80 +"\n")
Node contents (excerpt):
Company-Sanctioned Social Media Accounts
The Umbrella Corporation maintains official social media accounts to share company news, research updates, and engage with the
public. These accounts are manag
Score: 7.0
--------------------------------------------------------------------------------
Node contents (excerpt):
The Umbrella Corporation's rich history is a testament to its commitment to innovation, discovery, and pushing the boundaries of human
knowledge. From humble beginnings to global recognition, our comp
Score: 4.0
--------------------------------------------------------------------------------
Node contents (excerpt):
* Subsection 2.2: Workplace Etiquette and Professionalism
Subsection 2.2: Workplace Etiquette and Professionalism
As a leading player in the pharmaceutical, biotechnology, and genetic engineering indu
Score: 3.0
--------------------------------------------------------------------------------
Node contents (excerpt):
1
.
Confidential
: This includes sensitive information that could cause harm to the company or its stakeholders if disclosed, such as
trade secrets, proprietary research data, and confidential busine
Score: 9.0
--------------------------------------------------------------------------------
Node contents (excerpt):
passwords that are difficult to guess or crack. The following guidelines must be adhered to:
Passwords must be a minimum of 12 characters in length.
Passwords must contain a mix of uppercase and lower
Score: 8.0
As you can see, the LLM is returning the most relevant nodes to the query. This is a very powerful way to augment the knowledge of an LLM without having to retrain it.
Conclusion#
In this article, we explored the SummaryIndex
class in LlamaIndex. We learned how to create a SummaryIndex
with and without embeddings, and how to query it. We also learned why the SummaryIndex
is not suitable for large datasets.