Skip to main content
  1. Posts/

LlamaIndex: What is SummaryIndex?

·5 mins
Alejandro AO
Author
Alejandro AO
I’m a software engineer building AI applications. I publish weekly video tutorials where I show you how to build real-world projects. Feel free to visit my YouTube channel or Discord and join the community.

In this article, we will explore the SummaryIndex class in LlamaIndex. The SummaryIndex class is a data structure that allows you to perform RAG on small datasets. It is an alternative to other kinds of indexes, such as the VectorStoreIndex, which is more suitable for larger knowledge bases. The advantage of the SummaryIndex is that it does not require a vector database.

Consider that this is also a rather old implementation of an index. It was (unless I am mistaken) the first index implemented in LlamaIndex by Jerry. It is not as efficient as the VectorStoreIndex for large datasets, but it is useful for small datasets. Plus, looking at this implementation will help you understand what an index is and that they are not all vector databases.

What is a SummaryIndex?
#

A good way to think of it is as an array of nodes that represent the text in your document. When you query the SummaryIndex, it will go over each node in the array and return the ones that are most relevant to your query.

Here is what a SummaryIndex looks like if it contained 4 nodes:

Node 1
Node 2
Node 3
Node 4

Creating a SummaryIndex without embeddings
#

Creating a SummaryIndex without passing an embeddings transformation will just return all the nodes in the list. This is useful if all your data fits into the context window of the model you are using.

Let’s install LlamaIndex and Hugging Face’s embeddings:

pip install -Uq llama-index llama-index-embeddings-huggingface

To create a SummaryIndex, you need to first load your data.

from llama_index.core import SimpleDirectoryReader

docs = SimpleDirectoryReader('./data').load_data()

len(docs) # one Document object per page

Now let’s create the SummaryIndex:

from llama_index.core import SummaryIndex
from llama_index.core.retrievers import SummaryIndexRetriever

# Create the index
index = SummaryIndex.from_documents(docs)

# Create a retriever
retriever = SummaryIndexRetriever(index=index)

# Query the index
results = retriever.retrieve("whatever")

Regardless of your query, this will return all the nodes in the SummaryIndex. This is probably not what you want, so let’s add embeddings.

Creating a SummaryIndex with embeddings
#

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

index_w_embeddings = SummaryIndex.from_documents(
  docs,
  transformations=[
    HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
  ])

Now, each node contains an embedding alongside the text. Now we can initialize a SummaryIndexEmbeddingRetriever:

from llama_index.core.indices.list.retrievers import SummaryIndexEmbeddingRetriever

retriever_w_embeddings = SummaryIndexEmbeddingRetriever(
  index=index_w_embeddings,
  embed_model=HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
  similarity_top_k=3,  # default is 1
)

This class requires that you pass the index and the embed_model. The embed_model is the same model you used to create the SummaryIndex. As you can see, we have more parameters to play with. The similarity_top_k parameter will return the top k most similar nodes to your query.

Why not use a Vector Database
#

Notice that we are using embeddings, but we are not putting them into a vector database. This is because the SummaryIndex is not designed to be used with a vector database. It is designed to be used with small documents.

The SummaryIndex data structure is simply running a for loop over all the nodes in the list, comparing their embeddings to the embeddings of the query, and returning the K most similar nodes. This is not efficient for large datasets, but it is very efficient for small datasets.

Using a LLM as a retriever
#

You can also use a LLM as a retriever. This was one of the early ideas on how to augment the knowledge of an LLM without having to retrain it or fine-tune it, even before the common RAG pipeline was introduced.

This system involves looping over all the nodes in the SummaryIndex, concatenating the query with the node, and passing it to the LLM. The LLM will then return a score for each node.

Here is what the implementation looks like:

I’ll use Groq.

pip install -Uq llama-index-llms-groq
import os
import getpass

os.environ['GROQ_API_KEY'] = getpass.getpass('Groq API Key:')

I am assuming you already have the index_w_embeddings variable from the previous example.

from llama_index.llms.groq import Groq

retriever_llm = index_w_embeddings.as_retriever(
  retriever_mode="llm",
  llm=Groq(model="deepseek-r1-distill-qwen-32b", api_key=os.environ["GROQ_API_KEY"]),
  verbose=True
)

results = retriever_llm.retrieve("what are the punishments")

Let’s take a look at the results:

for r in results:
  print("Node contents (excerpt):\n", r.node.text[:200].replace("\t", " "))
  print("Score: ", r.score)
  print("\n" + "-" * 80 +"\n")
Node contents (excerpt):
 Company-Sanctioned Social Media Accounts
The Umbrella Corporation maintains official social media accounts to share company news, research updates, and engage with the
public. These accounts are manag
Score:  7.0

--------------------------------------------------------------------------------

Node contents (excerpt):
 The Umbrella Corporation's rich history is a testament to its commitment to innovation, discovery, and pushing the boundaries of human
knowledge. From humble beginnings to global recognition, our comp
Score:  4.0

--------------------------------------------------------------------------------

Node contents (excerpt):
 * Subsection 2.2: Workplace Etiquette and Professionalism
Subsection 2.2: Workplace Etiquette and Professionalism
As a leading player in the pharmaceutical, biotechnology, and genetic engineering indu
Score:  3.0

--------------------------------------------------------------------------------

Node contents (excerpt):
 1
. 
Confidential
: This includes sensitive information that could cause harm to the company or its stakeholders if disclosed, such as
trade secrets, proprietary research data, and confidential busine
Score:  9.0

--------------------------------------------------------------------------------

Node contents (excerpt):
 passwords that are difficult to guess or crack. The following guidelines must be adhered to:
Passwords must be a minimum of 12 characters in length.
Passwords must contain a mix of uppercase and lower
Score:  8.0

As you can see, the LLM is returning the most relevant nodes to the query. This is a very powerful way to augment the knowledge of an LLM without having to retrain it.

Conclusion
#

In this article, we explored the SummaryIndex class in LlamaIndex. We learned how to create a SummaryIndex with and without embeddings, and how to query it. We also learned why the SummaryIndex is not suitable for large datasets.

References
#