LlamaIndex RAG That Actually Works: Fixing the Top 5 Retrieval Quality Problems

Default LlamaIndex settings are great for demos. Here are the five changes that make retrieval good enough for production.

The Gap

LlamaIndex's defaults are optimised for getting started quickly, not for production retrieval quality. The default chunk size, default similarity_top_k, and default response synthesiser work well in demos. They break down when your knowledge base grows beyond a handful of documents, when users ask precise questions, or when documents contain mixed content types.

These are the five highest-impact changes you can make.

Fix 1: Custom Text Splitter

The default splitter uses a fixed chunk size with no awareness of document structure. For technical docs, this splits code blocks in half. For articles, it cuts sentences mid-thought. A sentence-aware or semantic splitter dramatically improves chunk quality.

from llama_index.core.node_parser import (
    SentenceSplitter,          # sentence-aware chunking
    SemanticSplitterNodeParser, # semantic boundary detection (slower, better)
    CodeSplitter,              # for code-heavy documents
)
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
 
documents = SimpleDirectoryReader("./docs").load_data()
 
# For most text content:
splitter = SentenceSplitter(
    chunk_size=512,        # tokens, not characters
    chunk_overlap=64,
)
 
# For code-heavy technical docs:
# splitter = CodeSplitter(language="python", chunk_lines=40)
 
# For highest quality (uses an embedding model to find semantic boundaries):
# splitter = SemanticSplitterNodeParser(
#     buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
# )
 
nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)

Splitter	Best for	Speed
SentenceSplitter	General text, articles, documentation	Fast
SemanticSplitterNodeParser	Dense mixed-content documents	Slow (calls embeddings)
CodeSplitter	Code files, technical references	Fast

Fix 2: Metadata Extraction and Filtering

Attaching metadata to nodes at index time and filtering at query time is the single biggest precision improvement for multi-category knowledge bases. Without filtering, a question about billing might surface chunks from engineering docs.

from llama_index.core.extractors import (
    TitleExtractor,
    KeywordExtractor,
    SummaryExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
 
# Add metadata during ingestion
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=64),
        TitleExtractor(),    # extracts document title into metadata
        KeywordExtractor(),  # extracts keywords for each chunk
    ]
)
 
nodes = pipeline.run(documents=documents)
 
# Add custom metadata manually
for node in nodes:
    node.metadata["category"] = "billing"
    node.metadata["product_version"] = "3.2"
 
index = VectorStoreIndex(nodes)
 
# Filter at query time
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters
 
filters = MetadataFilters(filters=[
    MetadataFilter(key="category", value="billing"),
])
 
query_engine = index.as_query_engine(
    similarity_top_k=5,
    filters=filters,
)
 
response = query_engine.query("What are the billing options?")

Fix 3: Reranking

Vector similarity retrieves the most embedding-similar chunks, not necessarily the most relevant ones for the specific question. A reranker re-scores the top-k retrieved chunks using a dedicated model that understands query-document relevance more precisely than embedding similarity alone.

from llama_index.postprocessor.cohere_rerank import CohereRerank
# Or:
# from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
 
reranker = CohereRerank(
    api_key="your-cohere-key",
    top_n=3,  # rerank top 10, keep best 3
)
 
query_engine = index.as_query_engine(
    similarity_top_k=10,           # retrieve more for the reranker to work with
    node_postprocessors=[reranker], # reranker narrows to top 3
)

A common pattern: retrieve top 10-15 with vector search, rerank to top 3-5. The extra retrieval cost is small; the quality improvement from reranking is significant. This pattern alone often doubles user satisfaction with RAG answers.

Fix 4: Hybrid Search

Pure vector search excels at semantic similarity but misses exact keyword matches. Hybrid search combines vector scores with BM25 keyword scores to handle both semantic queries ('tell me about the cancellation process') and exact queries ('what is the SLA for P1 incidents?').

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
 
# Vector retriever
vector_retriever = index.as_retriever(similarity_top_k=5)
 
# BM25 keyword retriever (same nodes)
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=5,
)
 
# Fuse both retrievers
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,    # set >1 to generate query variations (improves recall)
    mode="reciprocal_rerank",  # RRF fusion scoring
)
 
# Use as a query engine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)

Fix 5: Response Synthesiser Mode

The default response mode (compact) works well for short answers. For longer documents and more nuanced questions, changing the synthesis mode can dramatically improve answer quality.

Mode	What it does	Best for
compact	Stuffs all retrieved chunks into one prompt	Short answers, fast responses
refine	Iteratively refines answer chunk by chunk	Long-form, comprehensive answers
tree_summarize	Builds a tree of summaries	Very long documents, summarisation tasks
accumulate	Generates a response per chunk, then combines	Research-style answers citing multiple sources

query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="refine",  # better quality for complex questions
)

Production RAG Checklist for LlamaIndex

Use SentenceSplitter or SemanticSplitterNodeParser instead of default splitter
Add metadata (category, source, date) during ingestion and filter at query time
Add a reranker (Cohere or FlagEmbedding) with retrieve top 10, rerank to top 3-5
Use hybrid retrieval (vector + BM25) via QueryFusionRetriever
Choose response_mode based on answer style: compact for speed, refine for quality
Test retrieval quality in isolation before testing end-to-end (check what chunks are returned, not just the final answer)