Default LlamaIndex settings are great for demos. Here are the five changes that make retrieval good enough for production.
The Gap
LlamaIndex's defaults are optimised for getting started quickly, not for production retrieval quality. The default chunk size, default similarity_top_k, and default response synthesiser work well in demos. They break down when your knowledge base grows beyond a handful of documents, when users ask precise questions, or when documents contain mixed content types.
These are the five highest-impact changes you can make.
Fix 1: Custom Text Splitter
The default splitter uses a fixed chunk size with no awareness of document structure. For technical docs, this splits code blocks in half. For articles, it cuts sentences mid-thought. A sentence-aware or semantic splitter dramatically improves chunk quality.
from llama_index.core.node_parser import (
SentenceSplitter, # sentence-aware chunking
SemanticSplitterNodeParser, # semantic boundary detection (slower, better)
CodeSplitter, # for code-heavy documents
)
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("./docs").load_data()
# For most text content:
splitter = SentenceSplitter(
chunk_size=512, # tokens, not characters
chunk_overlap=64,
)
# For code-heavy technical docs:
# splitter = CodeSplitter(language="python", chunk_lines=40)
# For highest quality (uses an embedding model to find semantic boundaries):
# splitter = SemanticSplitterNodeParser(
# buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
# )
nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)| Splitter | Best for | Speed |
|---|---|---|
| SentenceSplitter | General text, articles, documentation | Fast |
| SemanticSplitterNodeParser | Dense mixed-content documents | Slow (calls embeddings) |
| CodeSplitter | Code files, technical references | Fast |
Fix 2: Metadata Extraction and Filtering
Attaching metadata to nodes at index time and filtering at query time is the single biggest precision improvement for multi-category knowledge bases. Without filtering, a question about billing might surface chunks from engineering docs.
from llama_index.core.extractors import (
TitleExtractor,
KeywordExtractor,
SummaryExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
# Add metadata during ingestion
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512, chunk_overlap=64),
TitleExtractor(), # extracts document title into metadata
KeywordExtractor(), # extracts keywords for each chunk
]
)
nodes = pipeline.run(documents=documents)
# Add custom metadata manually
for node in nodes:
node.metadata["category"] = "billing"
node.metadata["product_version"] = "3.2"
index = VectorStoreIndex(nodes)
# Filter at query time
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters
filters = MetadataFilters(filters=[
MetadataFilter(key="category", value="billing"),
])
query_engine = index.as_query_engine(
similarity_top_k=5,
filters=filters,
)
response = query_engine.query("What are the billing options?")Fix 3: Reranking
Vector similarity retrieves the most embedding-similar chunks, not necessarily the most relevant ones for the specific question. A reranker re-scores the top-k retrieved chunks using a dedicated model that understands query-document relevance more precisely than embedding similarity alone.
from llama_index.postprocessor.cohere_rerank import CohereRerank
# Or:
# from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
reranker = CohereRerank(
api_key="your-cohere-key",
top_n=3, # rerank top 10, keep best 3
)
query_engine = index.as_query_engine(
similarity_top_k=10, # retrieve more for the reranker to work with
node_postprocessors=[reranker], # reranker narrows to top 3
)A common pattern: retrieve top 10-15 with vector search, rerank to top 3-5. The extra retrieval cost is small; the quality improvement from reranking is significant. This pattern alone often doubles user satisfaction with RAG answers.Fix 4: Hybrid Search
Pure vector search excels at semantic similarity but misses exact keyword matches. Hybrid search combines vector scores with BM25 keyword scores to handle both semantic queries ('tell me about the cancellation process') and exact queries ('what is the SLA for P1 incidents?').
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
# Vector retriever
vector_retriever = index.as_retriever(similarity_top_k=5)
# BM25 keyword retriever (same nodes)
bm25_retriever = BM25Retriever.from_defaults(
nodes=nodes,
similarity_top_k=5,
)
# Fuse both retrievers
hybrid_retriever = QueryFusionRetriever(
retrievers=[vector_retriever, bm25_retriever],
similarity_top_k=5,
num_queries=1, # set >1 to generate query variations (improves recall)
mode="reciprocal_rerank", # RRF fusion scoring
)
# Use as a query engine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)Fix 5: Response Synthesiser Mode
The default response mode (compact) works well for short answers. For longer documents and more nuanced questions, changing the synthesis mode can dramatically improve answer quality.
| Mode | What it does | Best for |
|---|---|---|
| compact | Stuffs all retrieved chunks into one prompt | Short answers, fast responses |
| refine | Iteratively refines answer chunk by chunk | Long-form, comprehensive answers |
| tree_summarize | Builds a tree of summaries | Very long documents, summarisation tasks |
| accumulate | Generates a response per chunk, then combines | Research-style answers citing multiple sources |
query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="refine", # better quality for complex questions
)Production RAG Checklist for LlamaIndex
- Use SentenceSplitter or SemanticSplitterNodeParser instead of default splitter
- Add metadata (category, source, date) during ingestion and filter at query time
- Add a reranker (Cohere or FlagEmbedding) with retrieve top 10, rerank to top 3-5
- Use hybrid retrieval (vector + BM25) via QueryFusionRetriever
- Choose response_mode based on answer style: compact for speed, refine for quality
- Test retrieval quality in isolation before testing end-to-end (check what chunks are returned, not just the final answer)