Default LlamaIndex settings are great for demos. Here are the five changes that make retrieval good enough for production.

The Gap

LlamaIndex's defaults are optimised for getting started quickly, not for production retrieval quality. The default chunk size, default similarity_top_k, and default response synthesiser work well in demos. They break down when your knowledge base grows beyond a handful of documents, when users ask precise questions, or when documents contain mixed content types.

These are the five highest-impact changes you can make.

Fix 1: Custom Text Splitter

The default splitter uses a fixed chunk size with no awareness of document structure. For technical docs, this splits code blocks in half. For articles, it cuts sentences mid-thought. A sentence-aware or semantic splitter dramatically improves chunk quality.

from llama_index.core.node_parser import (
    SentenceSplitter,          # sentence-aware chunking
    SemanticSplitterNodeParser, # semantic boundary detection (slower, better)
    CodeSplitter,              # for code-heavy documents
)
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
 
documents = SimpleDirectoryReader("./docs").load_data()
 
# For most text content:
splitter = SentenceSplitter(
    chunk_size=512,        # tokens, not characters
    chunk_overlap=64,
)
 
# For code-heavy technical docs:
# splitter = CodeSplitter(language="python", chunk_lines=40)
 
# For highest quality (uses an embedding model to find semantic boundaries):
# splitter = SemanticSplitterNodeParser(
#     buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
# )
 
nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)
Splitter Best for Speed
SentenceSplitter General text, articles, documentation Fast
SemanticSplitterNodeParser Dense mixed-content documents Slow (calls embeddings)
CodeSplitter Code files, technical references Fast

Fix 2: Metadata Extraction and Filtering

Attaching metadata to nodes at index time and filtering at query time is the single biggest precision improvement for multi-category knowledge bases. Without filtering, a question about billing might surface chunks from engineering docs.

from llama_index.core.extractors import (
    TitleExtractor,
    KeywordExtractor,
    SummaryExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
 
# Add metadata during ingestion
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=64),
        TitleExtractor(),    # extracts document title into metadata
        KeywordExtractor(),  # extracts keywords for each chunk
    ]
)
 
nodes = pipeline.run(documents=documents)
 
# Add custom metadata manually
for node in nodes:
    node.metadata["category"] = "billing"
    node.metadata["product_version"] = "3.2"
 
index = VectorStoreIndex(nodes)
 
# Filter at query time
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters
 
filters = MetadataFilters(filters=[
    MetadataFilter(key="category", value="billing"),
])
 
query_engine = index.as_query_engine(
    similarity_top_k=5,
    filters=filters,
)
 
response = query_engine.query("What are the billing options?")

Fix 3: Reranking

Vector similarity retrieves the most embedding-similar chunks, not necessarily the most relevant ones for the specific question. A reranker re-scores the top-k retrieved chunks using a dedicated model that understands query-document relevance more precisely than embedding similarity alone.

from llama_index.postprocessor.cohere_rerank import CohereRerank
# Or:
# from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
 
reranker = CohereRerank(
    api_key="your-cohere-key",
    top_n=3,  # rerank top 10, keep best 3
)
 
query_engine = index.as_query_engine(
    similarity_top_k=10,           # retrieve more for the reranker to work with
    node_postprocessors=[reranker], # reranker narrows to top 3
)
A common pattern: retrieve top 10-15 with vector search, rerank to top 3-5. The extra retrieval cost is small; the quality improvement from reranking is significant. This pattern alone often doubles user satisfaction with RAG answers.

Pure vector search excels at semantic similarity but misses exact keyword matches. Hybrid search combines vector scores with BM25 keyword scores to handle both semantic queries ('tell me about the cancellation process') and exact queries ('what is the SLA for P1 incidents?').

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
 
# Vector retriever
vector_retriever = index.as_retriever(similarity_top_k=5)
 
# BM25 keyword retriever (same nodes)
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=5,
)
 
# Fuse both retrievers
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,    # set >1 to generate query variations (improves recall)
    mode="reciprocal_rerank",  # RRF fusion scoring
)
 
# Use as a query engine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)

Fix 5: Response Synthesiser Mode

The default response mode (compact) works well for short answers. For longer documents and more nuanced questions, changing the synthesis mode can dramatically improve answer quality.

Mode What it does Best for
compact Stuffs all retrieved chunks into one prompt Short answers, fast responses
refine Iteratively refines answer chunk by chunk Long-form, comprehensive answers
tree_summarize Builds a tree of summaries Very long documents, summarisation tasks
accumulate Generates a response per chunk, then combines Research-style answers citing multiple sources
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="refine",  # better quality for complex questions
)

Production RAG Checklist for LlamaIndex

  • Use SentenceSplitter or SemanticSplitterNodeParser instead of default splitter
  • Add metadata (category, source, date) during ingestion and filter at query time
  • Add a reranker (Cohere or FlagEmbedding) with retrieve top 10, rerank to top 3-5
  • Use hybrid retrieval (vector + BM25) via QueryFusionRetriever
  • Choose response_mode based on answer style: compact for speed, refine for quality
  • Test retrieval quality in isolation before testing end-to-end (check what chunks are returned, not just the final answer)