Dify RAG in Production: Chunking, Metadata Filters, and Dynamic Updates

Dify's default knowledge base setup works for demos. Here's what you need to change before it's production-ready.

The Gap

Dify's built-in knowledge base is one of its strongest features: upload documents, connect the knowledge base to an agent or workflow, and it retrieves relevant chunks on demand. The default setup gets you to a working demo in minutes.

But the default settings -- auto-chunking, no metadata, no filtering -- are not production defaults. They're onboarding defaults. When your knowledge base grows beyond a handful of documents, retrieval quality drops, stale content surfaces, and you have no way to restrict what gets retrieved for a given query.

This article walks through the four settings that matter most for production RAG in Dify.

1. Chunking Strategy

Dify offers three indexing modes: High Quality, Economy, and QA Pairing. Most tutorials use High Quality without explaining what it actually does, or how to tune it.

Mode	What it does	When to use it
High Quality	Uses an LLM to generate embeddings -- better semantic understanding, higher cost	Most production use cases
Economy	Uses keyword-based indexing -- fast and cheap, lower recall quality	Large document sets where cost matters more than precision
QA Pairing	Automatically generates Q&A pairs from your documents -- very high precision for FAQ-style retrieval	Support docs, product manuals, structured knowledge

Within High Quality mode, the key settings to tune are chunk size and overlap. Dify's defaults (1000 characters, 200 overlap) work for general text. Adjust them for your content type:

Content type	Recommended chunk size	Overlap
General text / articles	800–1000 chars	150–200
Technical documentation / code	400–600 chars	80–100
Legal / dense structured text	1200–1500 chars	200–300
Short FAQ entries	200–400 chars	50

If retrieval returns chunks that seem off-topic, your chunks are too large -- the similarity score is being diluted by irrelevant surrounding text. If retrieved chunks are missing critical context, they're too small. Tune chunk size first before touching anything else.

2. Metadata: The Feature Most Builders Skip

Every document uploaded to a Dify knowledge base can have custom metadata attached: source, author, date, category, product version, department -- whatever dimensions matter for filtering. Almost nobody sets this up during onboarding, and almost everybody wishes they had when the knowledge base grows.

Adding metadata to documents

Go to your Knowledge Base in Dify.
Click on a document to open its settings.
Add metadata key-value pairs under Document Metadata.
Repeat for all documents, or use the API to batch-set metadata.

{
  "source": "help-center",
  "category": "billing",
  "product_version": "3.2",
  "last_updated": "2026-01-15",
  "audience": "admin"
}

Using metadata in retrieval

Once metadata is set, you can filter retrieval in two ways: statically (always filter by a fixed value) or dynamically (filter based on the user's context, detected by the agent or workflow).

In a Dify workflow, use the Knowledge Retrieval node and set a Metadata Filter condition. For example, only retrieve documents where category equals the value passed in from the user query classifier.

Dify's metadata filter support varies by the vector store backend you've configured. Weaviate and Qdrant have full filter support. If you're using the built-in vector store, check the current filter capabilities in the Dify docs -- they expand with each release.

3. Handling Document Updates

When a document in your knowledge base changes, you need to re-index it. Dify does not automatically detect source changes -- you have to trigger re-indexing manually or via the API.

The re-indexing workflow

In the Knowledge Base, select the outdated document and click Re-Index.
For automated re-indexing, use the Dify API: POST /datasets/{dataset_id}/documents/{document_id}/reindex.
Build a scheduled Dify workflow (or external cron job) that calls this endpoint for time-sensitive documents.

# Re-index a specific document via Dify API
curl -X POST 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{document_id}/reindex' \
  -H 'Authorization: Bearer {api_key}'
 
# List documents to find stale ones (check updated_at vs your source)
curl 'https://api.dify.ai/v1/datasets/{dataset_id}/documents' \
  -H 'Authorization: Bearer {api_key}'

For knowledge bases that pull from external sources (Notion, Google Drive, web URLs), Dify's Knowledge Pipeline feature (released 2025) provides a visual ETL pipeline that can be scheduled to sync and re-index automatically -- this is the cleanest production option if you're on a recent Dify version.

4. Retrieval Mode: Vector vs Full-Text vs Hybrid

Dify supports three retrieval modes. Most demos use vector search only. For production, hybrid search almost always outperforms either mode alone.

Mode	How it works	Best for
Vector search	Semantic similarity -- finds conceptually related chunks	Natural language queries, fuzzy matching
Full-text search	Keyword matching -- finds exact terms	Structured queries, product names, IDs, codes
Hybrid (recommended)	Combines vector + keyword scores with a reranker	Most production use cases

To enable hybrid search in Dify: go to Knowledge Base Settings > Retrieval Settings > select Hybrid Search. You can also enable a Reranker model (Cohere Rerank, BGE Reranker, etc.) to re-score the top results before returning them to the agent. This consistently improves answer quality.

If your users regularly ask about specific product names, version numbers, or IDs, pure vector search will miss them -- it finds concepts, not exact strings. Hybrid search fixes this.

Production RAG Checklist for Dify

Indexing mode set to High Quality (not Economy) for production use cases
Chunk size tuned for your document type -- not left at default
Metadata added to all documents (source, category, date at minimum)
Metadata filters configured in Knowledge Retrieval nodes
Re-indexing pipeline set up for documents that change -- manual or automated via API
Retrieval mode set to Hybrid Search with a reranker enabled
Knowledge Pipeline used for external sources (Notion, Drive, URLs) if on recent Dify version