RAG System Integration
The context.md file generated by ContextMD is specifically optimized for high-density information retrieval. Unlike raw HTML or basic Markdown scrapes, the output is "pre-digested" by AI to remove noise, making it an ideal source for Retrieval-Augmented Generation (RAG) systems.
Ingestion Strategies
When integrating ContextMD output into your RAG pipeline, consider the following best practices for chunking, embedding, and metadata mapping.
1. Document Chunking
Because ContextMD refines content into a dense format, you can often use larger chunk sizes than you would with raw documentation. The output uses clear Markdown headers (## Source: ...) to delineate original pages.
- Recommended Strategy: Markdown-based splitting.
- Chunk Size: 1000–1500 tokens (depending on your embedding model).
- Overlap: 10–15% to maintain context across logical breaks.
2. Metadata Extraction
ContextMD automatically inserts source headers for every processed page:
## Source: [Page Title](https://example.com/page)
When ingesting, use a regex or a Markdown parser to extract these URLs and assign them to the source metadata field in your vector database. This ensures that when your RAG system cites a source, it points back to the live documentation.
Integration Examples
LangChain (TypeScript)
Use the MarkdownTextSplitter to maintain the structural integrity of the refined documentation.
import { TextLoader } from "langchain/document_loaders/fs/text";
import { MarkdownTextSplitter } from "langchain/text_splitter";
// 1. Load the context.md file
const loader = new TextLoader("context.md");
const docs = await loader.load();
// 2. Split based on Markdown headers
const splitter = new MarkdownTextSplitter({
chunkSize: 1000,
chunkOverlap: 100,
});
const chunks = await splitter.splitDocuments(docs);
// 3. Add to Vector Store (e.g., Pinecone, Milvus, Supabase)
// Each chunk preserves the refined technical density
LlamaIndex (Python)
LlamaIndex can ingest the context.md file using the MarkdownReader to create a hierarchical index.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.readers.file import MarkdownReader
# Initialize the Markdown reader
reader = MarkdownReader()
documents = SimpleDirectoryReader(
input_files=["context.md"],
file_extractor={".md": reader}
).load_data()
# Build index
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
# The agent now has access to refined, high-density context
response = query_engine.query("How do I configure the authentication middleware?")
Why ContextMD for RAG?
Integrating ContextMD into your RAG pipeline offers several advantages over standard scrapers:
- Reduced Noise: By stripping navigation bars, footers, and sidebars before embedding, you save on vector storage costs and improve retrieval accuracy (the model doesn't retrieve "Next Page" or "Sign Up" links).
- Token Efficiency: The LLM-powered refinement step compresses verbose tutorials into concise technical logic, allowing more documentation to fit into the same retrieval window.
- Improved Reasoning: Since the content is rewritten for "machine comprehension," embedding models can more easily capture the semantic relationships between API signatures and their implementations.
Continuous Integration (CI)
For production RAG systems, it is recommended to run ContextMD as a scheduled job (e.g., GitHub Actions) to regenerate the context.md file whenever your documentation changes. This ensures your vector database stays synced with your latest product updates.