Crawl Engine

The Crawl Engine is the core component responsible for traversing documentation websites and discovering content. It uses a breadth-first search (BFS) strategy to navigate from your provided entry point, ensuring that every relevant sub-page is identified and queued for processing.

How the Crawler Works

ContextMD is designed to be "domain-aware." When you provide a base URL, the engine pins itself to that specific hostname. This prevents the crawler from "leaking" into external sites like GitHub (for star counts), Twitter, or third-party integrations often found in documentation footers.

Key Logic

Normalization: Every discovered link is converted to an absolute URL. Fragment identifiers (e.g., #section-1) are stripped to prevent redundant crawling of the same page.
Breadth-First Traversal: The engine processes the queue page-by-page, extracting new links from the HTML as it goes.
Deduplication: A internal registry of visited URLs ensures that no page is processed twice, even if it is linked from multiple locations.
Content Filtering: The engine specifically looks for text/html content types, automatically skipping assets like PDFs, images, or ZIP files.

Usage via CLI

The crawler is triggered automatically when you run the contextmd command. You can control its behavior using the --limit flag.

# Crawl the site but stop after 50 pages
npx contextmd https://docs.example.com --limit 50

| Option | Default | Description | | :--- | :--- | :--- | | <url> | (Required) | The starting point for the crawl. The hostname of this URL defines the crawl boundary. | | -l, --limit | 100 | The maximum number of unique pages to crawl. Use this to prevent infinite loops on massive sites. |

Programmatic Interface

If you are extending ContextMD or using the Crawler class within your own TypeScript project, you can interact with the engine directly.

`Crawler` Class

The Crawler class handles the network requests and link extraction.

import { Crawler } from 'contextmd/dist/crawler';

const crawler = new Crawler('https://docs.example.com');

// Start the crawl
const pages = await crawler.crawl(50, (url) => {
    console.log(`Discovered: ${url}`);
});

`crawl(maxPages, onUrlFound)`

maxPages (number): The hard limit for the number of pages to fetch.
onUrlFound (callback): An optional function triggered every time a new, valid URL is pulled from the queue for processing.
Returns: Promise<Page[]> — An array of objects containing the url, the raw content (HTML), and the page title.

Crawling Constraints

To ensure high-quality context and respect for the source site, the engine follows these strict rules:

Same-Domain Policy: The crawler will only follow links that match the hostname of the initial URL. For example, if you start at docs.example.com, it will not follow links to blog.example.com.
Protocol Support: Only http: and https: protocols are supported.
User-Agent: The crawler identifies itself as AgenticDocsConverter/1.0.
Timeouts: To prevent hung processes, individual page fetches time out after 10 seconds.
Error Handling: If a single page fails to load (404, 500, etc.), the crawler logs the issue internally and continues to the next URL in the queue without crashing the entire process.