Noise Reduction Logic
Overview of Noise Reduction
To ensure AI agents receive high-signal context without being distracted by UI clutter, ContextMD employs a multi-stage noise reduction pipeline. This process automatically identifies, isolates, and strips non-essential elements such as navigation bars, footers, and sidebars before the content is transformed into Markdown.
Structural HTML Cleaning
The first layer of noise reduction happens at the HTML level. The Processor utilizes a structural analysis strategy to remove boilerplate components that do not contribute to the core documentation.
Element Exclusions
The following elements are targeted and removed from the DOM:
- Scripts and Styles:
<script>and<style>tags are purged to prevent code execution or styling metadata from entering the context. - Navigation Components: Elements with
<nav>tags,.navclasses, or[role="navigation"]attributes. - UI Scaffolding: Footers (
<footer>,.footer), sidebars (.sidebar), and iframes (<iframe>) are discarded. - Accessibility Fallbacks:
<noscript>tags are removed to avoid redundant text.
Content Prioritization
Once the clutter is removed, the processor attempts to locate the "source of truth" within the page. It follows a hierarchical fallback system to find the primary content container:
<main>: The primary semantic container for the page.<article>: Used if a main tag is absent, common in blog-style documentation.<body>: The final fallback if no semantic containers are identified.
AI-Powered Semantic Refinement
After the HTML is cleaned and converted to Markdown, the system performs a final "semantic" noise reduction using a high-speed LLM (gpt-4o-mini). This step ensures the output is "Agentic-ready" by applying the following logic:
- Fluff Removal: Strips conversational filler and introductory pleasantries (e.g., "In this section, we will explore...").
- Density Optimization: Re-writes content to be extremely high-density and concise while strictly preserving API signatures and code blocks.
- Keyword Prioritization: Optimizes the text for retrieval, ensuring that technical constraints and logic are clear and discoverable.
Handling Edge Cases
While the noise reduction logic is aggressive, it is designed to be safe for technical documentation:
- Class-Based Sidebars: The logic specifically targets classes like
.sidebar. While this carries a minor risk of removing content if a site uses non-standard naming, it is industry-standard for documentation frameworks like Docusaurus, Mintlify, and GitBook. - Code Block Preservation: The cleaning process occurs before Markdown conversion, and the AI refinement step is explicitly instructed to maintain all technical constraints and fenced code blocks.
Example: Before and After
Raw Scrape Content:
<nav>...</nav>
<main>
<h1>Installation</h1>
<p>Welcome! In this guide, we'll help you install the tool.</p>
<pre><code>npm install contextmd</code></pre>
</main>
<aside class="sidebar">Related links...</aside>
<footer>Copyright 2024</footer>
Cleaned AI Context Output:
## Source: [Installation](https://docs.example.com/install)
# Installation
Install the utility via NPM:
`npm install contextmd`