Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Context-Dump is a high-performance, security-first native Rust engine designed to aggregate source code and technical documentation into a unified, token-optimized context format. It acts as a secure bridge between complex, multi-format file systems and Large Language Models (LLMs).

The Core Problem

Software projects are increasingly complex, containing not just code, but also sensitive configuration, minified third-party assets, and deeply nested archives. Feeding this data to an LLM manually is risky and inefficient.

The Solution

Context-Dump provides a highly concurrent binary that:

  • Secures Data: Automatically detects and masks PII (Emails, IPs, Credit Cards) and Secrets (AWS, JWT, Stripe).
  • Detects Noise: Flags suspicious, minified, or obfuscated code that consumes tokens without providing value.
  • Tracks Origin: Injects “Provenance” metadata (Source URL and Commit Hash) for remote repositories.
  • Respects Privacy: Maps .gitignore to its internal selection state while maintaining user control.
  • Native Extraction: Processes PDFs, Office docs, and ZIP/TAR archives without external runtimes.

Design Philosophy: Absolute Portability & RAII

The project follows a “Tank” architecture: a single static binary with no dependencies. It uses strict RAII (Resource Acquisition Is Initialization) patterns to guarantee that temporary data from remote clones is wiped from the disk even if the process crashes.

System Architecture

Context-Dump is structured using a Modular Native Architecture. We deliberately avoid overly abstract patterns (like pure Hexagonal/Ports-and-Adapters) where direct implementation provides better performance and readability, while maintaining clear boundaries between IO and Domain logic.

System Flow

The application operates as a linear pipeline, transforming file system paths into a single, token-optimized string.

graph TD
    A[Project Root / Remote URL] -->|FsScanner: WalkBuilder| B(File Tree Discovery)
    B -->|Smart Filters: NoiseDetector| C{Selection Mode}
    C -->|TUI: Mouse & Keyboard| D[Interactive Selection]
    C -->|CLI: Automated| E[Filter Engine]
    
    D & E --> F[Confirmed FileNodes]
    F -->|FsReader: Rayon Parallel| G{Format Router}
    
    G -->|.pdf / .docx / .xlsx| H[Native Document Parsers]
    G -->|.zip / .tar.gz| I[Archive X-Ray Engine]
    G -->|Text / .ipynb| J[Structure Parsers]
    
    H & I & J --> K[PII Masker: Luhn & Regex]
    K --> L[Code Analyzer: Obfuscation Detect]
    L --> M[Tokenization]
    M --> N[Priority Sorter: Hoisting Docs]
    N --> O[XML / Markdown Generator]
    O --> P[Secure Dispatcher: Clipboard/File]

Core Layers

1. The Domain Layer (src/core/)

Contains the fundamental data structures that have no dependencies on the outside world.

  • FileNode: A lightweight pointer to a filesystem entry containing metadata (hidden status, ignored status, token estimate).
  • FileContext: The final domain object containing the raw extracted text, language identification, and token count.
  • ContextConfig: The central state object defining rules, limits, and output preferences.

2. The Adapter Layer (src/adapters/)

Handles interactions with external systems and complex data formats.

  • Scanners: Uses the ignore crate to traverse directories while respecting standard exclusion rules.
  • Parsers: Pure-Rust implementations that extract raw text from structured binary formats (PDFs, Office docs) and structured JSON (Jupyter Notebooks).
  • Output: Serializes FileContext arrays into LLM-friendly formats (XML, Markdown).

3. The Engine Layer (src/engine/)

The orchestrator. It uses rayon to manage a thread pool, parallelizing the ingestion of files. It connects the FsScanner to the FsReader and finally to the OutputDispatcher.

4. The UI Layer (src/ui/)

A deterministic, state-driven Terminal User Interface built with ratatui. It translates the linear file tree into a navigable, hierarchical visual tree, allowing users to select subsets of a project dynamically.

Headless Mode (CLI)

The non-interactive mode is designed for speed and automation. It allows developers to integrate context extraction into shell scripts, CI/CD pipelines, or quick terminal workflows.

Basic Execution

To scan the current directory and generate an XML report:

context .

By default, this will scan the folder, apply smart ignore heuristics, extract the content, and output the result to standard output (or prompt the TUI if no specific flags indicate automation).

Core Arguments

FlagNameDescription
-o, --output <FILE>Output PathWrites the final report to the specified file.
-f, --format <FMT>FormatForces serialization format (xml or markdown).
-s, --stdoutStandard OutputForces the report to the terminal stdout stream.
-c, --clipClipboardCopies the final report directly to the OS clipboard.
-I, --interactiveForce TUIForces the Terminal User Interface to launch.

Filtering Context

You can heavily restrict what is included in the dump to preserve token limits:

Extension Whitelisting

Only include specific file types:

context . -e rs,md,toml

Path Exclusion

Exclude specific directories or files containing a substring:

context . -X tests,migrations,vendor

Depth Limiting

Prevent the scanner from diving too deeply into the file tree:

context . --depth 2

Example: Quick PR Review Context

To copy the source code of a specific module to your clipboard for an LLM review, formatted as Markdown:

context src/api -c -f markdown -e rs

Interactive Mode (TUI)

The Terminal User Interface (TUI) provides a deterministic, visual way to explore a project’s architecture and selectively include files into the context dump.

Initialization

Running the application without explicit filtering flags or output targets launches the TUI:

context

Interface Layout

The interface is divided into three distinct panels:

  1. Project Explorer (Top): A hierarchical view of the scanned file system. Hidden or ignored files are displayed in dark gray and deselected by default.
  2. Context Summary (Middle): Real-time statistics showing the total selected files, total token budget, target output destination, and a language distribution breakdown.
  3. Controls (Bottom): A reference bar for keyboard shortcuts.
ActionKeybindingDescription
Move CursorUp / DownNavigates vertically through the visible list.
Expand/CollapseRight / LeftOpens or closes the currently selected directory.
Toggle SelectionSpaceSelects or deselects the highlighted file. If a directory is highlighted, it recursively toggles all its children.
Cycle DestinationoToggles the output target between TERMINAL, FILE, and SYSTEM CLIPBOARD.
Cycle FormatfToggles the output structure between XML and MARKDOWN.
ExecuteEnterConfirms the selection, exits the TUI, and begins the extraction phase.
Quitq or EscAborts the application without processing.

Smart Selection

The TUI automatically pre-selects files based on the engine’s noise detection heuristics. Build artifacts, lockfiles, and hidden directories (.git) are automatically excluded from the initial selection, ensuring that a rapid Enter press yields a clean, LLM-ready context block.

Configuration & State Management

Context-Dump implements a silent state-persistence mechanism. This ensures that a developer’s workflow remains uninterrupted across multiple executions without requiring them to repeatedly pass the same CLI flags.

Persistence Lifecycle

When the application executes, it follows this configuration lifecycle:

  1. Load Phase: The engine checks the user’s OS-specific configuration directory (e.g., ~/.config/context/last_run.json on Linux or %AppData% on Windows). If it exists, it deserializes the previous session’s state.
  2. Override Phase: Any explicit flags passed via the CLI during the current execution (e.g., -f markdown) override the loaded persistent state.
  3. Save Phase: Upon successful confirmation in the TUI, the active ContextConfig is serialized and saved back to the disk.

Note: Headless CLI executions (where flags like --stdout or --output are used) bypass the TUI and do not overwrite the persistent state, ensuring automated scripts do not pollute your interactive preferences.

Smart Ignore Heuristics

The engine employs a multi-layered approach to filter out irrelevant data via the NoiseDetector module. This is more aggressive than a standard .gitignore parser.

1. Standard Exclusions

Common development artifacts are hard-blocked to save IO time:

  • Version Control: .git, .svn, .hg
  • Dependency Caches: node_modules, vendor, .venv, bin, obj
  • Build Outputs: target, dist, build, out

2. Heavy Artifact Detection

Files are evaluated based on extension and size limits. Even if a file isn’t explicitly ignored, it will be flagged as noise if it exceeds safety limits:

  • Source code: Max 50 MB (prevents reading massive minified JS files).
  • Data files (XML/JSON/CSV): Max 250 MB.
  • Binary Office files: Max 1 GB (handles large documentation).
  • Pure Binaries: .exe, .dll, .png, .mp4, .zip are hard-blocked to prevent token wastage.

Future Roadmap & Epics

The following epics outline the planned features and architectural expansions for Context-Dump. They are categorized by technical impact and priority.

Epic 1: Security & Protection (High Priority)

Preventing the accidental leakage of sensitive credentials to LLMs.

  • Hardcoded Blacklist: Absolute denial of parsing for standard key files (.env, id_rsa, *.pem, *.key).
  • TUI Security Warnings: A dedicated UI panel that flashes red to alert the user if a wallet or authentication directory is detected.
  • Entropy Filtering: Pre-parsing analysis to detect high-entropy strings (potential API keys or passwords) and sanitize them before output.

Epic 2: Token Economics & Budgeting

Enhancing the user’s control over the exact payload size sent to the LLM.

  • Test Exclusion Shortcut: A dedicated TUI shortcut to instantly deselect all localized test files (e.g., *_test.rs, *.spec.ts, tests/).
  • Token Weight Heatmap: Rendering file names in the TUI using color gradients (Green/Yellow/Red) based on their token_estimate size.
  • Context Truncation: Setting a hard token limit per file; if exceeded, the parser will append an <omitted for brevity> marker instead of halting or flooding the context.
  • Priority Dumping: Reordering the final XML/Markdown output so critical files (README.md, configuration files) appear at the top, capitalizing on the LLM’s primary attention window.

Epic 3: UI Quality of Life & State Persistence

  • Mass Collapse/Expand: Shortcuts (e.g., Shift+Arrows) to recursively expand or collapse the entire file tree.
  • Project-Specific Persistence: Modifying the state manager to remember file selections on a per-project basis. If new files are added, they are selected by default, while previously unselected files remain ignored.
  • Clipboard Limits: Checking the native OS clipboard buffer capacity and warning the user if the project dump exceeds it.

Epic 4: External Integrations

  • Remote Repository Ingestion: Allowing the CLI to accept a Git URL, cloning it to a temporary directory, extracting the context, and self-cleaning.
  • Web Content Ingestion: A lightweight, headless HTTP crawler to extract text from a provided documentation URL without requiring heavy external browser binaries.
  • Third-Party File Filtering: Specific detection algorithms to auto-exclude generated dist or vendor content injected by package managers.