llm.codes - Transform Developer Documentation for AI Agents

A high-performance web service that converts JavaScript-heavy documentation sites into clean, LLM-optimized Markdown. Built specifically to solve the problem of AI agents being unable to parse modern documentation sites that rely heavily on client-side rendering.

Why llm.codes exists: Modern AI agents like Claude Code struggle with JavaScript-heavy documentation sites, particularly Apple's developer docs. This tool bridges that gap by converting dynamic content into clean, parseable Markdown that AI agents can actually use.

📖 Read the full story: How llm.codes Transforms Developer Documentation for AI Agents

Technical Architecture

Core Problem Solved

Modern documentation sites (especially Apple's) use heavy JavaScript rendering that makes content invisible to AI agents. llm.codes solves this by:

Using Firecrawl's headless browser to execute JavaScript and capture fully-rendered content
Converting dynamic HTML to clean, semantic Markdown
Removing noise (navigation, footers, duplicate content) that wastes AI context tokens
Providing parallel URL processing for efficient multi-page documentation crawling

Key Features

Parallel Processing: Fetches up to 20 URLs concurrently using batched promises
Smart Caching: Redis-backed 30-day cache reduces API calls and improves response times
Content Filtering: Multiple filtering strategies to remove:
- Navigation elements and boilerplate
- Platform availability strings (iOS 14.0+, etc.)
- Duplicate content across pages
- Empty sections and formatting artifacts
Recursive Crawling: Configurable depth-first crawling with intelligent link extraction
Browser Notifications: Web Notifications API integration for background processing alerts
URL State Management: Query parameter-based URL sharing for easy documentation links

Live Demo

🚀 Try it now at llm.codes

Experience the tool instantly without any setup required.

Quick Start

Prerequisites

Node.js 20+
npm or yarn
Firecrawl API key

Installation

Clone the repository:

git clone https://github.com/amantus-ai/llm-codes.git
cd llm-codes

Install dependencies:

npm install

Create a .env.local file:

cp .env.local.example .env.local

Add your Firecrawl API key to .env.local:

# Required
FIRECRAWL_API_KEY=your_api_key_here
# Optional - Redis Cache (Recommended for production)
UPSTASH_REDIS_REST_URL=https://your-redis-instance.upstash.io
UPSTASH_REDIS_REST_TOKEN=your_redis_token_here
# Optional - Cache Admin
CACHE_ADMIN_KEY=your_secure_admin_key_here

Run the development server:

npm run dev

Open https://localhost:3000

Deployment

Deploy to Vercel

The easiest way to deploy is using Vercel:

Click the button above
Create a new repository
Add your FIRECRAWL_API_KEY environment variable
Deploy!

Manual Deployment

Push to your GitHub repository
Import project on Vercel
Add environment variables:
- FIRECRAWL_API_KEY: Your Firecrawl API key (required)
- UPSTASH_REDIS_REST_URL: Your Upstash Redis URL (optional)
- UPSTASH_REDIS_REST_TOKEN: Your Upstash Redis token (optional)
- CACHE_ADMIN_KEY: Admin key for cache endpoints (optional)
Deploy

Usage

Enter URL: Paste any documentation URL
- Most documentation sites are automatically supported through pattern matching
- Click "Learn more" to see the supported URL patterns
Configure Options (click "Show Options"):
- Crawl Depth: How deep to follow links (0 = main page only, max 5)
- Max URLs: Maximum number of pages to process (1-1000, default 200)
- Filter URLs: Remove hyperlinks from content (recommended for LLMs)
- Deduplicate Content: Remove duplicate paragraphs to save tokens
- Filter Availability: Remove platform availability strings (iOS 14.0+, etc.)
Process: Click "Process Documentation" and grant notification permissions if prompted
Monitor Progress:
- Real-time progress bar shows completion percentage
- Activity log displays detailed processing information
- Browser notifications alert you when complete
Download: View statistics and download your clean Markdown file

Supported Documentation Sites

llm.codes uses intelligent pattern matching to support most documentation sites automatically. Rather than maintaining a list of thousands of individual sites, we use regex patterns to match common documentation URL structures.

Pattern-Based Matching

We support documentation sites that match these patterns:

Documentation Subdomains (docs.*, developer.*, learn.*, etc.)
- Examples: docs.python.org, developer.apple.com, learn.microsoft.com
- Pattern: Any subdomain like docs, developer, dev, learn, help, api, guide, wiki, or devcenter
Documentation Paths (/docs, /guide, /learn, etc.)
- Examples: angular.io/docs, redis.io/docs, react.dev/learn
- Pattern: URLs ending with paths like /docs, /documentation, /api-docs, /guides, /learn, /help, /stable, or /latest
Programming Language Sites (*js.org, *lang.org, etc.)
- Examples: vuejs.org, kotlinlang.org, ruby-doc.org
- Pattern: Domains ending with js, lang, py, or -doc followed by .org or .com
GitHub Pages (*.github.io)
- Examples: Any GitHub Pages documentation site
- Pattern: All subdomains of github.io

Explicit Exceptions

A small number of popular documentation sites don't follow standard patterns and are explicitly supported:

Swift Package Index (swiftpackageindex.com)
Flask (flask.palletsprojects.com)
Material-UI (mui.com/material-ui)
pip (pip.pypa.io/en/stable)
PHP (www.php.net/docs.php)

Adding New Sites

Most documentation sites are automatically supported! If your site follows standard documentation URL patterns (like having /docs in the path or docs. as a subdomain), it should work without any changes.

If you find a documentation site that isn't supported, please open an issue and we'll either adjust our patterns or add it as an exception.

Configuration Options

Option	Description	Default	Range
Crawl Depth	How many levels deep to follow links	2	0-5
Max URLs	Maximum number of URLs to process	200	1-1000
Batch Size	URLs processed concurrently	20	N/A
Cache Duration	How long results are cached	30 days	N/A

API Architecture

POST `/api/scrape`

The core API endpoint that handles documentation conversion.

Request Flow:

URL validation against allowed domains whitelist
Cache check (Redis/in-memory with 30-day TTL)
Firecrawl API call with optimized scraping parameters
Content post-processing and filtering
Response with markdown and cache status

Request Body:

{
  "url": "https://developer.apple.com/documentation/swiftui",
  "action": "scrape"
}

Response:

{
  "success": true,
  "data": {
    "markdown": "# SwiftUI Documentation\n\n..."
  },
  "cached": false
}

Error Handling:

Domain validation errors (400)
Firecrawl API errors (500)
Network timeouts (504)
Rate limiting (429)

Tech Stack

Framework: Next.js 15 with App Router
Language: TypeScript
Styling: Tailwind CSS v4
API: Firecrawl for web scraping
Cache: Upstash Redis for distributed caching
Deployment: Vercel
Development: Turbopack for fast refreshes

Project Structure

llm-codes/
├── src/
│   ├── app/
│   │   ├── api/
│   │   │   └── scrape/
│   │   │       ├── route.ts           # API endpoint
│   │   │       └── __tests__/         # API tests
│   │   ├── globals.css                # Global styles & Tailwind
│   │   ├── layout.tsx                 # Root layout
│   │   ├── page.tsx                   # Main page component
│   │   └── icon.tsx                   # Dynamic favicon
│   ├── constants.ts                   # Configuration constants
│   ├── utils/                         # Utility functions
│   │   ├── content-processing.ts      # Content cleaning logic
│   │   ├── file-utils.ts              # File handling
│   │   ├── notifications.ts           # Browser notifications
│   │   ├── scraping.ts                # Scraping utilities
│   │   ├── url-utils.ts               # URL validation & handling
│   │   └── __tests__/                 # Utility tests
│   └── test/
│       └── setup.ts                   # Test configuration
├── public/
│   └── favicon.svg                    # Static favicon
├── next.config.js                     # Next.js configuration
├── postcss.config.js                  # PostCSS with Tailwind v4
├── tsconfig.json                      # TypeScript configuration
├── vitest.config.ts                   # Vitest test configuration
├── spec.md                            # Detailed specification
└── package.json                       # Dependencies

Technical Implementation Details

Content Processing Pipeline

URL Extraction: Custom regex patterns extract links from markdown and HTML
Domain-Specific Filtering: Each documentation site has custom rules for link following
Parallel Batch Processing: URLs processed in batches of 10 for optimal performance
Content Deduplication: Hash-based paragraph and section deduplication
Multi-Stage Filtering: Sequential filters for URLs, navigation, boilerplate, and platform strings

Performance Optimizations

Batched API Calls: Reduces Firecrawl API latency by processing multiple URLs per request
Progressive Loading: UI updates with real-time progress during long crawls
Smart Link Extraction: Only follows relevant documentation links based on URL patterns
Client-Side Caching: Browser-based result caching for repeat operations

Testing Strategy

# Run all tests
npm test
# Run tests with UI
npm run test:ui
# Run tests with coverage
npm run test:coverage
# Type checking
npm run type-check

Tests cover:

URL validation and domain filtering
Content processing and deduplication
API error handling
Cache behavior
UI component interactions

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Troubleshooting

Notifications not working?

Check browser permissions for notifications
Ensure you're using a supported browser (Chrome, Firefox, Safari 10.14+, Edge)
Try resetting notification permissions in browser settings

API Rate Limits?

The app includes a 30-day cache to minimize API calls. If you're hitting rate limits:

Reduce crawl depth
Lower maximum URLs
Wait for cached results
Consider setting up Redis cache for better performance

Redis Cache Setup

For production use, we recommend setting up Redis cache:

Sign up for Upstash (free tier available)
Create a Redis database
Add the credentials to your environment variables
The app will automatically use Redis for caching

Benefits:

Cache persists across deployments
Shared cache across all instances
Automatic compression for large documents
~70% reduction in Firecrawl API calls

Deployment Issues?

Ensure FIRECRAWL_API_KEY is set in environment variables
Check Vercel function logs for errors
Verify your API key is valid

License

This project is licensed under the MIT License - see the LICENSE file for details.

Supported Documentation Sites

LLM Codes supports 69 documentation sites across multiple categories:

Programming Languages

Python, MDN Web Docs, TypeScript, Rust, Go, Java, Ruby, PHP, Swift, Kotlin

Web Frameworks

React, Vue.js, Angular, Next.js, Nuxt, Svelte, Django, Flask, Express.js, Laravel

Cloud Platforms

AWS, Google Cloud, Azure, DigitalOcean, Heroku, Vercel, Netlify, Salesforce

Databases

PostgreSQL, MongoDB, MySQL, Redis, Elasticsearch, Couchbase, Cassandra

DevOps & Infrastructure

Docker, Kubernetes, Terraform, Ansible, GitHub, GitLab

AI/ML Libraries

PyTorch, TensorFlow, Hugging Face, scikit-learn, LangChain, pandas, NumPy

CSS Frameworks

Tailwind CSS, Bootstrap, Material-UI, Chakra UI

Build Tools & Package Managers

npm, webpack, Vite, pip, Cargo, Maven

Testing Frameworks

Jest, Cypress, Playwright, pytest, Mocha

Mobile Development

React Native, Flutter, Android, Apple Developer

Missing a Site?

If you need support for a documentation site that's not listed, please open an issue on GitHub!

Architecture Decisions

Why Firecrawl?

Handles JavaScript-heavy sites that traditional scrapers can't parse
Built-in markdown conversion with semantic structure preservation
Reliable headless browser automation at scale

Why Next.js 15 + App Router?

Server-side API key security
Built-in caching with fetch()
Streaming responses for large documentation sets
Edge-ready deployment on Vercel

Why Client-Side Processing?

Reduces server load for filtering operations
Enables real-time UI updates during processing
Allows users to customize output without re-fetching

Future Enhancements

WebSocket support for real-time crawl progress
Custom domain rule configuration
Batch URL upload via CSV/JSON
Export to multiple formats (PDF, EPUB, Docusaurus)
LLM-specific formatting profiles

Acknowledgments

Powered by Firecrawl for JavaScript rendering
Inspired by the challenges of making documentation accessible to AI agents
Built with Next.js 15, Tailwind CSS v4, and TypeScript

Built by Peter Steinberger | Blog Post | Twitter

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github/workflows		.github/workflows
docs		docs
public		public
src		src
.env.example		.env.example
.env.local.example		.env.local.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierrc		.prettierrc
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PERFORMANCE_SUMMARY.md		PERFORMANCE_SUMMARY.md
README.md		README.md
components.json		components.json
hardcoded-config-audit.md		hardcoded-config-audit.md
next.config.js		next.config.js
package-lock.json		package-lock.json
package.json		package.json
performance-improvements.md		performance-improvements.md
postcss.config.js		postcss.config.js
spec.md		spec.md
tsconfig.json		tsconfig.json
vercel.json		vercel.json
vitest.config.ts		vitest.config.ts

License

amantus-ai/llm-codes

Folders and files

Latest commit

History

Repository files navigation

llm.codes - Transform Developer Documentation for AI Agents

Technical Architecture

Core Problem Solved

Key Features

Live Demo

Quick Start

Prerequisites

Installation

Deployment

Deploy to Vercel

Manual Deployment

Usage

Supported Documentation Sites

Pattern-Based Matching

Explicit Exceptions

Adding New Sites

Configuration Options

API Architecture

POST /api/scrape

Tech Stack

Project Structure

Technical Implementation Details

Content Processing Pipeline

Performance Optimizations

Testing Strategy

Contributing

Troubleshooting

Notifications not working?

API Rate Limits?

Redis Cache Setup

Deployment Issues?

License

Supported Documentation Sites

Programming Languages

Web Frameworks

Cloud Platforms

Databases

DevOps & Infrastructure

AI/ML Libraries

CSS Frameworks

Build Tools & Package Managers

Testing Frameworks

Mobile Development

Missing a Site?

Architecture Decisions

Why Firecrawl?

Why Next.js 15 + App Router?

Why Client-Side Processing?

Future Enhancements

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

POST `/api/scrape`