A high-performance web service that converts JavaScript-heavy documentation sites into clean, LLM-optimized Markdown. Built specifically to solve the problem of AI agents being unable to parse modern documentation sites that rely heavily on client-side rendering.
Why llm.codes exists: Modern AI agents like Claude Code struggle with JavaScript-heavy documentation sites, particularly Apple's developer docs. This tool bridges that gap by converting dynamic content into clean, parseable Markdown that AI agents can actually use.
đź“– Read the full story: How llm.codes Transforms Developer Documentation for AI Agents
Modern documentation sites (especially Apple's) use heavy JavaScript rendering that makes content invisible to AI agents. llm.codes solves this by:
- Using Firecrawl's headless browser to execute JavaScript and capture fully-rendered content
- Converting dynamic HTML to clean, semantic Markdown
- Removing noise (navigation, footers, duplicate content) that wastes AI context tokens
- Providing parallel URL processing for efficient multi-page documentation crawling
- Parallel Processing: Fetches up to 20 URLs concurrently using batched promises
- Smart Caching: Redis-backed 30-day cache reduces API calls and improves response times
- Content Filtering: Multiple filtering strategies to remove:
- Navigation elements and boilerplate
- Platform availability strings (iOS 14.0+, etc.)
- Duplicate content across pages
- Empty sections and formatting artifacts
- Recursive Crawling: Configurable depth-first crawling with intelligent link extraction
- Browser Notifications: Web Notifications API integration for background processing alerts
- URL State Management: Query parameter-based URL sharing for easy documentation links
🚀 Try it now at llm.codes
Experience the tool instantly without any setup required.
- Node.js 20+
- npm or yarn
- Firecrawl API key
- Clone the repository:
git clone https://github.com/amantus-ai/llm-codes.git
cd llm-codes
- Install dependencies:
npm install
- Create a
.env.local
file:
cp .env.local.example .env.local
- Add your Firecrawl API key to
.env.local
:
# Required
FIRECRAWL_API_KEY=your_api_key_here
# Optional - Redis Cache (Recommended for production)
UPSTASH_REDIS_REST_URL=https://your-redis-instance.upstash.io
UPSTASH_REDIS_REST_TOKEN=your_redis_token_here
# Optional - Cache Admin
CACHE_ADMIN_KEY=your_secure_admin_key_here
- Run the development server:
npm run dev
The easiest way to deploy is using Vercel:
- Click the button above
- Create a new repository
- Add your
FIRECRAWL_API_KEY
environment variable - Deploy!
- Push to your GitHub repository
- Import project on Vercel
- Add environment variables:
FIRECRAWL_API_KEY
: Your Firecrawl API key (required)UPSTASH_REDIS_REST_URL
: Your Upstash Redis URL (optional)UPSTASH_REDIS_REST_TOKEN
: Your Upstash Redis token (optional)CACHE_ADMIN_KEY
: Admin key for cache endpoints (optional)
- Deploy
-
Enter URL: Paste any documentation URL
- Most documentation sites are automatically supported through pattern matching
- Click "Learn more" to see the supported URL patterns
-
Configure Options (click "Show Options"):
- Crawl Depth: How deep to follow links (0 = main page only, max 5)
- Max URLs: Maximum number of pages to process (1-1000, default 200)
- Filter URLs: Remove hyperlinks from content (recommended for LLMs)
- Deduplicate Content: Remove duplicate paragraphs to save tokens
- Filter Availability: Remove platform availability strings (iOS 14.0+, etc.)
-
Process: Click "Process Documentation" and grant notification permissions if prompted
-
Monitor Progress:
- Real-time progress bar shows completion percentage
- Activity log displays detailed processing information
- Browser notifications alert you when complete
-
Download: View statistics and download your clean Markdown file
llm.codes uses intelligent pattern matching to support most documentation sites automatically. Rather than maintaining a list of thousands of individual sites, we use regex patterns to match common documentation URL structures.
We support documentation sites that match these patterns:
-
Documentation Subdomains (
docs.*, developer.*, learn.*, etc.
)- Examples:
docs.python.org
,developer.apple.com
,learn.microsoft.com
- Pattern: Any subdomain like docs, developer, dev, learn, help, api, guide, wiki, or devcenter
- Examples:
-
Documentation Paths (
/docs, /guide, /learn, etc.
)- Examples:
angular.io/docs
,redis.io/docs
,react.dev/learn
- Pattern: URLs ending with paths like /docs, /documentation, /api-docs, /guides, /learn, /help, /stable, or /latest
- Examples:
-
Programming Language Sites (
*js.org, *lang.org, etc.
)- Examples:
vuejs.org
,kotlinlang.org
,ruby-doc.org
- Pattern: Domains ending with js, lang, py, or -doc followed by .org or .com
- Examples:
-
GitHub Pages (
*.github.io
)- Examples: Any GitHub Pages documentation site
- Pattern: All subdomains of github.io
A small number of popular documentation sites don't follow standard patterns and are explicitly supported:
- Swift Package Index (
swiftpackageindex.com
) - Flask (
flask.palletsprojects.com
) - Material-UI (
mui.com/material-ui
) - pip (
pip.pypa.io/en/stable
) - PHP (
www.php.net/docs.php
)
Most documentation sites are automatically supported! If your site follows standard documentation URL patterns (like having /docs
in the path or docs.
as a subdomain), it should work without any changes.
If you find a documentation site that isn't supported, please open an issue and we'll either adjust our patterns or add it as an exception.
Option | Description | Default | Range |
---|---|---|---|
Crawl Depth | How many levels deep to follow links | 2 | 0-5 |
Max URLs | Maximum number of URLs to process | 200 | 1-1000 |
Batch Size | URLs processed concurrently | 20 | N/A |
Cache Duration | How long results are cached | 30 days | N/A |
The core API endpoint that handles documentation conversion.
Request Flow:
- URL validation against allowed domains whitelist
- Cache check (Redis/in-memory with 30-day TTL)
- Firecrawl API call with optimized scraping parameters
- Content post-processing and filtering
- Response with markdown and cache status
Request Body:
{
"url": "https://developer.apple.com/documentation/swiftui",
"action": "scrape"
}
Response:
{
"success": true,
"data": {
"markdown": "# SwiftUI Documentation\n\n..."
},
"cached": false
}
Error Handling:
- Domain validation errors (400)
- Firecrawl API errors (500)
- Network timeouts (504)
- Rate limiting (429)
- Framework: Next.js 15 with App Router
- Language: TypeScript
- Styling: Tailwind CSS v4
- API: Firecrawl for web scraping
- Cache: Upstash Redis for distributed caching
- Deployment: Vercel
- Development: Turbopack for fast refreshes
llm-codes/
├── src/
│ ├── app/
│ │ ├── api/
│ │ │ └── scrape/
│ │ │ ├── route.ts # API endpoint
│ │ │ └── __tests__/ # API tests
│ │ ├── globals.css # Global styles & Tailwind
│ │ ├── layout.tsx # Root layout
│ │ ├── page.tsx # Main page component
│ │ └── icon.tsx # Dynamic favicon
│ ├── constants.ts # Configuration constants
│ ├── utils/ # Utility functions
│ │ ├── content-processing.ts # Content cleaning logic
│ │ ├── file-utils.ts # File handling
│ │ ├── notifications.ts # Browser notifications
│ │ ├── scraping.ts # Scraping utilities
│ │ ├── url-utils.ts # URL validation & handling
│ │ └── __tests__/ # Utility tests
│ └── test/
│ └── setup.ts # Test configuration
├── public/
│ └── favicon.svg # Static favicon
├── next.config.js # Next.js configuration
├── postcss.config.js # PostCSS with Tailwind v4
├── tsconfig.json # TypeScript configuration
├── vitest.config.ts # Vitest test configuration
├── spec.md # Detailed specification
└── package.json # Dependencies
- URL Extraction: Custom regex patterns extract links from markdown and HTML
- Domain-Specific Filtering: Each documentation site has custom rules for link following
- Parallel Batch Processing: URLs processed in batches of 10 for optimal performance
- Content Deduplication: Hash-based paragraph and section deduplication
- Multi-Stage Filtering: Sequential filters for URLs, navigation, boilerplate, and platform strings
- Batched API Calls: Reduces Firecrawl API latency by processing multiple URLs per request
- Progressive Loading: UI updates with real-time progress during long crawls
- Smart Link Extraction: Only follows relevant documentation links based on URL patterns
- Client-Side Caching: Browser-based result caching for repeat operations
# Run all tests
npm test
# Run tests with UI
npm run test:ui
# Run tests with coverage
npm run test:coverage
# Type checking
npm run type-check
Tests cover:
- URL validation and domain filtering
- Content processing and deduplication
- API error handling
- Cache behavior
- UI component interactions
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
- Check browser permissions for notifications
- Ensure you're using a supported browser (Chrome, Firefox, Safari 10.14+, Edge)
- Try resetting notification permissions in browser settings
The app includes a 30-day cache to minimize API calls. If you're hitting rate limits:
- Reduce crawl depth
- Lower maximum URLs
- Wait for cached results
- Consider setting up Redis cache for better performance
For production use, we recommend setting up Redis cache:
- Sign up for Upstash (free tier available)
- Create a Redis database
- Add the credentials to your environment variables
- The app will automatically use Redis for caching
Benefits:
- Cache persists across deployments
- Shared cache across all instances
- Automatic compression for large documents
- ~70% reduction in Firecrawl API calls
- Ensure
FIRECRAWL_API_KEY
is set in environment variables - Check Vercel function logs for errors
- Verify your API key is valid
This project is licensed under the MIT License - see the LICENSE file for details.
LLM Codes supports 69 documentation sites across multiple categories:
- Python, MDN Web Docs, TypeScript, Rust, Go, Java, Ruby, PHP, Swift, Kotlin
- React, Vue.js, Angular, Next.js, Nuxt, Svelte, Django, Flask, Express.js, Laravel
- AWS, Google Cloud, Azure, DigitalOcean, Heroku, Vercel, Netlify, Salesforce
- PostgreSQL, MongoDB, MySQL, Redis, Elasticsearch, Couchbase, Cassandra
- Docker, Kubernetes, Terraform, Ansible, GitHub, GitLab
- PyTorch, TensorFlow, Hugging Face, scikit-learn, LangChain, pandas, NumPy
- Tailwind CSS, Bootstrap, Material-UI, Chakra UI
- npm, webpack, Vite, pip, Cargo, Maven
- Jest, Cypress, Playwright, pytest, Mocha
- React Native, Flutter, Android, Apple Developer
If you need support for a documentation site that's not listed, please open an issue on GitHub!
- Handles JavaScript-heavy sites that traditional scrapers can't parse
- Built-in markdown conversion with semantic structure preservation
- Reliable headless browser automation at scale
- Server-side API key security
- Built-in caching with fetch()
- Streaming responses for large documentation sets
- Edge-ready deployment on Vercel
- Reduces server load for filtering operations
- Enables real-time UI updates during processing
- Allows users to customize output without re-fetching
- WebSocket support for real-time crawl progress
- Custom domain rule configuration
- Batch URL upload via CSV/JSON
- Export to multiple formats (PDF, EPUB, Docusaurus)
- LLM-specific formatting profiles
- Powered by Firecrawl for JavaScript rendering
- Inspired by the challenges of making documentation accessible to AI agents
- Built with Next.js 15, Tailwind CSS v4, and TypeScript
Built by Peter Steinberger | Blog Post | Twitter