- Takes a website URL as input
- Finds all pages on that website automatically
- Downloads and converts content to clean, structured text
- Saves everything into
llm_full.txt
in a format that's easy for LLMs to process
-
Make sure Python is installed
-
Set up your workspace:
# Create environment
python -m venv .venv
# Activate it (Windows)
.venv\Scripts\activate
# or (Mac/Linux)
source .venv/bin/activate
# Install packages
pip install -r requirements.txt
- Start the crawler:
python crawler.py
-
Enter a website URL (example: https://example.com)
-
The program will:
- Scan the website for all pages
- Convert content to LLM-friendly format
- Save everything to
llm_full.txt
Pull and run the container:
docker pull ghcr.io/YOUR_GITHUB_USERNAME/url-crawler:latest
docker run -it ghcr.io/YOUR_GITHUB_USERNAME/url-crawler
Or build locally:
docker build -t url-crawler .
docker run -it url-crawler
The llm_full.txt
file contains:
- Clean, structured text without HTML or other markup
- Clear separation between different pages
- Content organized in a way that's optimal for LLM processing
- Some websites might block automated access
- Large websites might take longer to process
- Make sure you have a stable internet connection