Skip to content

A specialized web crawler that converts website content into LLM-friendly text format.

Notifications You must be signed in to change notification settings

Loongphy/url-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM URL Crawler

What it Does

  1. Takes a website URL as input
  2. Finds all pages on that website automatically
  3. Downloads and converts content to clean, structured text
  4. Saves everything into llm_full.txt in a format that's easy for LLMs to process

Setup

  1. Make sure Python is installed

  2. Set up your workspace:

# Create environment
python -m venv .venv

# Activate it (Windows)
.venv\Scripts\activate
# or (Mac/Linux)
source .venv/bin/activate

# Install packages
pip install -r requirements.txt

How to Use

  1. Start the crawler:
python crawler.py
  1. Enter a website URL (example: https://example.com)

  2. The program will:

    • Scan the website for all pages
    • Convert content to LLM-friendly format
    • Save everything to llm_full.txt

Docker Usage

Pull and run the container:

docker pull ghcr.io/YOUR_GITHUB_USERNAME/url-crawler:latest
docker run -it ghcr.io/YOUR_GITHUB_USERNAME/url-crawler

Or build locally:

docker build -t url-crawler .
docker run -it url-crawler

Output Format

The llm_full.txt file contains:

  • Clean, structured text without HTML or other markup
  • Clear separation between different pages
  • Content organized in a way that's optimal for LLM processing

Common Issues

  • Some websites might block automated access
  • Large websites might take longer to process
  • Make sure you have a stable internet connection

About

A specialized web crawler that converts website content into LLM-friendly text format.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published