WebSpider

command module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 8, 2025 License: MIT Imports: 17 Imported by: 0

README

WebSpider

Go Version License Go Report Card Release Build Status

A sophisticated web directory crawler built in Go that provides advanced rate limiting, intelligent filtering, and selective downloading capabilities. Designed as a modern alternative to wget --spider with enhanced features for respectful web crawling.

Features

  • Advanced Rate Limiting: Supports both standard requests-per-second limiting and specialized burst-with-backoff patterns
  • Robots.txt Support: Automatic robots.txt fetching, parsing, and compliance with crawl-delay and disallow/allow rules
  • Two-Phase Operation: Discover directory structure first, then selectively download only desired files
  • Intelligent Filtering: Powerful regex-based URL acceptance and rejection patterns
  • Respectful Crawling: Built-in detection of rate limiting responses with automatic backoff
  • Directory Structure Preservation: Maintains original directory hierarchy during downloads
  • Concurrent Processing: Configurable concurrent request limiting with semaphore control
  • Comprehensive Logging: Detailed progress tracking and verbose debugging options

Quick Start

Installation
# Clone the repository
git clone https://github.com/0xRepo-Source/WebSpider.git
cd WebSpider

# Build from source
go mod tidy
go build -o webspider

# Or install directly
go install github.com/0xRepo-Source/WebSpider@latest
Windows Setup for Easy Access

For convenient command-line usage on Windows, you can set up WebSpider to be accessible from anywhere:

  1. Download/Build WebSpider:

    • Download webspider-windows-amd64.exe from the releases page
    • Or build from source as shown above
  2. Create Directory and Rename:

    # Create a dedicated directory
    mkdir C:\WebSpider
    
    # Copy the executable and rename it for easier typing
    copy webspider-windows-amd64.exe C:\WebSpider\ws.exe
    
  3. Add to PATH Environment Variable:

    • Press Win + R, type sysdm.cpl, press Enter
    • Click "Environment Variables" button
    • Under "User variables" or "System variables", find and select "Path"
    • Click "Edit" → "New" → Add C:\WebSpider
    • Click "OK" to close all dialogs
  4. Usage from Anywhere:

    # Now you can use 'ws' from any directory
    ws -url "https://example.com/files/" -discover-only -verbose
    ws -urls "discovered-urls.txt" -rate 0.5
    ws -special-rate -url "https://sensitive-server.com/" -discover-only
    
PowerShell Alternative Setup

If you prefer using PowerShell profiles:

# Create a PowerShell function (add to your PowerShell profile)
function ws { & "C:\WebSpider\ws.exe" $args }

# Usage
ws -url "https://example.com/" -discover-only
Basic Usage
# Discover directory structure (recommended first step)
./webspider -url "https://example.com/files/" -discover-only -verbose

# Edit the generated discovered-urls.txt file to select desired files

# Download selected files
./webspider -urls "discovered-urls.txt" -rate 0.5

Windows (after PATH setup):

# Same commands but using the shorter 'ws' alias
ws -url "https://example.com/files/" -discover-only -verbose
ws -urls "discovered-urls.txt" -rate 0.5

Advanced Usage

Standard Rate Limiting

For servers with standard rate limiting policies:

For servers with standard rate limiting policies:

# Conservative crawling
./webspider -url "https://example.com/docs/" -rate 0.5 -discover-only

# Moderate speed crawling
./webspider -url "https://example.com/files/" -rate 2.0 -depth 4
Special Rate Limiting

For servers that implement burst-then-block rate limiting (e.g., 2 requests per 5 seconds, then 10-second block):

# Default special rate limiting (2 req/5sec, 10sec block)
./webspider -url "https://sensitive-server.com/" -special-rate -discover-only

# Custom burst limiting (3 req/10sec, 15sec block)
./webspider -url "https://custom-server.com/" \
  -special-rate \
  -max-requests 3 \
  -time-window 10s \
  -block-duration 15s \
  -verbose
Content Filtering

Target specific file types and exclude unwanted content:

# Academic papers and documents
./webspider -url "https://university.edu/publications/" \
  -accept "\.(pdf|doc|docx|ppt|pptx)$" \
  -discover-only \
  -save-list "academic-papers.txt"

# Software packages only
./webspider -url "https://releases.example.com/" \
  -accept "\.(tar\.gz|zip|deb|rpm|dmg)$" \
  -reject "/archive/|/old/" \
  -discover-only

# Exclude web assets
./webspider -url "https://docs.example.com/" \
  -reject "\.(css|js|jpg|jpeg|png|gif|svg|ico)$" \
  -depth 5

Command Line Reference

Core Options
Flag Description Default Example
-url Base URL to crawl Required https://example.com/files/
-urls File containing URLs to download - discovered-urls.txt
-discover-only Only discover URLs, don't download false -
-depth Maximum crawling depth 3 5
-output Output directory for downloads ./downloads ./my-files
-save-list File to save discovered URLs discovered-urls.txt results.txt
-verbose Enable verbose logging false -
Rate Limiting Options
Flag Description Default Example
-rate Standard rate limit (req/sec) 1.0 0.5
-special-rate Enable burst-then-block limiting false -
-max-requests Max requests in time window 2 3
-time-window Time window for request limiting 5s 10s
-block-duration Duration server blocks after limit 10s 15s
Filtering Options
Flag Description Default Example
-accept Regex pattern for URLs to accept - \.(pdf|doc)$
-reject Regex pattern for URLs to reject - \.(css|js)$
-user-agent Custom User-Agent string WebSpider/1.0 Mozilla/5.0...
-timeout HTTP request timeout 30s 60s
Robots.txt Compliance
Flag Description Default Example
-ignore-robots Ignore robots.txt rules false (respects robots.txt) -ignore-robots

Robots.txt Features:

  • Automatic fetching: Downloads and parses robots.txt from each domain
  • User-agent matching: Respects rules for WebSpider, *, and custom user agents
  • Crawl-delay support: Automatically adjusts rate limiting based on Crawl-delay directive
  • Path filtering: Honors Disallow and Allow path patterns
  • Caching: Caches robots.txt for 24 hours to reduce server load

Use Cases

Academic Research

Download research papers and documentation while respecting university server policies:

./webspider -url "https://university.edu/papers/" \
  -special-rate \
  -accept "\.(pdf|doc|docx)$" \
  -discover-only \
  -verbose

Windows:

ws -url "https://university.edu/papers/" -special-rate -accept "\.(pdf|doc|docx)$" -discover-only -verbose
Software Distribution

Mirror software releases with conservative rate limiting:

./webspider -url "https://releases.project.org/" \
  -rate 0.5 \
  -accept "\.(tar\.gz|zip|deb|rpm)$" \
  -depth 3 \
  -output "./software-mirror"

Windows:

ws -url "https://releases.project.org/" -rate 0.5 -accept "\.(tar\.gz|zip|deb|rpm)$" -depth 3 -output ".\software-mirror"
Documentation Archival

Archive website documentation excluding assets:

./webspider -url "https://docs.example.com/" \
  -reject "\.(jpg|jpeg|png|gif|css|js)$" \
  -accept "\.(html|htm|pdf|txt|md)$" \
  -rate 2.0

Windows:

ws -url "https://docs.example.com/" -reject "\.(jpg|jpeg|png|gif|css|js)$" -accept "\.(html|htm|pdf|txt|md)$" -rate 2.0
Robots.txt Compliant Crawling

Crawl a website while automatically respecting robots.txt rules:

./webspider -url "https://example.com/data/" \
  -verbose \
  -discover-only \
  -accept "\.(csv|json|xml)$"

Windows:

ws -url "https://example.com/data/" -verbose -discover-only -accept "\.(csv|json|xml)$"

This will:

  • Automatically fetch and parse robots.txt from example.com
  • Respect any Disallow paths that block access to /data/ or subdirectories
  • Honor Crawl-delay directives by adjusting the rate limiter
  • Skip URLs blocked by robots.txt (shown in verbose output)

To override robots.txt protection (use responsibly):

./webspider -url "https://example.com/data/" -ignore-robots -rate 0.5

Best Practices

Respectful Crawling
  • Always start with -discover-only to understand the site structure
  • WebSpider automatically respects robots.txt by default
  • Use conservative rate limits (0.5-1.0 req/sec) for unknown servers
  • Monitor server responses with -verbose flag
  • Only use -ignore-robots when you have permission to bypass robots.txt
Efficient Filtering
  • Use -accept patterns to target specific file types early
  • Combine with -reject patterns to exclude unwanted content
  • Set appropriate -depth limits to avoid unnecessary crawling
  • Test regex patterns before large crawls
Error Recovery
  • Use -timeout for unreliable connections
  • Enable -verbose logging for debugging
  • Check generated URL lists before downloading
  • Consider using -special-rate for sensitive servers

Technical Details

Rate Limiting Implementation
  • Standard Mode: Token bucket algorithm via golang.org/x/time/rate
  • Special Mode: Sliding window with automatic block detection
  • Backoff Strategy: Exponential backoff on HTTP 429/503 responses
  • Concurrent Control: Configurable semaphore limiting
URL Discovery
  • HTML Parsing: Uses goquery for robust DOM traversal
  • Link Resolution: Handles relative and absolute URLs correctly
  • Deduplication: Prevents revisiting discovered URLs
  • Filtering: Real-time regex-based URL filtering
Download Management
  • Directory Preservation: Maintains original site structure
  • Atomic Operations: Safe concurrent file creation
  • Error Handling: Graceful failure recovery
  • Progress Tracking: Comprehensive logging system

Troubleshooting

Common Issues

Rate Limiting Errors (HTTP 429)

# Solution: Use special rate limiting mode
./webspider -special-rate -max-requests 1 -time-window 10s

Connection Timeouts

# Solution: Increase timeout and reduce rate
./webspider -timeout 60s -rate 0.1

Memory Usage with Large Sites

# Solution: Limit depth and use filtering
./webspider -depth 2 -accept "\.(pdf|zip)$"
Windows-Specific Issues

"ws is not recognized" Error

# Check if PATH was added correctly
echo %PATH%

# Verify the executable exists
dir C:\WebSpider\ws.exe

# Try full path if PATH isn't working
C:\WebSpider\ws.exe -url "https://example.com/" -discover-only

Permission Issues on Windows

# Run Command Prompt as Administrator when setting up PATH
# Or use PowerShell profile method instead
Getting Help
  • Use -h flag to see all available options
  • Enable -verbose logging for detailed operation information
  • Check the generated URL list files for unexpected results
  • Test with small depth values before full crawls

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup
git clone https://github.com/0xRepo-Source/WebSpider.git
cd WebSpider
go mod tidy
go build -o webspider
Running Tests
go test ./...

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built with goquery for HTML parsing
  • Rate limiting powered by golang.org/x/time/rate
  • Inspired by wget's spider functionality with modern improvements

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL