How to Crawl Your Entire Site with JavaScript and UTM Parameters Using Crawlbase on macOS

SHARE THIS NOW!

illphated

How to Crawl Your Entire Site with JavaScript and UTM Parameters Using Crawlbase on macOS
At illphated.com, I like to automate everything—especially when it comes to tracking page performance and how different links behave. Recently, I needed a reliable way to crawl all pages of my site, including those with dynamic JavaScript and UTM parameters, for SEO and analytics research.

Enter Crawlbase (formerly ProxyCrawl), a powerful scraping platform that supports real browser rendering—perfect for modern websites with JavaScript and multimedia.

Here’s how I automated crawling illphated.com from macOS using a simple shell script, including support for ?utm=1 through ?utm=999.

🔧 The Problem
Most scrapers choke on JavaScript-heavy pages or fail to simulate realistic browser traffic. I needed something that:

Renders JavaScript like a real browser

Handles multimedia and dynamic loading

Looks like a real user (not a bot)

Can cycle through all my UTM-tracked URLs

🚀 The Solution: Crawlbase + Shell Script
Using Crawlbase’s JavaScript token, I built a macOS script to crawl:

✅ Static pages like /about, /blog, etc.
✅ Dynamic URLs like /?utm=1 through /?utm=999
✅ Rendered JavaScript just like a real user would see
✅ A modern Windows Chrome user-agent to stay stealthy

🧠 The Script (Save as crawl_illphated.sh)

#!/bin/bash

# Crawlbase JavaScript Token
TOKEN=”GqmUjbLgg1HOFyfYIQnhlQ”
BASE_URL=”https://api.crawlbase.com”
SITE_URL=”https://illphated.com”

# Latest Windows Chrome user-agent (July 2025)
USER_AGENT=”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.127 Safari/537.36″

# Output directory
OUTPUT_DIR=”./crawl_results”
mkdir -p “$OUTPUT_DIR”

# Encode URLs for the API
urlencode() {
local raw=”$1″
python3 -c “import urllib.parse; print(urllib.parse.quote(”’$raw”’, safe=”))”
}

# Static paths to crawl
STATIC_PATHS=(
“”
“/about”
“/blog”
“/contact”
“/travel”
“/search?q=cyberpunk”
)

# Crawl static pages
echo “🚀 Crawling static pages…”
for path in “${STATIC_PATHS[@]}”; do
FULL_URL=”${SITE_URL}${path}”
ENCODED_URL=$(urlencode “$FULL_URL”)
OUTPUT_FILE=”$OUTPUT_DIR/$(echo “$path” | tr ‘/?’ ‘__’ | sed ‘s/^_//’).html”

echo “🔍 Crawling: $FULL_URL”
curl -s -A “$USER_AGENT”
-H “Accept-Encoding: gzip”
“$BASE_URL/?token=$TOKEN&url=$ENCODED_URL&render=true”
-o “$OUTPUT_FILE”
sleep 0.25
done

# Crawl all ?utm=1 to ?utm=999 URLs
echo “🚀 Crawling UTM query pages…”
for i in $(seq 1 999); do
FULL_URL=”${SITE_URL}/?utm=$i”
ENCODED_URL=$(urlencode “$FULL_URL”)
OUTPUT_FILE=”$OUTPUT_DIR/utm_$i.html”

echo “🔍 Crawling: $FULL_URL”
curl -s -A “$USER_AGENT”
-H “Accept-Encoding: gzip”
“$BASE_URL/?token=$TOKEN&url=$ENCODED_URL&render=true”
-o “$OUTPUT_FILE”
sleep 0.25
done

echo “🎉 Full crawl complete! Results saved to $OUTPUT_DIR”

🧪 How to Run It on macOS
Save the script as crawl_illphated.sh

Make it executable:

bash
Copy
Edit
chmod +x crawl_illphated.sh
Run it:

bash
Copy
Edit
./crawl_illphated.sh
It will create a folder called crawl_results/ filled with rendered HTML files from every page crawled.

🎯 Why It’s Useful
This is perfect for:

Verifying UTM tracking link functionality

Benchmarking SEO and rendering speed

Testing dynamic elements or lazy-loaded content

Running offline snapshots of your website for archival or QA

👀 What’s Next?
Want to level it up?

Add link discovery to crawl deeper from internal pages

Parse HTML and export metadata (title, meta tags, H1s)

Auto-upload the data to S3 or Google Drive for backup

Schedule this script via cron for nightly automation

If you’re obsessed with automation like I am, this is just the beginning.

Follow @illphated for more web hacking, automation, and real-world digital wizardry.

🛰️

Leave a Comment Cancel Reply