illphated
How to Crawl Your Entire Site with JavaScript and UTM Parameters Using Crawlbase on macOS
At illphated.com, I like to automate everything—especially when it comes to tracking page performance and how different links behave. Recently, I needed a reliable way to crawl all pages of my site, including those with dynamic JavaScript and UTM parameters, for SEO and analytics research.
Enter Crawlbase (formerly ProxyCrawl), a powerful scraping platform that supports real browser rendering—perfect for modern websites with JavaScript and multimedia.
Here’s how I automated crawling illphated.com from macOS using a simple shell script, including support for ?utm=1 through ?utm=999.
🔧 The Problem
Most scrapers choke on JavaScript-heavy pages or fail to simulate realistic browser traffic. I needed something that:
Renders JavaScript like a real browser
Handles multimedia and dynamic loading
Looks like a real user (not a bot)
Can cycle through all my UTM-tracked URLs
🚀 The Solution: Crawlbase + Shell Script
Using Crawlbase’s JavaScript token, I built a macOS script to crawl:
✅ Static pages like /about, /blog, etc.
✅ Dynamic URLs like /?utm=1 through /?utm=999
✅ Rendered JavaScript just like a real user would see
✅ A modern Windows Chrome user-agent to stay stealthy
🧠 The Script (Save as crawl_illphated.sh)
#!/bin/bash
# Crawlbase JavaScript Token
TOKEN=”GqmUjbLgg1HOFyfYIQnhlQ”
BASE_URL=”https://api.crawlbase.com”
SITE_URL=”https://illphated.com”
# Latest Windows Chrome user-agent (July 2025)
USER_AGENT=”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.127 Safari/537.36″
# Output directory
OUTPUT_DIR=”./crawl_results”
mkdir -p “$OUTPUT_DIR”
# Encode URLs for the API
urlencode() {
local raw=”$1″
python3 -c “import urllib.parse; print(urllib.parse.quote(”’$raw”’, safe=”))”
}
# Static paths to crawl
STATIC_PATHS=(
“”
“/about”
“/blog”
“/contact”
“/travel”
“/search?q=cyberpunk”
)
# Crawl static pages
echo “🚀 Crawling static pages…”
for path in “${STATIC_PATHS[@]}”; do
FULL_URL=”${SITE_URL}${path}”
ENCODED_URL=$(urlencode “$FULL_URL”)
OUTPUT_FILE=”$OUTPUT_DIR/$(echo “$path” | tr ‘/?’ ‘__’ | sed ‘s/^_//’).html”
echo “🔍 Crawling: $FULL_URL”
curl -s -A “$USER_AGENT”
-H “Accept-Encoding: gzip”
“$BASE_URL/?token=$TOKEN&url=$ENCODED_URL&render=true”
-o “$OUTPUT_FILE”
sleep 0.25
done
# Crawl all ?utm=1 to ?utm=999 URLs
echo “🚀 Crawling UTM query pages…”
for i in $(seq 1 999); do
FULL_URL=”${SITE_URL}/?utm=$i”
ENCODED_URL=$(urlencode “$FULL_URL”)
OUTPUT_FILE=”$OUTPUT_DIR/utm_$i.html”
echo “🔍 Crawling: $FULL_URL”
curl -s -A “$USER_AGENT”
-H “Accept-Encoding: gzip”
“$BASE_URL/?token=$TOKEN&url=$ENCODED_URL&render=true”
-o “$OUTPUT_FILE”
sleep 0.25
done
echo “🎉 Full crawl complete! Results saved to $OUTPUT_DIR”
🧪 How to Run It on macOS
Save the script as crawl_illphated.sh
Make it executable:
bash
Copy
Edit
chmod +x crawl_illphated.sh
Run it:
bash
Copy
Edit
./crawl_illphated.sh
It will create a folder called crawl_results/ filled with rendered HTML files from every page crawled.
🎯 Why It’s Useful
This is perfect for:
Verifying UTM tracking link functionality
Benchmarking SEO and rendering speed
Testing dynamic elements or lazy-loaded content
Running offline snapshots of your website for archival or QA
👀 What’s Next?
Want to level it up?
Add link discovery to crawl deeper from internal pages
Parse HTML and export metadata (title, meta tags, H1s)
Auto-upload the data to S3 or Google Drive for backup
Schedule this script via cron for nightly automation
If you’re obsessed with automation like I am, this is just the beginning.
Follow @illphated for more web hacking, automation, and real-world digital wizardry.
🛰️