Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wp-ulike domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/illphate/public_html/wp-includes/functions.php on line 6131

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the astra domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/illphate/public_html/wp-includes/functions.php on line 6131

Deprecated: Function WP_Dependencies->add_data() was called with an argument that is deprecated since version 6.9.0! IE conditional comments are ignored by all supported browsers. in /home/illphate/public_html/wp-includes/functions.php on line 6131
How To Crawl Your Entire Site With JavaScript And UTM Parameters Using Crawlbase On MacOS | Illphated Dot COM

ILLPHATED dot COM

 
 

How to Crawl Your Entire Site with JavaScript and UTM Parameters Using Crawlbase on macOS

NORDVPN_ILLPHATED
SHARE THIS NOW!

illphated

How to Crawl Your Entire Site with JavaScript and UTM Parameters Using Crawlbase on macOS
At illphated.com, I like to automate everything—especially when it comes to tracking page performance and how different links behave. Recently, I needed a reliable way to crawl all pages of my site, including those with dynamic JavaScript and UTM parameters, for SEO and analytics research.

Enter Crawlbase (formerly ProxyCrawl), a powerful scraping platform that supports real browser rendering—perfect for modern websites with JavaScript and multimedia.

NORDVPN_ILLPHATED

Here’s how I automated crawling illphated.com from macOS using a simple shell script, including support for ?utm=1 through ?utm=999.

🔧 The Problem
Most scrapers choke on JavaScript-heavy pages or fail to simulate realistic browser traffic. I needed something that:

Renders JavaScript like a real browser

Handles multimedia and dynamic loading

Looks like a real user (not a bot)

Can cycle through all my UTM-tracked URLs

🚀 The Solution: Crawlbase + Shell Script
Using Crawlbase’s JavaScript token, I built a macOS script to crawl:

✅ Static pages like /about, /blog, etc.
✅ Dynamic URLs like /?utm=1 through /?utm=999
✅ Rendered JavaScript just like a real user would see
✅ A modern Windows Chrome user-agent to stay stealthy

🧠 The Script (Save as crawl_illphated.sh)

#!/bin/bash

# Crawlbase JavaScript Token
TOKEN=”GqmUjbLgg1HOFyfYIQnhlQ”
BASE_URL=”https://api.crawlbase.com”
SITE_URL=”https://illphated.com”

# Latest Windows Chrome user-agent (July 2025)
USER_AGENT=”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.127 Safari/537.36″

# Output directory
OUTPUT_DIR=”./crawl_results”
mkdir -p “$OUTPUT_DIR”

# Encode URLs for the API
urlencode() {
local raw=”$1″
python3 -c “import urllib.parse; print(urllib.parse.quote(”’$raw”’, safe=”))”
}

# Static paths to crawl
STATIC_PATHS=(
“”
“/about”
“/blog”
“/contact”
“/travel”
“/search?q=cyberpunk”
)

# Crawl static pages
echo “🚀 Crawling static pages…”
for path in “${STATIC_PATHS[@]}”; do
FULL_URL=”${SITE_URL}${path}”
ENCODED_URL=$(urlencode “$FULL_URL”)
OUTPUT_FILE=”$OUTPUT_DIR/$(echo “$path” | tr ‘/?’ ‘__’ | sed ‘s/^_//’).html”

echo “🔍 Crawling: $FULL_URL”
curl -s -A “$USER_AGENT”
-H “Accept-Encoding: gzip”
“$BASE_URL/?token=$TOKEN&url=$ENCODED_URL&render=true”
-o “$OUTPUT_FILE”
sleep 0.25
done

# Crawl all ?utm=1 to ?utm=999 URLs
echo “🚀 Crawling UTM query pages…”
for i in $(seq 1 999); do
FULL_URL=”${SITE_URL}/?utm=$i”
ENCODED_URL=$(urlencode “$FULL_URL”)
OUTPUT_FILE=”$OUTPUT_DIR/utm_$i.html”

echo “🔍 Crawling: $FULL_URL”
curl -s -A “$USER_AGENT”
-H “Accept-Encoding: gzip”
“$BASE_URL/?token=$TOKEN&url=$ENCODED_URL&render=true”
-o “$OUTPUT_FILE”
sleep 0.25
done

echo “🎉 Full crawl complete! Results saved to $OUTPUT_DIR”

🧪 How to Run It on macOS
Save the script as crawl_illphated.sh

Make it executable:

bash
Copy
Edit
chmod +x crawl_illphated.sh
Run it:

bash
Copy
Edit
./crawl_illphated.sh
It will create a folder called crawl_results/ filled with rendered HTML files from every page crawled.

🎯 Why It’s Useful
This is perfect for:

Verifying UTM tracking link functionality

Benchmarking SEO and rendering speed

Testing dynamic elements or lazy-loaded content

Running offline snapshots of your website for archival or QA

👀 What’s Next?
Want to level it up?

Add link discovery to crawl deeper from internal pages

Parse HTML and export metadata (title, meta tags, H1s)

Auto-upload the data to S3 or Google Drive for backup

Schedule this script via cron for nightly automation

If you’re obsessed with automation like I am, this is just the beginning.

Follow @illphated for more web hacking, automation, and real-world digital wizardry.

🛰️

EmailURL

2 thoughts on “How to Crawl Your Entire Site with JavaScript and UTM Parameters Using Crawlbase on macOS”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Notice: ob_end_flush(): failed to send buffer of zlib output compression (0) in /home/illphate/public_html/wp-includes/functions.php on line 5481