🕷️ Scraping·Mar 2026·3 pillars · 35 min read

playwright-scraping-advanced

A production-grade guide to advanced browser scraping with Playwright — covering the three pillars of industrial-strength automation: Network Interception for capturing and mocking XHR/fetch/WebSocket traffic, Chrome DevTools Protocol (CDP) for raw Chromium engine access, and Stealth Mode for fingerprint evasion and bypassing bot-detection systems like Cloudflare, DataDome, and Imperva.

page.route()route.fetch()WebSocketRouteHARCDPNetwork.enableplaywright-stealthaddInitScriptnavigator.webdriverbezier_moveproxy_rotationFlareSolverrcapsolver2captchaProfiler.coverageEmulation.setCPUThrottlingRate

Three Pillars of Advanced Scraping

Most Playwright tutorials stop at page.goto() and page.locator(). Production scrapers need to go deeper — intercepting the API calls that power dynamic pages, reaching into the Chromium engine via CDP, and evading the fingerprinting systems that block headless browsers within milliseconds of the first request.

3
Skill pillars
in this file
10+
Bot-detector
signals patched
40%
Load time saved
by blocking assets
PILLAR 01
Network Interception
page.route() intercepts every request before it leaves the browser. Capture JSON from background XHR/fetch calls, mock API responses for testing, block analytics/images to speed up crawling, intercept WebSocket frames, and record or replay traffic as HAR files.
PILLAR 02
Chrome DevTools Protocol (CDP)
CDP gives direct access to Chromium's internals below Playwright's abstraction. Collect JS coverage, measure performance timing, throttle CPU/network to simulate mobile, override geolocation, set cookies at the protocol level, and listen to raw browser events. Chromium only — Firefox and WebKit do not support CDP.
PILLAR 03
Stealth & Anti-Bot Evasion
Bot detectors check dozens of browser signals: navigator.webdriver, canvas fingerprints, WebGL strings, plugin arrays, mouse trajectory, timing anomalies, and IP reputation. This pillar covers baseline Chrome flags, playwright-stealth, manual fingerprint patches, human mouse simulation, proxy rotation, captcha solving, and Cloudflare bypass strategies.

Environment Setup

bashinstallation
# Core — always required
pip install playwright
playwright install chromium        # preferred for CDP + stealth

# Stealth plugin (Python)
pip install playwright-stealth

# Node.js stealth alternative
npm install playwright-extra playwright-extra-plugin-stealth

# Optional helpers
pip install httpx faker            # proxy rotation, realistic UAs

Always Use Chromium for Advanced Scraping

Firefox and WebKit lack full CDP support. Chromium is also the only engine where playwright-stealth and fingerprint patches are effective, since bot detectors are built to detect non-Chrome browsers out of context.

Universal Boilerplate

Start every advanced scraper with this skeleton, then layer in the patterns from each pillar. Apply stealth before goto(), register routes before navigation, open CDP session after context creation.

pythonbase scraper skeleton — async
import asyncio
from playwright.async_api import async_playwright, Page, BrowserContext

STEALTH_ARGS = [
    "--no-sandbox",
    "--disable-blink-features=AutomationControlled",   # removes webdriver flag
    "--disable-infobars",
    "--disable-dev-shm-usage",
]

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=STEALTH_ARGS,
        )
        context: BrowserContext = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            timezone_id="America/New_York",
        )
        page: Page = await context.new_page()

        # 1. Stealth patches (see Pillar 3)
        # 2. Route handlers (see Pillar 1)
        # 3. CDP session (see Pillar 2)
        # 4. Navigate last
        await page.goto("https://target.com", wait_until="networkidle")

        await browser.close()

asyncio.run(main())

Pillar 1 — Network Interception

page.route(pattern, handler) intercepts all matching requests before they leave the browser. The handler must call exactly one terminal method — route.continue_(), route.fulfill(),route.abort(), or route.fetch() followed by route.fulfill() — or the request hangs forever.

GoalMethod
Read-only capture, no side effectspage.on("response", ...)
Capture + pass throughroute.fetch()route.fulfill(response=resp)
Fully mock responseroute.fulfill(status=200, body=...)
Block resourceroute.abort()
Modify headers onlyroute.continue_(headers=...)
Redirect URLroute.continue_(url=new_url)
WebSocket tappage.route_web_socket(...)
Full offline replaycontext.route_from_har(...)

Capturing API Responses

The most common use case: grab JSON from background XHR/fetch calls instead of scraping rendered HTML. This gives you clean, structured data directly from the source.

pythoncapture XHR/fetch JSON — intercept + pass-through
import json

captured_data = []

async def intercept(route, request):
    response = await route.fetch()           # fetch from real server
    body = await response.body()
    try:
        captured_data.append(json.loads(body))
    except Exception:
        pass
    await route.fulfill(response=response)   # pass through to page

await page.route("**/api/products*", intercept)
await page.goto("https://shop.example.com/catalog")
await page.wait_for_load_state("networkidle")
print(f"Captured {len(captured_data)} API responses")
pythonread-only capture with page.on("response")
# Simpler for read-only — no route needed, no risk of hanging
responses = []
page.on("response", lambda resp: responses.append(resp) if "api/items" in resp.url else None)

await page.goto("https://example.com")
await page.wait_for_load_state("networkidle")

for resp in responses:
    if resp.ok:
        data = await resp.json()
        print(data)

Intercepting Paginated APIs

pythoncapture all pages via Load More button
all_pages = []

async def capture_page(route, request):
    response = await route.fetch()
    data = await response.json()
    all_pages.append(data)
    await route.fulfill(response=response)

await page.route("**/api/items*", capture_page)
await page.goto("https://example.com/items")

# Click "Load More" until it disappears
while await page.query_selector("#load-more"):
    await page.click("#load-more")
    await page.wait_for_load_state("networkidle")

print(f"Total pages captured: {len(all_pages)}")

Blocking Unwanted Resources

pythonsmart resource blocker — 40–70% faster loads
BLOCKED_TYPES   = {"image", "media", "font", "stylesheet"}
BLOCKED_DOMAINS = {"google-analytics.com", "doubleclick.net", "hotjar.com", "facebook.com"}

async def smart_block(route, request):
    if request.resource_type in BLOCKED_TYPES:
        await route.abort()
        return
    if any(d in request.url for d in BLOCKED_DOMAINS):
        await route.abort()
        return
    await route.continue_()

await page.route("**/*", smart_block)

WebSocket Interception

Playwright 1.38+ exposes WebSocketRoute for full WS control — ideal for scraping real-time dashboards, live trading feeds, or chat applications.

pythoncapture and forward WebSocket frames
async def ws_handler(ws_route):
    messages = []

    async def on_message(message):
        messages.append(message)       # log inbound frames
        await ws_route.send(message)   # forward to page unchanged

    ws_route.on("message", on_message)
    await ws_route.connect()           # connect to real server

await page.route_web_socket("wss://stream.example.com/**", ws_handler)
await page.goto("https://example.com/live")
await page.wait_for_timeout(10_000)    # collect 10s of frames
print(f"Captured {len(messages)} WS frames")

HAR Recording & Offline Replay

pythonrecord HAR → replay fully offline
# RECORD — capture all API traffic to HAR
context = await browser.new_context(
    record_har_path="capture.har",
    record_har_url_filter="**/api/**",    # only API traffic
    record_har_content="attach",          # embed response bodies
)
page = await context.new_page()
await page.goto("https://example.com")
await page.wait_for_load_state("networkidle")
await context.close()    # HAR written on close

# REPLAY — serve all requests from HAR, no real network
context = await browser.new_context()
await context.route_from_har(
    "capture.har",
    url="**/api/**",
    update=False,         # strict mode — abort on miss
)
page = await context.new_page()
await page.goto("https://example.com")   # fully offline

Pillar 2 — Chrome DevTools Protocol (CDP)

CDP provides direct access to Chromium's internals below Playwright's high-level API. Open a session with page.context.new_cdp_session(page), then send commands and listen to events from any CDP domain. Always enable the domain before listening to its events.

pythonopening a CDP session
# Open session scoped to a specific page
client = await page.context.new_cdp_session(page)

# Enable a domain before using it
await client.send("Network.enable")

# Send a command and get the result
result = await client.send("Network.getAllCookies")
cookies = result["cookies"]

# Listen to events
client.on("Network.requestWillBeSent", lambda p: print(p["request"]["url"]))

# Detach when done (optional — auto-closes with context)
await client.detach()

Network Domain — Richer Than page.route()

pythoncapture full request/response details via CDP
client = await page.context.new_cdp_session(page)
await client.send("Network.enable")

requests: dict = {}

def on_request(params):
    requests[params["requestId"]] = {
        "url":     params["request"]["url"],
        "method":  params["request"]["method"],
        "headers": params["request"]["headers"],
    }

def on_response(params):
    rid = params["requestId"]
    if rid in requests:
        requests[rid]["status"] = params["response"]["status"]
        requests[rid]["mime"]   = params["response"]["mimeType"]

client.on("Network.requestWillBeSent", on_request)
client.on("Network.responseReceived",  on_response)

await page.goto("https://example.com")

# Get response body via CDP (bypasses Playwright's body restrictions)
async def get_body(request_id: str) -> bytes:
    result = await client.send("Network.getResponseBody", {"requestId": request_id})
    if result.get("base64Encoded"):
        import base64
        return base64.b64decode(result["body"])
    return result["body"].encode()

Performance Metrics & JS Coverage

pythonperformance metrics + JS coverage collection
# Performance metrics
await client.send("Performance.enable")
metrics_result = await client.send("Performance.getMetrics")
metrics = {m["name"]: m["value"] for m in metrics_result["metrics"]}
print(f"DOM Content Loaded: {metrics.get('DOMContentLoaded', 0):.2f}ms")
print(f"JS Heap Used: {metrics.get('JSHeapUsedSize', 0) / 1e6:.1f} MB")

# JS coverage — which code actually ran?
await client.send("Profiler.enable")
await client.send("Profiler.startPreciseCoverage", {"callCount": True, "detailed": True})

await page.goto("https://example.com")
await page.wait_for_load_state("networkidle")

result = await client.send("Profiler.takePreciseCoverage")
for script in result["result"]:
    total   = sum(r["endOffset"] - r["startOffset"] for r in script["functions"])
    covered = sum(
        r["endOffset"] - r["startOffset"]
        for fn in script["functions"]
        for r in fn["ranges"] if r["count"] > 0
    )
    pct = (covered / total * 100) if total else 0
    print(f"{script['url']}: {pct:.1f}% executed")

Emulation: CPU, Network & Geolocation

pythonthrottling and geolocation override
# CPU throttling — simulate mid-range Android device (4× slowdown)
await client.send("Emulation.setCPUThrottlingRate", {"rate": 4})

# Network throttling — simulate 3G connection
await client.send("Network.emulateNetworkConditions", {
    "offline":             False,
    "downloadThroughput":  1.5 * 1024 * 1024 / 8,   # 1.5 Mbps
    "uploadThroughput":    750 * 1024 / 8,            # 750 Kbps
    "latency":             150,                       # 150ms RTT
})

# Simulate offline
await client.send("Network.emulateNetworkConditions", {
    "offline": True, "downloadThroughput": -1, "uploadThroughput": -1, "latency": 0,
})

# Override geolocation
await client.send("Emulation.setGeolocationOverride", {
    "latitude":  1.3521,    # Singapore
    "longitude": 103.8198,
    "accuracy":  100,
})

Full CDP Command Reference

DomainCommandPurpose
NetworkNetwork.enableStart network events
NetworkNetwork.getResponseBodyFetch body after load
NetworkNetwork.setBlockedURLsBlock URL patterns
NetworkNetwork.emulateNetworkConditionsSimulate slow network
NetworkNetwork.getAllCookiesRead all cookies
PerformancePerformance.getMetricsPerf counters
ProfilerstartPreciseCoverageJS coverage collection
EmulationsetCPUThrottlingRateCPU slowdown
EmulationsetGeolocationOverrideFake GPS coordinates
EmulationsetDeviceMetricsOverrideCustom viewport
SecuritysetIgnoreCertificateErrorsSkip TLS errors
RuntimeevaluateRun JS in page context
RuntimeaddBindingExpose callback to page
InputdispatchKeyEventSynthetic keyboard

Pillar 3 — Stealth & Anti-Bot Evasion

Modern bot-detection systems check dozens of browser signals simultaneously. A headless Chromium without stealth patches fails detection within milliseconds. This section covers the full layered defence: baseline flags, the playwright-stealth library, manual fingerprint patches, human interaction simulation, proxy rotation, and captcha solving.

Threat Model — What Bot Detectors Check

SignalWhat Detectors CheckCountermeasure
navigator.webdriver=== trueOverride to undefined before load
Chrome automation extensionwindow.chrome.app missingSpoof window.chrome object
Permissions APIReturns denied for notificationsPatch API return value
Plugin arrayEmpty navigator.pluginsInject fake plugin list (≥3 entries)
WebGL rendererANGLE headless stringOverride via canvas hook
Canvas fingerprintDeterministic pixel outputAdd subtle random noise
User-Agent mismatchUA says Windows, platform says LinuxSync all UA fields
Mouse trajectoryTeleporting cursor, no mouseoverBezier curve movement
IP reputationDatacenter IP rangeResidential proxy rotation
TLS fingerprint (JA3)Node.js TLS differs from real ChromeUse real Chromium binary

Baseline Chromium Flags — Free Stealth

pythonstealth launch args — always include these
STEALTH_ARGS = [
    "--no-sandbox",
    "--disable-blink-features=AutomationControlled",    # removes webdriver flag
    "--disable-infobars",
    "--disable-dev-shm-usage",
    "--disable-browser-side-navigation",
    "--disable-features=IsolateOrigins,site-per-process",
    "--flag-switches-begin",
    "--disable-site-isolation-trials",
    "--flag-switches-end",
]

browser = await p.chromium.launch(headless=True, args=STEALTH_ARGS)

# Always set a matching real user-agent
context = await browser.new_context(
    user_agent=(
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
)

playwright-stealth (Python)

playwright-stealth patches the most common fingerprint leaks automatically. Apply it to the page before any navigation.

pythonplaywright-stealth — full setup
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True, args=STEALTH_ARGS)
    page    = await browser.new_page()

    await stealth_async(page)   # ← MUST be called before goto()

    await page.goto("https://bot.sannysoft.com")   # detection test page
    await page.screenshot(path="stealth_test.png")

What playwright-stealth patches automatically

navigator.webdriverundefined · window.chrome → realistic object · navigator.plugins → fake 3-entry list · navigator.languages["en-US","en"] · WebGL vendor/renderer strings · Permissions API · window.outerWidth/Height · screen properties

Manual Fingerprint Patches via addInitScript

When libraries are unavailable, patch manually. addInitScript runs before any page JavaScript — the only reliable injection point.

pythonfingerprint patches — injected before page JS
FINGERPRINT_PATCHES = """
// 1. Remove webdriver flag
Object.defineProperty(navigator, 'webdriver', {
    get: () => undefined, configurable: true,
});

// 2. Fake chrome object
window.chrome = {
    app: { isInstalled: false, InstallState: {}, RunningState: {} },
    runtime: {}, loadTimes: function() {}, csi: function() {},
};

// 3. Fake plugins (need at least 3)
Object.defineProperty(navigator, 'plugins', {
    get: () => {
        const plugins = [
            { name: 'Chrome PDF Plugin',  filename: 'internal-pdf-viewer' },
            { name: 'Chrome PDF Viewer',  filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai' },
            { name: 'Native Client',      filename: 'internal-nacl-plugin' },
        ];
        plugins.__proto__ = PluginArray.prototype;
        return plugins;
    },
});

// 4. Fix permissions API
const _origQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (p) => (
    p.name === 'notifications'
        ? Promise.resolve({ state: Notification.permission })
        : _origQuery(p)
);

// 5. Canvas noise — subtle per-session randomisation
const _origGetContext = HTMLCanvasElement.prototype.getContext;
HTMLCanvasElement.prototype.getContext = function(type, ...args) {
    const ctx = _origGetContext.apply(this, [type, ...args]);
    if (type === '2d' && ctx) {
        const _origGetImageData = ctx.getImageData;
        ctx.getImageData = function(...a) {
            const img = _origGetImageData.apply(this, a);
            for (let i = 0; i < img.data.length; i += 100) {
                img.data[i] = img.data[i] ^ (Math.random() * 3 | 0);
            }
            return img;
        };
    }
    return ctx;
};
"""

await context.add_init_script(FINGERPRINT_PATCHES)  # apply to all pages in context

Human Simulation — Mouse, Keyboard & Scroll

pythonbezier mouse movement + human typing + realistic scroll
import math, random, asyncio

# Natural mouse movement along a cubic Bezier curve
async def bezier_move(page, x1, y1, x2, y2, steps=30):
    cx1 = x1 + random.randint(20, 80);  cy1 = y1 + random.randint(-40, 40)
    cx2 = x2 - random.randint(20, 80);  cy2 = y2 + random.randint(-40, 40)
    for i in range(steps + 1):
        t = i / steps
        x = (1-t)**3*x1 + 3*(1-t)**2*t*cx1 + 3*(1-t)*t**2*cx2 + t**3*x2
        y = (1-t)**3*y1 + 3*(1-t)**2*t*cy1 + 3*(1-t)*t**2*cy2 + t**3*y2
        await page.mouse.move(x, y)
        await asyncio.sleep(random.uniform(0.005, 0.015))

# Human-like typing with per-keystroke delays
async def human_type(page, selector: str, text: str):
    await page.click(selector)
    await asyncio.sleep(random.uniform(0.3, 0.7))
    for char in text:
        await page.keyboard.type(char)
        await asyncio.sleep(random.uniform(0.05, 0.18))
    await asyncio.sleep(random.uniform(0.2, 0.5))

# Realistic scroll with jitter
async def scroll_down(page, total_px=2000, steps=15):
    per_step = total_px // steps
    for _ in range(steps):
        await page.mouse.wheel(0, per_step + random.randint(-30, 30))
        await asyncio.sleep(random.uniform(0.05, 0.2))

# Random wait helper — use between every action
async def jitter(min_ms=500, max_ms=2000):
    await asyncio.sleep(random.uniform(min_ms, max_ms) / 1000)

# Usage
await bezier_move(page, 100, 200, 400, 350)
await page.mouse.click(400, 350)
await jitter(300, 800)
await human_type(page, "#search", "playwright scraping")

Proxy Rotation

pythonper-request proxy rotation
PROXIES = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "socks5://user:pass@proxy3.example.com:1080",
]

async def scrape_with_rotation(urls: list[str]):
    async with async_playwright() as p:
        results = []
        for i, url in enumerate(urls):
            proxy_url = PROXIES[i % len(PROXIES)]
            browser = await p.chromium.launch(
                proxy={"server": proxy_url},
                args=STEALTH_ARGS,
            )
            context = await browser.new_context(
                proxy={"server": proxy_url},
                user_agent="Mozilla/5.0 ...",
            )
            page = await context.new_page()
            await page.goto(url)
            results.append(await page.content())
            await browser.close()
        return results

# Residential proxy (e.g. Bright Data)
PROXY_CONFIG = {
    "server":   "http://brd.superproxy.io:22225",
    "username": "YOUR_ZONE_USERNAME",
    "password": "YOUR_PASSWORD",
}
context = await browser.new_context(proxy=PROXY_CONFIG)

Captcha Solving

OPTION A — 2captcha
reCAPTCHA v2/v3 solving service
pip install 2captcha-python. Sends the sitekey to the 2captcha API, which uses human workers to solve it and returns a token. Good for reCAPTCHA v2/v3. Costs ~$1–3 per 1000 solves.
OPTION B — capsolver (recommended)
Supports Cloudflare Turnstile
pip install capsolver. Supports reCAPTCHA, hCaptcha, and Cloudflare Turnstile via AntiTurnstileTaskProxyLess. Faster than 2captcha for Turnstile challenges.
OPTION C — FlareSolverr (self-hosted)
Free Cloudflare bypass via Docker
Run docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latest and send requests to its HTTP API. Returns cookies and HTML after solving the JS challenge. No per-solve cost, but requires self-hosting.
pythonFlareSolverr — self-hosted Cloudflare bypass
import httpx

async def flaresolverr_get(url: str) -> str:
    async with httpx.AsyncClient() as client:
        r = await client.post("http://localhost:8191/v1", json={
            "cmd":        "request.get",
            "url":        url,
            "maxTimeout": 60000,
        })
        return r.json()["solution"]["response"]

html = await flaresolverr_get("https://cloudflare-protected-site.com")

Cloudflare / DataDome / Imperva Strategies

Cloudflare Bot Management — Key Rules

1. Use real Chromium with all stealth patches applied. 2. Residential proxy — datacenter IPs are pre-blocked. 3. Add a realistic delay (3–8 s) before any interaction after landing. 4. Do NOT abort CSS or font resources — Cloudflare uses resource loading as a bot signal. 5. Set Sec-Fetch-* and Sec-CH-UA headers consistently.

pythonCloudflare-compatible context headers
context = await browser.new_context(
    extra_http_headers={
        "Accept-Language":    "en-US,en;q=0.9",
        "Sec-Ch-Ua":          '"Chromium";v="124", "Google Chrome";v="124"',
        "Sec-Ch-Ua-Mobile":   "?0",
        "Sec-Ch-Ua-Platform": '"Windows"',
    }
)

Combining All Three Pillars

When a task requires stealth + interception + CDP simultaneously, apply in this exact order: stealth patches first, then route handlers, then CDP session, then navigate.

pythonfull-stack example — stealth + network + CDP
async def full_stack_scrape(url: str) -> list:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, args=STEALTH_ARGS)
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
        )
        page = await context.new_page()

        # 1. Stealth patches — BEFORE anything else
        await stealth_async(page)

        # 2. Network interception — registered before goto()
        captured = []
        async def capture_api(route, request):
            if "api/v2/products" in request.url:
                response = await route.fetch()
                captured.append(await response.json())
                await route.fulfill(response=response)
            else:
                await route.continue_()
        await page.route("**/*", capture_api)

        # 3. CDP session — after context creation
        client = await context.new_cdp_session(page)
        await client.send("Network.enable")

        # 4. Navigate — always last
        await jitter(500, 1200)                      # pre-load delay
        await page.goto(url, wait_until="networkidle")
        await jitter(2000, 4000)                     # post-load human pause

        # Scroll to trigger lazy-loaded API calls
        await scroll_down(page, total_px=3000)
        await page.wait_for_load_state("networkidle")

        await browser.close()
        return captured

Detection Testing Checklist

Always test your stealth setup against these pages before deploying. A green result on all four means your browser looks convincingly human.

pythonautomated stealth verification
TEST_URLS = [
    "https://bot.sannysoft.com",                     # webdriver, plugins, chrome object
    "https://fingerprintjs.com/demo/",               # fingerprint hash
    "https://deviceandbrowserinfo.com/",             # detailed browser signals
    "https://abrahamjuliot.github.io/creepjs/",      # overall creak score
]

for url in TEST_URLS:
    await page.goto(url)
    await page.screenshot(path=f"test_{url.split('/')[2]}.png")

# Inline JS assertions — run after stealth applied
checks = await page.evaluate("""() => ({
    webdriver: navigator.webdriver,
    hasChrome:  !!window.chrome,
    plugins:    navigator.plugins.length,
    ua:         navigator.userAgent,
    platform:   navigator.platform,
    langs:      navigator.languages,
})""")

assert checks["webdriver"] is None or checks["webdriver"] is False
assert checks["hasChrome"]  is True
assert checks["plugins"]    >= 3
assert "HeadlessChrome" not in checks["ua"]
print("✓ All stealth checks passed")

Common Pitfalls

PitfallFix
route.continue_() not awaitedAlways await route.continue_() — unawaited routes hang forever
CDP session on wrong targetUse page.context.new_cdp_session(page), not browser.new_cdp_session()
navigator.webdriver still trueStealth must be applied via addInitScript BEFORE goto()
Stealth plugin applied to page, not contextApply stealth to the context for full coverage across all pages
HAR recording misses service workersSet record_har_url_filter broadly; SW traffic needs CDP interception
Cloudflare JS challenge timeoutIncrease wait_until="networkidle" timeout; add random pre-challenge delay
Blocking CSS/fonts on Cloudflare-protected sitesDo NOT block stylesheet or font resources — CF uses them as a bot signal
Datacenter IP blocked on first requestResidential proxy is non-negotiable for sites with strict IP reputation checks

Anti-Patterns to Avoid

  • Navigating before applying stealth patches — bot detectors see the real fingerprint on first load
  • Calling route.continue_() without await — the request hangs indefinitely
  • Using Firefox or WebKit for CDP-heavy scrapers — CDP is Chromium-only
  • Blocking CSS and fonts on Cloudflare-protected sites — resource loading patterns are a detection signal
  • Using datacenter IPs against sites with IP reputation systems — switch to residential proxies
  • Teleporting the mouse directly to elements — always move along a bezier curve first
  • Opening a CDP session on browser instead of page.context
  • Forgetting await context.close() before reading HAR — the file is incomplete until context closes
  • Hard-coding time.sleep() instead of jitter() — fixed delays are trivially fingerprinted
AI Skill File

Download playwright-scraping-advanced Skill

This .skill file contains four complete reference documents covering every aspect of advanced Playwright scraping — network interception, CDP commands, stealth fingerprint patches, and anti-bot strategies — ready to load into Claude or any AI tool as expert context for your scraping questions.

Network interception guide
Full CDP reference
Stealth patch scripts
Cloudflare bypass strategies
Human simulation helpers
Proxy rotation patterns
⬇ Download Skill File

Hosted by ZynU Host · host.zynu.net