Skip to main content

Scrape Safety Runbook

Operational handling for scrape safety cooldowns, threshold alerts, and blocked-IP events.

Key Signals

Use structured log event fields (recommended with LOG_FORMAT=json):

EventDescription
ingestion.safety_policy_blocked_run_startRun blocked by safety policy
ingestion.safety_cooldown_enteredCooldown activated
ingestion.safety_cooldown_clearedCooldown expired
ingestion.alert_blocked_failure_threshold_exceededBlocked failure threshold tripped
ingestion.alert_network_failure_threshold_exceededNetwork failure threshold tripped
ingestion.alert_retry_scheduled_threshold_exceededRetry threshold tripped
api.runs.manual_blocked_policyManual run blocked by policy
api.runs.manual_blocked_safetyManual run blocked by safety cooldown
scheduler.run_skipped_safety_cooldownScheduled run skipped due to cooldown
scheduler.run_skipped_safety_cooldown_precheckScheduler precheck blocked
scheduler.queue_item_deferred_safety_cooldownQueue item deferred due to cooldown

Each event includes metric-style fields (metric_name, metric_value) for log-based alert rules.

  • Cooldown enters: Trigger on event=ingestion.safety_cooldown_entered
  • Repeated start blocks: High rate of event=api.runs.manual_blocked_safety
  • Threshold trips: Any of the *_threshold_exceeded events
  • Scheduler pressure: Sustained scheduler.queue_item_deferred_safety_cooldown

If Your IP Appears Blocked

Symptoms

  • Cooldown reason: blocked_failure_threshold_exceeded
  • Parse state: blocked_or_captcha
  • Redirects toward Google account sign-in flows

Actions

  1. Stop manual retries immediately.
  2. Let cooldown expire; do not spam retriggers.
  3. Increase INGESTION_MIN_REQUEST_DELAY_SECONDS and user request delay values.
  4. Reduce concurrency pressure (keep one scheduler instance).
  5. Keep name-search disabled if login-gated responses persist.
  6. Resume with a small monitored run and verify blocked rate drops.

Avoid

  • Aggressive rapid retries
  • Rotating through risky scraping patterns that increase challenge rates
  • Bypass/CAPTCHA-solving workflows that may violate source platform rules

Environment Controls

Policy floors and safety controls:

VariableDefaultDescription
INGESTION_MIN_REQUEST_DELAY_SECONDS2Floor delay between requests
INGESTION_MIN_RUN_INTERVAL_MINUTES15Minimum time between runs
INGESTION_ALERT_BLOCKED_FAILURE_THRESHOLD1Blocked failures before alert
INGESTION_ALERT_NETWORK_FAILURE_THRESHOLD2Network failures before alert
INGESTION_ALERT_RETRY_SCHEDULED_THRESHOLD3Retries before alert
INGESTION_SAFETY_COOLDOWN_BLOCKED_SECONDS1800Cooldown after blocked threshold (30 min)
INGESTION_SAFETY_COOLDOWN_NETWORK_SECONDS900Cooldown after network threshold (15 min)
INGESTION_MANUAL_RUN_ALLOWED1Enable manual runs
INGESTION_AUTOMATION_ALLOWED1Enable automated runs

Apply stricter values first, then relax slowly only after sustained healthy runs.