docs
Core Features
Scraper Engine

Scraper Engine

The Scraper is the entry point of the pipeline. Unlike traditional scrapers that rely on fragile HTML selectors (CSS/XPath) that break whenever a site changes its theme, WP AutoFlow relies on structured data.

The WordPress API Strategy

WP AutoFlow is built to consume the WordPress REST API.

Most WordPress sites expose their content via standard JSON endpoints (/wp-json/wp/v2/posts). Our engine automatically connects to these endpoints to fetch:

  • Clean Content: Gets the raw HTML content without sidebars, ads, or popups.
  • High-Res Images: Fetches the original media URL directly from the metadata.
  • Categories & Tags: Preserves the original classification context.

How it works

  1. Add a Source: You simply input the homepage URL (e.g., https://techcrunch.com).
  2. Auto-Discovery: The system scans the URL to find the API endpoint.
  3. Ingestion: It fetches the latest posts and checks against your database to avoid duplicates.
  4. Queueing: New posts are sent to the Redis Queue for processing.

Why this is better?

MethodStabilityContent QualityImage Quality
HTML Scraping❌ Low (Breaks easily)⚠️ Messy (Includes ads)⚠️ Low Res
WP REST APIHigh (Standardized)Clean (JSON)Original

Proxy Support

Even though we use the API, some high-traffic sites might rate-limit your server IP. WP AutoFlow includes native proxy support to rotate your identity.

  • Direct: Uses your server's IP (Fastest, strictly for open APIs).
  • Tunnel Proxy: Support for standard HTTP/HTTPS proxy strings.
  • API Mode: Integration with services like Scrape.do for enterprise-grade rotation.