Scraper Engine
The Scraper is the entry point of the pipeline. Unlike traditional scrapers that rely on fragile HTML selectors (CSS/XPath) that break whenever a site changes its theme, WP AutoFlow relies on structured data.
The WordPress API Strategy
WP AutoFlow is built to consume the WordPress REST API.
Most WordPress sites expose their content via standard JSON endpoints (/wp-json/wp/v2/posts). Our engine automatically connects to these endpoints to fetch:
- Clean Content: Gets the raw HTML content without sidebars, ads, or popups.
- High-Res Images: Fetches the original media URL directly from the metadata.
- Categories & Tags: Preserves the original classification context.
How it works
- Add a Source: You simply input the homepage URL (e.g.,
https://techcrunch.com). - Auto-Discovery: The system scans the URL to find the API endpoint.
- Ingestion: It fetches the latest posts and checks against your database to avoid duplicates.
- Queueing: New posts are sent to the Redis Queue for processing.
Why this is better?
| Method | Stability | Content Quality | Image Quality |
|---|---|---|---|
| HTML Scraping | ❌ Low (Breaks easily) | ⚠️ Messy (Includes ads) | ⚠️ Low Res |
| WP REST API | ✅ High (Standardized) | ✅ Clean (JSON) | ✅ Original |
Proxy Support
Even though we use the API, some high-traffic sites might rate-limit your server IP. WP AutoFlow includes native proxy support to rotate your identity.
- Direct: Uses your server's IP (Fastest, strictly for open APIs).
- Tunnel Proxy: Support for standard HTTP/HTTPS proxy strings.
- API Mode: Integration with services like Scrape.do for enterprise-grade rotation.