This guide introduces advanced web scraping techniques using Crawl for AI, a library designed to extract structured information from websites efficiently. You’ll learn to extract pricing details from Anthropic using asynchronous crawling and browser automation via Playwright. This tutorial is intended for developers experienced in async programming, data extraction, and structured data manipulation.
First, install the necessary packages. Crawl for AI requires browser automation through Playwright, so Playwright and its dependencies must be installed.
pip install crawl-for-aipip install playwrightplaywright install
Secure access requires an OpenAI key. Set up your environment by defining the key, essential for accessing AI-powered features in Crawl for AI.
export OPENAI_API_KEY='your-openai-api-key'
Use async programming for efficient concurrency. Import the necessary libraries and initialize your project with these primary tools.
import asynciofrom crawl_for_ai import AsyncCrawl, SyncCrawl # Though async is recommendedfrom playwright.async_api import async_playwright
Define the data schema for extraction. Specifically, structure the data to extract model names and their input and output fees from Anthropic.
class ModelPricing: def __init__(self, model_name, input_fee, output_fee): self.model_name = model_name self.input_fee = input_fee self.output_fee = output_fee def __repr__(self): return f\ModelPricing(model_name={self.model_name}, input_fee={self.input_fee}, output_fee={self.output_fee})\
Create an asynchronous function to fetch, scrape, and process data according to the defined schema.
async def fetch_and_extract_pricing(url): async with async_playwright() as p: browser = await p.chromium.launch() page = await browser.new_page() await page.goto(url) # Extract the pricing table or list elements = await page.query_selector_all(\div.pricing-info\) pricing_data = [] for elem in elements: model_name = await elem.query_selector(\h2.model-name\).get_text_content() input_fee = await elem.query_selector(\span.input-fee\).get_text_content() output_fee = await elem.query_selector(\span.output-fee\).get_text_content() pricing_data.append(ModelPricing(model_name, float(input_fee.replace(\$\, \\)), float(output_fee.replace(\$\, \\)))) await browser.close() return pricing_data# Run the crawlerasync def main(): url = \https://example.com/anthropic-pricing\ pricing_info = await fetch_and_extract_pricing(url) for data in pricing_info: print(data)# Example executionasyncio.run(main())
Avoid repeated data fetching by implementing caching. Integrate libraries like aiocache or create custom cache systems for efficiency. Here’s an example using aiocache.
from aiocache import cached@cached(ttl=3600)async def cached_fetch_and_extract_pricing(url): return await fetch_and_extract_pricing(url)
Crawl for AI’s flexibility allows you to extend the scraping process to other platforms. Test similar extractions from sources like Google AI Studio by adjusting selectors and URLs.
By understanding these elements, you can effectively handle the challenges of web scraping, manage latencies, and maintain data accuracy. This guide provides you with the tools to efficiently extract structured data for in-depth analysis.
keywords: