PitchHut logo
wxpath
Declarative web crawling with XPath made simple.
Pitch

wxpath is a unique tool that allows users to perform web crawling by expressing traversal in pure XPath. It enables efficient data extraction without the complexity of traditional imperative coding methods, providing a streamlined process that executes queries concurrently while streaming results in real-time.

Description

wxpath is an innovative declarative web crawler designed to simplify web data extraction using XPath expressions. Unlike traditional crawling methods that require the writing of imperative loops, wxpath allows users to define the crawling and extraction process in a single intuitive expression. This approach facilitates efficient concurrent execution and outputs results as they are discovered, optimizing the data gathering experience.

Key Features

  • Declarative Crawling: Define crawl paths and extraction logic using XPath for readability and maintainability.

  • Concurrent Execution: Utilize asynchronous capabilities for fast, breadth-first crawling, minimizing wait times while navigating complex web structures.

  • Deep Web Traversal: Take advantage of the custom url(...) operator and the /// syntax to perform deep and paginated crawling effortlessly.

  • Output Flexibility: Results are structured and can include various data types such as dictionaries, lists, and even custom lxml objects, enabling users to handle the output more effectively.

Example Usage

Here's a simple example demonstrating wxpath in action:

import wxpath

# Define the crawl expression
path_expr = """
url('https://en.wikipedia.org/wiki/Expression_language')
 ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
 /map{
    'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
    'url': string(base-uri(.)),
    'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.),
    'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.)
 }
"""

# Iteratively crawl and print results
for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
    print(item)

The example above starts at the Wikipedia page for Expression Language, follows all internal links, and extracts key details including titles and URLs of linked pages, streaming the results in real-time.

Language and API Design

Built to support XPath 3.1, wxpath unlocks advanced features, allowing users to write complex queries with maps and arrays seamlessly. The API is structured to support both blocking and asynchronous operations, catering to various programming environments.

Polite Crawling

By default, wxpath respects the site's robots.txt rules, promoting ethical crawling practices while ensuring compliance with web standards.

Community and Support

As a project in early development, feedback is welcomed to enhance the features and API. Engagement with the community is encouraged for sharing issues, feature requests, or suggestions for improvement. Support and commercial consulting services are available for those needing more customized solutions.

For detailed documentation and more examples, refer to the official documentation.

For any inquiries regarding commercial support or contributions, contact through the provided email.

Explore the potential of web crawling with wxpath and simplify your data extraction processes!

0 comments

No comments yet.

Sign in to be the first to comment.