PitchHut logo
A spatial query layer optimized for Polars users.
Pitch

PyCanopy introduces a high-performance spatial query layer designed specifically for Polars, leveraging a Rust core for speed and efficiency. It simplifies spatial queries while maintaining Polars-like syntax, offering the advantages of relational databases, such as query planning and indexing, in a seamless dataframe interface.

Description

PyCanopy: A High-Performance Spatial Query Layer for Polars

PyCanopy is an innovative spatial query layer designed specifically for the Polars DataFrame engine. By leveraging a Rust core and a Python API, it optimizes spatial queries while maintaining the familiar syntax of Polars, making it accessible to data professionals.

Key Features

  • Polars-Native API: Direct integration with Polars eliminates the need for SQL or data type conversions, simplifying the user experience.
  • Advanced Query Planning: The integrated spatial query planner optimizes operations through techniques like reordering, fusing, and pushdown.
  • Cost-Model-Driven Index Selection: Smart indexing is decided based on a cost model, ensuring efficient query performance.
  • Dynamic Index Capability: On-the-fly index selection ensures that the best indexing strategy is employed based on the query structure and data distribution.

Performance Benchmarking

PyCanopy has proven its competitive edge in benchmarks against renowned libraries. In the Apache SpatialBench, it achieved outstanding results, being the fastest on multiple queries:

  • Single Node Spatial Query Benchmark: Outperformed competitors like SedonaDB, DuckDB, and GeoPandas in speed and efficiency.

Here is a brief overview of the performance on SF1 and SF10 datasets:

QueryPyCanopySedonaDBDuckDBGeoPandas
q11.410.660.9612.78
q23.948.079.9520.74
...............
q1214.0014.55ERRORTIMEOUT

Example Usage

Below is an example of how to utilize PyCanopy for spatial queries:

import polars as pl
from pycanopy import SpatialFrame

# Create a SpatialFrame from a dataset
sf = SpatialFrame(pl.read_parquet("cities.parquet"), x_col="lon", y_col="lat")

# Perform a lazy spatial range query
result = sf.lazy()
    .filter(pl.col("population") > 100_000)
    .range_query(-10.0, 35.0, 40.0, 70.0)
    .collect()

Comprehensive Querying Capabilities

PyCanopy supports a wide array of complex spatial operations:

  • Query Plan Inspection: Analyze and optimize query execution paths.
  • kNN Join: Efficiently find the nearest neighbors with automatic streaming for large datasets.
  • Proximity Join with Aggregation: Execute joins based on distance and aggregate results without materializing the full pair frame.
  • Polygon Intersects Self-Join: Identify intersecting geometric shapes with precision.

Optimized for Performance

The engine uses advanced query planning techniques, including:

  • Predicate Pushdown: Filters out irrelevant data early in the process.
  • Fusion of Operations: Combines multiple spatial operations into a single efficient step.
  • Cost Model Utilization: Dynamically determines the best execution strategy based on the query and data characteristics.

Why Choose PyCanopy?

For users needing high-performance spatial operations within a Polars DataFrame, PyCanopy provides:

  • A seamless and powerful interface for spatial data querying.
  • Cutting-edge performance benchmarks demonstrating its efficacy against leading frameworks.
  • A comprehensive and intuitive API designed to facilitate complex spatial analysis tasks effortlessly.

For further details and documentation, please refer to the official documentation.

0 comments

No comments yet.

Sign in to be the first to comment.