pdperf is a powerful static performance linter designed to identify hidden performance issues in Pandas code. By scanning Python scripts for common anti-patterns, it catches potential slowdowns that could hinder scalability. Local-first and CI-friendly, pdperf ensures code quality without requiring execution, making it an invaluable tool for any data-centric project.
pdperf — A Static Performance Linter for Pandas
pdperf is a powerful static linter designed specifically to identify and address common performance pitfalls in Pandas code before deployment. This tool scans Python scripts for practices that may produce correct results but can severely degrade performance, often by factors of 10 to 100 when working with large datasets. Here are some key aspects of the project:
Key Features
- Static Analysis: Operates without executing code, ensuring safety and predictability.
- Deterministic Results: Produces consistent output, making it suitable for continuous integration pipelines.
- Comprehensive Rule Set: Detects a variety of performance anti-patterns that commonly affect Pandas.
Why Use pdperf?
Many Pandas operations are straightforward but can lead to inefficient execution when scaled. For instance, consider the following code:
# This works, but is painfully slow on large datasets
total = 0
for idx, row in df.iterrows():
total += row['price'] * row['quantity']
pdperf highlights this issue and suggests a more efficient alternative:
# Use vectorized operations for better performance
(total := df['price'] * df['quantity']).sum()
How It Works
pdperf parses Python code and builds an Abstract Syntax Tree (AST). It then traverses the AST to identify patterns that match known performance issues. This analytical approach enables pdperf to:
- Flag inefficient looping constructs like
iterrows(). - Warn against using
apply(axis=1)for row-wise operations, which is less optimal than vectorized alternatives. - Recognize inefficient DataFrame concatenation patterns that lead to O(n²) complexity.
Example Output
When scanning code, pdperf provides detailed reports indicating the specific lines and types of issues detected:
📄 etl/transform.py
⚠️ 45:12 [PPO001] Avoid df.iterrows() or df.itertuples() in loops; prefer vectorized operations.
💡 Use vectorized column operations like df['a'] + df['b'], or np.where().
❌ 67:8 [PPO003] Building DataFrame via append/concat in a loop is O(n²); accumulate in a list first.
💡 Collect DataFrames in a list, then call pd.concat(frames, ignore_index=True) once after the loop.
Supported Performance Rules
pdperf includes eight rules specifically targeting the most impactful Pandas performance anti-patterns:
| Rule | Description | Severity |
|---|---|---|
| PPO001 | Avoid df.iterrows() or df.itertuples() in loops | ⚠️ WARN |
| PPO002 | row-wise apply() operations are slow | ⚠️ WARN |
| PPO003 | Avoid concat/append in loops | ❌ ERROR |
| PPO004 | Chained indexing can lead to silent failures | ❌ ERROR |
| PPO005 | Index reconstruction in loops is expensive | ⚠️ WARN |
| PPO006 | Using .values can yield inconsistent results | ⚠️ WARN |
| PPO007 | groupby().apply() is not optimized for performance | ⚠️ WARN |
| PPO008 | String operations within loops should be avoided | ⚠️ WARN |
Getting Started
Integrating pdperf into projects is straightforward. It can be run from the command line to scan specified files or directories:
pdperf scan your_code.py
pdperf scan src/
Each finding can be referenced with detailed explanations, making it easier to adopt efficient coding practices.
Conclusion
pdperf serves as an essential tool to enhance performance in Pandas applications by identifying and guiding against inefficient patterns. Its static analysis approach ensures that developers can work with the confidence that their code will perform optimally under load.
No comments yet.
Sign in to be the first to comment.