Graph-Code is an advanced Retrieval-Augmented Generation system tailored for Python codebases. By utilizing knowledge graphs and natural language processing, it allows developers to intuitively query their monorepos, extract relationships, and retrieve relevant code snippets directly, enhancing productivity and understanding of complex code structures.
Graph-Code: A Graph-Based RAG System for Python Codebases
Graph-Code is an advanced Retrieval-Augmented Generation (RAG) system designed to analyze Python repositories effectively. This innovative solution builds knowledge graphs and allows users to perform natural language queries to explore the structure and relationships within a codebase.
Features
- AST-based Code Analysis: Conducts in-depth parsing of Python files to extract classes, functions, methods, and their interrelations, ensuring a comprehensive understanding of the code structure.
- Knowledge Graph Storage: Utilizes Memgraph to store the codebase structure as an interconnected graph, facilitating efficient data retrieval and exploration.
- Natural Language Querying: Users can query the codebase using plain language, making it intuitive to retrieve necessary information.
- AI-Powered Cypher Generation: Integrates Google Gemini to translate natural language queries into Cypher queries, simplifying the process of extracting data.
- Code Snippet Retrieval: Provides actual source code snippets corresponding to the retrieved functions or methods, enhancing usability and understanding.
- Dependency Analysis: Analyzes
pyproject.toml
files to identify and understand external dependencies, ensuring an accurate representation of the project environment.
Architecture
The system is composed of two main components:
- Repository Parser (
repo_parser.py
): This component analyzes Python codebases and ingests relevant data into the Memgraph. - RAG System (
codebase_rag/
): An interactive command-line interface (CLI) that allows users to directly query the knowledge graph that has been constructed.
Core Components
- Graph Database: Memgraph serves as the underlying storage system for the code structure, represented as nodes and relationships.
- LLM Integration: Leverages Google Gemini for natural language processing to enhance user interactions.
- Code Analysis: Employs abstract syntax tree (AST) traversal to accurately extract code components and their attributes.
- Query Tools: Offers specialized utilities for engaging with the graph and retrieving code snippets.
Usage
Step 1: Parse a Repository
To ingest a Python repository into the knowledge graph, execute the following command:
python repo_parser.py /path/to/your/python/repo --clean
Options include:
--clean
: Clears existing data before parsing--host
: Specify Memgraph host (default: localhost)--port
: Specify Memgraph port (default: 7687)
Step 2: Query the Codebase
To start the interactive RAG CLI, use:
python -m codebase_rag.main --repo-path /path/to/your/repo
Example queries include:
- "Show me all classes that contain 'user' in their name"
- "Find functions related to database operations"
- "What methods does the User class have?"
- "Show me functions that handle authentication"
Graph Schema
The knowledge graph includes several node types and relationships that represent the structure of the codebase effectively.
Node Types
- Project: Represents the entire repository.
- Package: Python packages denoted by directories containing
__init__.py
files. - Module: Individual Python files.
- Class: Defined class structures.
- Function: Module-level functions.
- Method: Functions defined within classes.
- Folder: Standard directories within the project.
- File: Non-Python files related to the codebase.
- ExternalPackage: Represents external dependencies.
Relationships
CONTAINS_PACKAGE/MODULE/FILE/FOLDER
: Depicts hierarchical containment.DEFINES
: Indicates that a module defines specific classes or functions.DEFINES_METHOD
: Indicates that a class defines certain methods.DEPENDS_ON_EXTERNAL
: Highlights dependencies on external packages.
Graph-Code stands out as a valuable tool for developers looking to enhance their understanding of complex Python codebases through intelligent querying and data retrieval.
No comments yet.
Sign in to be the first to comment.