Skip to content

shivamsahu-tech/coderag-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

42 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 CodeRAG AI

Chat with Your Codebase Using AI-Powered Dependency Graphs

MIT License Python FastAPI React Neo4j Gemini

🌐 Live Demo β€’ πŸ“Έ View Screenshots β€’ πŸš€ Quick Start β€’ πŸ—οΈ Architecture


Revolutionary RAG system that builds dependency graphs using Tree-sitter for context-aware code understanding


🌟 Why CodeRAG AI?

Traditional RAG systems break code context with random chunking. CodeRAG AI is different.

❌ Traditional RAG

  • πŸ“„ Random chunking breaks meaning
  • πŸ” Loses code dependencies
  • ❌ No function/class relationships
  • πŸ’¬ Generic, context-free responses
  • ⚠️ Ignores import dependencies

βœ… CodeRAG AI

  • 🌳 Context-rich AST-based chunking
  • 🧬 Full dependency graph tracking
  • βœ… Function, class, import resolution
  • 🎯 Context-aware with relationships
  • πŸ”— Inter-file dependency mapping

✨ Key Features

Feature Description
🌳 Tree-sitter Parsing Context-rich chunking that preserves code structure
πŸ”— Dependency Graphs Complete function, class, and import relationship mapping
πŸ—„οΈ Neo4j Storage Graph database with vector embeddings for semantic search
πŸ€– Gemini-Powered AI responses with full codebase context
πŸ”„ Session Management Create new sessions or rejoin existing conversations
🎯 Smart Context Retrieves code with all dependencies intact
πŸ’‘ Natural Queries Ask "define loginController" - automatically enhanced
πŸ“Š Multi-Language Supports multiple programming languages via Tree-sitter
πŸ” Persistent Sessions Use session IDs to continue conversations anytime

πŸ—οΈ Architecture

Core Innovation: Dependency Graph Construction

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Git Clone     β”‚ ───> β”‚  Tree-sitter     β”‚ ───> β”‚  Extract Nodes  β”‚
β”‚   Repository    β”‚      β”‚  AST Parser      β”‚      β”‚  & Chunks       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                            β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  Build Dependency Graph β”‚
        β”‚  β€’ Function calls       β”‚
        β”‚  β€’ Class calls          β”‚
        β”‚  β€’ Import resolution    β”‚
        β”‚  β€’ Sibling relationshipsβ”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Generate      β”‚        β”‚ Store in     β”‚
β”‚ Embeddings    β”‚        β”‚ Neo4j with   β”‚
β”‚ (Gemini)      β”‚        β”‚ Session ID   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“₯ Ingestion Pipeline

  1. Repository Cloning - Clone via Git with unique session ID
  2. File Walking - Extract language from extension, process code files
  3. AST Generation - Tree-sitter parses code into Abstract Syntax Trees
  4. Import Extraction - Capture all imports for inter-file dependencies
  5. Chunk Creation - DFS traversal with MIN_CHUNK_SIZE threshold
    • Node ID: {file_path}:{start_line}:{node_type}
    • Extract: name, calls, siblings, parent relationships
  6. Call Resolution - Match function/class calls to actual definitions
  7. Import Resolution - Link imports to their source definitions
  8. Document Handling - Process README, docs with recursive chunking
  9. Graph Storage - Store in Neo4j with embeddings and session ID

πŸ” Retrieval Pipeline

  1. User Query - Natural language question with session ID
  2. Query Enhancement - LLM expands "define loginController" to full context
  3. Vector Embedding - Convert enhanced query to vector (Gemini)
  4. Top-K Retrieval - Find most similar chunks from Neo4j
  5. Dependency Fetching - Retrieve all related nodes (calls, imports, siblings)
  6. Context Assembly - Concatenate code with dependencies
  7. LLM Response - Gemini generates answer with full context
  8. Return Result - Send back to client with session preserved

🎯 Session Management

πŸ†• Create New Session

  • Paste any GitHub repository URL
  • System generates unique session ID
  • Index repository and build dependency graph
  • Start chatting immediately

πŸ”„ Join Existing Session

  • Use session ID from previous conversations
  • Instantly access same codebase context
  • Continue conversations anytime
  • Share session IDs with team members

Live Application: code-rag.vercel.app


πŸ“Š Dependency Graph Structure

Chunk Node Schema

{
  "id": "file.py:42:function_definition",
  "name": "loginController",
  "code_str": "def loginController(req, res): ...",
  "ast_type": "function_definition",
  "file": "src/controllers/auth.py",
  "language": "python",
  "start_line": 42,
  "end_line": 58,
  "size": 245,
  "relationships": {
    "belongs_to": ["file.py:10:class_declaration"],
    "parent": ["file.py:5:module"],
    "sibling": ["file.py:60:function_definition"],
    "function_call": ["utils.py:15:function_definition"],
    "class_call": ["models.py:20:class_declaration"],
    "imports_from": ["auth_service.py:5:function_definition"]
  },
  "metadata": {
    "depth": 2,
    "calls": ["validateUser", "generateToken"],
    "type_references": ["User", "AuthService"],
    "is_definition": true,
    "definition_type": "function"
  }
}

Why Dependency Graphs Matter

Problem: Traditional RAG loses context

# Random chunk breaks meaning
def process_payment(order):
    validator = OrderValidator()  # What is OrderValidator?
    if validator.check(order):     # check() definition lost
        return payment_gateway.charge()  # No import context

Solution: CodeRAG preserves everything

  • OrderValidator β†’ Links to class definition
  • check() β†’ Links to method implementation
  • payment_gateway β†’ Resolves import source
  • Siblings β†’ Related functions in same file

πŸ“Έ Screenshots

πŸ–₯️ Session Selection

Session Management

Choose to create new session or join existing conversation

πŸ’¬ Context-Aware Chat

Chat Interface

Natural conversation with full dependency context

πŸ” Neo4j Dependency Graph

Graph Visualization

Visual representation of code relationships


πŸš€ Quick Start

πŸ“‹ Prerequisites

  • 🐍 Python 3.9+
  • πŸ“¦ Node.js 18+ and npm
  • πŸ”‘ API Keys:

1️⃣ Clone Repository

git clone https://github.com/shivamsahu-tech/coderag-ai.git
cd coderag-ai

2️⃣ Environment Configuration

Client (client/.env):

VITE_SERVER_URL=http://localhost:8000

Server (server/.env):

NEO4J_URI=neo4j+s://xxxxx.databases.neo4j.io
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your-password
LLM_API_KEY=your-gemini-api-key
EMBEDDING_API_KEY=your-gemini-api-key

3️⃣ Install Dependencies

Frontend:

cd client
npm install

Backend:

cd server
python3 -m venv .env
source .env/bin/activate  # Windows: .env\Scripts\activate
pip install -r requirements.txt

4️⃣ Run Application

Backend (Terminal 1):

cd server
source .env/bin/activate
uvicorn main:app --reload

βœ… Running at: http://localhost:8000

Frontend (Terminal 2):

cd client
npm run dev

βœ… Running at: http://localhost:5173

5️⃣ Start Using! πŸŽ‰

  1. Open http://localhost:5173
  2. New Session: Paste GitHub URL β†’ Wait for indexing
  3. Join Session: Enter existing session ID
  4. Ask questions about the codebase!

πŸ› οΈ Technical Deep Dive

Tree-sitter Chunking Strategy

MIN_CHUNK_SIZE determines granularity based on:

  • βœ… LLM token limits (directly proportional)
  • βœ… Embedding dimensions (directly proportional)
  • βœ… Graph density (inversely proportional)

Node ID Format: {file_path}:{start_line}:{node_type}

Example: src/auth.py:42:function_definition

Call Resolution Algorithm

  1. Extract calls from AST nodes (e.g., foo())
  2. Traverse all chunks to find matching definitions
  3. Link caller β†’ callee with function_call relationship
  4. Store resolved node IDs in relationship fields

Import Resolution

  1. Parse imports_from field (e.g., from auth import login)
  2. Identify source file from import path
  3. Search source file chunks for matching module
  4. Create imports_from relationship edge

πŸ’‘ Use Cases

πŸ‘¨β€πŸ’» For Developers

  • πŸ” Understand unfamiliar codebases instantly
  • πŸ“š Onboard new team members 10x faster
  • πŸ› Debug with full dependency context
  • πŸ“ Auto-generate documentation

🏒 For Teams

  • 🀝 Share session IDs for collaboration
  • πŸ“Š Code review with context awareness
  • πŸ”„ Refactoring impact analysis
  • 🎯 Architecture exploration

πŸ§ͺ Example Queries

"Define loginController"
β†’ Enhanced: "Explain loginController function, its purpose, related 
   functions, and how it handles authentication"

"Where is the database initialized?"
"Show all functions that call validateUser"
"What does the UserService class do?"
"How are imports structured in this project?"
"Find all API endpoints"

shsax

πŸŽ“ B.Tech in Information Technology

LinkedIn Portfolio

Live Demo: code-rag.vercel.app


🌟 If you find CodeRAG AI helpful, please star the repository! 🌟

About

AI based repo assistant uses RAG with graph and vector database to assist users.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors