codechat

CodeChat - Repository-Intelligent Code Search

Note: This README contains all essential documentation. Previous documentation files have been consolidated here. Component-specific READMEs remain in their respective directories (chat/README.md, etc.). Archived components and old docs are in the archive/ directory.

πŸš€ Quick Start - Run the Main Pipeline

The easiest way to test everything is to run the main pipeline:

cd /Users/sagar/projects/codechat

# Set environment variables (or use .env file)
export OPENAI_API_KEY="your-openai-key"
export PINECONE_API_KEY="your-pinecone-key"
export PINECONE_ENVIRONMENT="us-east-1-aws"
export PINECONE_INDEX_NAME="model-earth"

# Run the complete pipeline
python main.py

What this does:


πŸ§ͺ Testing the Restructured Codebase

Prerequisites

  1. Python Environment: Make sure you have Python 3.8+ installed
  2. Dependencies: Install required packages
    pip install -r requirements.txt
    
  3. Environment Variables: Set up your API keys
    cp config/.env.example .env
    # Edit .env with your actual API keys:
    # OPENAI_API_KEY=your_openai_key
    # PINECONE_API_KEY=your_pinecone_key
    # PINECONE_ENVIRONMENT=your_pinecone_env
    

πŸ§ͺ Test 1: Basic Import Test

Verify that all modules can be imported correctly:

# Test core modules
python -c "from src.core.simple_processor import SimpleCodeProcessor; print('βœ… Simple processor import OK')"
python -c "from src.core.code_processor_main import Config; print('βœ… Main processor import OK')"
python -c "from src.core.summarizer import CodeSummarizer; print('βœ… Summarizer import OK')"
python -c "from src.core.embedding_generator import CodeEmbeddingGenerator; print('βœ… Embedding generator import OK')"

# Test utility modules
python -c "from src.utils.evaluate import *; print('βœ… Utils import OK')"

πŸ§ͺ Test 2: Simple Processor (No API Keys Required)

Test the basic file processing pipeline without external APIs:

# Run the simple processor (will use mock embeddings)
python -m src.core.simple_processor

# Expected output:
# πŸš€ Processing 1 repositories...
# πŸ“¦ Processing: codechat
# ⚠️ Repository path repo_analysis_output/test_repo not found, using current workspace for testing
# πŸ“ Found X files to process
# πŸ”„ Processing: [filename]
# πŸ’Ύ Would store chunk chunk_0 in Pinecone
#      Index: model-earth, Namespace: codechat-test
#      Embedding size: 1536
#      Metadata keys: ['repo_name', 'file_path', 'chunk_content', 'chunk_summary', 'chunk_id', 'language', 'timestamp']
# βœ… Completed: codechat
# πŸŽ‰ Processing complete!

πŸ§ͺ Test 3: Configuration Loading

Test that configuration files are loaded correctly:

# Test repositories.yml loading
python -c "
from src.core.simple_processor import SimpleCodeProcessor
processor = SimpleCodeProcessor()
repos = processor.load_repositories()
print(f'βœ… Loaded {len(repos)} repositories from config')
for repo in repos:
    print(f'  - {repo.get(\"name\", \"unnamed\")}: {repo.get(\"url\", \"no url\")}')
"

πŸ§ͺ Test 4: Individual Components

Test individual components separately:

4.1 Test the Chunker

python -c "
from src.core.chunker.smart_chunker import SmartChunker
chunker = SmartChunker()
test_content = '''def hello():
    print('Hello World')

class TestClass:
    def method(self):
        return 'test' '''
chunks = chunker.smart_chunk_file_from_content('test.py', test_content)
print(f'βœ… Generated {len(chunks)} chunks')
for i, chunk in enumerate(chunks[:2]):  # Show first 2 chunks
    print(f'  Chunk {i}: {len(chunk[\"content\"])} chars')
"

4.2 Test the Summarizer (requires OpenAI API key)

python -c "
import os
if os.getenv('OPENAI_API_KEY'):
    from src.core.summarizer import CodeSummarizer
    summarizer = CodeSummarizer(os.getenv('OPENAI_API_KEY'))
    test_code = 'def calculate_sum(a, b): return a + b'
    summary = summarizer.summarize_full_code(test_code, 'test.py')
    print('βœ… Summarizer working')
    print(f'Summary: {summary.get(\"summary\", \"No summary\")[:100]}...')
else:
    print('⚠️  OPENAI_API_KEY not set - skipping summarizer test')
"

4.3 Test the Embedding Generator (requires OpenAI API key)

python -c "
import os
if os.getenv('OPENAI_API_KEY'):
    from src.core.embedding_generator import CodeEmbeddingGenerator
    generator = CodeEmbeddingGenerator(os.getenv('OPENAI_API_KEY'))
    test_text = 'def hello(): return \"world\"'
    embedding = generator.generate_embedding(test_text)
    print(f'βœ… Embedding generated: {len(embedding)} dimensions')
else:
    print('⚠️  OPENAI_API_KEY not set - skipping embedding test')
"

πŸ§ͺ Test 5: Utility Scripts

Test the utility scripts in src/utils/:

5.1 Run the Demo Script

python src/utils/run_chunker_demo.py

5.2 Test Evaluation Script

python src/utils/evaluate.py

5.3 Run Unit Tests

python src/utils/test.py

πŸ§ͺ Test 6: Full Integration Test

Run a complete end-to-end test (requires API keys):

# Make sure your .env file has the correct API keys
export $(cat .env | xargs)

# Run the main processor
python -m src.core.code_processor_main

# Expected: Complete processing pipeline with real API calls

πŸ§ͺ Test 7: Lambda Function Tests

Test the Lambda functions locally:

# Test code processor Lambda
python -c "
from src.lambdas.code_processor.index import lambda_handler
event = {'repositories': [{'url': 'https://github.com/test/repo', 'name': 'test'}]}
result = lambda_handler(event, None)
print('βœ… Lambda function executed')
print(f'Result: {result}')
"

πŸ§ͺ Test 8: Web Interface

If you want to test the web interface:

# Check if there are any web files
ls src/web/

# If there's an index.html, you can open it in a browser
# Or run a local server
python -m http.server 8000
# Then visit http://localhost:8000/src/web/index.html

πŸ” Debugging Tips

If imports fail:

# Check Python path
python -c "import sys; print('\n'.join(sys.path))"

# Add src to Python path manually
export PYTHONPATH=$PYTHONPATH:$(pwd)/src

If configuration fails:

# Check if config files exist
ls -la config/

# Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('config/repositories.yml'))"

If API calls fail:

# Check environment variables
echo $OPENAI_API_KEY
echo $PINECONE_API_KEY

# Test API connectivity
python -c "import openai; openai.api_key = os.getenv('OPENAI_API_KEY'); print('βœ… OpenAI API accessible')"

πŸ“Š Expected Test Results

Test Expected Result Notes
Import Test βœ… Success messages All modules should import without errors
Simple Processor βœ… File processing output Should show chunking and mock storage
Config Loading βœ… Repository list loaded Should show configured repositories
Chunker Test βœ… Chunks generated Should create multiple chunks from test code
Summarizer Test βœ… Summary generated Requires OpenAI API key
Embedding Test βœ… 1536-dim vector Requires OpenAI API key
Utility Scripts βœ… Script execution May require additional setup
Full Integration βœ… Complete pipeline Requires all API keys
Lambda Tests βœ… Function execution Should handle events properly
Test Suite βœ… Passing tests May have some integration test failures

🚨 Common Issues & Solutions

Issue: β€œModule not found”

Solution: Add src to Python path

export PYTHONPATH=$PYTHONPATH:$(pwd)/src

Issue: β€œNo repositories found”

Solution: Check config/repositories.yml exists and has valid YAML

Issue: β€œAPI key not set”

Solution: Copy config/.env.example to .env and fill in your keys

Issue: β€œPermission denied”

Solution: Make sure scripts are executable

chmod +x src/utils/*.py

🎯 Quick Test Command

For a fast sanity check, run this one-liner:

cd /Users/sagar/projects/codechat && python -c "
from src.core.simple_processor import SimpleCodeProcessor
from src.core.code_processor_main import Config
print('βœ… Core modules import successfully')
processor = SimpleCodeProcessor()
repos = processor.load_repositories()
print(f'βœ… Configuration loaded: {len(repos)} repositories')
print('πŸŽ‰ Ready for testing!')
"

This will verify that the restructuring worked correctly and the basic functionality is intact!


πŸ—οΈ Backend + Chunking Strategies

Infrastructure Overview

CodeChat uses a streamlined AWS serverless architecture focused on essential components:

Core Components:

Deployment Architecture

Frontend (chat/) 
    ↓ HTTP requests
API Gateway (/query, /repositories)
    ↓ Lambda invocations  
Lambda Functions (query_handler, get_repositories)
    ↓ Dependencies
Lambda Layer (Python packages)
    ↓ Configuration
S3 Bucket (modelearth_repos.yml)

Intelligent Chunking Strategy

CodeChat implements an advanced multi-agent chunking system for optimal code understanding:

1. Smart File Processing

2. Multi-Level Chunking

The system uses agentic components for enhanced search:

# Query Analysis Agent
class QueryAnalysisAgent:
    def analyze_query(self, query: str) -> QueryAnalysis:
        # Determines query type and search strategy
        # Returns: code_search, conceptual_search, debugging_help, etc.

# Repository Intelligence Agent  
class RepositoryIntelligentSearchAgent:
    def search(self, query_analysis: QueryAnalysis, repo_context: str):
        # Executes targeted search based on query type
        # Returns: relevant code chunks with explanations

4. Context Generation

Clean Deployment Process

Quick Deploy (One Command)

# Set environment variables
export TF_VAR_openai_api_key="your-openai-key"
export TF_VAR_pinecone_api_key="your-pinecone-key"

# Deploy everything
./deploy-clean.sh

Manual Deployment

# 1. Build Lambda layers
cd backend/lambda_layers
pip3 install -r lambda_layer_query_handler_requirements.txt -t temp_layer/python/
zip -r lambda-layer-query-handler.zip temp_layer/python/

# 2. Deploy infrastructure
cd ../infra
terraform init
terraform apply -var-file="terraform-clean.tfvars"

# 3. Configure frontend
API_URL=$(terraform output -raw api_gateway_url)
echo "window.CODECHAT_API_ENDPOINT = '$API_URL';" >> ../../chat/script.js

Configuration Management

Repository Configuration (config/modelearth_repos.yml)

repositories:
  - name: "modelearth/webroot"
    description: "Main website repository"
    priority: "high"
  - name: "modelearth/cloud" 
    description: "Cloud infrastructure"
    priority: "medium"

Environment Variables

# Required for deployment
export TF_VAR_openai_api_key="sk-..."
export TF_VAR_pinecone_api_key="..."

# Optional (have defaults)
export TF_VAR_aws_region="us-east-1"
export TF_VAR_pinecone_environment="us-east-1-aws"
export TF_VAR_pinecone_index="model-earth-jam-stack"

Performance Optimizations

Lambda Configuration

Search Optimizations

Architecture Cleanup (Recent Changes)

Archived Components (moved from active to archive):

Essential Components Kept:

Benefits Achieved:

API Endpoints

POST /query

Submit search queries to the repository-intelligent search system:

{
  "query": "How does authentication work?",
  "repo_name": "modelearth/webroot", 
  "llm_provider": "bedrock"
}

GET /repositories

Get list of available repositories for search:

{
  "repositories": [
    {"name": "modelearth/webroot", "description": "Main website"},
    {"name": "modelearth/cloud", "description": "Cloud infrastructure"}
  ]
}

Frontend Integration

The chat interface (chat/index.html) provides:

Monitoring & Troubleshooting

CloudWatch Integration

# View Lambda logs
aws logs describe-log-groups --log-group-name-prefix "/aws/lambda/codechat"

# Monitor API Gateway metrics
aws cloudwatch get-metric-statistics --namespace AWS/ApiGateway

Common Issues

This streamlined architecture provides a robust, scalable foundation for repository-intelligent code search while maintaining simplicity and cost-effectiveness.


For component-specific documentation, see individual README files in chat/, backend/, etc. All major system documentation has been consolidated into this main README.