IntelliDoc is an advanced persona-driven document intelligence system that extracts and ranks the most relevant sections from PDF documents based on a specific persona and their job-to-be-done. It's designed to help users quickly find the most pertinent information in documents based on their specific needs and expertise level.
- Features
- Installation
- Usage
- Project Structure
- Architecture
- Input Format
- Output Format
- Docker Support
- Development
- License
- Multi-format Support: Extracts headings from both text-based and image-based PDFs
- Multilingual OCR: Supports multiple languages including English, Japanese, Chinese, Arabic, Hindi, and Korean
- Smart Heading Detection:
- Identifies headings based on font size, weight, and style
- Handles nested heading hierarchies
- Distinguishes between main headings and subheadings
- Document Structure Analysis:
- Automatically detects document structure
- Identifies body text vs. headings
- Handles complex layouts and columns
- Output Formats:
- JSON output with hierarchical heading structure
- Preserves page numbers and positions
- Extracts text content under each heading
- Enhanced PDF text extraction with outline parsing
- OCR fallback for image-based PDFs
- Multilingual support (English, Japanese, Chinese, Arabic, Hindi, Korean)
- Intelligent section boundary detection
- Content type classification
- Table and figure extraction
- Automatic persona type identification (researcher, analyst, student, etc.)
- Domain focus detection (academic, business, technical, etc.)
- Experience level assessment
- Expertise area extraction
- Job requirement parsing
- Multi-dimensional scoring system:
- Semantic Similarity (35%)
- Keyword Overlap (25%)
- Content Type Match (20%)
- Expertise Alignment (15%)
- Structural Importance (5%)
- Granular content extraction from top sections
- Paragraph-level relevance scoring
- Intelligent text chunking for optimal readability
- Context-aware text refinement
- Python 3.8 or higher
- Tesseract OCR (for image-based PDFs)
- Poppler (for PDF processing)
- Git (for cloning the repository)
git clone https://github.com/Codealpha07/IntelliDoc.git
cd IntelliDoc# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# On Windows:
.\venv\Scripts\activate
# On Unix or MacOS:
source venv/bin/activatepip install -r requirements.txtUbuntu/Debian:
sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utilsmacOS (using Homebrew):
brew install tesseract popplerWindows:
- Download and install Tesseract OCR from UB Mannheim
- During installation, check "Add Tesseract to your system PATH"
- Download and install Poppler from poppler-windows
- Add Poppler to your system PATH
python -c "import fitz; print('PyMuPDF version:', fitz.__version__)"
python -c "import pytesseract; print('Tesseract version:', pytesseract.get_tesseract_version())"-
Prepare Input Files
- Place 3-10 PDF files in the
inputs/directory - Create a configuration file (see Input Format)
- Example directory structure:
IntelliDoc/ ├── inputs/ │ ├── document1.pdf │ ├── document2.pdf │ └── config.json └── ...
- Place 3-10 PDF files in the
-
Run the Application
# Basic usage with default settings python main.py # With custom input/output directories python main.py --input-dir ./my_inputs --output-dir ./my_outputs # With verbose logging python main.py --verbose
-
View Results
- Output will be saved in the
outputs/directory by default - Main output file:
challenge1b_output.json - Logs are available in
app.log(when verbose mode is enabled)
- Output will be saved in the
python main.py [options]
Options:
--input-dir PATH Directory containing input PDFs (default: inputs/)
--output-dir PATH Directory to save output (default: outputs/)
--config FILE Path to configuration file (default: auto-detected)
--verbose Enable verbose logging (default: False)
--debug Enable debug mode (more detailed logging) (default: False)
--max-docs N Maximum number of documents to process (default: 10)
--no-ocr Disable OCR processing (faster but may miss text in images)
--language LANG Set OCR language (default: eng)Create a config.json file in your input directory with the following format:
{
"persona": "[Persona description]",
"job_to_be_done": "[Detailed description of the task]",
"scoring_weights": {
"semantic_similarity": 0.35,
"keyword_overlap": 0.25,
"content_type_match": 0.20,
"expertise_alignment": 0.15,
"structural_importance": 0.05
}
}Note: The scoring_weights block is optional. If provided but the weights do not sum to 1.0, they will be normalized proportionally. If omitted or if the sum is 0, defaults will be used.
Example:
{
"persona": "Senior Machine Learning Engineer with 5 years of experience in NLP",
"job_to_be_done": "Research state-of-the-art transformer architectures for document understanding",
"scoring_weights": {
"semantic_similarity": 0.50,
"keyword_overlap": 0.30,
"content_type_match": 0.10,
"expertise_alignment": 0.05,
"structural_importance": 0.05
}
}You can also configure the application using environment variables:
# Set input/output directories
export INTELLIDOC_INPUT_DIR=./my_inputs
export INTELLIDOC_OUTPUT_DIR=./my_outputs
# Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
export INTELLIDOC_LOG_LEVEL=INFO
# Enable/disable features
export INTELLIDOC_ENABLE_OCR=true
export INTELLIDOC_MAX_DOCS=5The system expects:
- 3-10 PDF files in
/app/input/directory - Configuration file (one of):
config.json: JSON format withpersonaandjob_to_be_donefieldspersona.json: Alternative JSON configurationinput.json: Another alternative format*.txt: Plain text with persona on first line, job description following
{
"persona": "PhD Researcher in Computational Biology with expertise in machine learning applications",
"job_to_be_done": "Prepare a comprehensive literature review focusing on methodologies, datasets, and performance benchmarks for graph neural networks in drug discovery"
}The system generates challenge1b_output.json with:
{
"metadata": {
"input_documents": ["doc1.pdf", "doc2.pdf"],
"persona": "...",
"job_to_be_done": "...",
"processing_timestamp": "2025-01-XX...",
"total_sections_analyzed": 45,
"documents_processed": 3
},
"extracted_sections": [
{
"document": "doc1.pdf",
"page_number": 3,
"section_title": "Graph Neural Network Architectures",
"importance_rank": 1
}
],
"subsection_analysis": [
{
"document": "doc1.pdf",
"page_number": 3,
"refined_text": "Graph neural networks have emerged as..."
}
]
}The system recognizes 8 persona types:
- Researcher: Focus on methodology, literature, benchmarks
- Student: Emphasis on concepts, examples, fundamentals
- Analyst: Business metrics, trends, performance analysis
- Engineer: Technical implementation, architecture, systems
- Manager: Strategy, planning, execution, processes
- Consultant: Recommendations, best practices, optimization
- Journalist: Facts, events, context, impact analysis
- Entrepreneur: Market opportunities, innovation, scaling
Automatically identifies 6 content types:
- Methodology: Approaches, techniques, frameworks
- Results: Findings, data, measurements, outcomes
- Background: Context, literature, historical information
- Analysis: Evaluation, interpretation, discussion
- Examples: Cases, illustrations, applications
- Summary: Conclusions, abstracts, key points
def calculate_score(section, persona, job):
score = (
0.35 * semantic_similarity(section, persona, job) +
0.25 * keyword_overlap(section, persona, job) +
0.20 * content_type_match(section, job) +
0.15 * expertise_alignment(section, persona) +
0.05 * structural_importance(section)
)
return min(score, 1.0)- Efficient Text Processing: Optimized regular expressions and string operations
- Smart Caching: Avoid redundant computations
- Memory Management: Process documents sequentially to minimize RAM usage
- Early Filtering: Skip irrelevant content early in the pipeline
-
Build the Docker image:
docker build --platform linux/amd64 -t intellidoc:latest . -
Verify the image was built successfully:
docker images | grep intellidoc
-
Create input and output directories:
mkdir -p ./input ./output
-
Place your PDF files and config in the input directory:
cp your-document.pdf ./input/ cp config.json ./input/
-
Run the container:
docker run --rm \ -v $(pwd)/input:/app/input \ -v $(pwd)/output:/app/output \ --network none \ intellidoc:latest
For easier management, you can use Docker Compose:
-
Create a
docker-compose.ymlfile:version: '3.8' services: intellidoc: build: . volumes: - ./input:/app/input - ./output:/app/output environment: - INTELLIDOC_LOG_LEVEL=INFO - INTELLIDOC_MAX_DOCS=5 deploy: resources: limits: cpus: '2' memory: 2G
-
Start the service:
docker-compose up --build
- Resource Limits: The example above sets reasonable CPU and memory limits
- Volume Mounting: Use named volumes for production deployments
- Security: Runs with non-root user inside container
- Caching: Leverages Docker layer caching for faster builds
- Python 3.8+: Core programming language
- PyMuPDF (fitz): Advanced PDF processing and text extraction
- pytesseract: OCR capabilities for image-based PDFs
- Pillow: Image processing for OCR preprocessing
- numpy: Numerical computations and array operations
- pandas: Data manipulation and analysis
- scikit-learn: Machine learning utilities for text processing
- nltk: Natural language processing toolkit
- tqdm: Progress bars for long-running operations
- pytest: Testing framework
- black: Code formatting
- flake8: Linting
- mypy: Static type checking
- pylint: Code quality checking
- Tesseract OCR: For optical character recognition
- Poppler: For PDF rendering and processing
- Git: For version control
- Graceful Fallbacks: Automatically falls back to OCR for image-based PDFs
- Robust Parsing: Handles malformed PDF structures and recovers gracefully
- Input Validation: Validates all inputs before processing
- Configuration Defaults: Provides sensible defaults for missing configuration
- Resource Management: Properly manages file handles and system resources
Logs are written to app.log and can be configured via environment variables:
import logging
# Basic configuration
logging.basicConfig(
level=os.getenv('INTELLIDOC_LOG_LEVEL', 'INFO'),
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('app.log'),
logging.StreamHandler()
]
)
# Get logger for a module
logger = logging.getLogger(__name__)-
OCR Failures:
- Ensure Tesseract is installed and in PATH
- Check that the language packs are installed
- Verify image quality if using scanned documents
-
PDF Processing Errors:
- Corrupt PDFs may cause issues
- Try opening the PDF in a viewer to verify integrity
- Consider converting to a different format if problems persist
-
Performance Issues:
- Large PDFs may require more memory
- Consider splitting large documents
- Enable logging to identify bottlenecks
| Metric | Value | Notes |
|---|---|---|
| Processing Time | ~30-45s | For 5 average-sized PDFs |
| Memory Usage | 400-600MB | Peak usage during processing |
| CPU Usage | 2 cores | Can be scaled based on workload |
| Model Size | <100MB | No large external models |
| Document Size | Up to 50MB | Per document |
| Batch Size | Up to 10 | Documents per run |
-
Vertical Scaling:
- Increase CPU cores for faster processing
- Add more RAM for larger documents or batches
-
Horizontal Scaling:
- Process documents in parallel across multiple instances
- Use a message queue for job distribution
-
Optimization Tips:
- Enable caching for repeated documents
- Disable OCR if not needed with
--no-ocr - Process documents in smaller batches
| Resource | Minimum | Recommended |
|---|---|---|
| CPU Cores | 1 | 2+ |
| RAM | 1GB | 2GB |
| Disk Space | 100MB | 1GB |
| OS | Linux/Windows/macOS | Linux |
The codebase includes comprehensive test coverage for:
- Document processing and text extraction
- Persona analysis and classification
- Relevance scoring algorithms
- Input/output handling
- Error conditions and edge cases
Run the full test suite:
pytest tests/ -v --cov=.-
Unit Tests:
- Individual component testing
- Mocked dependencies
- Edge case validation
-
Integration Tests:
- End-to-end document processing
- Configuration handling
- File system operations
-
Performance Tests:
- Processing time benchmarks
- Memory usage profiling
- Scaling characteristics
Test data is stored in tests/test_data/ and includes:
- Sample PDF documents
- Configuration files
- Expected output files
The project includes a .github/workflows/ci.yml file that runs:
- Unit tests
- Type checking
- Linting
- Code coverage
-
Semantic Understanding:
- Context-aware text analysis
- Topic modeling
- Entity recognition
-
Persona Adaptation:
- Dynamic adjustment based on expertise level
- Domain-specific processing
- Customizable scoring weights
-
Content Intelligence:
- Automatic section detection
- Table and figure extraction
- Citation analysis
-
Scoring Weights: You can customize the relevance engine scoring weights dynamically by adding a
scoring_weightsblock to yourconfig.jsonfile. If omitted, the default weights are:{ "semantic_similarity": 0.35, "keyword_overlap": 0.25, "content_type_match": 0.20, "expertise_alignment": 0.15, "structural_importance": 0.05 } -
Plugins and Extensions:
- Custom document processors
- Specialized analyzers
- Output formatters
-
API Access:
- RESTful interface
- Python package
- Command-line interface
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
For support, please open an issue in the GitHub repository.
- Adobe for the original challenge
- Open source contributors
- The Python community
This implementation combines the robust PDF extraction capabilities from Round 1A with sophisticated intelligence layers to deliver highly relevant, persona-specific document insights.