Pdf2Table

A Python library for detecting, extracting, and processing tables from PDF documents.

Overview

This project provides a robust solution for extracting tabular data from PDF documents using state-of-the-art computer vision models.

The extraction pipeline implements a 4-step methodology:

PDF Rendering - Convert pages to high-resolution images with text extraction
Table Detection - Identify table regions using transformer models
Structure Recognition - Detect cells, rows, columns, and headers
Grid Construction - Build structured grids with intelligent text extraction (direct PDF + OCR fallback)

Technologies Used

Table Transformer Models (Microsoft Research): Detection and structure recognition
PyMuPDF (fitz): PDF processing and rendering with direct text extraction
TrOCR (Microsoft): Optional OCR fallback for scanned documents
Transformers (Hugging Face): Model inference pipeline
Pydantic: Entity validation and data modeling

Features

Advanced Table Detection: High-confidence table region identification using transformer models
Structure Recognition: Automatic detection of cells, merged cells, headers, rows, and columns
Intelligent Text Extraction: Two-stage approach (direct PDF text + OCR fallback)
Flexible Configuration: Customizable thresholds, DPI, and processing options
Clean Architecture: Well-organized codebase following software engineering best practices
Comprehensive Logging: Detailed logs for debugging and monitoring
Multiple Output Formats: JSON export with metadata and structured data

Installation

# Install the package in development mode
pip install -e .

Usage

from pdf2table.frameworks.pipeline import create_pipeline

# Create the extraction pipeline with configuration
pipeline = create_pipeline(
    device="cpu",
    detection_threshold=0.9,
    structure_threshold=0.6,
    pdf_dpi=300,
    load_ocr=False,
    visualize=False
)

# Extract tables from a specific page
response = pipeline.extract_tables(pdf_path="document.pdf", page_number=0)

# Or extract tables from all pages
response = pipeline.extract_tables(pdf_path="document.pdf")

# Check if extraction was successful
if response.success:
    print(f"Successfully extracted {len(response.tables)} tables")
    
    # Access extracted tables
    for table in response.tables:
        print(f"Table with {len(table.grid.cells)} cells")
        print(f"Grid size: {table.grid.n_rows} x {table.grid.n_cols}")
    
    # Convert to dictionary format
    result_dict = response.to_dict()
    print(result_dict)
    
    # Save results to JSON file
    response.save_to_json("output/extracted_tables.json")
else:
    print(f"Extraction failed: {response.error_message}")

Configuration Options

The create_pipeline() method accepts the following parameters:

device (str): Device for ML models - "cpu" or "cuda" (default: "cpu")
detection_threshold (float): Confidence threshold for table detection (default: 0.9)
- High (0.9+): Fewer false positives, may miss some tables
- Medium (0.7-0.9): Balanced detection
- Low (<0.7): More tables detected, more false positives
structure_threshold (float): Confidence threshold for structure recognition (default: 0.6)
- High (0.8+): Clean structure, may miss some cells
- Medium (0.5-0.8): Balanced recognition (recommended)
- Low (<0.5): Detects more cells, more noise
pdf_dpi (int): DPI for PDF page rendering (default: 300)
- 150: Faster processing, lower quality
- 300: Balanced (recommended)
- 600+: Better quality, slower processing
load_ocr (bool): Whether to load OCR service (default: False)
- False: Direct PDF text extraction only (faster, recommended for native PDFs)
- True: Enable TrOCR fallback (essential for scanned documents)
visualize (bool): Whether to enable visualization (default: False)
visualization_save_dir (str): Directory to save visualizations (default: "data/table_visualizations")

📋 Logging

The project includes comprehensive logging capabilities for debugging and monitoring:

Log Files: By default, logs are written to:

logs/pdf2table.log - Main application log
logs/pdf2table_errors.log - Error-only log

Documentation: See docs/logging_guide.md for detailed logging documentation.

🏗️ Architecture

The project follows Clean Architecture with three main layers:

Entities Layer: Core business objects (PageImage, DetectedTable, TableGrid, GridCell)
Use Cases Layer: Business logic orchestration (TableExtractionUseCase, TableGridBuilder)
Frameworks Layer: External tools (PyMuPDF, Table Transformer, TrOCR)

For detailed technical documentation, see docs/technical_report/technical_report.md.

🎯 Use Cases

Document Processing Pipelines

Financial Reports: Extract financial tables from annual reports and earnings statements
Research Papers: Extract data tables from scientific publications
Invoice Processing: Extract line items and totals from invoices
Government Documents: Process regulatory filings and public documents

Integration Scenarios

Data Analytics: Feed extracted data into analytical workflows
Document Management: Enhance document search with structured table data
Compliance: Automated extraction for regulatory compliance reporting
Business Intelligence: Convert PDF reports into structured datasets for analysis

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Follow Clean Architecture principles
Add comprehensive tests
Update documentation
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Microsoft Research: Table Transformer models
Hugging Face: TrOCR and Transformers library
PyMuPDF Team: Excellent PDF processing capabilities

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
docs		docs
examples		examples
pdf2table		pdf2table
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pdf2Table

Overview

Technologies Used

Features

Installation

Usage

Configuration Options

📋 Logging

🏗️ Architecture

🎯 Use Cases

Document Processing Pipelines

Integration Scenarios

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases 1

Packages

Languages

License

Alijanloo/Pdf2Table

Folders and files

Latest commit

History

Repository files navigation

Pdf2Table

Overview

Technologies Used

Features

Installation

Usage

Configuration Options

📋 Logging

🏗️ Architecture

🎯 Use Cases

Document Processing Pipelines

Integration Scenarios

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages