A Python library for detecting, extracting, and processing tables from PDF documents.
This project provides a robust solution for extracting tabular data from PDF documents using state-of-the-art computer vision models.
The extraction pipeline implements a 4-step methodology:
- PDF Rendering - Convert pages to high-resolution images with text extraction
- Table Detection - Identify table regions using transformer models
- Structure Recognition - Detect cells, rows, columns, and headers
- Grid Construction - Build structured grids with intelligent text extraction (direct PDF + OCR fallback)
- Table Transformer Models (Microsoft Research): Detection and structure recognition
- PyMuPDF (fitz): PDF processing and rendering with direct text extraction
- TrOCR (Microsoft): Optional OCR fallback for scanned documents
- Transformers (Hugging Face): Model inference pipeline
- Pydantic: Entity validation and data modeling
- Advanced Table Detection: High-confidence table region identification using transformer models
- Structure Recognition: Automatic detection of cells, merged cells, headers, rows, and columns
- Intelligent Text Extraction: Two-stage approach (direct PDF text + OCR fallback)
- Flexible Configuration: Customizable thresholds, DPI, and processing options
- Clean Architecture: Well-organized codebase following software engineering best practices
- Comprehensive Logging: Detailed logs for debugging and monitoring
- Multiple Output Formats: JSON export with metadata and structured data
# Install the package in development mode
pip install -e .from pdf2table.frameworks.pipeline import create_pipeline
# Create the extraction pipeline with configuration
pipeline = create_pipeline(
device="cpu",
detection_threshold=0.9,
structure_threshold=0.6,
pdf_dpi=300,
load_ocr=False,
visualize=False
)
# Extract tables from a specific page
response = pipeline.extract_tables(pdf_path="document.pdf", page_number=0)
# Or extract tables from all pages
response = pipeline.extract_tables(pdf_path="document.pdf")
# Check if extraction was successful
if response.success:
print(f"Successfully extracted {len(response.tables)} tables")
# Access extracted tables
for table in response.tables:
print(f"Table with {len(table.grid.cells)} cells")
print(f"Grid size: {table.grid.n_rows} x {table.grid.n_cols}")
# Convert to dictionary format
result_dict = response.to_dict()
print(result_dict)
# Save results to JSON file
response.save_to_json("output/extracted_tables.json")
else:
print(f"Extraction failed: {response.error_message}")The create_pipeline() method accepts the following parameters:
device(str): Device for ML models - "cpu" or "cuda" (default: "cpu")detection_threshold(float): Confidence threshold for table detection (default: 0.9)- High (0.9+): Fewer false positives, may miss some tables
- Medium (0.7-0.9): Balanced detection
- Low (<0.7): More tables detected, more false positives
structure_threshold(float): Confidence threshold for structure recognition (default: 0.6)- High (0.8+): Clean structure, may miss some cells
- Medium (0.5-0.8): Balanced recognition (recommended)
- Low (<0.5): Detects more cells, more noise
pdf_dpi(int): DPI for PDF page rendering (default: 300)- 150: Faster processing, lower quality
- 300: Balanced (recommended)
- 600+: Better quality, slower processing
load_ocr(bool): Whether to load OCR service (default: False)- False: Direct PDF text extraction only (faster, recommended for native PDFs)
- True: Enable TrOCR fallback (essential for scanned documents)
visualize(bool): Whether to enable visualization (default: False)visualization_save_dir(str): Directory to save visualizations (default: "data/table_visualizations")
The project includes comprehensive logging capabilities for debugging and monitoring:
Log Files: By default, logs are written to:
logs/pdf2table.log- Main application loglogs/pdf2table_errors.log- Error-only log
Documentation: See docs/logging_guide.md for detailed logging documentation.
The project follows Clean Architecture with three main layers:
- Entities Layer: Core business objects (PageImage, DetectedTable, TableGrid, GridCell)
- Use Cases Layer: Business logic orchestration (TableExtractionUseCase, TableGridBuilder)
- Frameworks Layer: External tools (PyMuPDF, Table Transformer, TrOCR)
For detailed technical documentation, see docs/technical_report/technical_report.md.
- Financial Reports: Extract financial tables from annual reports and earnings statements
- Research Papers: Extract data tables from scientific publications
- Invoice Processing: Extract line items and totals from invoices
- Government Documents: Process regulatory filings and public documents
- Data Analytics: Feed extracted data into analytical workflows
- Document Management: Enhance document search with structured table data
- Compliance: Automated extraction for regulatory compliance reporting
- Business Intelligence: Convert PDF reports into structured datasets for analysis
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Follow Clean Architecture principles
- Add comprehensive tests
- Update documentation
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Microsoft Research: Table Transformer models
- Hugging Face: TrOCR and Transformers library
- PyMuPDF Team: Excellent PDF processing capabilities
