A simple command-line client for LLMWhisperer, a powerful document extraction service from Unstract that converts complex documents (PDFs, images, scanned files) into LLM-ready text.
- Extract text from PDFs, images, and scanned documents
- Multiple extraction modes for different document types
- Table structure preservation with optional border recreation
- Page-specific extraction
- Save output to file or display in terminal
- Environment-based API key configuration
- Python 3.7 or higher
- pip package manager
- Clone this repository:
git clone https://github.com/Zipstack/llmwhisperer-cli-test-script.git
cd llmwhisperer-cli-test-script- Create and activate a virtual environment:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Configure your API key:
Create a .env file in the project directory:
LLMWHISPERER_API_KEY=your_api_key_hereAlternatively, set it as an environment variable:
export LLMWHISPERER_API_KEY=your_api_key_hereGet your API key from Unstract LLMWhisperer
Extract text from a document:
python llmwhisperer_cli.py document.pdfpython llmwhisperer_cli.py document.pdf -o extracted_text.txtLLMWhisperer supports different extraction modes optimized for various document types:
native_text: For digitally created PDFs with embedded text (fastest)low_cost: For clean, printed documents with good scan qualityhigh_quality(default): For challenging documents including handwritten textform: For documents with forms, checkboxes, and structured layoutstable: For documents with dense table structures
Example:
python llmwhisperer_cli.py document.pdf -m tableExtract specific pages or page ranges:
# Extract pages 1-5
python llmwhisperer_cli.py document.pdf -p "1-5"
# Extract pages 1-5 and page 7
python llmwhisperer_cli.py document.pdf -p "1-5,7"
# Extract from page 21 to end
python llmwhisperer_cli.py document.pdf -p "21-"For better table structure preservation:
# Add vertical borders
python llmwhisperer_cli.py document.pdf --vert
# Add both vertical and horizontal borders
python llmwhisperer_cli.py document.pdf --vert --horizNote:
--horizrequires--vertto be enabled
Extract tables from specific pages with borders and save to file:
python llmwhisperer_cli.py financial_report.pdf \
-m table \
-p "10-15" \
--vert --horiz \
-o tables_output.txt| Option | Description | Default |
|---|---|---|
file_path |
Path to the document to process | Required |
-o, --output |
Output file to save extracted text | None (prints to console) |
-m, --mode |
Extraction mode (see modes above) | high_quality |
-p, --pages |
Pages to extract (e.g., "1-5,7,21-") | All pages |
--vert |
Recreate vertical table borders | False |
--horiz |
Recreate horizontal table borders | False |
-h, --help |
Show help message | - |
The client provides:
- Extracted text (to console or file)
- Total number of pages processed
- Processing status and progress indicators
The client uses the following environment variables:
LLMWHISPERER_API_KEY: Your API key (required)LLMWHISPERER_BASE_URL_V2: API endpoint (optional, defaults to US region)
For EU region, set:
LLMWHISPERER_BASE_URL_V2=https://llmwhisperer-api.eu-west.unstract.com/api/v2python llmwhisperer_cli.py scanned_document.pdf -m low_costpython llmwhisperer_cli.py application_form.pdf -m form -o form_data.txtpython llmwhisperer_cli.py data_tables.pdf -m table --vert --horizpython llmwhisperer_cli.py manual.pdf -p "1-10,50-55" -o summary.txt-
"LLMWHISPERER_API_KEY not found": Ensure your
.envfile is in the same directory as the script or set the environment variable. -
"--horiz requires --vert": Horizontal borders can only be added when vertical borders are enabled.
-
Timeout errors: For large documents, the default timeout is 200 seconds. The script will wait for processing to complete.
This project is provided under the MIT license.