A simple command-line client for LLMWhisperer, a powerful document extraction service from Unstract that converts complex documents (PDFs, images, scanned files) into LLM-ready text.
- Extract text from PDFs, images, and scanned documents
- Multiple extraction modes for different document types
- Table structure preservation with optional border recreation
- Page-specific extraction
- Save output to file or display in terminal
- Environment-based API key configuration
- Python 3.7 or higher
- pip package manager
- Clone this repository:
git clone https://github.com/Zipstack/llmwhisperer-cli-test-script.git
cd llmwhisperer-cli-test-script- Create and activate a virtual environment:
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Configure your API key:
Create a .env file in the project directory:
LLMWHISPERER_API_KEY=your_api_key_hereAlternatively, set it as an environment variable:
export LLMWHISPERER_API_KEY=your_api_key_hereGet your API key from Unstract LLMWhisperer
Extract text from a document:
python llmwhisperer_cli.py document.pdfpython llmwhisperer_cli.py document.pdf -o extracted_text.txtLLMWhisperer supports different extraction modes optimized for various document types:
- native_text: For digitally created PDFs with embedded text (fastest)
- low_cost: For clean, printed documents with good scan quality
- high_quality(default): For challenging documents including handwritten text
- form: For documents with forms, checkboxes, and structured layouts
- table: For documents with dense table structures
Example:
python llmwhisperer_cli.py document.pdf -m tableExtract specific pages or page ranges:
# Extract pages 1-5
python llmwhisperer_cli.py document.pdf -p "1-5"
# Extract pages 1-5 and page 7
python llmwhisperer_cli.py document.pdf -p "1-5,7"
# Extract from page 21 to end
python llmwhisperer_cli.py document.pdf -p "21-"For better table structure preservation:
# Add vertical borders
python llmwhisperer_cli.py document.pdf --vert
# Add both vertical and horizontal borders
python llmwhisperer_cli.py document.pdf --vert --horizNote:
--horizrequires--vertto be enabled
Extract tables from specific pages with borders and save to file:
python llmwhisperer_cli.py financial_report.pdf \
  -m table \
  -p "10-15" \
  --vert --horiz \
  -o tables_output.txt| Option | Description | Default | 
|---|---|---|
| file_path | Path to the document to process | Required | 
| -o, --output | Output file to save extracted text | None (prints to console) | 
| -m, --mode | Extraction mode (see modes above) | high_quality | 
| -p, --pages | Pages to extract (e.g., "1-5,7,21-") | All pages | 
| --vert | Recreate vertical table borders | False | 
| --horiz | Recreate horizontal table borders | False | 
| -h, --help | Show help message | - | 
The client provides:
- Extracted text (to console or file)
- Total number of pages processed
- Processing status and progress indicators
The client uses the following environment variables:
- LLMWHISPERER_API_KEY: Your API key (required)
- LLMWHISPERER_BASE_URL_V2: API endpoint (optional, defaults to US region)
For EU region, set:
LLMWHISPERER_BASE_URL_V2=https://llmwhisperer-api.eu-west.unstract.com/api/v2python llmwhisperer_cli.py scanned_document.pdf -m low_costpython llmwhisperer_cli.py application_form.pdf -m form -o form_data.txtpython llmwhisperer_cli.py data_tables.pdf -m table --vert --horizpython llmwhisperer_cli.py manual.pdf -p "1-10,50-55" -o summary.txt- 
"LLMWHISPERER_API_KEY not found": Ensure your .envfile is in the same directory as the script or set the environment variable.
- 
"--horiz requires --vert": Horizontal borders can only be added when vertical borders are enabled. 
- 
Timeout errors: For large documents, the default timeout is 200 seconds. The script will wait for processing to complete. 
This project is provided under the MIT license.