Skip to content

Conversation

@YoungVor
Copy link
Contributor

@YoungVor YoungVor commented Oct 7, 2025

TL;DR

Added PDF file support for OpenAI models with proper token counting and estimation.

What changed?

  • Made max_completion_tokens optional in OpenAI chat completions requests
  • Implemented PDF file token counting for OpenAI models:
    • Added methods to count tokens for PDF files in input and output
    • Updated token counter to handle PDF files with proper token estimation
    • Added support for PDF parsing to various OpenAI models in the model catalog
  • Refactored openai token counting logic
  • Mini refactor - separate _max_output_tokens user limit concept from _estimate_output_tokens for cost estimation and throttling
    • added to openai and updated gemini
    • Max tokens (put in request)
      • Use max tokens provided by semantic_operator if exists
        - OR, for page parsing specifically, use an upper limit based on output limit of our smallest VLM supported (8000 tokens)
      • Add expected reasoning effort
    • Estimate output tokens (for cost estimate and throttling)
      • Use max tokens provided by semantic_operator if exists
      • OR estimate file output tokens
      • Add expected reasoning effort
  • Added openai models to semantic_parse_pdf tests

Out of scope

The token estimation should happen at the semantic operator level, since it has the context of what its expecting from the model. Currently, semantic operator only passes 'max token' limit to the client and we use that upper limit in our estimates. As a future improvement we should refactor and have the semantic operator decide on the output token limit for the request

How to test?

  1. Run the new token counter tests: pytest tests/_inference/test_openai_token_counter.py
  2. Test PDF parsing with OpenAI models: pytest tests/_backends/local/functions/test_semantic_parse_pdf.py
  3. Verify that token estimation works correctly with PDF files by using a model that supports PDF parsing

Copy link
Contributor Author

YoungVor commented Oct 7, 2025

@YoungVor YoungVor changed the title Add openai support for semantic parse_pdf feat: Add openai support for semantic parse_pdf Oct 7, 2025
@YoungVor YoungVor force-pushed the 09-25-add_openai_support_for_semantic_parse_pdf branch 3 times, most recently from f71a6bc to 37e3057 Compare October 7, 2025 20:29
@YoungVor
Copy link
Contributor Author

YoungVor commented Oct 7, 2025

Running tests with 200 pages of financial documentation

O3 and gpt-4o-mini did the best at reproducing the table structure.

###gpt-4o-mini

- Time taken: 0:04:11.227989
estimated_input_tokens=365142, estimated_output_tokens=147632
Session Usage Summary:
  App Name: evaluate_pdf_parsing
  Session ID: f7d378de-2d81-47f8-923f-e2b34cbbd8d5
  Total queries executed: 4
  Total execution time: 249608.92ms
  Total rows processed: 9
  Total language model cost: $0.078701
  Total language model requests: 69
  Total language model tokens: 171,084 input tokens, 2,944 cached input tokens, 88,030 output tokens
  Total embedding model cost: $0.00
  Total cost: $0.078701

###o3

- Time taken: 0:05:44.930210
estimated_input_tokens=365142, estimated_output_tokens=430256
Session Usage Summary:
  App Name: evaluate_pdf_parsing
  Session ID: dd3224ae-ce1a-41b1-b97b-935aa24a8d0f
  Total queries executed: 4
  Total execution time: 340675.98ms
  Total rows processed: 9
  Total language model cost: $1.473413
  Total language model requests: 69
  Total language model tokens: 173,959 input tokens, 0 cached input tokens, 140,687 output tokens
  Total embedding model cost: $0.00
  Total cost: $1.473413

###gpt-5-nano

- Time taken: 0:02:38.444469
estimated_input_tokens=365142, estimated_output_tokens=288944
Session Usage Summary:
  App Name: evaluate_pdf_parsing
  Session ID: a43d8f82-cd38-4965-8223-54e01663a5b7
  Total queries executed: 4
  Total execution time: 155641.77ms
  Total rows processed: 9
  Total language model cost: $0.061618
  Total language model requests: 69
  Total language model tokens: 221,763 input tokens, 0 cached input tokens, 126,325 output tokens
  Total embedding model cost: $0.00
  Total cost: $0.061618

###gpt-5-mini

- Time taken: 0:04:05.514505
estimated_input_tokens=365142, estimated_output_tokens=288944
Session Usage Summary:
  App Name: evaluate_pdf_parsing
  Session ID: 9a7c6fe8-6e01-4a10-98ab-c49dda8410f7
  Total queries executed: 4
  Total execution time: 243652.21ms
  Total rows processed: 9
  Total language model cost: $0.347699
  Total language model requests: 69
  Total language model tokens: 218,307 input tokens, 3,456 cached input tokens, 146,518 output tokens
  Total embedding model cost: $0.00
  Total cost: $0.347699

###gpt-5

- Time taken: 0:05:00.949263
estimated_input_tokens=365142, estimated_output_tokens=288944
Session Usage Summary:
  App Name: evaluate_pdf_parsing
  Session ID: 8ab36464-19ec-4499-b8ca-832131cf2a33
  Total queries executed: 4
  Total execution time: 299190.23ms
  Total rows processed: 9
  Total language model cost: $1.680932
  Total language model requests: 69
  Total language model tokens: 172,167 input tokens, 1,792 cached input tokens, 146,550 output tokens
  Total embedding model cost: $0.00
  Total cost: $1.680932

@YoungVor YoungVor force-pushed the 09-25-add_openai_support_for_semantic_parse_pdf branch 2 times, most recently from c4ccd60 to ef58d6d Compare October 8, 2025 20:23
@YoungVor YoungVor marked this pull request as ready for review October 8, 2025 20:24
@YoungVor YoungVor requested a review from bcallender October 8, 2025 20:24
@YoungVor YoungVor force-pushed the 09-25-add_openai_support_for_semantic_parse_pdf branch 2 times, most recently from 3395eac to 64879d0 Compare October 13, 2025 16:38
Copy link
Contributor

@bcallender bcallender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just a bit confused as to why the token limit needs to be set in each client when they're all using the same constant -- I get the simplification of not trying to calculate it right now.

@YoungVor YoungVor force-pushed the 09-25-add_openai_support_for_semantic_parse_pdf branch 2 times, most recently from 07cf907 to baf3028 Compare October 15, 2025 21:15
Copy link
Contributor

@bcallender bcallender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! separating out the concepts of output limit vs output estimate across all of the clients is a nice way of allowing us to implement provider-specific behavior in a standardized way.

@YoungVor YoungVor force-pushed the 09-25-add_openai_support_for_semantic_parse_pdf branch from baf3028 to faeb5db Compare October 21, 2025 18:29
@YoungVor YoungVor merged commit e3f58cd into main Oct 21, 2025
12 checks passed
@YoungVor YoungVor deleted the 09-25-add_openai_support_for_semantic_parse_pdf branch October 21, 2025 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants