feat: add page-range arg to pdf parse #265

YoungVor · 2025-10-21T16:50:01Z

TL;DR

Added support for specifying page ranges when parsing PDFs with semantic.parse_pdf().

What changed?

Added a new pages parameter to semantic.parse_pdf() that allows users to specify which pages to parse
The pages parameter can be:
- A single page number (e.g., 5)
- A list of page numbers (e.g., [1, 3, 5])
- A list of page ranges (e.g., [[1, 3], [5, 7]])
- A column expression that resolves to any of the above
Implemented validation for the pages parameter to ensure valid page numbers and ranges
Added logic to coalesce overlapping page ranges for efficient processing
Updated the PDF chunking logic to respect the specified page ranges

How to test?

# Parse specific pages from a PDF
df = session.create_dataframe({"pdf_path": ["document.pdf"]})

# Parse only pages 1, 3-5, and 7
result = df.select(
    semantic.parse_pdf(col("pdf_path"), pages=[1, [3, 5], 7])
)

# Use a column to specify different pages for each PDF
df = session.create_dataframe({
    "pdf_path": ["doc1.pdf", "doc2.pdf"],
    "pages": [[1, 3, 5], [[2, 4]]]
})
result = df.select(
    semantic.parse_pdf(col("pdf_path"), pages=col("pages"))
)

# Test with UDF:
@fc.udf(return_type=fc.ArrayType(fc.IntegerType))
    def get_end_page_range(page_count: int) -> Optional[list[int]]:
        """
        Returns the last two page values as a list.
        If page_count is 9, returns [8, 9].
        """
        return [page_count - 1, page_count] 

pdf_metadata_df = session.read.pdf_metadata(f"{data_dir}/**/*.pdf")
pdf_to_md_content = pdf_metadata_df.with_column(
        "end_pages_markdown_content", 
        fc.semantic.parse_pdf(fc.col("file_path"), model_alias="parse_model", pages=get_end_page_range(fc.col("page_count")))
    )

Why make this change?

This feature allows users to selectively parse specific pages from PDFs, which is useful for:

Extracting information from specific sections of large documents
Reducing processing time and costs by focusing only on relevant pages
Creating more targeted and efficient workflows when dealing with structured documents
Supporting use cases where only certain parts of a document need to be analyzed

YoungVor · 2025-10-21T16:50:17Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

feat: add page-range arg to pdf parse #265 👈 (View in Graphite)
feat: tweak pdf parser for corner cases and add 120s demo #259 : 1 other dependent PR (#260 )
feat: Add pdf_parsing to openrouter #257
feat: Add openai support for semantic parse_pdf #253
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

rohitrastogi · 2025-10-21T17:56:38Z

src/fenic/_backends/local/utils/doc_loader.py

 logger = logging.getLogger(__name__)

+
+def validate_pages_argument(pages: Optional[Union[int, List[Union[int, List[int]]]]]) -> None:


What do you think of making the page range a argument a proper Pydantic type with built-in validation like our other configuration objects?

I think that's a good idea- It would be more clear to the user and easier to validate.

I wanted to have something working, but can certainly make the change before we merge.

For the column case, the type would still be array[int] or array[array[int]] though, right? But we can convert it to the pydantic type during the validate step while resolving the column

We spoke in person -

for the Column case, it will continue to be a array[int] OR array[array[int]]

we'll keep the ability to do array[array[int]], because it lets the user white-list an entire document without enumerating the pages

drawback is that validating the page mask must be done dynamically on plan conversion

We'll fail loudly (instead of nulling the row) if the page mask is malformed

adding a convenience pydantic model is a nice to have for the static (not Column) case, I'll add it

YoungVor marked this pull request as ready for review October 21, 2025 17:54

YoungVor force-pushed the 10-21-feat_add_page-range_arg_to_pdf_parse branch from f952bf0 to 329b4d0 Compare October 21, 2025 17:56

rohitrastogi reviewed Oct 21, 2025

View reviewed changes

YoungVor force-pushed the 10-21-feat_add_page-range_arg_to_pdf_parse branch from 329b4d0 to b76bcdf Compare October 24, 2025 23:45

feat: add page-range arg to pdf parse

ccaa943

YoungVor force-pushed the 10-21-feat_add_page-range_arg_to_pdf_parse branch from b76bcdf to ccaa943 Compare October 24, 2025 23:48

YoungVor changed the base branch from main to 10-12-feat_tweak_pdf_parser_for_corner_cases_and_add_120s_demo October 24, 2025 23:48

YoungVor added the publish Publish assets label Oct 24, 2025

This was referenced Oct 24, 2025

feat: Add openai support for semantic parse_pdf #253

Merged

feat: Add pdf_parsing to openrouter #257

Merged

feat: tweak pdf parser for corner cases and add 120s demo #259

Open

feat: pdf parsing evaluation tool and test pipeline #260

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add page-range arg to pdf parse #265

feat: add page-range arg to pdf parse #265

Uh oh!

YoungVor commented Oct 21, 2025 •

edited

Loading

Uh oh!

YoungVor commented Oct 21, 2025 •

edited

Loading

Uh oh!

rohitrastogi Oct 21, 2025

Uh oh!

YoungVor Oct 21, 2025 •

edited

Loading

Uh oh!

YoungVor Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		logger = logging.getLogger(__name__)


		def validate_pages_argument(pages: Optional[Union[int, List[Union[int, List[int]]]]]) -> None:

feat: add page-range arg to pdf parse #265

Are you sure you want to change the base?

feat: add page-range arg to pdf parse #265

Uh oh!

Conversation

YoungVor commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

What changed?

How to test?

Why make this change?

Uh oh!

YoungVor commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohitrastogi Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

YoungVor Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YoungVor Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YoungVor commented Oct 21, 2025 •

edited

Loading

YoungVor commented Oct 21, 2025 •

edited

Loading

YoungVor Oct 21, 2025 •

edited

Loading