Skip to content

Conversation

@YoungVor
Copy link
Contributor

@YoungVor YoungVor commented Oct 21, 2025

TL;DR

Added support for specifying page ranges when parsing PDFs with semantic.parse_pdf().

What changed?

  • Added a new pages parameter to semantic.parse_pdf() that allows users to specify which pages to parse
  • The pages parameter can be:
    • A single page number (e.g., 5)
    • A list of page numbers (e.g., [1, 3, 5])
    • A list of page ranges (e.g., [[1, 3], [5, 7]])
    • A column expression that resolves to any of the above
  • Implemented validation for the pages parameter to ensure valid page numbers and ranges
  • Added logic to coalesce overlapping page ranges for efficient processing
  • Updated the PDF chunking logic to respect the specified page ranges

How to test?

# Parse specific pages from a PDF
df = session.create_dataframe({"pdf_path": ["document.pdf"]})

# Parse only pages 1, 3-5, and 7
result = df.select(
    semantic.parse_pdf(col("pdf_path"), pages=[1, [3, 5], 7])
)

# Use a column to specify different pages for each PDF
df = session.create_dataframe({
    "pdf_path": ["doc1.pdf", "doc2.pdf"],
    "pages": [[1, 3, 5], [[2, 4]]]
})
result = df.select(
    semantic.parse_pdf(col("pdf_path"), pages=col("pages"))
)

# Test with UDF:
@fc.udf(return_type=fc.ArrayType(fc.IntegerType))
    def get_end_page_range(page_count: int) -> Optional[list[int]]:
        """
        Returns the last two page values as a list.
        If page_count is 9, returns [8, 9].
        """
        return [page_count - 1, page_count] 

pdf_metadata_df = session.read.pdf_metadata(f"{data_dir}/**/*.pdf")
pdf_to_md_content = pdf_metadata_df.with_column(
        "end_pages_markdown_content", 
        fc.semantic.parse_pdf(fc.col("file_path"), model_alias="parse_model", pages=get_end_page_range(fc.col("page_count")))
    )

Why make this change?

This feature allows users to selectively parse specific pages from PDFs, which is useful for:

  • Extracting information from specific sections of large documents
  • Reducing processing time and costs by focusing only on relevant pages
  • Creating more targeted and efficient workflows when dealing with structured documents
  • Supporting use cases where only certain parts of a document need to be analyzed

Copy link
Contributor Author

YoungVor commented Oct 21, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@YoungVor YoungVor marked this pull request as ready for review October 21, 2025 17:54
@YoungVor YoungVor force-pushed the 10-21-feat_add_page-range_arg_to_pdf_parse branch from f952bf0 to 329b4d0 Compare October 21, 2025 17:56
logger = logging.getLogger(__name__)


def validate_pages_argument(pages: Optional[Union[int, List[Union[int, List[int]]]]]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of making the page range a argument a proper Pydantic type with built-in validation like our other configuration objects?

Copy link
Contributor Author

@YoungVor YoungVor Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea- It would be more clear to the user and easier to validate.

I wanted to have something working, but can certainly make the change before we merge.

For the column case, the type would still be array[int] or array[array[int]] though, right? But we can convert it to the pydantic type during the validate step while resolving the column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spoke in person -

  • for the Column case, it will continue to be a array[int] OR array[array[int]]
  • we'll keep the ability to do array[array[int]], because it lets the user white-list an entire document without enumerating the pages
  • drawback is that validating the page mask must be done dynamically on plan conversion
  • We'll fail loudly (instead of nulling the row) if the page mask is malformed
  • adding a convenience pydantic model is a nice to have for the static (not Column) case, I'll add it

@YoungVor YoungVor force-pushed the 10-21-feat_add_page-range_arg_to_pdf_parse branch from 329b4d0 to b76bcdf Compare October 24, 2025 23:45
@YoungVor YoungVor force-pushed the 10-21-feat_add_page-range_arg_to_pdf_parse branch from b76bcdf to ccaa943 Compare October 24, 2025 23:48
@YoungVor YoungVor changed the base branch from main to 10-12-feat_tweak_pdf_parser_for_corner_cases_and_add_120s_demo October 24, 2025 23:48
@YoungVor YoungVor added the publish Publish assets label Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

publish Publish assets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants