-
Notifications
You must be signed in to change notification settings - Fork 23
feat: add page-range arg to pdf parse #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 10-12-feat_tweak_pdf_parser_for_corner_cases_and_add_120s_demo
Are you sure you want to change the base?
feat: add page-range arg to pdf parse #265
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
f952bf0 to
329b4d0
Compare
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def validate_pages_argument(pages: Optional[Union[int, List[Union[int, List[int]]]]]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of making the page range a argument a proper Pydantic type with built-in validation like our other configuration objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good idea- It would be more clear to the user and easier to validate.
I wanted to have something working, but can certainly make the change before we merge.
For the column case, the type would still be array[int] or array[array[int]] though, right? But we can convert it to the pydantic type during the validate step while resolving the column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We spoke in person -
- for the Column case, it will continue to be a array[int] OR array[array[int]]
- we'll keep the ability to do array[array[int]], because it lets the user white-list an entire document without enumerating the pages
- drawback is that validating the page mask must be done dynamically on plan conversion
- We'll fail loudly (instead of nulling the row) if the page mask is malformed
- adding a convenience pydantic model is a nice to have for the static (not Column) case, I'll add it
329b4d0 to
b76bcdf
Compare
b76bcdf to
ccaa943
Compare

TL;DR
Added support for specifying page ranges when parsing PDFs with
semantic.parse_pdf().What changed?
pagesparameter tosemantic.parse_pdf()that allows users to specify which pages to parsepagesparameter can be:5)[1, 3, 5])[[1, 3], [5, 7]])pagesparameter to ensure valid page numbers and rangesHow to test?
Why make this change?
This feature allows users to selectively parse specific pages from PDFs, which is useful for: