-
Notifications
You must be signed in to change notification settings - Fork 1k
General virtual columns support + row numbers as a first use-case #8715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General virtual columns support + row numbers as a first use-case #8715
Conversation
Co-authored-by: scovich <scovich@users.noreply.github.com>
…eature tests pass
| } | ||
|
|
||
| fn skip_records(&mut self, num_records: usize) -> Result<usize> { | ||
| // TODO: Use advance_by when it stabilizes to improve performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO from original PR
| } | ||
|
|
||
| fn row_groups(&self) -> Box<dyn Iterator<Item = &RowGroupMetaData> + '_> { | ||
| Box::new(std::iter::once(self.metadata.row_group(self.row_group_idx))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this duplicates a lot, not sure if anything can be done here
parquet/src/arrow/schema/complex.rs
Outdated
| /// - If nullable: def_level = parent_def_level + 1 | ||
| /// - If required: def_level = parent_def_level | ||
| /// - rep_level = parent_rep_level (virtual fields are not repeated) | ||
| fn convert_virtual_field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name used here is not aligned with what other convert_ functions do
…hen metadata parsing may skip row groups
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @vustef -- I went through this PR carefully and I think we can merge it as is
I took the liberty to push a few cleanups and reduce some duplication
This is a major step forward in my mind.
I have some thoughts on how to improve the tests that I will either push directly to this PR or propose as follow ons.
cc @mbrobbel @etseidl @adriangb in case you are interested in reviewing. I am especially interested in other reviewer comments on the public API surface
| } | ||
|
|
||
| #[derive(Debug, Clone)] | ||
| pub enum ParquetFieldType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked at this enum is not publically exposed: https://docs.rs/parquet/latest/parquet/?search=ParquetFieldType
Thus this is a backwards compatible change
| Ok(ParquetMetaData::new(fmd, row_groups)) | ||
| } | ||
|
|
||
| /// Assign [`RowGroupMetaData::ordinal`] if it is missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted the ordinal assignment to a structure, mostly so I had a place to add additional comments as well as avoiding duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb for the review. Your commits look good to me, thanks for that as well.
| Ok(ParquetMetaData::new(fmd, row_groups)) | ||
| } | ||
|
|
||
| /// Assign [`RowGroupMetaData::ordinal`] if it is missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick first pass seems ok, just a couple of comments.
| } | ||
|
|
||
| // decrypt column chunk info | ||
| // decrypt column chunk info and handle ordinal assignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this doesn't hurt, but it shouldn't be necessary. The row groups already passed through the OrdinalAssigner in parquet_metadata_from_bytes, and row_group_from_encrypted_thrift re-uses the ordinal from the RowGroupMetaData passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the change here.
| if actual_ordinal == 0 { | ||
| self.first_has_ordinal = rg_has_ordinal; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if first_has_ordinal should be an option, and then set it if it's None. If/when we implement row group selection the first ordinal seen may not be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then enumeration won't work at all, and we'd have to rely on ordinal in the metadata? Actually, row numbers feature won't work at all, since it relies on having information (sizes in rows) about all row groups to figure out first row index of the row group it reads. Unless some trick that I'm not aware of is used.
Note that row numbers have to be stable across queries (i.e. independent of whether there was filtering), otherwise we would've implemented them on the client side by just enumerating rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it seems this code is executed regardless of the presence of a virtual column. It looks to me like skipping the first rowgroup will result in an error if ordinals are present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should only error out if either ordinal is present in metadata for some row group but not for others, or vice-versa. That shouldn't happen with row group skipping, i.e. even if we skip, all the remaining row groups that we iterate through would fulfil this condition, right?
What I'd like though, is that the build_row_number_reader fails if skipping occurs. Or that it ensures that there's no row skipping if it's invoked. Not sure how to ensure that at this point though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I skip row groups 0 and 1 and go straight to row group 2 (think externally cached statistics with point look up), which has ordinal set to 2 in the footer, self.first_has_ordinal is false and rg_has_ordinal is true. L904 will evaluate false, L906 will evaluate true and return an error.
I'll take a closer look at the rest of this PR to see if there's a way to make skipping and row numbers mutually exclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why we think differently. I was assuming then that the row group 2 would be the one to set self.first_has_ordinal. Because for example the loop that invokes ensure would be going for ordinal in 0..list_ident.size, whe size is going to be based only on the row groups after skipping.
Is that not going to happen?
Perhaps there might need some changes to be made once row group skipping is implemented, not sure how best to guard against getting the intended behaviour broken without catching it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because for example the loop that invokes ensure would be going for ordinal in 0..list_ident.size, whe size is going to be based only on the row groups after skipping.
Ah, but here we're in a function whose input is the entire encoded footer. list_ident.size will always be the unfiltered number of row groups. So I think we'll either:
- loop over
ordinal in 0..list_ident.size, but skip decoding ifordinalnot in a list - assuming an index, loop over a list of ordinals, decode the row group using a range from the index, and pass
ordinalasactual_ordinal
In either event, we'd pass 2 as actual_ordinal in my scenario above. But yeah, that's pretty far down the road. Guess we can solve it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I see why you proposed to use Option. I pushed a change for that.
|
How is this going to integrate into DataFusion? I feel like |
I'll defer that to @alamb . I only thought about how to integrate it in iceberg-rust, and perhaps there are similarities. There the user of the library would have to request this column, and then we'd propagate this information to the underlying arrow reader. |
Remove unused import
I don't have any real plan for the DataFusion integration. FWIW the first usecase of these numbers I think will be various iceberg / other table format integrations to support delete vectors In order to support "virtual columns" in DataFusion, I suspect we will need to update ListingTable to have some notion of virtual columns (in addition to partition columns). Other virtual columns people have talked about are file names, for example, which wouldn't come from the parquet reader but instead would come from the file opening machinery |
…ef/arrow-rs into feature/parquet-virtual-row-numbers
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// # Ok(()) | ||
| /// # } | ||
| /// ``` | ||
| pub fn with_virtual_columns(self, virtual_columns: Vec<FieldRef>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vustef I think this is where we'd be able to detect row group filtering. There would be a with_row_group_selection() or some such function added to control skipping, and a check could be added both here and in the new function to disallow setting both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for figuring this out. I guess there's no action that we can take right now then, please let me know if it's otherwise.
|
Also thanks to @jkylling whose started this project |
|
Once the Ci is green I'll merge this PR. Thank you @vustef |
|
gogoogogogogo!!! |
|
The 57.1.0 patch release may be the most epic minor release we have ever had |
Based on #7307.
Which issue does this PR close?
Rationale for this change
We need row numbers for many of the downstream features, e.g. computing unique row identifier in iceberg.
What changes are included in this PR?
New API to get row numbers as a virtual column:
This column is defined as an extension type.
Parquet metadata is propagated to the array builder to compute first row indexes.
New Virtual column is included in addition to Primitive and Group.
Are these changes tested?
Yes
Are there any user-facing changes?
This is user facing feature, and has added docstrings.
No breaking changes, at least I tried not to, by creating a duplicate of public method to add more parameters.