-
Notifications
You must be signed in to change notification settings - Fork 1k
Add ability to skip or transform page encoding statistics in Parquet metadata #8797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
excerpts from new benchmarks Skipping the stats is not any faster than turning them into a mask 😮. |
f069128 to
f004a1f
Compare
| // The outer option acts as a global boolean, so if `skip_encoding_stats.is_some()` | ||
| // is `true` then we're at least skipping some stats. The inner `Option` is a keep | ||
| // list of column indicies to decode. | ||
| skip_encoding_stats: Option<Option<Arc<HashSet<usize>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my solution to per-column options. For huge schemas I didn't want a Vec<bool> that's mostly filled with false. Using Arc so cloning should be faster.
| skip_encoding_stats | behavior |
|---|---|
None |
decode all |
Some<None> |
decode none |
Some<Some<Set>> |
decode if in set |
| #[derive(Default, Debug, Clone)] | ||
| pub struct ParquetMetaDataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, | ||
| encoding_stats_as_mask: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defaults to false so no behavior change. If we want to enable using the mask by default, we can either implement Default, or change this to encoding_stats_as_vec or something to allow enabling the old behavior.
6827338 to
9d3350a
Compare
9d3350a to
95a77b4
Compare
| /// }) | ||
| /// } | ||
| /// ``` | ||
| pub fn page_encoding_stats_mask(&self) -> Option<&EncodingMask> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be data_page_encoding_stats_mask (or just data_page_encoding_stats) to make it clear it only has the stats for data pages.
| let bigger_expected_size = 3224; | ||
| #[cfg(feature = "encryption")] | ||
| let bigger_expected_size = 3360; | ||
| let bigger_expected_size = 3392; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for adding the mask to ColumnChunkMetaData. An alternative might be to create an enum with Vec and mask variants if we don't want more bloat.
| // with_encoding_stats_as_mask | ||
| add_mutator!(encoding_stats_as_mask, bool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I've done the macro, it saves all of 3 lines per instance. I'm fine with doing away with it.
One reason to keep this concise is that once an API is finalized for this setting, I'll add options for chunk statistics, size statistics, and geo statistics, and potentially an option to skip decoding the page index length/offsets as well (oh, and bloom filters too). That's a lot of repetition with high chances of cut-and-paste errors that I'd like to avoid, especially with the addition of per-column setters (i.e. the poorly named set_keep_X). I'd really like the whole suite of accessors/setters to be generated with a macro, but then documentation becomes problematic. Suggestions welcome 😄
| /// | ||
| /// [`encoding_stats`]: | ||
| /// https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L917 | ||
| pub fn set_keep_encoding_stats(&mut self, keep: &[usize]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to a better name for this. set_decode_encoding_stats_for_columns is a bit long winded.
Which issue does this PR close?
Rationale for this change
This builds on #8763 to add options to either skip decoding of the page encoding statistics array, or transform it to a bitmask. This gets rid of the last of the heavy allocations in the metadata decoder.
What changes are included in this PR?
Adds new options to
ParquetMetaDataOptions. Also adds more metadata benchmarks.Are these changes tested?
Yes
Are there any user-facing changes?
No, just adds new field to
ColumnChunkMetaData.