Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Nov 6, 2025

Which issue does this PR close?

Rationale for this change

This builds on #8763 to add options to either skip decoding of the page encoding statistics array, or transform it to a bitmask. This gets rid of the last of the heavy allocations in the metadata decoder.

What changes are included in this PR?

Adds new options to ParquetMetaDataOptions. Also adds more metadata benchmarks.

Are these changes tested?

Yes

Are there any user-facing changes?

No, just adds new field to ColumnChunkMetaData.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 6, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Nov 6, 2025

excerpts from new benchmarks

decode parquet metadata time:   [15.050 µs 15.105 µs 15.164 µs]
decode metadata with schema
                        time:   [7.7038 µs 7.7286 µs 7.7561 µs]
decode metadata with stats mask
                        time:   [14.035 µs 14.100 µs 14.182 µs
decode metadata with skip PES
                        time:   [13.976 µs 14.016 µs 14.060 µs]
decode parquet metadata (wide)
                        time:   [54.013 ms 54.236 ms 54.468 ms]
decode metadata (wide) with schema
                        time:   [48.399 ms 48.562 ms 48.738 ms]
decode metadata (wide) with stats mask
                        time:   [44.912 ms 45.077 ms 45.253 ms]
decode metadata (wide) with skip PES
                        time:   [44.500 ms 44.616 ms 44.739 ms]

Skipping the stats is not any faster than turning them into a mask 😮.

@etseidl etseidl force-pushed the page_enc_stats branch 2 times, most recently from f069128 to f004a1f Compare November 7, 2025 18:30
Comment on lines 37 to 40
// The outer option acts as a global boolean, so if `skip_encoding_stats.is_some()`
// is `true` then we're at least skipping some stats. The inner `Option` is a keep
// list of column indicies to decode.
skip_encoding_stats: Option<Option<Arc<HashSet<usize>>>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my solution to per-column options. For huge schemas I didn't want a Vec<bool> that's mostly filled with false. Using Arc so cloning should be faster.

skip_encoding_stats behavior
None decode all
Some<None> decode none
Some<Some<Set>> decode if in set

#[derive(Default, Debug, Clone)]
pub struct ParquetMetaDataOptions {
schema_descr: Option<SchemaDescPtr>,
encoding_stats_as_mask: bool,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defaults to false so no behavior change. If we want to enable using the mask by default, we can either implement Default, or change this to encoding_stats_as_vec or something to allow enabling the old behavior.

@etseidl etseidl marked this pull request as ready for review November 11, 2025 18:30
@etseidl etseidl changed the title [WIP] Add ability to skip or transform page encoding statistics in Parquet metadata Add ability to skip or transform page encoding statistics in Parquet metadata Nov 11, 2025
/// })
/// }
/// ```
pub fn page_encoding_stats_mask(&self) -> Option<&EncodingMask> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be data_page_encoding_stats_mask (or just data_page_encoding_stats) to make it clear it only has the stats for data pages.

let bigger_expected_size = 3224;
#[cfg(feature = "encryption")]
let bigger_expected_size = 3360;
let bigger_expected_size = 3392;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for adding the mask to ColumnChunkMetaData. An alternative might be to create an enum with Vec and mask variants if we don't want more bloat.

Comment on lines +100 to +101
// with_encoding_stats_as_mask
add_mutator!(encoding_stats_as_mask, bool);
Copy link
Contributor Author

@etseidl etseidl Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I've done the macro, it saves all of 3 lines per instance. I'm fine with doing away with it.

One reason to keep this concise is that once an API is finalized for this setting, I'll add options for chunk statistics, size statistics, and geo statistics, and potentially an option to skip decoding the page index length/offsets as well (oh, and bloom filters too). That's a lot of repetition with high chances of cut-and-paste errors that I'd like to avoid, especially with the addition of per-column setters (i.e. the poorly named set_keep_X). I'd really like the whole suite of accessors/setters to be generated with a macro, but then documentation becomes problematic. Suggestions welcome 😄

///
/// [`encoding_stats`]:
/// https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L917
pub fn set_keep_encoding_stats(&mut self, keep: &[usize]) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to a better name for this. set_decode_encoding_stats_for_columns is a bit long winded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce allocations in ParquetMetaData for improved performance

1 participant