-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Enforce shredded-type validation in shred_variant
#8796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
98c2469
9ecdfca
c293b80
aca092c
a6e38ce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -25,7 +25,7 @@ use crate::{VariantArray, VariantValueArrayBuilder}; | |||||||||||
| use arrow::array::{ArrayRef, BinaryViewArray, NullBufferBuilder}; | ||||||||||||
| use arrow::buffer::NullBuffer; | ||||||||||||
| use arrow::compute::CastOptions; | ||||||||||||
| use arrow::datatypes::{DataType, Fields}; | ||||||||||||
| use arrow::datatypes::{DataType, Fields, TimeUnit}; | ||||||||||||
| use arrow::error::{ArrowError, Result}; | ||||||||||||
| use parquet_variant::{Variant, VariantBuilderExt}; | ||||||||||||
|
|
||||||||||||
|
|
@@ -123,13 +123,39 @@ pub(crate) fn make_variant_to_shredded_variant_arrow_row_builder<'a>( | |||||||||||
| "Shredding variant array values as arrow lists".to_string(), | ||||||||||||
| )); | ||||||||||||
| } | ||||||||||||
| _ => { | ||||||||||||
| // Supported shredded primitive types, see Variant shredding spec: | ||||||||||||
| // https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types | ||||||||||||
| DataType::Boolean | ||||||||||||
|
Comment on lines
+127
to
+128
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Big question:
Suggested change
I'm pretty sure that So maybe we intentionally forbid
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update: I confused myself; the variant shredding spec does not allow shredding as null, so this code is correct as-is. HOWEVER, while looking at this I realized that we do have a row builder for The simplest fix would be to adjust the null builder's definition: define_variant_to_primitive_builder!(
struct VariantToNullArrowRowBuilder<'a>
|capacity| -> FakeNullBuilder { FakeNullBuilder::new(capacity) },
- |_value| Some(Variant::Null),
+ |value| value.as_null(),
type_name: "Null"
);and change the fake row builder's append_value method to suit: impl FakeNullBuilder {
fn new(capacity: usize) -> Self {
Self(NullArray::new(capacity))
}
- fn append_value<T>(&mut self, _: T) {}
+ fn append_value(&mut self, _: ()) {}
fn append_null(&mut self) {}... but that might produce clippy warnings about passing unit type as a function argument. If so, we'd need to adjust the value conversion to produce Some dummy value instead, e.g. Also, the fake null builder should probably track how many "values" were "appended" and either produce a NullArray of that length or blow up if the call count disagrees with the array's length. The former is probably more correct than the latter, since it matches all the other builders for whom "capacity" is only a hint.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The latter is unrelated to this PR, tracking it as #8810 |
||||||||||||
| | DataType::Int8 | ||||||||||||
| | DataType::Int16 | ||||||||||||
| | DataType::Int32 | ||||||||||||
| | DataType::Int64 | ||||||||||||
| | DataType::Float32 | ||||||||||||
| | DataType::Float64 | ||||||||||||
| | DataType::Decimal32(..) | ||||||||||||
| | DataType::Decimal64(..) | ||||||||||||
| | DataType::Decimal128(..) | ||||||||||||
| | DataType::Date32 | ||||||||||||
| | DataType::Time64(TimeUnit::Microsecond) | ||||||||||||
| | DataType::Timestamp(TimeUnit::Microsecond | TimeUnit::Nanosecond, _) | ||||||||||||
| | DataType::Binary | ||||||||||||
| | DataType::BinaryView | ||||||||||||
| | DataType::Utf8 | ||||||||||||
| | DataType::Utf8View | ||||||||||||
| | DataType::FixedSizeBinary(16) // UUID | ||||||||||||
| => { | ||||||||||||
| let builder = | ||||||||||||
| make_primitive_variant_to_arrow_row_builder(data_type, cast_options, capacity)?; | ||||||||||||
| let typed_value_builder = | ||||||||||||
| VariantToShreddedPrimitiveVariantRowBuilder::new(builder, capacity, top_level); | ||||||||||||
| VariantToShreddedVariantRowBuilder::Primitive(typed_value_builder) | ||||||||||||
| } | ||||||||||||
| DataType::FixedSizeBinary(_) => { | ||||||||||||
| return Err(ArrowError::InvalidArgumentError(format!("{data_type} is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported."))) | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to distinguish this with the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this provides a user friendly message and it's actually moved from |
||||||||||||
| } | ||||||||||||
|
Comment on lines
+153
to
+155
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure whether this distinction is important enough to merit its own error message? Also, we eventually need to check the field for UUID extension type and not just rely on the data type. If #8673 merges first, we should fix it here; if this PR merges first the other PR needs to incorporate the change.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, so sorry. I swear I never caught this notification. In terms of order, I'm fine with this merging first. I'll read through and update accordingly
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lol -- I certainly will never give someone a hard time for missing a notification! |
||||||||||||
| _ => { | ||||||||||||
| return Err(ArrowError::InvalidArgumentError(format!("{data_type} is not a valid variant shredding type"))) | ||||||||||||
| } | ||||||||||||
| }; | ||||||||||||
| Ok(builder) | ||||||||||||
| } | ||||||||||||
|
|
@@ -327,7 +353,7 @@ mod tests { | |||||||||||
| use super::*; | ||||||||||||
| use crate::VariantArrayBuilder; | ||||||||||||
| use arrow::array::{Array, FixedSizeBinaryArray, Float64Array, Int64Array}; | ||||||||||||
| use arrow::datatypes::{DataType, Field, Fields}; | ||||||||||||
| use arrow::datatypes::{DataType, Field, Fields, TimeUnit, UnionFields, UnionMode}; | ||||||||||||
| use parquet_variant::{ObjectBuilder, ReadOnlyMetadataBuilder, Variant, VariantBuilder}; | ||||||||||||
| use std::sync::Arc; | ||||||||||||
| use uuid::Uuid; | ||||||||||||
|
|
@@ -536,6 +562,60 @@ mod tests { | |||||||||||
| assert!(typed_value_float64.is_null(2)); // string doesn't convert | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| #[test] | ||||||||||||
| fn test_invalid_shredded_types_rejected() { | ||||||||||||
| let input = VariantArray::from_iter([Variant::from(42)]); | ||||||||||||
|
|
||||||||||||
| let invalid_types = vec![ | ||||||||||||
| DataType::UInt8, | ||||||||||||
| DataType::Float16, | ||||||||||||
| DataType::Decimal256(38, 10), | ||||||||||||
| DataType::Date64, | ||||||||||||
| DataType::Time32(TimeUnit::Second), | ||||||||||||
| DataType::Time64(TimeUnit::Nanosecond), | ||||||||||||
| DataType::Timestamp(TimeUnit::Millisecond, None), | ||||||||||||
| DataType::LargeBinary, | ||||||||||||
| DataType::LargeUtf8, | ||||||||||||
| DataType::FixedSizeBinary(17), | ||||||||||||
| DataType::Union( | ||||||||||||
| UnionFields::new( | ||||||||||||
| vec![0_i8, 1_i8], | ||||||||||||
| vec![ | ||||||||||||
| Field::new("int_field", DataType::Int32, false), | ||||||||||||
| Field::new("str_field", DataType::Utf8, true), | ||||||||||||
| ], | ||||||||||||
| ), | ||||||||||||
| UnionMode::Dense, | ||||||||||||
| ), | ||||||||||||
| DataType::Map( | ||||||||||||
| Arc::new(Field::new( | ||||||||||||
| "entries", | ||||||||||||
| DataType::Struct(Fields::from(vec![ | ||||||||||||
| Field::new("key", DataType::Utf8, false), | ||||||||||||
| Field::new("value", DataType::Int32, true), | ||||||||||||
| ])), | ||||||||||||
| false, | ||||||||||||
| )), | ||||||||||||
| false, | ||||||||||||
| ), | ||||||||||||
| DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8)), | ||||||||||||
| DataType::RunEndEncoded( | ||||||||||||
| Arc::new(Field::new("run_ends", DataType::Int32, false)), | ||||||||||||
| Arc::new(Field::new("values", DataType::Utf8, true)), | ||||||||||||
| ), | ||||||||||||
| ]; | ||||||||||||
|
|
||||||||||||
| for data_type in invalid_types { | ||||||||||||
| let err = shred_variant(&input, &data_type).unwrap_err(); | ||||||||||||
| assert!( | ||||||||||||
| matches!(err, ArrowError::InvalidArgumentError(_)), | ||||||||||||
| "expected InvalidArgumentError for {:?}, got {:?}", | ||||||||||||
| data_type, | ||||||||||||
| err | ||||||||||||
| ); | ||||||||||||
| } | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| #[test] | ||||||||||||
| fn test_object_shredding_comprehensive() { | ||||||||||||
| let mut builder = VariantArrayBuilder::new(7); | ||||||||||||
|
|
||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't make do the type cast after this? not sure if this is ok.
Currently, we will do the type check in
make_primitive_variant_to_arrow_row_builderwith the match arms, so maybe we don't need to add it here, and add the type check in two places seems will add maintaince.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that
make_primitive_variant_to_arrow_row_builderdoesn’t enforce the Parquet-primitive constraint. As a result, types likeUInt, which aren’t Parquet primitives, are accepted for shredding becausemake_primitive_variant_to_arrow_row_builderallows.