-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Enforce shredded-type validation in shred_variant
#8796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Variant] Enforce shredded-type validation in shred_variant
#8796
Conversation
liamzwbao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @klion26, this is the PR to address the confusion in #8768 (comment).
There’s some noise from fmt, please review with “Hide whitespace” on.
| DataType::FixedSizeBinary(size) => { | ||
| return Err(ArrowError::InvalidArgumentError(format!( | ||
| "FixedSizeBinary({size}) is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported." | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to shred_variant
| _ if data_type.is_primitive() => { | ||
| return Err(ArrowError::NotYetImplemented(format!( | ||
| "Primitive data_type {data_type:?} not yet implemented" | ||
| ))); | ||
| } | ||
| _ => { | ||
| return Err(ArrowError::InvalidArgumentError(format!( | ||
| "Not a primitive type: {data_type:?}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, primitive refers to Variant primitive, not Arrow primitive. So we shouldn’t throw an error about “not an Arrow primitive.” Ideally this function should accept any Arrow type, therefore changed this to NotYetImplemented instead.
| _ => { | ||
| // Supported shredded primitive types, see Variant shredding spec: | ||
| // https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types | ||
| DataType::Boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't make do the type cast after this? not sure if this is ok.
Currently, we will do the type check in make_primitive_variant_to_arrow_row_builder with the match arms, so maybe we don't need to add it here, and add the type check in two places seems will add maintaince.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that make_primitive_variant_to_arrow_row_builder doesn’t enforce the Parquet-primitive constraint. As a result, types like UInt, which aren’t Parquet primitives, are accepted for shredding because make_primitive_variant_to_arrow_row_builder allows.
| VariantToShreddedVariantRowBuilder::Primitive(typed_value_builder) | ||
| } | ||
| DataType::FixedSizeBinary(_) => { | ||
| return Err(ArrowError::InvalidArgumentError(format!("{data_type} is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported."))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to distinguish this with the _ match arm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this provides a user friendly message and it's actually moved from variant_to_arrow. Don't mind removing it tho
| StringView(VariantToStringArrowBuilder::new(cast_options, capacity)) | ||
| } | ||
| _ => { | ||
| return Err(ArrowError::NotYetImplemented(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can't remove the _ if data_type.is_primitive() for now, seems this arm is for the type we may but have not implement, and _ match arm is for the types invalid.
After #8768 merged, we complete the 1-1 mapping(and some transforms for some types) for all Variant primitive types, but we may support some DataTypes here which is not a valid variant primitive(e.g. Timestamp with different unit), and keep the _ if data_type.is_primitive() so that we know we may support the required, and remove it after we have a conclusion.
I am sorting out possible conversions and will create an issue to discuss them after the work done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually data_type.is_primitive() is not checking Parquet primitive, that's why I wanted to remove it to avoid confusion. It checks arrow primitive which only includes numeric and temporal. If we were to keep it, we shouldn't add Binary or String into this builder as they are not arrow primitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that
- All Parquet Variant primitive types will map to Arrow DataTypes
- Some Parquet Variant primitive types can be converted to other Arrow DataTypes(with cast)
- Here we want to support all parquet variant primitive types, and the target type here is Arrow DataType, we check the Arrow DataTypes here
- The
dataype.is_primitive()check(my guess) is that we can support all Arrow DataTypes, but has not done(when this code written), so add aNotYetImplementhere to mark this.
However, it is true that we might need an expert to confirm our understanding here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- All Parquet Variant primitive types will map to Arrow DataTypes
- Some Parquet Variant primitive types can be converted to other Arrow DataTypes(with cast)
- Here we want to support all parquet variant primitive types, and the target type here is Arrow DataType, we check the Arrow DataTypes here
That's my understanding as well.
- The
dataype.is_primitive()check(my guess) is that we can support all Arrow DataTypes, but has not done(when this code written), so add aNotYetImplementhere to mark this.
This is why I’m proposing the change. I’m not removing NotYetImplemented. Instead, I’m removing the datatype.is_primitive() check and the InvalidArgumentError: Not a primitive type, because, as you noted, we can support all Arrow DataTypes.
Example: variant_get() with Binary
- Before this change: fails with
InvalidArgumentError: Not a primitive typebecause Binary is not an Arrow primitive. This incorrectly suggestsBinaryis an invalid target type and shouldn’t be supported. - After this change: fails with
NotYetImplemented, which is the expected behavior until Binary is supported.
Hope this example makes sense
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types | ||
| DataType::Boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big question:
| // https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types | |
| DataType::Boolean | |
| // https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types | |
| DataType::Null | |
| | DataType::Boolean |
I'm pretty sure that Variant::Null (JSON null) is an actual value... but I'm NOT sure how useful it would be in practice. Strict casting would produce an error if even one row had a normal value, and non-strict casting would produce an all-null output no matter what the input was.
So maybe we intentionally forbid DataType::Null, with an explanatory comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update:
I confused myself; the variant shredding spec does not allow shredding as null, so this code is correct as-is.
HOWEVER, while looking at this I realized that we do have a row builder for DataType::Null, and it does not currently enforce strict casting (all values are blindly treated as null, no matter the casting mode).
The simplest fix would be to adjust the null builder's definition:
define_variant_to_primitive_builder!(
struct VariantToNullArrowRowBuilder<'a>
|capacity| -> FakeNullBuilder { FakeNullBuilder::new(capacity) },
- |_value| Some(Variant::Null),
+ |value| value.as_null(),
type_name: "Null"
);and change the fake row builder's append_value method to suit:
impl FakeNullBuilder {
fn new(capacity: usize) -> Self {
Self(NullArray::new(capacity))
}
- fn append_value<T>(&mut self, _: T) {}
+ fn append_value(&mut self, _: ()) {}
fn append_null(&mut self) {}... but that might produce clippy warnings about passing unit type as a function argument. If so, we'd need to adjust the value conversion to produce Some dummy value instead, e.g. value.as_null().map(|_| 0) or matches!(value, Variant::Null).then_some(0)
Also, the fake null builder should probably track how many "values" were "appended" and either produce a NullArray of that length or blow up if the call count disagrees with the array's length. The former is probably more correct than the latter, since it matches all the other builders for whom "capacity" is only a hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter is unrelated to this PR, tracking it as #8810
| DataType::FixedSizeBinary(_) => { | ||
| return Err(ArrowError::InvalidArgumentError(format!("{data_type} is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported."))) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether this distinction is important enough to merit its own error message?
Also, we eventually need to check the field for UUID extension type and not just rely on the data type. If #8673 merges first, we should fix it here; if this PR merges first the other PR needs to incorporate the change.
| _ => { | ||
| return Err(ArrowError::NotYetImplemented(format!( | ||
| "Data_type {data_type:?} not yet implemented" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a catch-all, should we enumerate the remaining data types so it's obvious at a glance which ones are still unsupported? I don't think there are very many left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, done
| let builder = match data_type { | ||
| DataType::Null => Null(VariantToNullArrowRowBuilder::new(cast_options, capacity)), | ||
| DataType::Boolean => Boolean(VariantToBooleanArrowRowBuilder::new(cast_options, capacity)), | ||
| DataType::Int8 => Int8(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Int16 => Int16(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Int32 => Int32(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Int64 => Int64(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::UInt8 => UInt8(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::UInt16 => UInt16(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::UInt32 => UInt32(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::UInt64 => UInt64(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Float16 => Float16(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Float32 => Float32(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Float64 => Float64(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Decimal32(precision, scale) => Decimal32(VariantToDecimalArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| *precision, | ||
| *scale, | ||
| )?), | ||
| DataType::Decimal64(precision, scale) => Decimal64(VariantToDecimalArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| *precision, | ||
| *scale, | ||
| )?), | ||
| DataType::Decimal128(precision, scale) => Decimal128(VariantToDecimalArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| *precision, | ||
| *scale, | ||
| )?), | ||
| DataType::Decimal256(precision, scale) => Decimal256(VariantToDecimalArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| *precision, | ||
| *scale, | ||
| )?), | ||
| DataType::Timestamp(TimeUnit::Microsecond, None) => TimestampMicroNtz( | ||
| VariantToTimestampNtzArrowRowBuilder::new(cast_options, capacity), | ||
| ), | ||
| DataType::Timestamp(TimeUnit::Microsecond, tz) => TimestampMicro( | ||
| VariantToTimestampArrowRowBuilder::new(cast_options, capacity, tz.clone()), | ||
| ), | ||
| DataType::Timestamp(TimeUnit::Nanosecond, None) => TimestampNanoNtz( | ||
| VariantToTimestampNtzArrowRowBuilder::new(cast_options, capacity), | ||
| ), | ||
| DataType::Timestamp(TimeUnit::Nanosecond, tz) => TimestampNano( | ||
| VariantToTimestampArrowRowBuilder::new(cast_options, capacity, tz.clone()), | ||
| ), | ||
| DataType::Date32 => Date(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Time64(TimeUnit::Microsecond) => Time(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::FixedSizeBinary(16) => { | ||
| Uuid(VariantToUuidArrowRowBuilder::new(cast_options, capacity)) | ||
| } | ||
| DataType::FixedSizeBinary(size) => { | ||
| return Err(ArrowError::InvalidArgumentError(format!( | ||
| "FixedSizeBinary({size}) is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported." | ||
| ))); | ||
| } | ||
| DataType::Utf8 => String(VariantToStringArrowBuilder::new(cast_options, capacity)), | ||
| DataType::LargeUtf8 => { | ||
| LargeString(VariantToStringArrowBuilder::new(cast_options, capacity)) | ||
| } | ||
| DataType::Utf8View => StringView(VariantToStringArrowBuilder::new(cast_options, capacity)), | ||
| _ if data_type.is_primitive() => { | ||
| return Err(ArrowError::NotYetImplemented(format!( | ||
| "Primitive data_type {data_type:?} not yet implemented" | ||
| ))); | ||
| } | ||
| _ => { | ||
| return Err(ArrowError::InvalidArgumentError(format!( | ||
| "Not a primitive type: {data_type:?}" | ||
| ))); | ||
| } | ||
| }; | ||
| let builder = | ||
| match data_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: Thanks, clippy... (best to review with whitespace ignored)
| DataType::FixedSizeBinary(16) => { | ||
| Uuid(VariantToUuidArrowRowBuilder::new(cast_options, capacity)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually incorrect. It's perfectly legal to (attempt to) convert any Variant::Binary (plus maybe also Variant::String and Variant::Uuid) value to fixed-len binary -- including length 16. The conversion just might fail if the length is wrong.
We really need a UUID extension type check to know whether the caller intended to work with UUID or binary values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, which is why I removed the branch that errored on FixedSizeBinary(size). But this issue should be handled in a separate PR as it needs to change the impl the FixedSizeBinary. I'll add a TODO here.
klion26
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, LGTM.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me -- thank you @liamzwbao and @klion26
Which issue does this PR close?
shred_variant#8795.Rationale for this change
Mentioned in the issue
What changes are included in this PR?
Add validation in
shred_variantto allow spec-approved types only.Are these changes tested?
Yes
Are there any user-facing changes?