Skip to content

Conversation

@liamzwbao
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Mentioned in the issue

What changes are included in this PR?

Add validation in shred_variant to allow spec-approved types only.

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Nov 5, 2025
Copy link
Contributor Author

@liamzwbao liamzwbao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @klion26, this is the PR to address the confusion in #8768 (comment).

There’s some noise from fmt, please review with “Hide whitespace” on.

cc @alamb @scovich

Comment on lines -315 to -331
DataType::FixedSizeBinary(size) => {
return Err(ArrowError::InvalidArgumentError(format!(
"FixedSizeBinary({size}) is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported."
)));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to shred_variant

@liamzwbao liamzwbao marked this pull request as ready for review November 6, 2025 01:15
Comment on lines -325 to -351
_ if data_type.is_primitive() => {
return Err(ArrowError::NotYetImplemented(format!(
"Primitive data_type {data_type:?} not yet implemented"
)));
}
_ => {
return Err(ArrowError::InvalidArgumentError(format!(
"Not a primitive type: {data_type:?}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, primitive refers to Variant primitive, not Arrow primitive. So we shouldn’t throw an error about “not an Arrow primitive.” Ideally this function should accept any Arrow type, therefore changed this to NotYetImplemented instead.

_ => {
// Supported shredded primitive types, see Variant shredding spec:
// https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types
DataType::Boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't make do the type cast after this? not sure if this is ok.

Currently, we will do the type check in make_primitive_variant_to_arrow_row_builder with the match arms, so maybe we don't need to add it here, and add the type check in two places seems will add maintaince.

Copy link
Contributor Author

@liamzwbao liamzwbao Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that make_primitive_variant_to_arrow_row_builder doesn’t enforce the Parquet-primitive constraint. As a result, types like UInt, which aren’t Parquet primitives, are accepted for shredding because make_primitive_variant_to_arrow_row_builder allows.

VariantToShreddedVariantRowBuilder::Primitive(typed_value_builder)
}
DataType::FixedSizeBinary(_) => {
return Err(ArrowError::InvalidArgumentError(format!("{data_type} is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported.")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to distinguish this with the _ match arm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this provides a user friendly message and it's actually moved from variant_to_arrow. Don't mind removing it tho

StringView(VariantToStringArrowBuilder::new(cast_options, capacity))
}
_ => {
return Err(ArrowError::NotYetImplemented(format!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can't remove the _ if data_type.is_primitive() for now, seems this arm is for the type we may but have not implement, and _ match arm is for the types invalid.

After #8768 merged, we complete the 1-1 mapping(and some transforms for some types) for all Variant primitive types, but we may support some DataTypes here which is not a valid variant primitive(e.g. Timestamp with different unit), and keep the _ if data_type.is_primitive() so that we know we may support the required, and remove it after we have a conclusion.

I am sorting out possible conversions and will create an issue to discuss them after the work done.

Copy link
Contributor Author

@liamzwbao liamzwbao Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually data_type.is_primitive() is not checking Parquet primitive, that's why I wanted to remove it to avoid confusion. It checks arrow primitive which only includes numeric and temporal. If we were to keep it, we shouldn't add Binary or String into this builder as they are not arrow primitive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that

  • All Parquet Variant primitive types will map to Arrow DataTypes
  • Some Parquet Variant primitive types can be converted to other Arrow DataTypes(with cast)
  • Here we want to support all parquet variant primitive types, and the target type here is Arrow DataType, we check the Arrow DataTypes here
  • The dataype.is_primitive() check(my guess) is that we can support all Arrow DataTypes, but has not done(when this code written), so add a NotYetImplement here to mark this.

However, it is true that we might need an expert to confirm our understanding here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • All Parquet Variant primitive types will map to Arrow DataTypes
  • Some Parquet Variant primitive types can be converted to other Arrow DataTypes(with cast)
  • Here we want to support all parquet variant primitive types, and the target type here is Arrow DataType, we check the Arrow DataTypes here

That's my understanding as well.

  • The dataype.is_primitive() check(my guess) is that we can support all Arrow DataTypes, but has not done(when this code written), so add a NotYetImplement here to mark this.

This is why I’m proposing the change. I’m not removing NotYetImplemented. Instead, I’m removing the datatype.is_primitive() check and the InvalidArgumentError: Not a primitive type, because, as you noted, we can support all Arrow DataTypes.

Example: variant_get() with Binary

  • Before this change: fails with InvalidArgumentError: Not a primitive type because Binary is not an Arrow primitive. This incorrectly suggests Binary is an invalid target type and shouldn’t be supported.
  • After this change: fails with NotYetImplemented, which is the expected behavior until Binary is supported.

Hope this example makes sense

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, but trips over UUID extension type issues. I'm not sure if we should merge as-is with follow-up items to fix those once #8420 is addressed? Or wait for #8673 to merge so we have access to the field and not just a bare data type?

Comment on lines +127 to +128
// https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types
DataType::Boolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big question:

Suggested change
// https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types
DataType::Boolean
// https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types
DataType::Null
| DataType::Boolean

I'm pretty sure that Variant::Null (JSON null) is an actual value... but I'm NOT sure how useful it would be in practice. Strict casting would produce an error if even one row had a normal value, and non-strict casting would produce an all-null output no matter what the input was.

So maybe we intentionally forbid DataType::Null, with an explanatory comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:

I confused myself; the variant shredding spec does not allow shredding as null, so this code is correct as-is.

HOWEVER, while looking at this I realized that we do have a row builder for DataType::Null, and it does not currently enforce strict casting (all values are blindly treated as null, no matter the casting mode).

The simplest fix would be to adjust the null builder's definition:

define_variant_to_primitive_builder!(
     struct VariantToNullArrowRowBuilder<'a>
     |capacity| -> FakeNullBuilder { FakeNullBuilder::new(capacity) },
-    |_value|  Some(Variant::Null),
+    |value|  value.as_null(),
     type_name: "Null"
 );

and change the fake row builder's append_value method to suit:

 impl FakeNullBuilder {
     fn new(capacity: usize) -> Self {
         Self(NullArray::new(capacity))
     }
-    fn append_value<T>(&mut self, _: T) {}
+    fn append_value(&mut self, _: ()) {}
     fn append_null(&mut self) {}

... but that might produce clippy warnings about passing unit type as a function argument. If so, we'd need to adjust the value conversion to produce Some dummy value instead, e.g. value.as_null().map(|_| 0) or matches!(value, Variant::Null).then_some(0)

Also, the fake null builder should probably track how many "values" were "appended" and either produce a NullArray of that length or blow up if the call count disagrees with the array's length. The former is probably more correct than the latter, since it matches all the other builders for whom "capacity" is only a hint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter is unrelated to this PR, tracking it as #8810

Comment on lines +153 to +155
DataType::FixedSizeBinary(_) => {
return Err(ArrowError::InvalidArgumentError(format!("{data_type} is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported.")))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether this distinction is important enough to merit its own error message?

Also, we eventually need to check the field for UUID extension type and not just rely on the data type. If #8673 merges first, we should fix it here; if this PR merges first the other PR needs to incorporate the change.

CC @friendlymatthew

Comment on lines 312 to 314
_ => {
return Err(ArrowError::NotYetImplemented(format!(
"Data_type {data_type:?} not yet implemented"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a catch-all, should we enumerate the remaining data types so it's obvious at a glance which ones are still unsupported? I don't think there are very many left.

Copy link
Contributor Author

@liamzwbao liamzwbao Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, done

Comment on lines 221 to 222
let builder = match data_type {
DataType::Null => Null(VariantToNullArrowRowBuilder::new(cast_options, capacity)),
DataType::Boolean => Boolean(VariantToBooleanArrowRowBuilder::new(cast_options, capacity)),
DataType::Int8 => Int8(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::Int16 => Int16(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::Int32 => Int32(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::Int64 => Int64(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::UInt8 => UInt8(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::UInt16 => UInt16(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::UInt32 => UInt32(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::UInt64 => UInt64(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::Float16 => Float16(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::Float32 => Float32(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::Float64 => Float64(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::Decimal32(precision, scale) => Decimal32(VariantToDecimalArrowRowBuilder::new(
cast_options,
capacity,
*precision,
*scale,
)?),
DataType::Decimal64(precision, scale) => Decimal64(VariantToDecimalArrowRowBuilder::new(
cast_options,
capacity,
*precision,
*scale,
)?),
DataType::Decimal128(precision, scale) => Decimal128(VariantToDecimalArrowRowBuilder::new(
cast_options,
capacity,
*precision,
*scale,
)?),
DataType::Decimal256(precision, scale) => Decimal256(VariantToDecimalArrowRowBuilder::new(
cast_options,
capacity,
*precision,
*scale,
)?),
DataType::Timestamp(TimeUnit::Microsecond, None) => TimestampMicroNtz(
VariantToTimestampNtzArrowRowBuilder::new(cast_options, capacity),
),
DataType::Timestamp(TimeUnit::Microsecond, tz) => TimestampMicro(
VariantToTimestampArrowRowBuilder::new(cast_options, capacity, tz.clone()),
),
DataType::Timestamp(TimeUnit::Nanosecond, None) => TimestampNanoNtz(
VariantToTimestampNtzArrowRowBuilder::new(cast_options, capacity),
),
DataType::Timestamp(TimeUnit::Nanosecond, tz) => TimestampNano(
VariantToTimestampArrowRowBuilder::new(cast_options, capacity, tz.clone()),
),
DataType::Date32 => Date(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::Time64(TimeUnit::Microsecond) => Time(VariantToPrimitiveArrowRowBuilder::new(
cast_options,
capacity,
)),
DataType::FixedSizeBinary(16) => {
Uuid(VariantToUuidArrowRowBuilder::new(cast_options, capacity))
}
DataType::FixedSizeBinary(size) => {
return Err(ArrowError::InvalidArgumentError(format!(
"FixedSizeBinary({size}) is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported."
)));
}
DataType::Utf8 => String(VariantToStringArrowBuilder::new(cast_options, capacity)),
DataType::LargeUtf8 => {
LargeString(VariantToStringArrowBuilder::new(cast_options, capacity))
}
DataType::Utf8View => StringView(VariantToStringArrowBuilder::new(cast_options, capacity)),
_ if data_type.is_primitive() => {
return Err(ArrowError::NotYetImplemented(format!(
"Primitive data_type {data_type:?} not yet implemented"
)));
}
_ => {
return Err(ArrowError::InvalidArgumentError(format!(
"Not a primitive type: {data_type:?}"
)));
}
};
let builder =
match data_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Thanks, clippy... (best to review with whitespace ignored)

Comment on lines +302 to +304
DataType::FixedSizeBinary(16) => {
Uuid(VariantToUuidArrowRowBuilder::new(cast_options, capacity))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually incorrect. It's perfectly legal to (attempt to) convert any Variant::Binary (plus maybe also Variant::String and Variant::Uuid) value to fixed-len binary -- including length 16. The conversion just might fail if the length is wrong.

We really need a UUID extension type check to know whether the caller intended to work with UUID or binary values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, which is why I removed the branch that errored on FixedSizeBinary(size). But this issue should be handled in a separate PR as it needs to change the impl the FixedSizeBinary. I'll add a TODO here.

Copy link
Member

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, LGTM.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me -- thank you @liamzwbao and @klion26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Enforce shredded-type validation in shred_variant

4 participants