Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 83 additions & 3 deletions parquet-variant-compute/src/shred_variant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ use crate::{VariantArray, VariantValueArrayBuilder};
use arrow::array::{ArrayRef, BinaryViewArray, NullBufferBuilder};
use arrow::buffer::NullBuffer;
use arrow::compute::CastOptions;
use arrow::datatypes::{DataType, Fields};
use arrow::datatypes::{DataType, Fields, TimeUnit};
use arrow::error::{ArrowError, Result};
use parquet_variant::{Variant, VariantBuilderExt};

Expand Down Expand Up @@ -123,13 +123,39 @@ pub(crate) fn make_variant_to_shredded_variant_arrow_row_builder<'a>(
"Shredding variant array values as arrow lists".to_string(),
));
}
_ => {
// Supported shredded primitive types, see Variant shredding spec:
// https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types
DataType::Boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't make do the type cast after this? not sure if this is ok.

Currently, we will do the type check in make_primitive_variant_to_arrow_row_builder with the match arms, so maybe we don't need to add it here, and add the type check in two places seems will add maintaince.

Copy link
Contributor Author

@liamzwbao liamzwbao Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that make_primitive_variant_to_arrow_row_builder doesn’t enforce the Parquet-primitive constraint. As a result, types like UInt, which aren’t Parquet primitives, are accepted for shredding because make_primitive_variant_to_arrow_row_builder allows.

Comment on lines +127 to +128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big question:

Suggested change
// https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types
DataType::Boolean
// https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types
DataType::Null
| DataType::Boolean

I'm pretty sure that Variant::Null (JSON null) is an actual value... but I'm NOT sure how useful it would be in practice. Strict casting would produce an error if even one row had a normal value, and non-strict casting would produce an all-null output no matter what the input was.

So maybe we intentionally forbid DataType::Null, with an explanatory comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:

I confused myself; the variant shredding spec does not allow shredding as null, so this code is correct as-is.

HOWEVER, while looking at this I realized that we do have a row builder for DataType::Null, and it does not currently enforce strict casting (all values are blindly treated as null, no matter the casting mode).

The simplest fix would be to adjust the null builder's definition:

define_variant_to_primitive_builder!(
     struct VariantToNullArrowRowBuilder<'a>
     |capacity| -> FakeNullBuilder { FakeNullBuilder::new(capacity) },
-    |_value|  Some(Variant::Null),
+    |value|  value.as_null(),
     type_name: "Null"
 );

and change the fake row builder's append_value method to suit:

 impl FakeNullBuilder {
     fn new(capacity: usize) -> Self {
         Self(NullArray::new(capacity))
     }
-    fn append_value<T>(&mut self, _: T) {}
+    fn append_value(&mut self, _: ()) {}
     fn append_null(&mut self) {}

... but that might produce clippy warnings about passing unit type as a function argument. If so, we'd need to adjust the value conversion to produce Some dummy value instead, e.g. value.as_null().map(|_| 0) or matches!(value, Variant::Null).then_some(0)

Also, the fake null builder should probably track how many "values" were "appended" and either produce a NullArray of that length or blow up if the call count disagrees with the array's length. The former is probably more correct than the latter, since it matches all the other builders for whom "capacity" is only a hint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter is unrelated to this PR, tracking it as #8810

| DataType::Int8
| DataType::Int16
| DataType::Int32
| DataType::Int64
| DataType::Float32
| DataType::Float64
| DataType::Decimal32(..)
| DataType::Decimal64(..)
| DataType::Decimal128(..)
| DataType::Date32
| DataType::Time64(TimeUnit::Microsecond)
| DataType::Timestamp(TimeUnit::Microsecond | TimeUnit::Nanosecond, _)
| DataType::Binary
| DataType::BinaryView
| DataType::Utf8
| DataType::Utf8View
| DataType::FixedSizeBinary(16) // UUID
=> {
let builder =
make_primitive_variant_to_arrow_row_builder(data_type, cast_options, capacity)?;
let typed_value_builder =
VariantToShreddedPrimitiveVariantRowBuilder::new(builder, capacity, top_level);
VariantToShreddedVariantRowBuilder::Primitive(typed_value_builder)
}
DataType::FixedSizeBinary(_) => {
return Err(ArrowError::InvalidArgumentError(format!("{data_type} is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported.")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to distinguish this with the _ match arm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this provides a user friendly message and it's actually moved from variant_to_arrow. Don't mind removing it tho

}
Comment on lines +153 to +155
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether this distinction is important enough to merit its own error message?

Also, we eventually need to check the field for UUID extension type and not just rely on the data type. If #8673 merges first, we should fix it here; if this PR merges first the other PR needs to incorporate the change.

CC @friendlymatthew

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, so sorry. I swear I never caught this notification. In terms of order, I'm fine with this merging first. I'll read through and update accordingly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol -- I certainly will never give someone a hard time for missing a notification!

_ => {
return Err(ArrowError::InvalidArgumentError(format!("{data_type} is not a valid variant shredding type")))
}
};
Ok(builder)
}
Expand Down Expand Up @@ -327,7 +353,7 @@ mod tests {
use super::*;
use crate::VariantArrayBuilder;
use arrow::array::{Array, FixedSizeBinaryArray, Float64Array, Int64Array};
use arrow::datatypes::{DataType, Field, Fields};
use arrow::datatypes::{DataType, Field, Fields, TimeUnit, UnionFields, UnionMode};
use parquet_variant::{ObjectBuilder, ReadOnlyMetadataBuilder, Variant, VariantBuilder};
use std::sync::Arc;
use uuid::Uuid;
Expand Down Expand Up @@ -536,6 +562,60 @@ mod tests {
assert!(typed_value_float64.is_null(2)); // string doesn't convert
}

#[test]
fn test_invalid_shredded_types_rejected() {
let input = VariantArray::from_iter([Variant::from(42)]);

let invalid_types = vec![
DataType::UInt8,
DataType::Float16,
DataType::Decimal256(38, 10),
DataType::Date64,
DataType::Time32(TimeUnit::Second),
DataType::Time64(TimeUnit::Nanosecond),
DataType::Timestamp(TimeUnit::Millisecond, None),
DataType::LargeBinary,
DataType::LargeUtf8,
DataType::FixedSizeBinary(17),
DataType::Union(
UnionFields::new(
vec![0_i8, 1_i8],
vec![
Field::new("int_field", DataType::Int32, false),
Field::new("str_field", DataType::Utf8, true),
],
),
UnionMode::Dense,
),
DataType::Map(
Arc::new(Field::new(
"entries",
DataType::Struct(Fields::from(vec![
Field::new("key", DataType::Utf8, false),
Field::new("value", DataType::Int32, true),
])),
false,
)),
false,
),
DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8)),
DataType::RunEndEncoded(
Arc::new(Field::new("run_ends", DataType::Int32, false)),
Arc::new(Field::new("values", DataType::Utf8, true)),
),
];

for data_type in invalid_types {
let err = shred_variant(&input, &data_type).unwrap_err();
assert!(
matches!(err, ArrowError::InvalidArgumentError(_)),
"expected InvalidArgumentError for {:?}, got {:?}",
data_type,
err
);
}
}

#[test]
fn test_object_shredding_comprehensive() {
let mut builder = VariantArrayBuilder::new(7);
Expand Down
30 changes: 29 additions & 1 deletion parquet-variant-compute/src/variant_get.rs
Original file line number Diff line number Diff line change
Expand Up @@ -320,7 +320,7 @@ mod test {
use arrow::datatypes::DataType::{Int16, Int32, Int64};
use arrow::datatypes::i256;
use arrow_schema::DataType::{Boolean, Float32, Float64, Int8};
use arrow_schema::{DataType, Field, FieldRef, Fields, TimeUnit};
use arrow_schema::{DataType, Field, FieldRef, Fields, IntervalUnit, TimeUnit};
use chrono::DateTime;
use parquet_variant::{
EMPTY_VARIANT_METADATA_BYTES, Variant, VariantDecimal4, VariantDecimal8, VariantDecimal16,
Expand Down Expand Up @@ -3685,6 +3685,34 @@ mod test {
));
}

#[test]
fn get_non_supported_temporal_types_error() {
let values = vec![None, Some(Variant::Null), Some(Variant::BooleanFalse)];
let variant_array: ArrayRef = ArrayRef::from(VariantArray::from_iter(values));

let test_cases = vec![
FieldRef::from(Field::new(
"result",
DataType::Duration(TimeUnit::Microsecond),
true,
)),
FieldRef::from(Field::new(
"result",
DataType::Interval(IntervalUnit::YearMonth),
true,
)),
];

for field in test_cases {
let options = GetOptions::new().with_as_type(Some(field));
let err = variant_get(&variant_array, options).unwrap_err();
assert!(
err.to_string()
.contains("Casting Variant to duration/interval types is not supported")
);
}
}

perfectly_shredded_variant_array_fn!(perfectly_shredded_invalid_time_variant_array, || {
// 86401000000 is invalid for Time64Microsecond (max is 86400000000)
Time64MicrosecondArray::from(vec![
Expand Down
Loading
Loading