General virtual columns support + row numbers as a first use-case #8715

vustef · 2025-10-27T08:51:46Z

Based on #7307.

Which issue does this PR close?

Closes [Parquet] Support file row number in Parquet reader #7299

Rationale for this change

We need row numbers for many of the downstream features, e.g. computing unique row identifier in iceberg.

What changes are included in this PR?

New API to get row numbers as a virtual column:

let file = File::open(path).unwrap();
let row_number_field = Field::new("row_number", ArrowDataType::Int64, false).with_extension_type(RowNumber);
let options = ArrowReaderOptions::new().with_virtual_columns(vec![row_number_field]);
let reader = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options)
    .unwrap()
    .build()
    .expect("Could not create reader");
reader
    .collect::<Result<Vec<_>, _>>()
    .expect("Could not read")
    ```

This column is defined as an extension type.
Parquet metadata is propagated to the array builder to compute first row indexes.
New Virtual column is included in addition to Primitive and Group.

Are these changes tested?

Yes

Are there any user-facing changes?

This is user facing feature, and has added docstrings.
No breaking changes, at least I tried not to, by creating a duplicate of public method to add more parameters.

Co-authored-by: scovich <scovich@users.noreply.github.com>

…r-row-numbers

…eature tests pass

…dded later

…passing

parquet/src/arrow/array_reader/builder.rs

parquet/src/arrow/array_reader/mod.rs

vustef · 2025-10-27T08:54:33Z

parquet/src/arrow/array_reader/row_number.rs

+    }
+
+    fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // TODO: Use advance_by when it stabilizes to improve performance


TODO from original PR

parquet/src/arrow/arrow_reader/mod.rs

vustef · 2025-10-27T09:16:44Z

parquet/src/arrow/async_reader/mod.rs

    }
+
+    fn row_groups(&self) -> Box<dyn Iterator<Item = &RowGroupMetaData> + '_> {
+        Box::new(std::iter::once(self.metadata.row_group(self.row_group_idx)))


this duplicates a lot, not sure if anything can be done here

vustef · 2025-10-27T09:18:13Z

parquet/src/arrow/schema/complex.rs

+/// - If nullable: def_level = parent_def_level + 1
+/// - If required: def_level = parent_def_level
+/// - rep_level = parent_rep_level (virtual fields are not repeated)
+fn convert_virtual_field(


the name used here is not aligned with what other convert_ functions do

…hen metadata parsing may skip row groups

alamb

Thank you @vustef -- I went through this PR carefully and I think we can merge it as is

I took the liberty to push a few cleanups and reduce some duplication

This is a major step forward in my mind.

I have some thoughts on how to improve the tests that I will either push directly to this PR or propose as follow ons.

cc @mbrobbel @etseidl @adriangb in case you are interested in reviewing. I am especially interested in other reviewer comments on the public API surface

parquet/src/arrow/arrow_reader/mod.rs

alamb · 2025-11-12T19:28:38Z

parquet/src/arrow/schema/complex.rs

+}
+
 #[derive(Debug, Clone)]
 pub enum ParquetFieldType {


I double checked at this enum is not publically exposed: https://docs.rs/parquet/latest/parquet/?search=ParquetFieldType

Thus this is a backwards compatible change

parquet/src/arrow/array_reader/mod.rs

parquet/src/arrow/array_reader/row_number.rs

alamb · 2025-11-12T19:59:54Z

parquet/src/file/metadata/thrift/mod.rs

    Ok(ParquetMetaData::new(fmd, row_groups))
 }

+/// Assign [`RowGroupMetaData::ordinal`]  if it is missing.


I extracted the ordinal assignment to a structure, mostly so I had a place to add additional comments as well as avoiding duplication

This is great, thanks

vustef

Thanks @alamb for the review. Your commits look good to me, thanks for that as well.

vustef · 2025-11-12T20:02:28Z

parquet/src/file/metadata/thrift/mod.rs

    Ok(ParquetMetaData::new(fmd, row_groups))
 }

+/// Assign [`RowGroupMetaData::ordinal`]  if it is missing.


This is great, thanks

parquet/src/file/metadata/thrift/mod.rs

etseidl

Quick first pass seems ok, just a couple of comments.

etseidl · 2025-11-12T21:53:08Z

parquet/src/file/metadata/thrift/encryption.rs

    }

-    // decrypt column chunk info
+    // decrypt column chunk info and handle ordinal assignment


I guess this doesn't hurt, but it shouldn't be necessary. The row groups already passed through the OrdinalAssigner in parquet_metadata_from_bytes, and row_group_from_encrypted_thrift re-uses the ordinal from the RowGroupMetaData passed in.

I reverted the change here.

etseidl · 2025-11-12T21:59:25Z

parquet/src/file/metadata/thrift/mod.rs

+        if actual_ordinal == 0 {
+            self.first_has_ordinal = rg_has_ordinal;
+        }


I wonder if first_has_ordinal should be an option, and then set it if it's None. If/when we implement row group selection the first ordinal seen may not be 0.

But then enumeration won't work at all, and we'd have to rely on ordinal in the metadata? Actually, row numbers feature won't work at all, since it relies on having information (sizes in rows) about all row groups to figure out first row index of the row group it reads. Unless some trick that I'm not aware of is used.

Note that row numbers have to be stable across queries (i.e. independent of whether there was filtering), otherwise we would've implemented them on the client side by just enumerating rows.

Yes, but it seems this code is executed regardless of the presence of a virtual column. It looks to me like skipping the first rowgroup will result in an error if ordinals are present.

This code should only error out if either ordinal is present in metadata for some row group but not for others, or vice-versa. That shouldn't happen with row group skipping, i.e. even if we skip, all the remaining row groups that we iterate through would fulfil this condition, right?

What I'd like though, is that the build_row_number_reader fails if skipping occurs. Or that it ensures that there's no row skipping if it's invoked. Not sure how to ensure that at this point though.

If I skip row groups 0 and 1 and go straight to row group 2 (think externally cached statistics with point look up), which has ordinal set to 2 in the footer, self.first_has_ordinal is false and rg_has_ordinal is true. L904 will evaluate false, L906 will evaluate true and return an error.

I'll take a closer look at the rest of this PR to see if there's a way to make skipping and row numbers mutually exclusive.

I see why we think differently. I was assuming then that the row group 2 would be the one to set self.first_has_ordinal. Because for example the loop that invokes ensure would be going for ordinal in 0..list_ident.size, whe size is going to be based only on the row groups after skipping.

Is that not going to happen?
Perhaps there might need some changes to be made once row group skipping is implemented, not sure how best to guard against getting the intended behaviour broken without catching it.

Because for example the loop that invokes ensure would be going for ordinal in 0..list_ident.size, whe size is going to be based only on the row groups after skipping.

Ah, but here we're in a function whose input is the entire encoded footer. list_ident.size will always be the unfiltered number of row groups. So I think we'll either:

loop over ordinal in 0..list_ident.size, but skip decoding if ordinal not in a list

assuming an index, loop over a list of ordinals, decode the row group using a range from the index, and pass ordinal as actual_ordinal

In either event, we'd pass 2 as actual_ordinal in my scenario above. But yeah, that's pretty far down the road. Guess we can solve it then.

Now I see why you proposed to use Option. I pushed a change for that.

adriangb · 2025-11-12T23:06:06Z

How is this going to integrate into DataFusion? I feel like FileScanConfig will have to lie and say it has this column, etc.

vustef · 2025-11-13T10:23:14Z

How is this going to integrate into DataFusion? I feel like FileScanConfig will have to lie and say it has this column, etc.

I'll defer that to @alamb . I only thought about how to integrate it in iceberg-rust, and perhaps there are similarities. There the user of the library would have to request this column, and then we'd propagate this information to the underlying arrow reader.

parquet/src/file/metadata/thrift/encryption.rs

Remove unused import

alamb · 2025-11-13T12:12:27Z

How is this going to integrate into DataFusion? I feel like FileScanConfig will have to lie and say it has this column, etc.

I don't have any real plan for the DataFusion integration. FWIW the first usecase of these numbers I think will be various iceberg / other table format integrations to support delete vectors

In order to support "virtual columns" in DataFusion, I suspect we will need to update ListingTable to have some notion of virtual columns (in addition to partition columns).

Other virtual columns people have talked about are file names, for example, which wouldn't come from the parquet reader but instead would come from the file opening machinery

…ef/arrow-rs into feature/parquet-virtual-row-numbers

etseidl

Impressive. Thanks @vustef and @alamb.

etseidl · 2025-11-13T23:24:56Z

parquet/src/arrow/arrow_reader/mod.rs

+    /// # Ok(())
+    /// # }
+    /// ```
+    pub fn with_virtual_columns(self, virtual_columns: Vec<FieldRef>) -> Self {


@vustef I think this is where we'd be able to detect row group filtering. There would be a with_row_group_selection() or some such function added to control skipping, and a check could be added both here and in the new function to disallow setting both.

Thanks for figuring this out. I guess there's no action that we can take right now then, please let me know if it's otherwise.

vustef · 2025-11-14T09:44:37Z

Impressive. Thanks @vustef and @alamb.

Thank you @etseidl for the review. @alamb now that we got another approve, are we good to merge this, before the Monday release?

alamb · 2025-11-14T14:39:49Z

Also thanks to @jkylling whose started this project

alamb · 2025-11-14T14:40:03Z

Impressive. Thanks @vustef and @alamb.

Thank you @etseidl for the review. @alamb now that we got another approve, are we good to merge this, before the Monday release?

Yeah, I don't see any reason to hold off merging. Let's do it!

…t-virtual-row-numbers

alamb · 2025-11-14T21:02:44Z

Once the Ci is green I'll merge this PR. Thank you @vustef

alamb · 2025-11-14T21:17:26Z

gogoogogogogo!!!

alamb · 2025-11-14T21:17:47Z

The 57.1.0 patch release may be the most epic minor release we have ever had

alamb · 2025-11-14T21:24:56Z

Thanks again @jkylling @vustef and @etseidl

vustef · 2025-11-14T21:28:23Z

Thanks again @jkylling @vustef and @etseidl

It was my pleasure, thanks to you all from me as well.

jkylling and others added 22 commits March 18, 2025 19:06

Add support for file row numbers in Parquet readers

f93d36e

Add Apache license header to row_number.rs

e485c0b

Run cargo format

2a62009

Change with_row_number_column to take impl Into<String>

fb5126f

Change Option<String> -> Option<&str> in build_array_reader

5350728

Replace ParquetError::RowGroupMetaDataMissingRowNumber with General

188f350

Split test_create_array_reader test into two

37a9d83

first_row_number -> first_row_index

41e38fe

Simplify RowNumberReader with iterators

1a1e6b6

Co-authored-by: scovich <scovich@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into feature/parquet-reade…

bcad87f

…r-row-numbers

add parquet-testing change from the merge

89c1fd1

Fix test_arrow_reader_all_columns

b0d53d0

Fix first_row_number

094ae81

Rename to first_row_index consistently, remove Option.

a5858df

revert parquet-testing update

5e7d9a1

Fix baselines in file::metadata::tests::test_memory_size

54c22c6

Fix encryption metadata and async tests. Those features and default f…

f05d470

…eature tests pass

RowNumber extension type

11e4f39

using supplied_schema works

d02c977

Don't modify parsing of parquet schema, virtual columns can only be a…

6fecc17

…dded later

Reworked with_virtual_columns in options

1414421

switch to ref to slice; cleanup with_row_number_columns; async tests …

07eb467

…passing

vustef commented Oct 27, 2025

View reviewed changes

Bring back optionality to first_row_index, for future consideration w…

af0e0f9

…hen metadata parsing may skip row groups

github-actions bot added the parquet Changes to the parquet crate label Oct 27, 2025

vustef mentioned this pull request Oct 27, 2025

[Parquet] Support file row number in Parquet reader #7299

Closed

vustef added 4 commits October 27, 2025 11:44

Reexport

8bccd22

reexport all within virtual_type

65679ba

pub mod virtual_type skipping experimental schema

968d461

Switch back to virtual_type::* for now; fix warnings on cargo test

6144967

Encapsulate ordinal assignment

95b8e16

alamb approved these changes Nov 12, 2025

View reviewed changes

alamb reviewed Nov 12, 2025

View reviewed changes

alamb added 2 commits November 12, 2025 15:04

cleanup

44cb421

Apply suggestion from @alamb

6b82f3e

vustef commented Nov 12, 2025

View reviewed changes

vustef added 2 commits November 12, 2025 21:27

Revert move of use

f754d07

rephrase and format error

4670de1

etseidl reviewed Nov 12, 2025

View reviewed changes

Don't use OrdinalAssigner for encryption path, it's redundant

417508a

vustef commented Nov 13, 2025

View reviewed changes

parquet/src/file/metadata/thrift/encryption.rs Outdated Show resolved Hide resolved

Apply suggestion from @vustef

2f65004

Remove unused import

vustef added 2 commits November 13, 2025 22:54

Make first_has_ordinal an Option{bool}

db17f6a

Merge branch 'feature/parquet-virtual-row-numbers' of github.com:vust…

962964e

…ef/arrow-rs into feature/parquet-virtual-row-numbers

etseidl approved these changes Nov 13, 2025

View reviewed changes

vustef added 2 commits November 14, 2025 21:54

Merge branch 'main' of github.com:apache/arrow-rs into feature/parque…

f74c8cd

…t-virtual-row-numbers

missed to add file after fixing merge

298ea6b

cargo fmt --all

b887749

alamb merged commit 3d5428d into apache:main Nov 14, 2025
16 checks passed

General virtual columns support + row numbers as a first use-case #8715

General virtual columns support + row numbers as a first use-case #8715

Uh oh!

Conversation

vustef commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vustef left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Nov 12, 2025

Uh oh!

vustef commented Nov 13, 2025

Uh oh!

Uh oh!

alamb commented Nov 13, 2025

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vustef commented Nov 14, 2025

Uh oh!

vustef commented Oct 27, 2025 •

edited

Loading

vustef left a comment •

edited

Loading

vustef commented Nov 14, 2025 •

edited

Loading