Prototype raster #249

jesspav · 2025-10-27T22:52:35Z

Simple prototype code to demonstrate viability of a StructArray data type for rasters (issue: #247 ).

If we are happy with the basic ideas here, we will break this up into many small PRs so we can get down to the nitty gritty. :)

Kontinuation

The overall design looks good to me.

Kontinuation · 2025-10-28T07:01:57Z

rust/sedona-raster/src/display_functions.rs

+    let data_type = band.metadata().data_type();
+    if band.metadata().storage_type() != StorageType::InDb {
+        return Err(ArrowError::InvalidArgumentError(
+            "Pretty print indb not supported for non-InDb storage".to_string(),


Pretty print not supported for non-InDb storage

Kontinuation · 2025-10-28T07:03:57Z

rust/sedona-raster/src/datatype_functions.rs

+pub fn bytes_per_pixel(data_type: BandDataType) -> Result<usize, ArrowError> {
+    match data_type {
+        BandDataType::UInt8 => Ok(1),
+        BandDataType::Int16 => Ok(2),
+        BandDataType::UInt16 => Ok(2),
+        BandDataType::Int32 => Ok(4),
+        BandDataType::UInt32 => Ok(4),
+        BandDataType::Float32 => Ok(4),
+        BandDataType::Float64 => Ok(8),
+    }
+}


We don't need to wrap the return type with Result if the match is always exhaustive.

Kontinuation · 2025-10-28T07:09:08Z

rust/sedona-raster/src/display_functions.rs

+                    out.write_fmt(format_args!("{:8.*} ", precision, value))
+                        .unwrap();


The implementation of Write trait may fail writing bytes due to lots of reasons, and the caller may want to handle some of them. We may want to propagate this error using ? instead of .unwrap().

Kontinuation · 2025-10-28T07:38:27Z

rust/sedona-schema/src/datatypes.rs

+    pub fn create_band_buffer(
+        &mut self,
+        capacity: usize,
+    ) -> (MutableBuffer, impl FnOnce(MutableBuffer) + '_) {
+        let mut buffer = MutableBuffer::with_capacity(capacity);
+
+        // Pre-allocate the buffer to the exact size
+        buffer.resize(capacity, 0);
+
+        let commit = move |buffer: MutableBuffer| {
+            // Convert MutableBuffer to &[u8] and append to BinaryBuilder
+            let data = buffer.as_slice();
+            self.band_data_writer().append_value(data);
+        };
+
+        (buffer, commit)
+    }


We already have another method commit_band_buffer, so we don't have to return a committer with the buffer.

Kontinuation · 2025-10-28T07:40:56Z

rust/sedona-schema/src/datatypes.rs

+    /// Alternative: Get a mutable slice from a MutableBuffer for GDAL
+    /// This provides the most direct access for zero-copy operations
+    /// TODO: have this 3 different way.... pick one!!
+    pub fn get_band_buffer_slice(&mut self, size: usize) -> (MutableBuffer, &mut [u8]) {
+        let mut buffer = MutableBuffer::with_capacity(size);
+        buffer.resize(size, 0);
+
+        // Get mutable slice that GDAL can write to
+        let slice = unsafe {
+            // This is safe because we just allocated the buffer with the exact size
+            std::slice::from_raw_parts_mut(buffer.as_mut_ptr(), size)
+        };
+
+        (buffer, slice)
+    }


This method has incorrect lifetime notation: the lifetime of the returned &mut [u8] is actually related to the lifetime of the returned MutableBuffer object, and not related to self (RasterBuilder).

We can simply remove this method since we already have create_band_buffer. The holder of the MutableBuffer can simply call as_slice_mut if they want a &mut [u8] from it.

Kontinuation · 2025-10-28T11:22:59Z

rust/sedona-schema/src/datatypes.rs

+        let metadata_struct = raster_struct
+            .column(raster_indices::METADATA)
+            .as_any()
+            .downcast_ref::<StructArray>()
+            .unwrap();
+
+        let crs = raster_struct
+            .column(raster_indices::CRS)
+            .as_any()
+            .downcast_ref::<StringArray>()
+            .unwrap();
+
+        let bbox = raster_struct
+            .column(raster_indices::BBOX)
+            .as_any()
+            .downcast_ref::<StructArray>()
+            .unwrap();
+
+        let bands_list = raster_struct
+            .column(raster_indices::BANDS)
+            .as_any()
+            .downcast_ref::<ListArray>()
+            .unwrap();


Performing all these .column, .as_any and downcast_ref for each raster could be a huge performance overhead, especially for queries that only fetches the metadata.

Kontinuation · 2025-10-28T11:25:11Z

rust/sedona-schema/src/datatypes.rs

+    fn width(&self) -> u64 {
+        self.metadata_struct
+            .column(metadata_indices::WIDTH)
+            .as_any()
+            .downcast_ref::<UInt64Array>()
+            .unwrap()
+            .value(self.index)


Calling .column, .as_any and .downcast_ref each time we access the metadata fields could also be a huge performance overhead. We may observe this when benchmarking rs_width.

Kontinuation · 2025-10-28T11:29:14Z

rust/sedona-schema/src/datatypes.rs

+    /// TODO: The band_metadata is in the finish in the band call, but in the
+    /// start in the raster call. Make it consistent.


I don't think it is a problem. start_raster takes raster metadata as pararameter, while finish_band takes band metadata as parameter.

Kontinuation · 2025-10-28T11:40:46Z

rust/sedona-schema/src/datatypes.rs

+
+    /// Band data schema (single binary blob)
+    pub fn band_data_type() -> DataType {
+        DataType::Binary // consider switching to BinaryView


I believe that BinaryView is a better option. Growing the binary array when appending in-db band data could be less painful.

Kontinuation · 2025-10-28T11:57:23Z

rust/sedona-gdal/src/readers.rs

+                let band_metadata = BandMetadata {
+                    nodata_value: nodata_value,
+                    storage_type: StorageType::InDb,
+                    datatype: data_type,
+                    outdb_url: None,
+                    outdb_band_id: None,
+                };


We can add a parameter to control if the loaded tiles are in-db or out-db, so that we can test basic functions of out-db rasters.

paleolimbot

This is very clean and there is great test coverage...thank you!

I have a few organizational comments but this is a great overall design (raster in SQL is hard!). A possible chain of PRs where we can discuss implementation details:

Add the data type to sedona-schema
Add sedona-raster with the data structure, builder, and iterator
Add sedona-testing helpers
Add sedona-raster-functions with the metadata extracting functions
GDAL

paleolimbot · 2025-10-28T15:16:55Z

rust/sedona-schema/src/datatypes.rs

+    /// CRS schema to store json representation
+    pub fn crs_type() -> DataType {
+        DataType::Utf8 // TODO: Consider Utf8View
+    }


I think a Utf8View here is a good choice since this value will be frequently repeated, less than 12 bytes, or both!

paleolimbot · 2025-10-28T15:25:33Z

rust/sedona-schema/src/datatypes.rs

    Arrow(DataType),
    Wkb(Edges, Crs),
    WkbView(Edges, Crs),
+    Raster(RasterSchema),


I think this can probably be unparameterized (i.e., SedonaType::Raster). My reading of this PR is that the underlying Arrow schema is a constant (so it can be cloned from RASTER_DATATYPE when requested by storage_type()).

paleolimbot · 2025-10-28T15:37:12Z

rust/sedona-schema/src/datatypes.rs

+/// Schema for storing raster data in Apache Arrow format.
+/// Utilizing nested structs and lists to represent raster metadata and bands.
+#[derive(Debug, PartialEq, Clone)]
+pub struct RasterSchema;
+impl RasterSchema {


I think it might be nice to have this live in a dedicated sedona-schema/src/raster.rs (arguably we should move the Wkb types into their own file as well in some future 😬 )

paleolimbot · 2025-10-28T15:39:53Z

rust/sedona-schema/src/datatypes.rs

+/// Builder for constructing raster arrays with zero-copy band data writing
+pub struct RasterBuilder {
+    main_builder: StructBuilder,
+}


Nice! I think your sedona-raster crate would be a good home for this one. I know our existing sedona-geometry doesn't depend on Arrow but I think that's fine for sedona-raster (we can move these things around if there's good reason to do so later on).

paleolimbot · 2025-10-28T15:48:32Z

rust/sedona-functions/src/executor.rs

+/// Helper for writing raster kernel implementations
+///
+/// The [RasterExecutor] provides a simplified interface for executing functions
+/// on raster arrays, handling the common pattern of downcasting to StructArray,
+/// creating raster iterators, and handling null values.
+pub struct RasterExecutor<'a, 'b> {
+    pub arg_types: &'a [SedonaType],
+    pub args: &'b [ColumnarValue],
+    num_iterations: usize,
+}


I think having a sedona-raster-functions crate would make sense! (sedona-functions is already a little crowded and might benefit from being split even in its current form)

paleolimbot · 2025-10-28T16:06:56Z

rust/sedona-gdal/src/rs_value.rs

+/// RS_Value() implementation
+pub fn rs_value_impl() -> ScalarKernelRef {
+    Arc::new(RSValue {})
+}


This is probably something we want to be an async scalar function (which we'd need to implement a Sedona wrapper for to get the kernel dispatch thing working): https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html#adding-an-async-scalar-udf . That would provide a mechanism for doing the IO required in parallel and a way to tell DataFusion that operating in small batches is a good idea.

paleolimbot · 2025-10-28T16:19:54Z

rust/sedona-schema/src/datatypes.rs

+    /// Raster metadata schema
+    pub fn metadata_type() -> DataType {
+        DataType::Struct(Fields::from(vec![
+            // Raster dimensions
+            Field::new(column::WIDTH, DataType::UInt64, false),
+            Field::new(column::HEIGHT, DataType::UInt64, false),
+            // Geospatial transformation parameters
+            Field::new(column::UPPERLEFT_X, DataType::Float64, false),
+            Field::new(column::UPPERLEFT_Y, DataType::Float64, false),
+            Field::new(column::SCALE_X, DataType::Float64, false),
+            Field::new(column::SCALE_Y, DataType::Float64, false),
+            Field::new(column::SKEW_X, DataType::Float64, false),
+            Field::new(column::SKEW_Y, DataType::Float64, false),
+        ]))
+    }


A potential future consideration, that it might be just as fast to do something like:

#[derive(Serialize, Deserialize) struct RasterMetadata { shape: [u64; 2], upper_left: [f64; 2], geo_transform: [f64; 6], }

...and use serde_json to read and write from JSON. (i.e., the metadata field would just be Utf8).

You've done a great job abstracting this so that nobody ever has to manually create or read this and so it should be easy to experiment with changing it in the future 🙂

paleolimbot · 2025-10-28T16:26:05Z

rust/sedona-schema/src/datatypes.rs

+        DataType::Struct(Fields::from(vec![
+            Field::new(column::NODATAVALUE, DataType::Binary, true), // Allow null nodata values
+            Field::new(column::STORAGE_TYPE, DataType::UInt32, false),
+            Field::new(column::DATATYPE, DataType::UInt32, false),
+            // OutDb reference fields - only used when storage_type == OutDbRef
+            Field::new(column::OUTDB_URL, DataType::Utf8, true),
+            Field::new(column::OUTDB_BAND_ID, DataType::UInt32, true),
+        ]))


With JSON storage this could also collapse to:

#[derive(Serialize, Deserialize)] struct BandMetadata { nodata_value: Vec<u8>, storage_type: StorageType, data_type: BandDataType, #[serde(skip_serializing_if = None)] outdb_url: Option<String>, #[serde(skip_serializing_if = None)] outdb_band_id: Option<String> }

(Again, this is abstracted nicely so we don't have to consider this now!)

jesspav added 18 commits October 16, 2025 09:28

add schema

6394eb8

add a bulder

72d647f

Add iterator

8918929

ran fmt

87e131b

switch to indexes for column refs

5101829

add crs to schema

a0df683

add crs

07d2d70

pull out bounding box

4bed58d

move raster datatype to sedona-schema

b0a933d

move outdb url into metadata and add an executor function

f7c4659

fix builder to handle null rasters

aafa3c5

adding gdal

809923d

add a temp test for outdb

eabeb52

cargo lock

898b880

fix warnings

64c9a3a

adding a gdal loader

1ef03dd

use 1-based band_number consistently

4bd1618

Add benchmarking

b766f22

Kontinuation reviewed Oct 28, 2025

View reviewed changes

paleolimbot reviewed Oct 28, 2025

View reviewed changes

jesspav mentioned this pull request Oct 28, 2025

Add Raster Schema #252

Closed

		out.write_fmt(format_args!("{:8.*} ", precision, value))
		.unwrap();

		/// TODO: The band_metadata is in the finish in the band call, but in the
		/// start in the raster call. Make it consistent.

Prototype raster #249

Are you sure you want to change the base?

Prototype raster #249

Uh oh!

Conversation

jesspav commented Oct 27, 2025

Uh oh!

Kontinuation left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants