-
Notifications
You must be signed in to change notification settings - Fork 30
Prototype raster #249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Prototype raster #249
Conversation
Kontinuation
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall design looks good to me.
| let data_type = band.metadata().data_type(); | ||
| if band.metadata().storage_type() != StorageType::InDb { | ||
| return Err(ArrowError::InvalidArgumentError( | ||
| "Pretty print indb not supported for non-InDb storage".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty print not supported for non-InDb storage
| pub fn bytes_per_pixel(data_type: BandDataType) -> Result<usize, ArrowError> { | ||
| match data_type { | ||
| BandDataType::UInt8 => Ok(1), | ||
| BandDataType::Int16 => Ok(2), | ||
| BandDataType::UInt16 => Ok(2), | ||
| BandDataType::Int32 => Ok(4), | ||
| BandDataType::UInt32 => Ok(4), | ||
| BandDataType::Float32 => Ok(4), | ||
| BandDataType::Float64 => Ok(8), | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to wrap the return type with Result if the match is always exhaustive.
| out.write_fmt(format_args!("{:8.*} ", precision, value)) | ||
| .unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of Write trait may fail writing bytes due to lots of reasons, and the caller may want to handle some of them. We may want to propagate this error using ? instead of .unwrap().
| pub fn create_band_buffer( | ||
| &mut self, | ||
| capacity: usize, | ||
| ) -> (MutableBuffer, impl FnOnce(MutableBuffer) + '_) { | ||
| let mut buffer = MutableBuffer::with_capacity(capacity); | ||
|
|
||
| // Pre-allocate the buffer to the exact size | ||
| buffer.resize(capacity, 0); | ||
|
|
||
| let commit = move |buffer: MutableBuffer| { | ||
| // Convert MutableBuffer to &[u8] and append to BinaryBuilder | ||
| let data = buffer.as_slice(); | ||
| self.band_data_writer().append_value(data); | ||
| }; | ||
|
|
||
| (buffer, commit) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have another method commit_band_buffer, so we don't have to return a committer with the buffer.
| /// Alternative: Get a mutable slice from a MutableBuffer for GDAL | ||
| /// This provides the most direct access for zero-copy operations | ||
| /// TODO: have this 3 different way.... pick one!! | ||
| pub fn get_band_buffer_slice(&mut self, size: usize) -> (MutableBuffer, &mut [u8]) { | ||
| let mut buffer = MutableBuffer::with_capacity(size); | ||
| buffer.resize(size, 0); | ||
|
|
||
| // Get mutable slice that GDAL can write to | ||
| let slice = unsafe { | ||
| // This is safe because we just allocated the buffer with the exact size | ||
| std::slice::from_raw_parts_mut(buffer.as_mut_ptr(), size) | ||
| }; | ||
|
|
||
| (buffer, slice) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has incorrect lifetime notation: the lifetime of the returned &mut [u8] is actually related to the lifetime of the returned MutableBuffer object, and not related to self (RasterBuilder).
We can simply remove this method since we already have create_band_buffer. The holder of the MutableBuffer can simply call as_slice_mut if they want a &mut [u8] from it.
| let metadata_struct = raster_struct | ||
| .column(raster_indices::METADATA) | ||
| .as_any() | ||
| .downcast_ref::<StructArray>() | ||
| .unwrap(); | ||
|
|
||
| let crs = raster_struct | ||
| .column(raster_indices::CRS) | ||
| .as_any() | ||
| .downcast_ref::<StringArray>() | ||
| .unwrap(); | ||
|
|
||
| let bbox = raster_struct | ||
| .column(raster_indices::BBOX) | ||
| .as_any() | ||
| .downcast_ref::<StructArray>() | ||
| .unwrap(); | ||
|
|
||
| let bands_list = raster_struct | ||
| .column(raster_indices::BANDS) | ||
| .as_any() | ||
| .downcast_ref::<ListArray>() | ||
| .unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performing all these .column, .as_any and downcast_ref for each raster could be a huge performance overhead, especially for queries that only fetches the metadata.
| fn width(&self) -> u64 { | ||
| self.metadata_struct | ||
| .column(metadata_indices::WIDTH) | ||
| .as_any() | ||
| .downcast_ref::<UInt64Array>() | ||
| .unwrap() | ||
| .value(self.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling .column, .as_any and .downcast_ref each time we access the metadata fields could also be a huge performance overhead. We may observe this when benchmarking rs_width.
| /// TODO: The band_metadata is in the finish in the band call, but in the | ||
| /// start in the raster call. Make it consistent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a problem. start_raster takes raster metadata as pararameter, while finish_band takes band metadata as parameter.
|
|
||
| /// Band data schema (single binary blob) | ||
| pub fn band_data_type() -> DataType { | ||
| DataType::Binary // consider switching to BinaryView |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that BinaryView is a better option. Growing the binary array when appending in-db band data could be less painful.
| let band_metadata = BandMetadata { | ||
| nodata_value: nodata_value, | ||
| storage_type: StorageType::InDb, | ||
| datatype: data_type, | ||
| outdb_url: None, | ||
| outdb_band_id: None, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a parameter to control if the loaded tiles are in-db or out-db, so that we can test basic functions of out-db rasters.
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very clean and there is great test coverage...thank you!
I have a few organizational comments but this is a great overall design (raster in SQL is hard!). A possible chain of PRs where we can discuss implementation details:
- Add the data type to sedona-schema
- Add sedona-raster with the data structure, builder, and iterator
- Add sedona-testing helpers
- Add sedona-raster-functions with the metadata extracting functions
- GDAL
| /// CRS schema to store json representation | ||
| pub fn crs_type() -> DataType { | ||
| DataType::Utf8 // TODO: Consider Utf8View | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a Utf8View here is a good choice since this value will be frequently repeated, less than 12 bytes, or both!
| Arrow(DataType), | ||
| Wkb(Edges, Crs), | ||
| WkbView(Edges, Crs), | ||
| Raster(RasterSchema), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can probably be unparameterized (i.e., SedonaType::Raster). My reading of this PR is that the underlying Arrow schema is a constant (so it can be cloned from RASTER_DATATYPE when requested by storage_type()).
| /// Schema for storing raster data in Apache Arrow format. | ||
| /// Utilizing nested structs and lists to represent raster metadata and bands. | ||
| #[derive(Debug, PartialEq, Clone)] | ||
| pub struct RasterSchema; | ||
| impl RasterSchema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be nice to have this live in a dedicated sedona-schema/src/raster.rs (arguably we should move the Wkb types into their own file as well in some future 😬 )
| /// Builder for constructing raster arrays with zero-copy band data writing | ||
| pub struct RasterBuilder { | ||
| main_builder: StructBuilder, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I think your sedona-raster crate would be a good home for this one. I know our existing sedona-geometry doesn't depend on Arrow but I think that's fine for sedona-raster (we can move these things around if there's good reason to do so later on).
| /// Helper for writing raster kernel implementations | ||
| /// | ||
| /// The [RasterExecutor] provides a simplified interface for executing functions | ||
| /// on raster arrays, handling the common pattern of downcasting to StructArray, | ||
| /// creating raster iterators, and handling null values. | ||
| pub struct RasterExecutor<'a, 'b> { | ||
| pub arg_types: &'a [SedonaType], | ||
| pub args: &'b [ColumnarValue], | ||
| num_iterations: usize, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a sedona-raster-functions crate would make sense! (sedona-functions is already a little crowded and might benefit from being split even in its current form)
| /// RS_Value() implementation | ||
| pub fn rs_value_impl() -> ScalarKernelRef { | ||
| Arc::new(RSValue {}) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably something we want to be an async scalar function (which we'd need to implement a Sedona wrapper for to get the kernel dispatch thing working): https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html#adding-an-async-scalar-udf . That would provide a mechanism for doing the IO required in parallel and a way to tell DataFusion that operating in small batches is a good idea.
| /// Raster metadata schema | ||
| pub fn metadata_type() -> DataType { | ||
| DataType::Struct(Fields::from(vec![ | ||
| // Raster dimensions | ||
| Field::new(column::WIDTH, DataType::UInt64, false), | ||
| Field::new(column::HEIGHT, DataType::UInt64, false), | ||
| // Geospatial transformation parameters | ||
| Field::new(column::UPPERLEFT_X, DataType::Float64, false), | ||
| Field::new(column::UPPERLEFT_Y, DataType::Float64, false), | ||
| Field::new(column::SCALE_X, DataType::Float64, false), | ||
| Field::new(column::SCALE_Y, DataType::Float64, false), | ||
| Field::new(column::SKEW_X, DataType::Float64, false), | ||
| Field::new(column::SKEW_Y, DataType::Float64, false), | ||
| ])) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential future consideration, that it might be just as fast to do something like:
#[derive(Serialize, Deserialize)
struct RasterMetadata {
shape: [u64; 2],
upper_left: [f64; 2],
geo_transform: [f64; 6],
}...and use serde_json to read and write from JSON. (i.e., the metadata field would just be Utf8).
You've done a great job abstracting this so that nobody ever has to manually create or read this and so it should be easy to experiment with changing it in the future 🙂
| DataType::Struct(Fields::from(vec![ | ||
| Field::new(column::NODATAVALUE, DataType::Binary, true), // Allow null nodata values | ||
| Field::new(column::STORAGE_TYPE, DataType::UInt32, false), | ||
| Field::new(column::DATATYPE, DataType::UInt32, false), | ||
| // OutDb reference fields - only used when storage_type == OutDbRef | ||
| Field::new(column::OUTDB_URL, DataType::Utf8, true), | ||
| Field::new(column::OUTDB_BAND_ID, DataType::UInt32, true), | ||
| ])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With JSON storage this could also collapse to:
#[derive(Serialize, Deserialize)]
struct BandMetadata {
nodata_value: Vec<u8>,
storage_type: StorageType,
data_type: BandDataType,
#[serde(skip_serializing_if = None)]
outdb_url: Option<String>,
#[serde(skip_serializing_if = None)]
outdb_band_id: Option<String>
}(Again, this is abstracted nicely so we don't have to consider this now!)
Simple prototype code to demonstrate viability of a StructArray data type for rasters (issue: #247 ).
If we are happy with the basic ideas here, we will break this up into many small PRs so we can get down to the nitty gritty. :)