-
Notifications
You must be signed in to change notification settings - Fork 30
feat(c/sedona-geos): Add ST_Polygonize() aggregate version #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cracking ground on the first aggregate function in geos! A bit of feedback.
Any ideas how to implement this with DafaFusion and keep support for both input styles?
See this comment (and PR) as an example for how to implement multiple different sets of arguments
Should we separate the "array form" to another PR?
I think it makes sense to do so. I'm not aware of any existing support for array input, so I think we'll need to add something to matchers.rs
| ('LINESTRING (25 10, 20 0)') | ||
| ) AS t(geom)""", | ||
| "GEOMETRYCOLLECTION (POLYGON ((10 0, 0 0, 5 10, 10 0)), POLYGON ((30 0, 20 0, 25 10, 30 0)))", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also add these cases too.
# a single polygon
select st_astext(st_polygonize(array[st_geomfromtext('POLYGON ((10 0, 0 0, 10 10, 10 0))')]));
-- GEOMETRYCOLLECTION(POLYGON((10 0,0 0,10 10,10 0)))
one of each of the other geom types (e.g points, multipoint, etc) and linestring empty
-- GEOMETRYCOLLECTION EMPTY
# A single linestring that's really a LinearRing
select st_astext(st_polygonize(array[st_geomfromtext('linestring(0 0, 0 1, 1 1, 1 0, 0 0)')]));
-- GEOMETRYCOLLECTION(POLYGON((0 0,0 1,1 1,1 0,0 0)))These are using Postgis array syntax. Since we don't have array syntax yet, we could do something like this to avoid having to write a bunch of those polygon:
@pytest.mark.parametrize("eng", [SedonaDB, PostGIS])
@pytest.mark.parametrize(
("geom", "expected"),
def test_st_polygonize_single_geom(eng, geom, expected): # feel free to come up with a better name
eng = eng.create_or_skip()
eng.assert_query_result(
f"""SELECT ST_Polygonize(ST_GeomFromText(geom)) FROM (
VALUES
('{geom}')
) AS t(geom)""",
expected,Optional: It would be nice to have a few of these cases on the Rust level. If someone makes changes during development, it's nice to have cargo test catch buggy changes rather than waiting to occasionally pytest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the idea of having test cases for single geometries. Tests are now updated for both rust and python suites.
c/sedona-geos/src/st_polygonize.rs
Outdated
| let arg_types = [self.input_type.clone()]; | ||
| let args = [ColumnarValue::Array(values[0].clone())]; | ||
| let executor = sedona_functions::executor::WkbExecutor::new(&arg_types, &args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let arg_types = [self.input_type.clone()]; | |
| let args = [ColumnarValue::Array(values[0].clone())]; | |
| let executor = sedona_functions::executor::WkbExecutor::new(&arg_types, &args); | |
| let arg_types = std::slice::from_ref(&self.input_type); | |
| let args = [ColumnarValue::Array(values[0].clone())]; | |
| let executor = sedona_functions::executor::WkbExecutor::new(arg_types, &args); |
We can avoid this clone.
c/sedona-geos/src/st_polygonize.rs
Outdated
| if let Some(item) = maybe_item { | ||
| self.geometries.push(item.buf().to_vec()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dug in and found a way to avoid some of these to_vec() calls that copy the buffer. It's scattered across the file (and thus hard to convey here in a concise way), so I'm tried something new today and submitted a PR on your fork: joonaspessi#1. Merge if it looks good to you, and it should reflect here in your PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a similar idea to ST_Collect()...it may help here (or improvements you find here might be useful there!)
sedona-db/rust/sedona-functions/src/st_collect.rs
Lines 98 to 105 in 94ce7db
| #[derive(Debug)] | |
| struct CollectionAccumulator { | |
| input_type: SedonaType, | |
| unique_geometry_types: HashSet<GeometryTypeId>, | |
| unique_dimensions: HashSet<Dimensions>, | |
| count: i64, | |
| item: Option<Vec<u8>>, | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating the PR, merged it into the branch and I hope that most of the unnecessary copy operations are now gone.
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregate functions are hard...thank you!
I am not sure how well DataFusion handles aggregate functions and scalar function that have the same name...we might have to rename this to ST_Polygonize_Agg() (but we can evaluate when we get to the scalar functions).
I agree that the scalar function is a separate PR. We don't have any existing functions that accept an array of geometries and so it will probably be verbose to test and you'll have to write your own TypeMatcher.
c/sedona-geos/src/st_polygonize.rs
Outdated
| if let Some(item) = maybe_item { | ||
| self.geometries.push(item.buf().to_vec()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a similar idea to ST_Collect()...it may help here (or improvements you find here might be useful there!)
sedona-db/rust/sedona-functions/src/st_collect.rs
Lines 98 to 105 in 94ce7db
| #[derive(Debug)] | |
| struct CollectionAccumulator { | |
| input_type: SedonaType, | |
| unique_geometry_types: HashSet<GeometryTypeId>, | |
| unique_dimensions: HashSet<Dimensions>, | |
| count: i64, | |
| item: Option<Vec<u8>>, | |
| } |
c/sedona-geos/src/st_polygonize.rs
Outdated
| #[derive(Debug)] | ||
| struct PolygonizeAccumulator { | ||
| input_type: SedonaType, | ||
| geometries: Vec<Vec<u8>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally optional, but it might simplify our state() implementation (and make it more similar to ST_Collect()) to use Vec<u8> here (in other words, accumulate a geometrycollection).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to think if this would work out. Using single buffer like in ST_Collect, would it require then to keep track of offsets and then re-slice the buffer so that we can pass them individually geos::Geometry::polygonize(&[geom1, geom2, ...]). If I got it right, it feels little complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd have to write it as a valid GEOMETRYCOLLECTION (like ST_Collect()), use GEOS to parse the geometrycollection (once), then pass its children to polygonize(). It should be faster for a few reasons (e.g., a Vec<Vec<u8>> will perform heap allocations for each element, whereas Vec<u8> can reuse the same heap allocation for multiple elements). Optional to handle that here but it would be nice as a precedent for future aggregate function writers.
5e0afd6 to
a497e1e
Compare
9859b25 to
870f871
Compare
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Just one note to take care of from my end 🙂
c/sedona-geos/src/st_polygonize.rs
Outdated
| thread_local! { | ||
| static GEOS_WKB_FACTORY: GEOSWkbFactory = GEOSWkbFactory::new(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need a thread local factory...we do that in at least one performance-critical place in SedonaDB but here I think it is sufficient to declare it once as a function-level variable before you loop over WKB geometries.
c/sedona-geos/src/st_polygonize.rs
Outdated
|
|
||
| let mut geos_geoms = Vec::with_capacity(self.geometries.len()); | ||
| for wkb_bytes in &self.geometries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right before this loop is where you could declare the function-level GEOSWkbFactory. If we run into performance issues with this function we can reevlauate whether a thread local is appropriate.
| // Clone is necessary: get_geometry_n() returns ConstGeometry<'_> which doesn't | ||
| // implement Borrow<Geometry>. The GEOS polygonize() function signature requires | ||
| // T: Borrow<Geometry>, so we must clone to get owned Geometry instances. | ||
| geos_geoms.push(geom.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't get rid of this :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK! A heap allocation per output item is better than a heap allocation per input item 🙂
petern48
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit about using DataFusionError::Internal. Double check formatting after
c/sedona-geos/src/st_polygonize.rs
Outdated
| return Err(DataFusionError::Internal( | ||
| "No input arrays provided to accumulator in update_batch".to_string(), | ||
| )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return Err(DataFusionError::Internal( | |
| "No input arrays provided to accumulator in update_batch".to_string(), | |
| )); | |
| return sedona_internal_err!("No input arrays provided to accumulator in update_batch"); |
nit: We try to avoid using Datafusion::Internal because it throws a misleading error message. See here for more detail: #2. This is a handy macro we typically use instead.
Here's the import use sedona_common::sedona_internal_err;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Should be fixed now
c/sedona-geos/src/st_polygonize.rs
Outdated
| return Err(DataFusionError::Internal(format!( | ||
| "Unexpected number of state fields for st_polygonize() (expected 2, got {})", | ||
| states.len() | ||
| ))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return Err(DataFusionError::Internal(format!( | |
| "Unexpected number of state fields for st_polygonize() (expected 2, got {})", | |
| states.len() | |
| ))); | |
| return sedona_internal_err!( | |
| "Unexpected number of state fields for st_polygonize() (expected 2, got {:?})", | |
| states.len() | |
| ); |
c/sedona-geos/src/st_polygonize.rs
Outdated
| } | ||
|
|
||
| let item_ref = self.item.as_mut().ok_or_else(|| { | ||
| DataFusionError::Internal("Unexpected internal state in ST_Polygonize()".to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| DataFusionError::Internal("Unexpected internal state in ST_Polygonize()".to_string()) | |
| sedona_internal_err!("Unexpected internal state in ST_Polygonize()") |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Peter's suggestion is a good one if you don't mind taking care of that (but in general this is good to go from my end!)
| // Clone is necessary: get_geometry_n() returns ConstGeometry<'_> which doesn't | ||
| // implement Borrow<Geometry>. The GEOS polygonize() function signature requires | ||
| // T: Borrow<Geometry>, so we must clone to get owned Geometry instances. | ||
| geos_geoms.push(geom.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK! A heap allocation per output item is better than a heap allocation per input item 🙂
|
We should rename this to https://sedona.apache.org/latest/api/sql/AggregateFunction/ |
|
@jiayuasu As suggested, renamed to |
|
Sorry to nitpick. I think Sedona uses |
|
Yes, I am aware. I plan to change them back to |
|
Oh, cool. Good to know! |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Implements
ST_Polygonizeaggregate function following PostGIS behavior. Creates polygons from a collection of linestrings that form closed rings using the GEOS polygonize algorithm.This is the first GEOS-based aggregate function in the codebase and asking for opinions about the implementation as I'm quite unsure if this is the correct approach.
How to support
geometry[]array input?PostGIS supports two function signatures:
Any ideas how to implement this with DafaFusion and keep support for both input styles?
Should we separate the "array form" to another PR?