Skip to content

Conversation

@joonaspessi
Copy link
Contributor

Implements ST_Polygonize aggregate function following PostGIS behavior. Creates polygons from a collection of linestrings that form closed rings using the GEOS polygonize algorithm.

This is the first GEOS-based aggregate function in the codebase and asking for opinions about the implementation as I'm quite unsure if this is the correct approach.

How to support geometry[] array input?

PostGIS supports two function signatures:

-- Aggregate form (implemented)
SELECT ST_Polygonize(geom) FROM table;

-- Array form (NOT implemented)
SELECT ST_Polygonize(ARRAY[geom1, geom2, geom3]);

Any ideas how to implement this with DafaFusion and keep support for both input styles?
Should we separate the "array form" to another PR?

Copy link
Collaborator

@petern48 petern48 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cracking ground on the first aggregate function in geos! A bit of feedback.

Any ideas how to implement this with DafaFusion and keep support for both input styles?

See this comment (and PR) as an example for how to implement multiple different sets of arguments

Should we separate the "array form" to another PR?

I think it makes sense to do so. I'm not aware of any existing support for array input, so I think we'll need to add something to matchers.rs

('LINESTRING (25 10, 20 0)')
) AS t(geom)""",
"GEOMETRYCOLLECTION (POLYGON ((10 0, 0 0, 5 10, 10 0)), POLYGON ((30 0, 20 0, 25 10, 30 0)))",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add these cases too.

# a single polygon
select st_astext(st_polygonize(array[st_geomfromtext('POLYGON ((10 0, 0 0, 10 10, 10 0))')]));
-- GEOMETRYCOLLECTION(POLYGON((10 0,0 0,10 10,10 0)))

one of each of the other geom types (e.g points, multipoint, etc) and linestring empty
-- GEOMETRYCOLLECTION EMPTY

# A single linestring that's really a LinearRing
select st_astext(st_polygonize(array[st_geomfromtext('linestring(0 0, 0 1, 1 1, 1 0, 0 0)')]));
-- GEOMETRYCOLLECTION(POLYGON((0 0,0 1,1 1,1 0,0 0)))

These are using Postgis array syntax. Since we don't have array syntax yet, we could do something like this to avoid having to write a bunch of those polygon:

@pytest.mark.parametrize("eng", [SedonaDB, PostGIS])
@pytest.mark.parametrize(
    ("geom", "expected"),  
def test_st_polygonize_single_geom(eng, geom, expected):  # feel free to come up with a better name
    eng = eng.create_or_skip()
    eng.assert_query_result(
        f"""SELECT ST_Polygonize(ST_GeomFromText(geom)) FROM (
            VALUES
                ('{geom}')
        ) AS t(geom)""",
        expected,

Optional: It would be nice to have a few of these cases on the Rust level. If someone makes changes during development, it's nice to have cargo test catch buggy changes rather than waiting to occasionally pytest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the idea of having test cases for single geometries. Tests are now updated for both rust and python suites.

Comment on lines 110 to 112
let arg_types = [self.input_type.clone()];
let args = [ColumnarValue::Array(values[0].clone())];
let executor = sedona_functions::executor::WkbExecutor::new(&arg_types, &args);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let arg_types = [self.input_type.clone()];
let args = [ColumnarValue::Array(values[0].clone())];
let executor = sedona_functions::executor::WkbExecutor::new(&arg_types, &args);
let arg_types = std::slice::from_ref(&self.input_type);
let args = [ColumnarValue::Array(values[0].clone())];
let executor = sedona_functions::executor::WkbExecutor::new(arg_types, &args);

We can avoid this clone.

Comment on lines 115 to 116
if let Some(item) = maybe_item {
self.geometries.push(item.buf().to_vec());
Copy link
Collaborator

@petern48 petern48 Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug in and found a way to avoid some of these to_vec() calls that copy the buffer. It's scattered across the file (and thus hard to convey here in a concise way), so I'm tried something new today and submitted a PR on your fork: joonaspessi#1. Merge if it looks good to you, and it should reflect here in your PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a similar idea to ST_Collect()...it may help here (or improvements you find here might be useful there!)

#[derive(Debug)]
struct CollectionAccumulator {
input_type: SedonaType,
unique_geometry_types: HashSet<GeometryTypeId>,
unique_dimensions: HashSet<Dimensions>,
count: i64,
item: Option<Vec<u8>>,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the PR, merged it into the branch and I hope that most of the unnecessary copy operations are now gone.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregate functions are hard...thank you!

I am not sure how well DataFusion handles aggregate functions and scalar function that have the same name...we might have to rename this to ST_Polygonize_Agg() (but we can evaluate when we get to the scalar functions).

I agree that the scalar function is a separate PR. We don't have any existing functions that accept an array of geometries and so it will probably be verbose to test and you'll have to write your own TypeMatcher.

Comment on lines 115 to 116
if let Some(item) = maybe_item {
self.geometries.push(item.buf().to_vec());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a similar idea to ST_Collect()...it may help here (or improvements you find here might be useful there!)

#[derive(Debug)]
struct CollectionAccumulator {
input_type: SedonaType,
unique_geometry_types: HashSet<GeometryTypeId>,
unique_dimensions: HashSet<Dimensions>,
count: i64,
item: Option<Vec<u8>>,
}

#[derive(Debug)]
struct PolygonizeAccumulator {
input_type: SedonaType,
geometries: Vec<Vec<u8>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally optional, but it might simplify our state() implementation (and make it more similar to ST_Collect()) to use Vec<u8> here (in other words, accumulate a geometrycollection).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to think if this would work out. Using single buffer like in ST_Collect, would it require then to keep track of offsets and then re-slice the buffer so that we can pass them individually geos::Geometry::polygonize(&[geom1, geom2, ...]). If I got it right, it feels little complicated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have to write it as a valid GEOMETRYCOLLECTION (like ST_Collect()), use GEOS to parse the geometrycollection (once), then pass its children to polygonize(). It should be faster for a few reasons (e.g., a Vec<Vec<u8>> will perform heap allocations for each element, whereas Vec<u8> can reuse the same heap allocation for multiple elements). Optional to handle that here but it would be nice as a precedent for future aggregate function writers.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Just one note to take care of from my end 🙂

Comment on lines 35 to 37
thread_local! {
static GEOS_WKB_FACTORY: GEOSWkbFactory = GEOSWkbFactory::new();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need a thread local factory...we do that in at least one performance-critical place in SedonaDB but here I think it is sufficient to declare it once as a function-level variable before you loop over WKB geometries.

Comment on lines 96 to 98

let mut geos_geoms = Vec::with_capacity(self.geometries.len());
for wkb_bytes in &self.geometries {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right before this loop is where you could declare the function-level GEOSWkbFactory. If we run into performance issues with this function we can reevlauate whether a thread local is appropriate.

// Clone is necessary: get_geometry_n() returns ConstGeometry<'_> which doesn't
// implement Borrow<Geometry>. The GEOS polygonize() function signature requires
// T: Borrow<Geometry>, so we must clone to get owned Geometry instances.
geos_geoms.push(geom.clone());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't get rid of this :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK! A heap allocation per output item is better than a heap allocation per input item 🙂

Copy link
Collaborator

@petern48 petern48 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit about using DataFusionError::Internal. Double check formatting after

Comment on lines 135 to 137
return Err(DataFusionError::Internal(
"No input arrays provided to accumulator in update_batch".to_string(),
));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Err(DataFusionError::Internal(
"No input arrays provided to accumulator in update_batch".to_string(),
));
return sedona_internal_err!("No input arrays provided to accumulator in update_batch");

nit: We try to avoid using Datafusion::Internal because it throws a misleading error message. See here for more detail: #2. This is a handy macro we typically use instead.

Here's the import use sedona_common::sedona_internal_err;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Should be fixed now

Comment on lines 181 to 184
return Err(DataFusionError::Internal(format!(
"Unexpected number of state fields for st_polygonize() (expected 2, got {})",
states.len()
)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Err(DataFusionError::Internal(format!(
"Unexpected number of state fields for st_polygonize() (expected 2, got {})",
states.len()
)));
return sedona_internal_err!(
"Unexpected number of state fields for st_polygonize() (expected 2, got {:?})",
states.len()
);

}

let item_ref = self.item.as_mut().ok_or_else(|| {
DataFusionError::Internal("Unexpected internal state in ST_Polygonize()".to_string())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataFusionError::Internal("Unexpected internal state in ST_Polygonize()".to_string())
sedona_internal_err!("Unexpected internal state in ST_Polygonize()")

@paleolimbot paleolimbot changed the title feat: st_polygonize feat(c/sedona-geos): Add ST_Polygonize() aggregate version Nov 11, 2025
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Peter's suggestion is a good one if you don't mind taking care of that (but in general this is good to go from my end!)

// Clone is necessary: get_geometry_n() returns ConstGeometry<'_> which doesn't
// implement Borrow<Geometry>. The GEOS polygonize() function signature requires
// T: Borrow<Geometry>, so we must clone to get owned Geometry instances.
geos_geoms.push(geom.clone());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK! A heap allocation per output item is better than a heap allocation per input item 🙂

@jiayuasu
Copy link
Member

jiayuasu commented Nov 11, 2025

We should rename this to ST_Polygonize_Agg. Sedona, Snowflake, Databricks all use _agg to separate their Spatial SQL aggregate functions from scalar functions.

https://sedona.apache.org/latest/api/sql/AggregateFunction/
https://docs.snowflake.com/en/sql-reference/functions-geospatial
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-st-geospatial-functions

@joonaspessi
Copy link
Contributor Author

@jiayuasu As suggested, renamed to ST_Polygonize_Agg. Maybe this makes most sense for both user and implementation perspective.

@yutannihilation
Copy link
Contributor

Sorry to nitpick. I think Sedona uses _Aggr instead of _Agg (although I feel _Agg is more familiar...)

@jiayuasu
Copy link
Member

Yes, I am aware. I plan to change them back to agg...

@yutannihilation
Copy link
Contributor

Oh, cool. Good to know!

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@paleolimbot paleolimbot merged commit 49cecb0 into apache:main Nov 12, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants