Skip to content

Conversation

@bcallender
Copy link
Contributor

@bcallender bcallender commented Oct 4, 2025

TL;DR

Add outer explode support that preserves null/empty arrays, plus position-aware variants. No breaking changes; default explode behavior unchanged.

What’s included

  • Explode “outer” semantics:
    • DataFrame APIs: explode_outer(col), posexplode_outer(col)
    • New keep_null_and_empty: bool on Explode and ExplodeWithIndex logical plans
    • Physical exec honors keep_null_and_empty:
      • Preserves rows with empty/null arrays
      • Emits null for value; and for position in pos-explode variants
  • Position-aware explode:
    • explode_with_index(col, index_name="index", value_name=None, keep_null_and_empty=False)
    • PySpark alias: posexplode(col) and posexplode_outer(col)
  • Serde + Protobuf:
    • New ExplodeWithIndex message
    • Explode and ExplodeWithIndex carry keep_null_and_empty
    • Optional value_name handled correctly (None when omitted)
  • Tests:
    • Outer behavior for explode/posexplode
    • Optional naming coverage (only value_name; only index_name; both; expression input)
    • Plan serde round-trip for ExplodeWithIndex (default/custom names, outer)

Behavior details

  • Regular explode: filters null/empty arrays (unchanged)
  • Outer explode: preserves all rows, emits nulls for missing elements
  • Position column:
    • Non-outer: 0-based indices per original row
    • Outer: null position for empty/null arrays

Compatibility

  • PySpark semantics for posexplode/posexplode_outer (0-based indices here; position nulls for outer)
  • Backward compatible defaults
  • Protobuf/serde extended; existing fields untouched

Examples (with results)

df = session.create_dataframe({
    "id": [1, 2, 3],
    "tags": [["red", "blue"], [], None],
})
  • Regular explode (filters empty/null)
df.explode("tags").to_polars()
id tags
1 red
1 blue
  • Outer explode (preserves rows, yields nulls)
df.explode_outer("tags").to_polars()
id tags
1 red
1 blue
2 null
3 null
  • posexplode (position + value; filters empty/null)
df.posexplode("tags").to_polars()
id pos col
1 0 red
1 1 blue
  • posexplode_outer (position + value; preserves rows; nulls for empty/null)
df.posexplode_outer("tags").to_polars()
id pos col
1 0 red
1 1 blue
2 null null
3 null null
  • explode_with_index (default names; filters empty/null)
df2 = session.create_dataframe({
    "id": [1, 2],
    "tags": [["x", "y"], ["z"]],
})
df2.explode_with_index("tags").to_polars()
id index tags
1 0 x
1 1 y
2 0 z
  • explode_with_index (custom names; outer behavior)
df3 = session.create_dataframe({
    "id": [1, 2, 3],
    "letters": [["a", "b"], [], None],
})
df3.explode_with_index("letters", index_name="pos", value_name="val", keep_null_and_empty=True).to_polars()
id pos val
1 0 a
1 1 b
2 null null
3 null null

Testing and quality

  • Lints clean
  • Full suite passing (1198 passed, 23 skipped)
  • Added serde and API tests covering optional parameters and outer variants

Copy link
Contributor Author

bcallender commented Oct 4, 2025

@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 0dd2fab to 082e2db Compare October 4, 2025 04:01
@bcallender bcallender marked this pull request as ready for review October 4, 2025 04:10
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 082e2db to fffb538 Compare October 6, 2025 17:17
@bcallender bcallender force-pushed the feat/explode-with-index branch 2 times, most recently from 7a31f76 to 10c1eb7 Compare October 6, 2025 17:20
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from fffb538 to 3c163b7 Compare October 13, 2025 17:24
@bcallender bcallender force-pushed the feat/explode-with-index branch 4 times, most recently from cbe46cd to 8c3ed51 Compare October 13, 2025 18:06
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 3c163b7 to d12deee Compare October 13, 2025 20:57
@bcallender bcallender force-pushed the feat/explode-with-index branch from 8c3ed51 to 9dd1148 Compare October 13, 2025 20:57
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from d12deee to fa21928 Compare October 13, 2025 21:02
@bcallender bcallender force-pushed the feat/explode-with-index branch from 9dd1148 to 718a5b9 Compare October 13, 2025 21:02
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from fa21928 to 796dbb9 Compare October 14, 2025 19:24
@bcallender bcallender force-pushed the feat/explode-with-index branch 3 times, most recently from eb9d019 to 8771644 Compare October 15, 2025 21:55
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 9eeb753 to e79ec74 Compare October 15, 2025 22:16
@bcallender bcallender force-pushed the feat/explode-with-index branch from 8771644 to 1c5aeb9 Compare October 15, 2025 22:16
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from e79ec74 to 0c8d0aa Compare October 15, 2025 22:23
@bcallender bcallender force-pushed the feat/explode-with-index branch from 1c5aeb9 to 47c7af9 Compare October 15, 2025 22:23
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 0c8d0aa to babcf09 Compare October 15, 2025 22:33
@bcallender bcallender force-pushed the feat/explode-with-index branch from 47c7af9 to 4ed937b Compare October 15, 2025 22:33
Copy link
Contributor

@YoungVor YoungVor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly finished a pass. Implementation and tests looks great, had a few comments on the API. I'll finish my pass tomorrow

self._session_state,
)

def explode_with_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe posexplode_with_fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't think so -- this is just a superset of the functionality of those two functions, so it makes sense to keep it named separately, at least imo.

def explode_with_index(
self,
column: ColumnOrName,
index_name: str = "index",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we couldn't keep everything as 'pos', so pos_name, and default to 'pos'? Its slightly strange to me to have the descriptions and default depart from the posexplode functions

in the example you could show that you can change it to index.

Copy link
Contributor Author

@bcallender bcallender Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it makes sense for the names of this function to be internally consistent (index), but I think it does make sense to change the default to pos to match the existing functions.

column: Name of array column to explode (as string) or Column expression.
index_name: Name for the column containing 0-based array positions (default: "index").
value_name: Optional name for the exploded value column. If None, uses the original column name.
keep_null_and_empty: If True, preserves rows where the array is null or empty (default: False).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could call this 'outer', or at least mention that behavior would mimick explode/explode_outer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added clarification. I think this is a more explicit name, but mentioned that it is the same as regular posexplode (False) or posexplode_outer (True).

@YoungVor
Copy link
Contributor

Finished my pass, looks great! I love the consolidation of code with a single explode_with_index logical expression.

A few nits, but mostly around the API documentation

@bcallender bcallender force-pushed the feat/regexp-text-functions branch from babcf09 to 6813598 Compare October 21, 2025 17:59
@bcallender bcallender force-pushed the feat/explode-with-index branch from 4ed937b to f048b72 Compare October 21, 2025 17:59
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 6813598 to 60cf0bd Compare October 22, 2025 17:38
@bcallender bcallender force-pushed the feat/explode-with-index branch from f048b72 to 05bda1e Compare October 22, 2025 17:38
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 60cf0bd to 26c24d7 Compare October 22, 2025 22:11
@bcallender bcallender force-pushed the feat/explode-with-index branch from 05bda1e to 6140a8a Compare October 22, 2025 22:11
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 26c24d7 to 335c273 Compare October 22, 2025 23:10
@bcallender bcallender force-pushed the feat/explode-with-index branch 2 times, most recently from 221d3f5 to 6a43034 Compare October 23, 2025 18:18
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 88d048b to 27558aa Compare October 25, 2025 00:00
@bcallender bcallender force-pushed the feat/explode-with-index branch from 6a43034 to 615972c Compare October 25, 2025 00:00
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 27558aa to 83420e9 Compare October 27, 2025 17:57
@bcallender bcallender force-pushed the feat/explode-with-index branch from 615972c to 2855f73 Compare October 27, 2025 17:57
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 83420e9 to 404bb85 Compare November 3, 2025 18:41
@bcallender bcallender force-pushed the feat/explode-with-index branch from 2855f73 to 40feadc Compare November 3, 2025 18:42
…`/`posexplode_outer` (via explode_with_index)
@bcallender bcallender force-pushed the feat/regexp-text-functions branch from 404bb85 to 67e400a Compare November 3, 2025 19:47
@bcallender bcallender force-pushed the feat/explode-with-index branch from d531650 to fbfaa0f Compare November 3, 2025 19:47
@bcallender bcallender force-pushed the feat/explode-with-index branch from fbfaa0f to 686ce7f Compare November 3, 2025 22:39
@bcallender bcallender requested a review from YoungVor November 3, 2025 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants