Skip to content

Conversation

@xiaonanyang-db
Copy link
Contributor

@xiaonanyang-db xiaonanyang-db commented Oct 30, 2025

What changes were proposed in this pull request?

When parsing XML data with parse_xml that contains decimal numbers with very large
exponents (e.g., "1E+2147483647"), the conversion to Variant type fails with:

java.lang.ArithmeticException: BigInteger would overflow supported range
    at java.base/java.math.BigDecimal.setScale(BigDecimal.java:3000)
    at org.apache.spark.sql.catalyst.xml.StaxXmlParser$.org$apache$spark$sql$catalyst$xml$StaxXmlParser$$appendXMLCharacterToVariant(StaxXmlParser.scala:1335)

It's because the parser calls setScale(0) to normalize the decimal. When the scale is extremely negative (e.g., -2147483647), setScale(0) attempts to
multiply the unscaled value by 10^2147483647, causing BigInteger overflow.

This PR will catch all errors when parsing strings as decimal in the XML variant parser and fall back to string.

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Oct 30, 2025
}

// Try parsing the value as decimal
val decimalParser = ExprUtils.getDecimalParser(options.locale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the issue at hand, decimalParser should be reused rather than initializing for every value.
Can the caller pass it instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have had this conversation in the original PR. Yes, we can, but we need to change a few function signatures. We can improve it later.

@HyukjinKwon
Copy link
Member

Merged to master and branch-4.1.

HyukjinKwon pushed a commit that referenced this pull request Nov 2, 2025
…ecimal parsing errors

### What changes were proposed in this pull request?
When parsing XML data with `parse_xml` that contains decimal numbers with very large
exponents (e.g., "1E+2147483647"), the conversion to Variant type fails with:
```
java.lang.ArithmeticException: BigInteger would overflow supported range
    at java.base/java.math.BigDecimal.setScale(BigDecimal.java:3000)
    at org.apache.spark.sql.catalyst.xml.StaxXmlParser$.org$apache$spark$sql$catalyst$xml$StaxXmlParser$$appendXMLCharacterToVariant(StaxXmlParser.scala:1335)
```

It's because the parser calls `setScale(0)` to normalize the decimal. When the scale is extremely negative (e.g., -2147483647), `setScale(0)` attempts to
multiply the unscaled value by 10^2147483647, causing BigInteger overflow.

This PR will catch all errors when parsing strings as decimal in the XML variant parser and fall back to string.

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #52801 from xiaonanyang-db/SPARK-54099.

Authored-by: Xiaonan Yang <xiaonan.yang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 88eef06)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants