- 
                Notifications
    
You must be signed in to change notification settings  - Fork 28.9k
 
[SPARK-54099][SQL] XML variant parser should fall back to string on decimal parsing errors #52801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
        
          
                sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | } | ||
| 
               | 
          ||
| // Try parsing the value as decimal | ||
| val decimalParser = ExprUtils.getDecimalParser(options.locale) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the issue at hand, decimalParser should be reused rather than initializing for every value.
Can the caller pass it instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have had this conversation in the original PR. Yes, we can, but we need to change a few function signatures. We can improve it later.
| 
           Merged to master and branch-4.1.  | 
    
…ecimal parsing errors
### What changes were proposed in this pull request?
When parsing XML data with `parse_xml` that contains decimal numbers with very large
exponents (e.g., "1E+2147483647"), the conversion to Variant type fails with:
```
java.lang.ArithmeticException: BigInteger would overflow supported range
    at java.base/java.math.BigDecimal.setScale(BigDecimal.java:3000)
    at org.apache.spark.sql.catalyst.xml.StaxXmlParser$.org$apache$spark$sql$catalyst$xml$StaxXmlParser$$appendXMLCharacterToVariant(StaxXmlParser.scala:1335)
```
It's because the parser calls `setScale(0)` to normalize the decimal. When the scale is extremely negative (e.g., -2147483647), `setScale(0)` attempts to
multiply the unscaled value by 10^2147483647, causing BigInteger overflow.
This PR will catch all errors when parsing strings as decimal in the XML variant parser and fall back to string.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New UT.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #52801 from xiaonanyang-db/SPARK-54099.
Authored-by: Xiaonan Yang <xiaonan.yang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 88eef06)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
    
What changes were proposed in this pull request?
When parsing XML data with
parse_xmlthat contains decimal numbers with very largeexponents (e.g., "1E+2147483647"), the conversion to Variant type fails with:
It's because the parser calls
setScale(0)to normalize the decimal. When the scale is extremely negative (e.g., -2147483647),setScale(0)attempts tomultiply the unscaled value by 10^2147483647, causing BigInteger overflow.
This PR will catch all errors when parsing strings as decimal in the XML variant parser and fall back to string.
Why are the changes needed?
Bug fix.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New UT.
Was this patch authored or co-authored using generative AI tooling?
No.