Update dependency org.jsoup:jsoup to v1.21.2 #21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
1.14.3->1.21.2Release Notes
jhy/jsoup (org.jsoup:jsoup)
v1.21.2Changes
Normalizer#normalize(String, bool)andAttribute#shouldCollapseAttribute(Document.OutputSettings). These will be removed in a future version.Connection#sslSocketFactory(SSLSocketFactory)in favor of the newConnection#sslContext(SSLContext). UsingsslSocketFactorywill force the use of the legacyHttpUrlConnectionimplementation, which does not support HTTP/2. #2370Improvements
Connection.Response#statusMessage()to return a simple loggable string message (e.g. "OK") when using theHttpClientimplementation, which doesn't otherwise return any server-set status message. #2356Attributes#size()andAttributes#isEmpty()now exclude any internal attributes (such as user data) from their count. This aligns with the attributes' serialized output and iterator. #2369Connection#sslContext(SSLContext)to provide a custom SSL (TLS) context to requests, supporting both theHttpClientand the legacyHttUrlConnectionimplementations. #2370element.child(0).remove(), and when usingParser#parseBodyFragement()to parse a large number of direct children. #2373.Bug Fixes
NodeTraversor, if a last child element was removed during thehead()call, the parent would be visited twice. #2355.Attributes#size()andAttributes#isEmpty(). #2356Element#children()on the same element concurrently, a race condition could happen when the method was generating the internal child element cache (a filtered view of its child nodes). Since concurrent reads of DOM objects should be threadsafe without external synchronization, this method has been updated to execute atomically. #2366v1.21.1Changes
:matchTextpseduo-selector due to its side effects on the DOM; use the new::textnodeselector and theElement#selectNodes(String css, Class type)method instead. #2343Connection.Response#bufferUp()in lieu ofConnection.Response#readFully()which can throw a checked IOException.Validate#ensureNotNull(replaced by typedValidate#expectNotNull); protected HTML appenders from Attribute and Node.Improvements
Selectorto support direct matching against nodes such as comments and text nodes. For example, you can now find an element that follows a specific comment:::comment:contains(prices) + pwill selectpelements immediately after a<!-- prices: -->comment. Supported types include::node,::leafnode,::comment,::text,::data, and::cdata. Node contextual selectors like::node:contains(text),:matches(regex), and:blankare also supported. IntroducedElement#selectNodes(String css)andElement#selectNodes(String css, Class nodeType)for direct node selection. #2324TagSet#onNewTag(Consumer<Tag> customizer): register a callback that’s invoked for each new or cloned Tag when it’s inserted into the set. Enables dynamic tweaks of tag options (for example, marking all custom tags as self-closing, or everything in a given namespace as preserving whitespace).TokenQueueandCharacterReaderautocloseable, to ensure that they will release their buffers back to the buffer pool, for later reuse.Selector#evaluatorOf(String css), as a clearer way to obtain an Evaluator from a CSS query. An alias ofQueryParser.parse(String css).TagSet) in a foreign namespace (e.g. SVG) can be configured to parse as data tags.NodeVisitor#traverse(Node)to simplify node traversal calls (vs. importingNodeTraversor).Connection#readFully()as a replacement forConnection#bufferUp()with an explicit IOException. Similarly, addedConnection#readBody()overConnection#body(). DeprecatedConnection#bufferUp(). #2327<and>characters are now escaped in attributes. This helps prevent a class of mutation XSS attacks. #2337Connectionto prefer using the JDK's HttpClient over HttpUrlConnection, if available, to enable HTTP/2 support by default. Users can disable via-Djsoup.useHttpClient=false. #2340Bug Fixes
scriptin asvgforeign context should be parsed as script data, not text. #2320Tag#isFormSubmittable()was updating the Tag's options. #2323v1.20.1Changes
<foo />)to close HTML elements by default. Foreign content (SVG, MathML), and content parsed with the XML parser, still
supports self-closing tags. If you need specific HTML tags to support self-closing, you can register a custom tag via
the
TagSetconfigured inParser.tagSet(), usingTag#set(Tag.SelfClose). Standard void tags (such as<img>,<br>, etc.) continue to behave as usual and are not affected by thischange. #2300.
ChangeNotifyingArrayList,Document.updateMetaCharsetElement(),Document.updateMetaCharsetElement(boolean),HtmlTreeBuilder.isContentForTagData(String),Parser.isContentForTagData(String),Parser.setTreeBuilder(TreeBuilder),Tag.formatAsBlock(),Tag.isFormListed(),TokenQueue.addFirst(String),TokenQueue.chompTo(String),TokenQueue.chompToIgnoreCase(String),TokenQueue.consumeToIgnoreCase(String),TokenQueue.consumeWord(),TokenQueue.matchesAny(String...)Functional Improvements
Tags, and provide a cleaner path for ongoing improvements. The specific HTML produced by the pretty-printer may be
different from previous versions. #2286.
TagSettag collection.Their properties can impact both the parse and how content is
serialized (output as HTML or XML). #2285.
Element.cssSelector()will prefer to return shorter selectors by using ancestor IDs when available and unique. E.g.#id > div > pinstead ofhtml > body > div > div > p#2283.Elements.deselect(int index),Elements.deselect(Object o), andElements.deselectAll()methods to removeelements from the
Elementslist without removing them from the underlying DOM. Also addedElements.asList()methodto get a modifiable list of elements without affecting the DOM. (Individual Elements remain linked to the
DOM.) #2100.
Connection.requestBodyStream(InputStream stream). #1122.Attributes. Also, added
Tag#prefix(),Tag#localName(),Attribute#prefix(),Attribute#localName(), andAttribute#namespace()to retrieve these. #2299.Element#cssSelector()will emitappropriately escaped selectors, and the QueryParser supports those. Added
Selector.escapeCssIdentifier()andSelector.unescapeCssIdentifier(). #2297, #2305Structure and Performance Improvements
QueryParserinto a clearer recursive descentparser. #2310.
div >> p) will throw an explicit parseexception. #2311.
#2307.
HTML. #2304.
Parserinstances threadsafe, so that inadvertent use of the same instance across threads will not lead toerrors. For actual concurrency, use
Parser#newInstance()perthread. #2314.
Bug Fixes
serializing. #1496.
encoded). #1743.
Documentto the W3C DOM inW3CDom, elements with an attribute in an undeclared namespace nowget a declaration of
xmlns:prefix="undefined". This allows subsequent serialization to XML viaW3CDom.asString()to succeed. #2087.
StreamParsercould emit the final elements of a document twice, due to howonNodeCompletedwas fired when closing out the stack. #2295.?in<?xml version="1.0"?>wouldincorrectly emit an error. #2298.
Element#cssSelector()on an element with combining characters in the class or ID now produces the correct output. #1984.v1.19.1Changes
Jsoup.connect(), when running on Java 11+, via the Java HttpClientimplementation. #2257.
System.setProperty("jsoup.useHttpClient", "true");to enable making requests via the HttpClient instead ,which will enable http/2 support, if available. This will become the default in a later version of jsoup, so now is
a good time to validate it.
that as a Multi-Release
JAR.
HttpClientimpl is not available in your JRE, requests will continue to be made viaHttpURLConnection(inhttp/1.1mode).developers need to enable core library desugaring. The minimum Java version remains Java 8.
#2173
org.jsoup.UncheckedIOException(replace withjava.io.UncheckedIOException);moved previously deprecated method
Element Element#forEach(Consumer)tovoid Element#forEach(Consumer()). #2246Document#updateMetaCharsetElement(boolean)andDocument#updateMetaCharsetElement(), as thesetting had no effect. When
Document#charset(Charset)is called, the document's meta charset or XML encodinginstruction is always set. #2247
Improvements
Safelistthat preserves relative links, theisValid()method will now consider theselinks valid. Additionally, the enforced attribute
rel=nofollowwill only be added to external links when configuredin the safelist. #2245
Element#selectStream(String query)andElement#selectStream(Evaluator)methods, that return aStreamofmatching elements. Elements are evaluated and returned as they are found, and the stream can be
terminated early. #2092
Elementobjects now implementIterable, enabling them to be used in enhanced for loops.ReaderviaParser#parseFragmentInput(Reader, Element, String). #1177jsoup-examples.jar. #1702#id .class(and other similar descendant queries) by around 4.6x, by betterbalancing the Ancestor evaluator's cost function in the query
planner. #2254
<isindex>tags, which would autovivify aformelement with labels. This is nolonger in the spec.
Elements.selectFirst(String cssQuery)andElements.expectFirst(String cssQuery), to select the firstmatching element from an
Elementslist. #2263through the HTML parser's bogus comment handler. Serialization for non-doctype declarations no longer end with a
spurious
!. #2275<are normalized to_to ensure validXML. For example,
<foo<bar>becomes<foo_bar>, as XML does not allow<in element names, but HTML5does. #2276
Bug Fixes
;in an attribute name, it could not be converted to a W3C DOM element, and so subsequent XPathqueries could miss that element. Now, the attribute name is more completely
normalized. #2244
"name". #2241
Connection, skip cookies that have no name, rather than throwing a validationexception. #2242
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;could be thrown when calling
Response#body()after parsing from a URL and the buffer size wasexceeded. #2250
nullInputStream inputs toJsoup.parse(InputStream stream, ...), by returningan empty
Document. #2252templatetag containing anliwithin an openliwould be parsed incorrectly, as it was not recognized as a"special" tag (which have additional processing rules). Also, added the SVG and MathML namespace tags to the list of
special tags. #2258
templatetag containing abuttonwithin an openbuttonwould be parsed incorrectly, as the "in button scope"check was not aware of the
templateelement. Corrected other instances including MathML and SVG elements,also. #2271
:nth-childselector with a negative digit-less step, such as:nth-child(-n+2), would be parsed incorrectly as apositive step, and so would not match as expected. #1147
doc.charset(charset)on an empty XML document would throw anIndexOutOfBoundsException. #2266StructuralEvaluator(e.g., a selector ancestor chain likeA B C) byensuring cache reset calls cascade to inner members. #2277
doc.clone().append(html)were not supported. When a document was cloned, itsParserwas not cloned but was a shallow copy of the original parser. #2281v1.18.3Bug Fixes
-,., or digits were incorrectly marked as invalid andremoved. 2235
v1.18.2Improvements
down between -6% and -89%, and throughput improved up to +143% for small inputs. Most inputs sizes will see
throughput increases of ~ 20%. These performance improvements come through recycling the backing
byte[]andchar[]arrays used to read and parse the input. 2186
html()andEntities.escape()when the input contains UTF characters in a supplementary plane, byaround 49%. 2183
FormElement.elements()now reflect changes made to the DOM,subsequently to the original parse. 2140
TreeBuilder, theonNodeInserted()andonNodeClosed()events are now also fired for the outermost /root
Documentnode. This enables source position tracking on the Document node (which was previously unset). Andit also enables the node traversor to see the outer Document node. 2182
Elements#set(). 2212Bug Fixes
Element.cssSelector()would fail if the element's class contained a*character. 2169
untracked. 2175
html, it should be parsed in QuirksMode. 2197
div:has(span + a), thehas()component was not working correctly, as the inner combiningquery caused the evaluator to match those against the outer's siblings, not
children. 2187
:has()components in a nested:has()might incorrectlyexecute. 2131
Connection.Response#cookies()will provide the last one set. Generally it is better to usethe Jsoup.newSession method to maintain a cookie jar, as that
applies appropriate path selection on cookies when making requests. 1831
attribute). 2207
created (
htmlorbody). 2204<as part of a tag name, instead of emitting it as acharacter node. 2230
<as the start of an attribute name, vs creating a new element. The previous behavior wasintended to parse closer to what we anticipated the author's intent to be, but that does not align to the spec or to
how browsers behave. 1483
v1.18.1Improvements
StreamParserprovides a progressive parse of its input. As eachElementis completed, it isemitted via a
StreamorIteratorinterface. Elements returned will be complete with all their children, and an(empty) next sibling, if applicable. Elements (or their children) may be removed from the DOM during the parse,
for e.g. to conserve memory, providing a mechanism to parse an input document that would otherwise be too large to fit
into memory, yet still providing a DOM interface to the document and its elements. Additionally, the parser provides
a
selectFirst(String query)/selectNext(String query), which will run the parser until a hit is found, at whichpoint the parse is suspended. It can be resumed via another
select()call, or via thestream()oriterator()methods. 2096
parsed). Supported on both a session and a single connection
level. 2164, 656
Pathaccepting parse methods:Jsoup.parse(Path),Jsoup.parse(path, charsetName, baseUri, parser),etc. 2055
buttontag configuration to include a space between multiple button elements in theElement.text()method. 2105
ns|*all elements in namespace Selector. 1811_, vs beingstripped. This should make the process clearer, and generally prevent an invalid attribute name being coerced
unexpectedly. 2143
Changes
Bug Fixes
to
-1.2106{,}in the path, a Malformed URL exception wouldbe thrown (if in development), or the URL might otherwise not be escaped correctly (if in
production). The URL encoding process has been improved to handle these characters
correctly. 2142
W3CDomwith a custom output Document, a Null Pointer Exception would bethrown. 2114
:has()selector did not match correctly when using sibling combinators (likee.g.:
h1:has(+h2)). 2137:emptyselector incorrectly matched elements that started with a blank text node and were followed bynon-empty nodes, due to an incorrect short-circuit. 2130
Element.cssSelector()would fail with "Did not find balanced marker" when building a selector for elements that hada
(or[in their class names. And selectors with those characters escaped would not match asexpected. 2146
Entities.escape(string)to make the escaped text suitable for both text nodes and attributes (previously wasonly for text nodes). This does not impact the output of
Element.html()which correctly applies a minimal escapedepending on if the use will be for text data or in a quoted
attribute. 1278
<base href>URL, in the normalizing regex.2165
v1.17.2Improvements
Element.attribute(String)andAttributes.attribute(String)to more simplyobtain an
Attributeobject. 2069via
Attribute.setKey(String)), the source range is now still trackedin
Attribute.sourceRange(). 2070[*]element with any attribute selector. And also restoredsupport for selecting by an empty attribute name prefix (
[^]). 2079Bug Fixes
mix-cased but the parser was lower-case normalizing attribute names, the source position for that attribute was not
tracked correctly. 2067
exception was thrown. 2068
character. 2074
parent [attr=va], other, the, ORwas bindingto
[attr=va]instead ofparent [attr=va], causing incorrect selections. The fix includes a EvaluatorDebug classthat generates a sexpr to represent the query, allowing simpler and more thorough query parse
tests. 2073
sections would have an extraneous CData section added, causing script execution errors. Now, the data content is
emitted in a HTML/XML/XHTML polyglot format, if the data is not already within a CData
section. 2078
:hasevaluator held a non-thread-safe Iterator, and so if an Evaluator object wasshared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be
incorrect. Now, the iterator object is a thread-local. 2088
Older changes for versions 0.1.1 (2010-Jan-31) through 1.17.1 (2023-Nov-27) may be found in
change-archive.txt.
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.