-
-
Notifications
You must be signed in to change notification settings - Fork 262
New function LISTAGG #8689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
New function LISTAGG #8689
Conversation
This PR implements one of the aggregate functions (LISTAGG), which depends on the order of the input stream records. The full list can be seen here: #7632 |
INSERT INTO TEST_T values(4, 'D', 'B', 'K', true, 'Й'); | ||
COMMIT; | ||
|
||
SELECT LISTAGG (ALL COL4, ':') AS FROM TEST_T; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the syntax format section, <within group specification>
is mandatory but does not exist in some examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SQL specification declares <within group specification>
as mandatory. However, IMHO it's quite restrictive and neither Oracle nor DB2 follows that rule, they have it optional. Given that LIST
and LISTAGG
share the same syntax in this PR, we've also made <within group specification>
optional. So the easiest solution is to fix the README ;-)
Or we may go the standard way and separate the legacy LIST
(leave it with the current grammar, without ordering) from LISTAGG
(which is strictly standard-compliant) at the parser level. But IMHO it would be annoying for users to select either of them depending on whether you need ordering or not. So personally I'd keep everything "as is" and just fix the docs.
Other opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, LIST
and LISTAGG
should be complete synonyms.
I would also remove the mention of ON OVERFLOW
from the documentation. It's standard, but we don't support it. It could be mentioned if it were simply ignored, but mentioning it leads to errors.
SELECT
LISTAGG(TRIM(RDB$RELATION_NAME), ';' ON OVERFLOW ERROR) WITHIN GROUP(ORDER BY RDB$RELATION_NAME) AS REL_NAMES
FROM RDB$RELATIONS;
Invalid token.
Dynamic SQL Error.
SQL error code = -104.
Token unknown - line 2, column 52.
ERROR.
----------------------------------
SQLCODE: -104
SQLSTATE: 42000
GDSCODE: 335544569
SELECT
LISTAGG(TRIM(RDB$RELATION_NAME), ';' ON OVERFLOW TRUNCATE '...' WITHOUT COUNT) WITHIN GROUP(ORDER BY RDB$RELATION_NAME) AS REL_NAMES
FROM RDB$RELATIONS;
Invalid token.
Dynamic SQL Error.
SQL error code = -104.
Token unknown - line 2, column 52.
TRUNCATE.
----------------------------------
SQLCODE: -104
SQLSTATE: 42000
GDSCODE: 335544569
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange. I need to check it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove ON OVERFLOW
from the docs, then I believe we should remove it from the parser too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ON OVERFLOW
can be kept if it will not cause errors.
======= | ||
C:A:D:B | ||
|
||
SELECT LISTAGG (ALL COL6, ':')FROM TEST_T; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT LISTAGG (ALL COL6, ':')FROM TEST_T; | |
SELECT LISTAGG (ALL COL6, ':') FROM TEST_T; |
for (auto& nodeOrder : sort->expressions) | ||
{ | ||
dsc toDesc = *(descOrder++); | ||
toDesc.dsc_address = data + (IPTR)toDesc.dsc_address; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toDesc.dsc_address = data + (IPTR)toDesc.dsc_address; | |
toDesc.dsc_address = data + (IPTR) toDesc.dsc_address; |
if (IS_INTL_DATA(fromDsc)) | ||
INTL_string_to_key(tdbb, INTL_TEXT_TO_INDEX(fromDsc->getTextType()), | ||
fromDsc, &toDesc, INTL_KEY_UNIQUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (IS_INTL_DATA(fromDsc)) | |
INTL_string_to_key(tdbb, INTL_TEXT_TO_INDEX(fromDsc->getTextType()), | |
fromDsc, &toDesc, INTL_KEY_UNIQUE); | |
if (IS_INTL_DATA(fromDsc)) | |
{ | |
INTL_string_to_key(tdbb, INTL_TEXT_TO_INDEX(fromDsc->getTextType()), | |
fromDsc, &toDesc, INTL_KEY_UNIQUE); | |
} |
} | ||
|
||
dsc toDesc = asb->desc; | ||
toDesc.dsc_address = data + (IPTR)toDesc.dsc_address; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toDesc.dsc_address = data + (IPTR)toDesc.dsc_address; | |
toDesc.dsc_address = data + (IPTR) toDesc.dsc_address; |
if (distinct) | ||
desc.dsc_address = data + (asb->intl ? asb->keyItems[1].getSkdOffset() : 0); | ||
else | ||
desc.dsc_address = data + (IPTR)asb->desc.dsc_address; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
desc.dsc_address = data + (IPTR)asb->desc.dsc_address; | |
desc.dsc_address = data + (IPTR) asb->desc.dsc_address; |
if (sort && distinct) | ||
{ | ||
ValueExprNode* const sortNode = *sort->expressions.begin(); | ||
if (!arg->sameAs(sortNode, false) || sort->expressions.getCount() > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should they be identical? Why?
I think it should have like GROUP BY rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Currently, we sort only once and this is a good bonus. If we follow your suggestion and allow slightly different expressions, then we should either sort twice or ignore the user-specified ordering after DISTINCT
. BTW, in this PR it also seems to be ignored -- LISTAGG(DISTINCT COL) WITHIN GROUP (ORDER BY COL DESC)
would produce ASC-ordered output. But I suppose it should work the same way as for plain SELECT DISTINCT(COL) FROM T ORDER BY COL DESC
, i.e. respect the ORDER BY
ordering and optimize the sorts (merge two sorts into one) only if they fully (expressions / directions / NULLs placement) match each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why direction and null placement are important for DISTINCT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nulls placement is not important, I agree, as NULLs are skipped by all aggregate functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By standard, DISTINCT
eliminates duplicates in the ordered (if specified) result set, it should not change the user-defined ordering. Why do you think LISTAGG
should behave differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't but I don't quite understand why you said
i.e. respect the ORDER BY ordering and optimize the sorts (merge two sorts into one) only if they fully (expressions / directions / NULLs placement) match each other.
BTW, IMHO, sort->expressions.getCount() > 1
condition here is not needed as it is fine for distinct to use only first sorting segment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, my bad. Surely, direction for the combined sort should be taken from ORDER BY
-- like we already do in Optimizer::checkSorts()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nulls placement is not important, I agree, as NULLs are skipped by all aggregate functions.
With my suggestion to use same existing rules, the DISTINCT LISTAGG expression may be something like COALESCE(field, 'z') with a
ORDER BY Z`.
const auto keyCount = aggNode->sort->expressions.getCount() * 2; | ||
sort_key_def* sortKey = asb->keyItems.getBuffer(keyCount); | ||
|
||
auto const* direction = aggNode->sort->direction.begin(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto*
for consistency, please.
// per function. | ||
return aggInfo.blr == o->aggInfo.blr && aggInfo.name == o->aggInfo.name && | ||
distinct == o->distinct && dialect1 == o->dialect1; | ||
distinct == o->distinct && dialect1 == o->dialect1 && sort == o->sort;; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not look correct the comparation of pointer address here for sort
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChudaykinAlex IMO, the sort node should be added to the list by ListAggNode::getChildren()
-- this way you may remove the sort comparison in dsqlMatch()
and also remove doPass2(sort)
in pass2()
, as the sort node will be processed by the inherited methods automagically.
Thanks for the recommendations, I'll fix it soon. |
Purpose
The current implementation has an aggregate function LIST which concatenates multiple row fields into a blob. The SQL standard has a similar function called LISTAGG. The major difference is that it also supports the ordered concatenation.
Syntax and rules
The legacy LIST syntax is preserved for backward compatibility, LISTAGG is added to cover the standard features.
There is a
<listagg overflow clause>
rule in the standard, which is intended to output an error when the output value overflows. Since the LIST function always returns a BLOB, it was decided that this rule would be meaningless. It was not implemented and silently ignored if specified.If DISTINCT is specified for LISTAGG, then ORDER BY
<sort specification list>
must fully match<character value expression>
If DISTINCT is specified, the presence of WITHIN GROUP must obey the restriction and will not affect the subsequent code execution.