Remove Whitespace Tokens from Parser #2077
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements a significant architectural refactoring by moving whitespace filtering from the parser to the tokenizer. Instead of emitting whitespace tokens (spaces, tabs, newlines, comments) and filtering them throughout the parser logic, the tokenizer now consumes whitespace during tokenization and never emits these tokens.
While some duplicated logic still remains in the parser (to be addressed in future PRs), this change eliminates a substantial amount of looping overhead. This PR sets the groundwork for a cleaner streaming version, where the tokens are parsed simultaneously as the statements, with no parser memory and only local context passed between parser function calls.
Fixes #2076
Motivation
As discussed in #2076, whitespace tokens were being filtered at numerous points throughout the parser. This approach had several drawbacks:
The parser had extensive whitespace-handling logic scattered throughout:
Functions with whitespace-skipping loops:
peek_tokens_with_location- loops to skip whitespacepeek_tokens_ref- loops to skip whitespacepeek_nth_token_ref- loops to skip whitespaceadvance_token- loops to skip whitespaceprev_token- loops backward to skip whitespaceSpecial variant functions that are now obsolete:
peek_token_no_skip- removed entirely (no longer needed)peek_nth_token_no_skip- removed entirely (no longer needed)next_token_no_skip- removed entirely (no longer needed)Since SQL is not a whitespace-sensitive language (unlike Python), so it should be safe to remove whitespace tokens entirely after tokenization.
Handling Edge Cases
While SQL is generally not whitespace-sensitive, there are specific edge cases that require careful consideration:
1. PostgreSQL COPY FROM STDIN
The
COPY FROM STDINstatement requires preserving the actual data content, which may include meaningful whitespace and newlines. The data section is treated as raw input that should be parsed according to the specified format (tab-delimited, CSV, etc.).Solution: The tokenizer now properly handles this by consuming the data as a single token. The parser then actually parses the body of the CSV-like string, which was not actually done correctly before this refactoring. I have extended the associated tests appropriately.
2. Hyphenated and path identifiers
The tokenizer now includes enhanced logic for hyphenated identifier parsing with proper validation: