Skip to content

Conversation

@LucaCappelletti94
Copy link
Contributor

This PR implements a significant architectural refactoring by moving whitespace filtering from the parser to the tokenizer. Instead of emitting whitespace tokens (spaces, tabs, newlines, comments) and filtering them throughout the parser logic, the tokenizer now consumes whitespace during tokenization and never emits these tokens.

While some duplicated logic still remains in the parser (to be addressed in future PRs), this change eliminates a substantial amount of looping overhead. This PR sets the groundwork for a cleaner streaming version, where the tokens are parsed simultaneously as the statements, with no parser memory and only local context passed between parser function calls.

Fixes #2076

Motivation

As discussed in #2076, whitespace tokens were being filtered at numerous points throughout the parser. This approach had several drawbacks:

  • Poor separation of concerns: Whitespace handling was scattered across both tokenizer and parser
  • Memory overhead: Whitespace tokens were stored in memory unnecessarily
  • Code duplication: Multiple loops throughout the parser to skip whitespace tokens, looking ahead or backwards for non-whitespace tokens
  • Performance: Each token access required checking and skipping whitespace tokens

The parser had extensive whitespace-handling logic scattered throughout:

Functions with whitespace-skipping loops:

Special variant functions that are now obsolete:

Since SQL is not a whitespace-sensitive language (unlike Python), so it should be safe to remove whitespace tokens entirely after tokenization.

Handling Edge Cases

While SQL is generally not whitespace-sensitive, there are specific edge cases that require careful consideration:

1. PostgreSQL COPY FROM STDIN

The COPY FROM STDIN statement requires preserving the actual data content, which may include meaningful whitespace and newlines. The data section is treated as raw input that should be parsed according to the specified format (tab-delimited, CSV, etc.).

Solution: The tokenizer now properly handles this by consuming the data as a single token. The parser then actually parses the body of the CSV-like string, which was not actually done correctly before this refactoring. I have extended the associated tests appropriately.

2. Hyphenated and path identifiers

The tokenizer now includes enhanced logic for hyphenated identifier parsing with proper validation:

  • Detects when hyphens/paths/tildes are part of identifiers vs. operators
  • Validates that identifiers don't start with digits after hyphens
  • Ensures identifiers don't end with trailing hyphens
  • Handles the whitespace-dependent context correctly

@LucaCappelletti94
Copy link
Contributor Author

The newly added dependency csv does not have an analogous no-std version, so the alloc-only test is currently failing. I will see whether a PR to csv adding support for an alloc-only mode is possible. I have used csv because I do not want to re-implement CSV-like document parsing in sqlparser.

@LucaCappelletti94
Copy link
Contributor Author

I have decided to replace the csv crate with a custom solution as right now that crate does not provide an alloc-only feature, and it seems rather complex to make a PR making it available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eliminating whitespace from the parser logic

1 participant