Skip to content

Conversation

@eyalleshem
Copy link
Contributor

Key points for this commit:

  • The peekable trait isn't sufficient for using string slices, as we need the byte indexes (start/end) to create string slices, so added the current byte position to the State struct (Note: in the long term we could potentially remove peekable and use only the current position as an iterator)
  • Created internal functions that create slices from the original query instead of allocating strings, then converted these functions to return String to maintain compatibility (the idea is to make a small, reviewable commit without changing the Token struct or the parser)

@eyalleshem eyalleshem force-pushed the reduce-string-copies branch 2 times, most recently from 94afa83 to ad1f74a Compare October 23, 2025 17:29
Key points for this commit:
- The peekable trait isn't sufficient for using string slices, as we need
  the byte indexes (start/end) to create string slices, so added the current
  byte position to the State struct
  (Note: in the long term we could potentially remove peekable and use only
  the current position as an iterator)
- Created internal functions that create slices from the original query
  instead of allocating strings, then converted these functions to return
  String to maintain compatibility (the idea is to make a small, reviewable
  commit without changing the Token struct or the parser)
@eyalleshem eyalleshem force-pushed the reduce-string-copies branch from ad1f74a to 82c6657 Compare October 23, 2025 17:31
@eyalleshem
Copy link
Contributor Author

This commit is a first step toward addressing issue #2036.

It doesn't contain any behavioral changes yet, and the scope is intentionally limited because it touches some core elements of the tokenizer. I wanted to:

  1. Keep the changes small and easy to review
  2. Get early feedback before proceeding with the full implementation

I'd appreciate any thoughts on this approach before I continue!

Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting to look into this @eyalleshem! Took a look now and I think the changes look reasonable to me overall, left some comments

Comment on lines +778 to +779
/// return the character after the next character (lookahead by 2) without advancing the stream
pub fn peek_next(&self) -> Option<char> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// return the character after the next character (lookahead by 2) without advancing the stream
pub fn peek_next(&self) -> Option<char> {
/// Return the `nth` next character without advancing the stream.
pub fn peek_nth(&self, n: usize) -> Option<char> {

Thinking we can have this more generic so that it can be reused in other context? Similar to the peek_nth* parser methods


/// return the character after the next character (lookahead by 2) without advancing the stream
pub fn peek_next(&self) -> Option<char> {
// Use the source and byte_pos instead of cloning the peekable iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a guard here that the index is safe? So that we don't panic if we hit a bug or the API is being misused. e.g.

if self.byte_pos >= self.source.len() {
    return None
}

chars.next(); // consume the first char
let ch: String = ch.into_iter().collect();
let word = self.tokenize_word(ch, chars);
// Calculate total byte length without allocating a String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation looks a bit off on this line?

let ch: String = ch.into_iter().collect();
let word = self.tokenize_word(ch, chars);
// Calculate total byte length without allocating a String
let consumed_byte_len: usize = ch.into_iter().map(|c| c.len_utf8()).sum();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can instead replace the ch parameter to this tokenize_identifier_or_keyword function with the actual consumed_byte_len: usize?

If I'm understanding the intent/requirement of this change correctly the caller of this function should have that value on hand given that we seem to require that ch contains the preceeding characters in the stream?

The current flow was initially unclear to me that the caller passes in an iterator of items, whose contents we did not use here, and it required digging further into tokenize_word to realised why that was the case.

/// `consumed_byte_len` is the byte length of the consumed character(s).
fn tokenize_word(&self, consumed_byte_len: usize, chars: &mut State<'a>) -> String {
// Calculate where the first character started
let first_char_byte_pos = chars.byte_pos - consumed_byte_len;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a check that the operation doesn't overflow? e.g.

if consumed_byte_len >= chars.byte_pos {
    return "".to_string()
}

Comment on lines +2213 to +2220
/// Borrow a slice from the original string until `predicate` returns `false` or EOF is hit.
///
/// # Arguments
/// * `chars` - The character iterator state (contains reference to original source)
/// * `predicate` - Function that returns true while we should continue taking characters
///
/// # Returns
/// A borrowed slice of the source string containing the matched characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc wise I think its easier to only reference the peeking_take_while method instead, and we mention only the difference to this function. That way we only describe the functionality once. Also, the project doesn't use the # Arguments, # Returns etc documentation format so that we can skip that for consistency I think.

chars: &mut State<'a>,
mut predicate: impl FnMut(char) -> bool,
) -> &'a str {
// Record the starting byte position
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can sanity check the index before using it here as well?

pub line: u64,
pub col: u64,
/// Byte position in the source string
pub byte_pos: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub byte_pos: usize,
byte_pos: usize,

///
/// # Returns
/// A borrowed slice of the source string containing the matched characters
fn borrow_slice_until_next<'a>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering, given this function is quite similar, would it make sense to implement borrowed_slice_until as following instead? To reuse the same definition/impl

fn borrow_slice_until(chars) {
    borrow_slice_until_next(chars, |ch, _|{
        predicate(ch)
    })
}

/// Same as peeking_take_while, but also passes the next character to the predicate.
fn peeking_next_take_while(
chars: &mut State,
/// Borrow a slice from the original string until `predicate` returns `false` or EOF is hit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just flagging similar comments for borrow_while_until apply to this function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants