Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Oct 18, 2025

✨ Description

New format for memmap datasets, which allows for more varied input types:

  • MemmapDataset is now agnostic of the data (sample type) being referred to and delegates much of the work to dynamic readers and writers.
  • Associate each sample type with a dynamic reader config holding metadata for the stored dataset (ex. dynamic type, buffer range, document and token count, etc.), and a reader/writer pair that handles the actual data.
  • Keep the existing memmap dataset for backward compatibility ("legacy memmap"), but remove dataset writing capability (except in test_match_megatron).
  • Simplify GPT data preparator, merge the multiple tokenization methods into a single _prepare_sample method.

File structure:

  • Hard-coded header ("fast_llm_prepared_dataset")
  • Pointer to the reader config (int64). Actual config not written here because it's not available until the whole dataset is written.
  • Reader-specific content.
  • Reader config (json-serialized).

Three reader types are currently implemented.

  1. Tokens:
    • Hard-coded header
    • Tokens
    • Cumulative sums of document token counts (for locating begin/end of documents)
    • Hard-coded footer.
  2. Range:
    • Hard-coded header
    • Ranges
    • Cumulative sums of number of ranges in each document (for locating begin/end of documents)
    • Hard-coded footer.
  3. Language model
    • Hard-coded header
    • Token reader content
    • (Optional) Loss masking span (range) reader content.
    • (Optional) Chosen span (range) reader content.
    • (Optional) Rejected span (range) reader content.
    • Hard-coded footer.

Note: Preparator isn't working as of this PR. It will be fixed and tested in #383.

@jlamypoirier jlamypoirier marked this pull request as ready for review October 29, 2025 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants