New memmap dataset format #381

jlamypoirier · 2025-10-18T04:08:38Z

✨ Description

New format for memmap datasets, which allows for more varied input types:

MemmapDataset is now agnostic of the data (sample type) being referred to and delegates much of the work to dynamic readers and writers.
Associate each sample type with a dynamic reader config holding metadata for the stored dataset (ex. dynamic type, buffer range, document and token count, etc.), and a reader/writer pair that handles the actual data.
Keep the existing memmap dataset for backward compatibility ("legacy memmap"), but remove dataset writing capability (except in test_match_megatron).
Simplify GPT data preparator, merge the multiple tokenization methods into a single _prepare_sample method.

File structure:

Hard-coded header ("fast_llm_prepared_dataset")
Pointer to the reader config (int64). Actual config not written here because it's not available until the whole dataset is written.
Reader-specific content.
Reader config (json-serialized).

Three reader types are currently implemented.

Tokens:
- Hard-coded header
- Tokens
- Cumulative sums of document token counts (for locating begin/end of documents)
- Hard-coded footer.
Range:
- Hard-coded header
- Ranges
- Cumulative sums of number of ranges in each document (for locating begin/end of documents)
- Hard-coded footer.
Language model
- Hard-coded header
- Token reader content
- (Optional) Loss masking span (range) reader content.
- (Optional) Chosen span (range) reader content.
- (Optional) Rejected span (range) reader content.
- Hard-coded footer.

Note: Preparator isn't working as of this PR. It will be fixed and tested in #383.

jlamypoirier added 3 commits October 18, 2025 00:07

Memmap dataset

90cd009

fixes

acfd30e

fixes

34939e9

jlamypoirier marked this pull request as ready for review October 29, 2025 23:45

int64

c5fa072

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New memmap dataset format #381

New memmap dataset format #381

Uh oh!

jlamypoirier commented Oct 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New memmap dataset format #381

Are you sure you want to change the base?

New memmap dataset format #381

Uh oh!

Conversation

jlamypoirier commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jlamypoirier commented Oct 18, 2025 •

edited

Loading