Checkpointing and retrieval #13

NicolasRR · 2024-04-15T16:13:02Z

Implemented checkpoint and retrieval for model, scheduler, and optimizer state_dict as well as random generator states for reproducibility.

src/optim/base.py

src/main.py

mkrima · 2024-04-16T16:43:48Z

LGTM

haeggee · 2024-04-16T17:18:48Z

quick question: without having gone through the code in detail, does this also make sure the data sampling is deterministic (dataloader state is restored)? what i mean is, does the model see the same data when loading from a checkpoint vs when doing a full training run

NicolasRR · 2024-04-16T17:51:44Z

Good point, I have added those modifications in another PR

mkrima · 2024-04-16T19:14:56Z

Sorry I think I merged this too quickly. I forgot that we switched to using dataloader in this repo. The PR resets the rng state but our dataloader has its own sampler with its own generator. so we have to set the rng state for that too. I can do that in another PR in our hacakthon tomorrow:
https://stackoverflow.com/questions/60993677/how-can-i-save-pytorchs-dataloader-instance

implemented checkpointing and retrieval

b049f2b

mkrima reviewed Apr 15, 2024

View reviewed changes

src/optim/base.py Outdated Show resolved Hide resolved

mkrima reviewed Apr 15, 2024

View reviewed changes

src/main.py Outdated Show resolved Hide resolved

NicolasRR added 2 commits April 15, 2024 18:45

fixed scheduler and random state dict

b51d951

ensure master created the ckpt folder

4531e46

mkrima reviewed Apr 16, 2024

View reviewed changes

src/main.py Outdated Show resolved Hide resolved

mkrima reviewed Apr 16, 2024

View reviewed changes

src/main.py Outdated Show resolved Hide resolved

minor fixes

25884da

mkrima merged commit c542d4b into epfml:main Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checkpointing and retrieval #13

Checkpointing and retrieval #13

Uh oh!

NicolasRR commented Apr 15, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkrima commented Apr 16, 2024

Uh oh!

haeggee commented Apr 16, 2024

Uh oh!

NicolasRR commented Apr 16, 2024

Uh oh!

mkrima commented Apr 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Checkpointing and retrieval #13

Checkpointing and retrieval #13

Uh oh!

Conversation

NicolasRR commented Apr 15, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkrima commented Apr 16, 2024

Uh oh!

haeggee commented Apr 16, 2024

Uh oh!

NicolasRR commented Apr 16, 2024

Uh oh!

mkrima commented Apr 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants