Skip to content

Conversation

@NicolasRR
Copy link
Contributor

Implemented checkpoint and retrieval for model, scheduler, and optimizer state_dict as well as random generator states for reproducibility.

@mkrima
Copy link
Collaborator

mkrima commented Apr 16, 2024

LGTM

@mkrima mkrima merged commit c542d4b into epfml:main Apr 16, 2024
@haeggee
Copy link
Collaborator

haeggee commented Apr 16, 2024

quick question: without having gone through the code in detail, does this also make sure the data sampling is deterministic (dataloader state is restored)? what i mean is, does the model see the same data when loading from a checkpoint vs when doing a full training run

@NicolasRR
Copy link
Contributor Author

Good point, I have added those modifications in another PR

@mkrima
Copy link
Collaborator

mkrima commented Apr 16, 2024

Sorry I think I merged this too quickly. I forgot that we switched to using dataloader in this repo. The PR resets the rng state but our dataloader has its own sampler with its own generator. so we have to set the rng state for that too. I can do that in another PR in our hacakthon tomorrow:
https://stackoverflow.com/questions/60993677/how-can-i-save-pytorchs-dataloader-instance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants