-
Notifications
You must be signed in to change notification settings - Fork 9
Add build_dataset_for_testing for creating cached datasets for testing #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/unit_test |
GiGL Automation@ 23:16:12UTC : 🔄 @ 24:25:36UTC : ✅ Workflow completed successfully. |
|
/unit_test |
GiGL Automation@ 23:17:21UTC : 🔄 @ 24:22:06UTC : ✅ Workflow completed successfully. |
|
/unit_test |
GiGL Automation@ 01:09:33UTC : 🔄 @ 02:25:38UTC : ✅ Workflow completed successfully. |
| cache_key = ( | ||
| task_config_uri, | ||
| partitioner_class.__name__, | ||
| splitter.__class__.__name__, | ||
| ssl_positive_label_percentage, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems error prone -- what if we want to test edge_dir=outbut have already cached a dataset whereedge_dir = in`? I think hard-coding the keys we are using to cache like this could make it harder to maintain in the future.
Do you know if we see also may see these runtime improvements if we instead use a lru_cache on the run_distributed_dataset and build_dataset classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should use lru_cache on build_dataset since that's used for prod and I'm not sure it'd be safe to keep the big datasets in memory like that (it may be! but I haven't looked enough into the details of it to know for sure.)
And we don't really use run_distributed_dataset much elsewhere, it's a bit limited since it takes in a mocked_dataset_info vs some task config uri. Though I guess we could convert the other tests to use it if we wanted to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the perf changes here don't seem to big, wdyt about punting on the caching and change this PR to only include create_test_process_group and removing DistContext from the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the perf changes here don't seem to big, wdyt about punting on the caching and change this PR to only include create_test_process_group and removing DistContext from the tests?
This makes sense to me, thanks Kyle!
| return f"tcp://{host}:{port_picker()}" | ||
|
|
||
|
|
||
| def create_test_process_group() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Scope of work done
I was hoping this would save more time but it seems like it's just a few minutes, sadly...
Guess I should properly benchmark :P
Anyways this should save time and logspam regardless?
Where is the documentation for this feature?: N/A
Did you add automated tests or write a test plan?
Updated Changelog.md? NO
Ready for code review?: NO