Skip to content

Conversation

@kmontemayor2-sc
Copy link
Collaborator

@kmontemayor2-sc kmontemayor2-sc commented Nov 6, 2025

Scope of work done
I was hoping this would save more time but it seems like it's just a few minutes, sadly...
Guess I should properly benchmark :P

Anyways this should save time and logspam regardless?

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

@kmontemayor2-sc
Copy link
Collaborator Author

/unit_test

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

GiGL Automation

@ 23:16:12UTC : 🔄 Unit Test started.

@ 24:25:36UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc
Copy link
Collaborator Author

/unit_test

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

GiGL Automation

@ 23:17:21UTC : 🔄 Unit Test started.

@ 24:22:06UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc kmontemayor2-sc changed the title Copy over some files Add build_dataset_for_testing for creating cached datasets for testing Nov 7, 2025
@kmontemayor2-sc
Copy link
Collaborator Author

/unit_test

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

GiGL Automation

@ 01:09:33UTC : 🔄 Unit Test started.

@ 02:25:38UTC : ✅ Workflow completed successfully.

Comment on lines +127 to +132
cache_key = (
task_config_uri,
partitioner_class.__name__,
splitter.__class__.__name__,
ssl_positive_label_percentage,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems error prone -- what if we want to test edge_dir=outbut have already cached a dataset whereedge_dir = in`? I think hard-coding the keys we are using to cache like this could make it harder to maintain in the future.

Do you know if we see also may see these runtime improvements if we instead use a lru_cache on the run_distributed_dataset and build_dataset classes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should use lru_cache on build_dataset since that's used for prod and I'm not sure it'd be safe to keep the big datasets in memory like that (it may be! but I haven't looked enough into the details of it to know for sure.)

And we don't really use run_distributed_dataset much elsewhere, it's a bit limited since it takes in a mocked_dataset_info vs some task config uri. Though I guess we could convert the other tests to use it if we wanted to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the perf changes here don't seem to big, wdyt about punting on the caching and change this PR to only include create_test_process_group and removing DistContext from the tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the perf changes here don't seem to big, wdyt about punting on the caching and change this PR to only include create_test_process_group and removing DistContext from the tests?

This makes sense to me, thanks Kyle!

return f"tcp://{host}:{port_picker()}"


def create_test_process_group() -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants