Add build_dataset_for_testing for creating cached datasets for testing #376

kmontemayor2-sc · 2025-11-06T23:15:45Z

Scope of work done
I was hoping this would save more time but it seems like it's just a few minutes, sadly...
Guess I should properly benchmark :P

Anyways this should save time and logspam regardless?

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

kmontemayor2-sc · 2025-11-06T23:15:54Z

/unit_test

github-actions · 2025-11-06T23:16:12Z

GiGL Automation

@ 23:16:12UTC : 🔄 Unit Test started.

@ 24:25:36UTC : ✅ Workflow completed successfully.

kmontemayor2-sc · 2025-11-06T23:17:08Z

/unit_test

github-actions · 2025-11-06T23:17:21Z

GiGL Automation

@ 23:17:21UTC : 🔄 Unit Test started.

@ 24:22:06UTC : ✅ Workflow completed successfully.

kmontemayor2-sc · 2025-11-07T01:09:23Z

/unit_test

github-actions · 2025-11-07T01:09:33Z

GiGL Automation

@ 01:09:33UTC : 🔄 Unit Test started.

@ 02:25:38UTC : ✅ Workflow completed successfully.

mkolodner-sc · 2025-11-07T21:57:01Z

python/tests/test_assets/distributed/run_distributed_dataset.py

+    cache_key = (
+        task_config_uri,
+        partitioner_class.__name__,
+        splitter.__class__.__name__,
+        ssl_positive_label_percentage,
+    )


This seems error prone -- what if we want to test edge_dir=outbut have already cached a dataset whereedge_dir = in`? I think hard-coding the keys we are using to cache like this could make it harder to maintain in the future.

Do you know if we see also may see these runtime improvements if we instead use a lru_cache on the run_distributed_dataset and build_dataset classes?

I'm not sure we should use lru_cache on build_dataset since that's used for prod and I'm not sure it'd be safe to keep the big datasets in memory like that (it may be! but I haven't looked enough into the details of it to know for sure.)

And we don't really use run_distributed_dataset much elsewhere, it's a bit limited since it takes in a mocked_dataset_info vs some task config uri. Though I guess we could convert the other tests to use it if we wanted to.

Since the perf changes here don't seem to big, wdyt about punting on the caching and change this PR to only include create_test_process_group and removing DistContext from the tests?

Since the perf changes here don't seem to big, wdyt about punting on the caching and change this PR to only include create_test_process_group and removing DistContext from the tests?

This makes sense to me, thanks Kyle!

mkolodner-sc · 2025-11-07T22:15:25Z

python/tests/test_assets/distributed/utils.py

    return f"tcp://{host}:{port_picker()}"
+
+
+def create_test_process_group() -> None:


Copy over some files

7f71f59

Update tests

a8e363e

kmonte added 2 commits November 6, 2025 23:23

copy

18946be

fix cache key

db3ff34

kmontemayor2-sc changed the title ~~Copy over some files~~ Add build_dataset_for_testing for creating cached datasets for testing Nov 7, 2025

mkolodner-sc reviewed Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add build_dataset_for_testing for creating cached datasets for testing #376

Add build_dataset_for_testing for creating cached datasets for testing #376

Uh oh!

kmontemayor2-sc commented Nov 6, 2025 •

edited

Loading

Uh oh!

kmontemayor2-sc commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

kmontemayor2-sc commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

kmontemayor2-sc commented Nov 7, 2025

Uh oh!

github-actions bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

mkolodner-sc Nov 7, 2025

Uh oh!

kmontemayor2-sc Nov 7, 2025

Uh oh!

kmontemayor2-sc Nov 7, 2025

Uh oh!

mkolodner-sc Nov 7, 2025

Uh oh!

mkolodner-sc Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return f"tcp://{host}:{port_picker()}"


		def create_test_process_group() -> None:

Add build_dataset_for_testing for creating cached datasets for testing #376

Are you sure you want to change the base?

Add build_dataset_for_testing for creating cached datasets for testing #376

Uh oh!

Conversation

kmontemayor2-sc commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmontemayor2-sc commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

kmontemayor2-sc commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

kmontemayor2-sc commented Nov 7, 2025

Uh oh!

github-actions bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

mkolodner-sc Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

kmontemayor2-sc Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

kmontemayor2-sc Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

mkolodner-sc Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

mkolodner-sc Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kmontemayor2-sc commented Nov 6, 2025 •

edited

Loading

github-actions bot commented Nov 6, 2025 •

edited

Loading

github-actions bot commented Nov 6, 2025 •

edited

Loading

github-actions bot commented Nov 7, 2025 •

edited

Loading