-
Notifications
You must be signed in to change notification settings - Fork 38
feat: Add --test-mode for resilient bootstrap with failure handling #719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add --test-mode for resilient bootstrap with failure handling #719
Conversation
09630a6 to
919ec13
Compare
|
Have not added the tests for this. Planning to add the tests after initial round of reviews |
dhellmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is going to be very slow, since it will re-process all of the packages that build successfully until it gets to the one that failed and is now marked as prebuilt.
It would be more efficient to do the check in the bootstrapper class at the point where a wheel is being built. If that build fails then we can treat the package as though it was prebuilt by running the logic to handle a prebuilt wheel, even though that wheel is not marked as prebuilt.
You could extract the logic from https://github.com/python-wheel-build/fromager/blob/main/src/fromager/bootstrapper.py#L188-L295 into its own function to make some of that logic easier to deal with. A closure inside the existing function might be easier, since that code uses a lot of the variables from elsewhere in the existing function.
714b3f4 to
4a343fa
Compare
tiran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for the blunt feedback, but I really don't like this feature. It's making one of the most complex parts of Fromager even more complicated. I spent half an hour with the new code and I still don't understand all its nooks and crannies. That worries me. I found at least one fundamental design flaw with bootstrap-parallel and failing wheel builds.
What use case do you want to solve? Make onboarding of new components easier for a developer? Can we implement simpler features to improve the UX of onboarding components?
- add a global command line flag
--prebuilt-packagesto mark packages are pre-built without creating a config file. - add a
--keep-goingflag tobuild,build-parallel, andbootstrap-parallelcommands. When the flag is set, then an exception fromwheels.build_wheelno longer stops the build and Fromager continues with building packages. At the end, print which packages have failed to build. - a new command that takes a constraints and requirements file, resolves all install dependencies, and tells the user which packages do neither have an sdist on PyPI nor settings/hooks to fetch sdist from somewhere else. Thanks to PEP 714 core metadata, this check should be fast.
src/fromager/commands/bootstrap.py
Outdated
| cache_wheel_server_url=cache_wheel_server_url, | ||
| sdist_only=True, | ||
| skip_constraints=skip_constraints, | ||
| test_mode=test_mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bootstrap-parallel runs bootstrap in sdist-only mode. That means it's not compiling any wheels except for build system requirements. The test-mode flag does not affect the wheel building for most packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, If you want to test actual wheel compilation failures, you'd need to run fromager bootstrap --test-mode package1 package2
However it will be still useful for bootstrap-parallel to identify some issues with the source distribution e.g. identifying of the source of a package is not available, or any-other issue with dependency resolution
|
As discussed with @tiran last week, I will create a design doc and get it reviewed first then will update this PR. cc @dhellmann |
0cb15a6 to
54dfc06
Compare
|
@dhellmann @tiran Here is the design document for the current implementation : doc |
54dfc06 to
877f2c5
Compare
That design seems more like what I was expecting. It isolates the check for test mode in the bootstrapper where it should be. To handle the parallel case, we could say this feature is not supported, to avoid having to change the build code in a similar way. Downstream we can control the mode when we want to run a test. |
877f2c5 to
23eb2a6
Compare
I am not clear on what we want to do for the build-parallel use case. |
23eb2a6 to
7ecef97
Compare
|
Updated the design doc with |
|
|
||
|
|
||
| @dataclasses.dataclass | ||
| class BuildResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the use of a data class here. How about you include the requirement, resolved version, and exception for failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I see #837
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have accepted the suggestion and fixed the code.
|
|
||
| def _find_cached_wheel( | ||
| self, req: Requirement, resolved_version: Version | ||
| ) -> tuple[pathlib.Path | None, pathlib.Path | None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all four combinations possible or just these two?
| ) -> tuple[pathlib.Path | None, pathlib.Path | None]: | |
| ) -> tuple[pathlib.Path, pathlib.Path] | tuple[None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these returns are possible:
- (Path, Path): Wheel found with metadata successfully extracted
- (Path, None): Wheel found but metadata extraction failed
- (None, None): No matching wheel found in any location
So current code looks right to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not changed the code for this, PTAL at the code and let me know if you disagree.
src/fromager/packagesettings.py
Outdated
| self._patches_dir = patches_dir | ||
| self._max_jobs = max_jobs | ||
| self._pbi_cache: dict[Package, PackageBuildInfo] = {} | ||
| self.pre_built_override: set[NormalizedName] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make more sense to have _pre_built_override on the WorkContext object and then have a small API to add / get pre_built overrides. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Fixing it.
92a3dbb to
cb38aef
Compare
cb38aef to
409554d
Compare
|
If multiple packages fails and they are added as prebuilt, here is a simulated output |
|
Here is the actual test and its output |
src/fromager/bootstrapper.py
Outdated
| "get install dependencies of wheel %s", | ||
| wheel_filename.name, | ||
| # Get install dependencies - much simpler logic | ||
| if result.failed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should check for test-mode, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of result.failed is true only in the test-mode. However your concern is valid. We can make the code more defensive for future errors by changing it to if self.test_mode and result.failed:
src/fromager/bootstrapper.py
Outdated
| ) | ||
|
|
||
| try: | ||
| self._mark_package_as_pre_built_runtime(req) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing state and going through self._build_wheel_and_sdist() again, could we just invoke self._download_prebuilt() directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch. The previous code was not efficient. Fixed it.
| except Exception as err: | ||
| if test_mode: | ||
| # Test mode: log error but continue processing | ||
| logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where we end up if bootstrap() fails to resolve a version, but that error isn't saved in the list of errors to be reported at the end of the program and cause it to exit with an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats right, there is the else condition for normal mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't clear.
I want resolution errors to be included with all of the others. In test mode, the bootstrapper should build everything it can, without stopping. It should collect all errors of any kind and save them to be reported when the program exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thanks for pointing this out. Fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added following.
+ bt.failed_builds.append(
+ bootstrapper.BuildResult.failure(req=req, exception=err)
dc5ee5a to
f32aa3d
Compare
| except Exception as err: | ||
| if test_mode: | ||
| # Test mode: log error but continue processing | ||
| logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't clear.
I want resolution errors to be included with all of the others. In test mode, the bootstrapper should build everything it can, without stopping. It should collect all errors of any kind and save them to be reported when the program exits.
src/fromager/context.py
Outdated
| def enable_parallel_builds(self) -> None: | ||
| self._parallel_builds = True | ||
|
|
||
| def add_pre_built_override(self, package_name: str | NormalizedName) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the change in the logic in the bootstrapper to invoke the logic for prebuilt packages directly, I think you can remove this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I misunderstood. I removed my previous comments. Going to fix it.
6539a5a to
ca3e0da
Compare
Add --test-mode flag that enables resilient bootstrapping by marking failed packages as pre-built and continuing until all packages are processed. Uses optimal n+1 retry logic with comprehensive failure reporting including exception types, messages, and per-package context. Major changes: - Enhanced BuildResult dataclass with req, resolved_version, and exception tracking for detailed failure analysis - Refactored pre_built_override from Settings to WorkContext for proper separation of static config vs runtime state - Introduced public WorkContext.package_build_info() API, replacing direct Settings access across commands (bootstrap, build, graph, list-overrides) - Fixed build-parallel command to use new public API - Added 4 essential test scenarios in test_bootstrap_test_mode.py Benefits: - Discover all build failures in one run rather than stopping on first failure - Support mixed source/binary dependency workflows - Better error context for debugging failed builds - Cleaner API boundaries between configuration and runtime context Fixes python-wheel-build#713 Co-developed-with: Cursor IDE with Claude 4.0 Sonnet Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>
ca3e0da to
1bf3690
Compare
Add --test-mode flag that enables resilient bootstrapping by marking failed
packages as pre-built and continuing until all packages are processed. Uses
optimal n+1 retry logic with comprehensive failure reporting including exception
types, messages, and per-package context.
Fixes #713
Co-developed-with: Cursor IDE with Claude 4.0 Sonnet
Command prompts: https://gist.github.com/LalatenduMohanty/762baf9999a09ef3d2d3e63220b9c52e