-
Notifications
You must be signed in to change notification settings - Fork 8
[WIP] update deps #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] update deps #368
Conversation
Co-authored-by: Shubham Vij <reachme@shubhamvij.com>
|
/help |
GiGL Automation@ 21:21:19UTC : 🤖 Available PR CommandsYou can trigger the following workflows by commenting on this PR:
💡 Usage: Simply comment on this PR with any of the commands above (e.g., ⏱️ Note: Commands may take some time to complete. Progress updates will be posted as comments. |
|
Semgrep found 1 Risk: Affected versions of pyarrow are vulnerable to Deserialization of Untrusted Data. An attacker can achieve arbitrary code execution due to a vulnerability in the package, stemming from improper handling of untrusted data during deserialization. Deserialization of Arrow IPC, Feather or Parquet data is affected. Fix: Upgrade this library to at least version 14.0.1 at GiGL/uv.lock:3060. Reference(s): GHSA-5wvp-7f3h-6wmm, CVE-2023-47248 |
| torch.distributed.broadcast_object_list( | ||
| object_list=ip_list, src=node_rank, device=device | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semgrep identified an issue in your code:
Functions reliant on pickle can result in arbitrary code execution
To resolve this comment:
✨ Commit Assistant fix suggestion
| torch.distributed.broadcast_object_list( | |
| object_list=ip_list, src=node_rank, device=device | |
| ) | |
| # NOTE: torch.distributed.broadcast_object_list uses Python's pickle under the hood, which can be unsafe if any node is untrusted. | |
| # Here, IP addresses are plain strings, which are safe as long as all nodes are trusted and fully controlled, with no user input. | |
| # If this environment is not trusted, consider using tensor-based communication instead. | |
| torch.distributed.broadcast_object_list( | |
| object_list=ip_list, src=node_rank, device=device | |
| ) | |
| # If you need stricter security and cannot trust all parties, see the following manual tensor-safe approach: | |
| # import torch | |
| # MAX_IP_LEN = 64 # Maximum expected length of IP string | |
| # if rank == node_rank: | |
| # ip_str_bytes = ip_list[0].encode('utf-8') | |
| # ip_tensor = torch.zeros(MAX_IP_LEN, dtype=torch.uint8, device=device) | |
| # ip_tensor[:len(ip_str_bytes)] = torch.tensor(list(ip_str_bytes), dtype=torch.uint8, device=device) | |
| # else: | |
| # ip_tensor = torch.zeros(MAX_IP_LEN, dtype=torch.uint8, device=device) | |
| # torch.distributed.broadcast(ip_tensor, src=node_rank) | |
| # node_ip = bytes(ip_tensor.cpu().numpy()).rstrip(b'\x00').decode('utf-8') | |
| # logger.info(f"Rank {rank} received master internal IP: {node_ip}") | |
| # assert node_ip, "Could not retrieve master node's internal IP" | |
| # The rest of the code can remain unchanged for primitive string communication in trusted setups. |
View step-by-step instructions
- Avoid using
torch.distributed.broadcast_object_listwith arbitrary objects, since this uses Python's pickle under the hood and can be unsafe if any sender is malicious. - Change
ip_listto contain only safe and primitive data types. In this case, IP addresses are already plain strings, which are safe. - If you control every node in your distributed setup and trust all sources, document in your code that use of these broadcast functions requires a trusted environment and does not accept user-supplied objects.
- Alternatively, if any part of your distributed system could be compromised or you want maximum safety, replace
broadcast_object_listwith tensor-based communication, converting IP strings to byte tensors usingip_tensor = torch.tensor(bytearray(ip_str, "utf-8")), and reconstruct the string on the receiving end withip_str = bytes(ip_tensor.tolist()).decode("utf-8"). - Review all other uses of
broadcast_object_list,all_gather_object,gather_object, andscatter_object_listto ensure they only serialize primitive types like integers or validated ASCII strings.
IP addresses as plain strings are safe as long as you trust all nodes in your torch.distributed setup; pickle is only a risk if objects originate from or can be influenced by an attacker.
💬 Ignore this finding
Reply with Semgrep commands to ignore this finding.
/fp <comment>for false positive/ar <comment>for acceptable risk/other <comment>for all other reasons
Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by pickles-in-pytorch-distributed.
You can view more details about this finding in the Semgrep AppSec Platform.
Scope of work done
Where is the documentation for this feature?: N/A
Did you add automated tests or write a test plan?
Updated Changelog.md? NO
Ready for code review?: NO