Skip to content

Conversation

@svij-sc
Copy link
Collaborator

@svij-sc svij-sc commented Oct 28, 2025

Scope of work done

  • build_base_docker_images
  • PR merge
    • unit test
    • integration test
    • e2e test
    • lint test
  • Comment
  • unit test
  • integration test
  • e2e test
  • lint test
  • Nightly release
  • Generate hashed requirements (removed)
  • fossa
  • create_release
  • release_documentation
  • release

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

@svij-sc
Copy link
Collaborator Author

svij-sc commented Oct 30, 2025

/help

@github-actions
Copy link
Contributor

GiGL Automation

@ 21:21:19UTC :

🤖 Available PR Commands

You can trigger the following workflows by commenting on this PR:

  • /help - Checkout code
  • /unit_test - Run Unit Tests
  • /integration_test - Run Integration Tests
  • /e2e_test - Run E2E Tests
  • /notebook_tests - Run Example Notebooks Tests
  • /lint_test - Run Linting Tests

💡 Usage: Simply comment on this PR with any of the commands above (e.g., /unit_test)

⏱️ Note: Commands may take some time to complete. Progress updates will be posted as comments.

@semgrep-code-snapchat
Copy link

Semgrep found 1 ssc-3c51c742-25ca-4da5-8ab1-a5a6225e2d25 finding:

  • python/gigl/experimental/knowledge_graph_embedding/common/graph_dataset.py

Risk: Affected versions of pyarrow are vulnerable to Deserialization of Untrusted Data. An attacker can achieve arbitrary code execution due to a vulnerability in the package, stemming from improper handling of untrusted data during deserialization. Deserialization of Arrow IPC, Feather or Parquet data is affected.

Fix: Upgrade this library to at least version 14.0.1 at GiGL/uv.lock:3060.

Reference(s): GHSA-5wvp-7f3h-6wmm, CVE-2023-47248

Comment on lines +147 to +149
torch.distributed.broadcast_object_list(
object_list=ip_list, src=node_rank, device=device
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semgrep identified an issue in your code:
Functions reliant on pickle can result in arbitrary code execution

To resolve this comment:

✨ Commit Assistant fix suggestion

Suggested change
torch.distributed.broadcast_object_list(
object_list=ip_list, src=node_rank, device=device
)
# NOTE: torch.distributed.broadcast_object_list uses Python's pickle under the hood, which can be unsafe if any node is untrusted.
# Here, IP addresses are plain strings, which are safe as long as all nodes are trusted and fully controlled, with no user input.
# If this environment is not trusted, consider using tensor-based communication instead.
torch.distributed.broadcast_object_list(
object_list=ip_list, src=node_rank, device=device
)
# If you need stricter security and cannot trust all parties, see the following manual tensor-safe approach:
# import torch
# MAX_IP_LEN = 64 # Maximum expected length of IP string
# if rank == node_rank:
# ip_str_bytes = ip_list[0].encode('utf-8')
# ip_tensor = torch.zeros(MAX_IP_LEN, dtype=torch.uint8, device=device)
# ip_tensor[:len(ip_str_bytes)] = torch.tensor(list(ip_str_bytes), dtype=torch.uint8, device=device)
# else:
# ip_tensor = torch.zeros(MAX_IP_LEN, dtype=torch.uint8, device=device)
# torch.distributed.broadcast(ip_tensor, src=node_rank)
# node_ip = bytes(ip_tensor.cpu().numpy()).rstrip(b'\x00').decode('utf-8')
# logger.info(f"Rank {rank} received master internal IP: {node_ip}")
# assert node_ip, "Could not retrieve master node's internal IP"
# The rest of the code can remain unchanged for primitive string communication in trusted setups.
View step-by-step instructions
  1. Avoid using torch.distributed.broadcast_object_list with arbitrary objects, since this uses Python's pickle under the hood and can be unsafe if any sender is malicious.
  2. Change ip_list to contain only safe and primitive data types. In this case, IP addresses are already plain strings, which are safe.
  3. If you control every node in your distributed setup and trust all sources, document in your code that use of these broadcast functions requires a trusted environment and does not accept user-supplied objects.
  4. Alternatively, if any part of your distributed system could be compromised or you want maximum safety, replace broadcast_object_list with tensor-based communication, converting IP strings to byte tensors using ip_tensor = torch.tensor(bytearray(ip_str, "utf-8")), and reconstruct the string on the receiving end with ip_str = bytes(ip_tensor.tolist()).decode("utf-8").
  5. Review all other uses of broadcast_object_list, all_gather_object, gather_object, and scatter_object_list to ensure they only serialize primitive types like integers or validated ASCII strings.

IP addresses as plain strings are safe as long as you trust all nodes in your torch.distributed setup; pickle is only a risk if objects originate from or can be influenced by an attacker.

💬 Ignore this finding

Reply with Semgrep commands to ignore this finding.

  • /fp <comment> for false positive
  • /ar <comment> for acceptable risk
  • /other <comment> for all other reasons

Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by pickles-in-pytorch-distributed.

You can view more details about this finding in the Semgrep AppSec Platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants