Skip to content

Conversation

@IAMDAVID0920
Copy link
Contributor

@IAMDAVID0920 IAMDAVID0920 commented Nov 10, 2025

📋 PR Title Format

The PR title should follow the format:

type(scope): concise message (max 50 chars)

Where:

  • type is one of: feat, fix, docs, refactor, perf, test, chore.
  • scope is optional and describes the part of the codebase affected (e.g., auth, ui, api).
  • concise message is a short description of the change (max 50 chars).

📝 Change Type

Please select the type of change this PR introduces (choose one or more):

  • feat: New feature.
  • fix: Bug fix.
  • docs: Documentation only changes.
  • refactor: A code change that neither fixes a bug nor adds a feature.
  • perf: Performance improvement.
  • test: Adding missing tests or correcting existing tests.
  • chore: Maintenance tasks (e.g., updating dependencies).

💡 Description

I discovered a bug where self.nodes contains duplicate node entries after bootstrap() completes. This occurs when some nodes remain unallocated after the initial pipeline construction.
I am new to this so please let me know if this behavior is intended or i am thinking it wrong or this is indeed a bug.

The following test consistently produces 3 nodes instead of the expected 2:

def test_scheduler_with_2_nodes():
    model = build_model_info(12)
    
    n1 = build_node("a100-0", model, tflops=312.0, mem_gb=80.0, x=0, y=0)
    n2 = build_node("5090-1", model, tflops=165.0, mem_gb=32.0, x=1, y=0)
    set_rtt_from_coords([n1, n2])
    
    sched = Scheduler(model, [n1, n2], strategy="dp", min_nodes_bootstrapping=2)
    ok = sched.bootstrap()
    assert ok
    assert n1.start_layer is not None and n1.end_layer is not None
    assert n2.start_layer is not None and n2.end_layer is not None

    assert len(sched.list_node_allocations()) == 2
    assert len(sched.nodes) == 2

Expect to see self.nodes to be [Node(node_id='a100-0'...), Node(node_id='5090-1'...)] but

AssertionError: assert 3 == 2
E        +  where 3 = len([Node(node_id='a100-0'...), Node(node_id='5090-1'...), Node(node_id='5090-1'...)] 

During global_allocation(), when nodes remain unallocated after pipeline construction:

  1. allocate_left_over_nodes() iterates through unallocated nodes
  2. Calls join(node)declare(node) for each node
  3. declare() checks: if node.node_id not in self.node_id_to_node
  4. Since node_id_to_node is initialized as an empty dict in init, this check always pass
  5. self.nodes.append(node) adds unallocated node again, creating a duplicate

The issue is that node_id_to_node is never pre-populated with the initial nodes passed to the allocator, so declare() cannot detect that nodes already exist in self.nodes. This will happen on both allocator strategy if there are unallocated nodes left.

Key Changes

  1. Pre-populate node_id_to_node dict in BaseLayerAllocator.__init__ to prevent declare() from re-adding existing nodes
  2. Add tests for both DP and Greedy Allocator

🔗 Related Issues

List any issues this PR closes or relates to:

  • Closes #IssueNumber (e.g., Closes #123)

✅ Checklist

Please ensure the following points are addressed before merging:

  • I have performed a self-review of my own code.
  • I have added/updated tests that prove my fix or feature works (if applicable).
  • I have updated the documentation (if necessary).
  • My code follows the project's style guidelines.

@IAMDAVID0920 IAMDAVID0920 requested a review from a team November 10, 2025 03:35
@TianyiZhao1437
Copy link
Collaborator

TianyiZhao1437 commented Nov 11, 2025

I think currently we always give an empty node list when initializing scheduler. But this fix makes sense to me.
@christ-tt please take a look at this.

Copy link
Collaborator

@TianyiZhao1437 TianyiZhao1437 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@TianyiZhao1437 TianyiZhao1437 merged commit 0e83a7c into GradientHQ:main Nov 12, 2025
4 checks passed
@danget1345
Copy link

like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants