fix(bug): Prevent node duplication during leftover node allocation #209

IAMDAVID0920 · 2025-11-10T03:35:11Z

📋 PR Title Format

The PR title should follow the format:

type(scope): concise message (max 50 chars)

Where:

type is one of: feat, fix, docs, refactor, perf, test, chore.
scope is optional and describes the part of the codebase affected (e.g., auth, ui, api).
concise message is a short description of the change (max 50 chars).

📝 Change Type

Please select the type of change this PR introduces (choose one or more):

feat: New feature.
fix: Bug fix.
docs: Documentation only changes.
refactor: A code change that neither fixes a bug nor adds a feature.
perf: Performance improvement.
test: Adding missing tests or correcting existing tests.
chore: Maintenance tasks (e.g., updating dependencies).

💡 Description

I discovered a bug where self.nodes contains duplicate node entries after bootstrap() completes. This occurs when some nodes remain unallocated after the initial pipeline construction.
I am new to this so please let me know if this behavior is intended or i am thinking it wrong or this is indeed a bug.

The following test consistently produces 3 nodes instead of the expected 2:

def test_scheduler_with_2_nodes():
    model = build_model_info(12)
    
    n1 = build_node("a100-0", model, tflops=312.0, mem_gb=80.0, x=0, y=0)
    n2 = build_node("5090-1", model, tflops=165.0, mem_gb=32.0, x=1, y=0)
    set_rtt_from_coords([n1, n2])
    
    sched = Scheduler(model, [n1, n2], strategy="dp", min_nodes_bootstrapping=2)
    ok = sched.bootstrap()
    assert ok
    assert n1.start_layer is not None and n1.end_layer is not None
    assert n2.start_layer is not None and n2.end_layer is not None

    assert len(sched.list_node_allocations()) == 2
    assert len(sched.nodes) == 2

Expect to see self.nodes to be [Node(node_id='a100-0'...), Node(node_id='5090-1'...)] but

AssertionError: assert 3 == 2
E        +  where 3 = len([Node(node_id='a100-0'...), Node(node_id='5090-1'...), Node(node_id='5090-1'...)]

During global_allocation(), when nodes remain unallocated after pipeline construction:

allocate_left_over_nodes() iterates through unallocated nodes
Calls join(node) → declare(node) for each node
declare() checks: if node.node_id not in self.node_id_to_node
Since node_id_to_node is initialized as an empty dict in init, this check always pass
self.nodes.append(node) adds unallocated node again, creating a duplicate

The issue is that node_id_to_node is never pre-populated with the initial nodes passed to the allocator, so declare() cannot detect that nodes already exist in self.nodes. This will happen on both allocator strategy if there are unallocated nodes left.

Key Changes

Pre-populate node_id_to_node dict in BaseLayerAllocator.__init__ to prevent declare() from re-adding existing nodes
Add tests for both DP and Greedy Allocator

🔗 Related Issues

List any issues this PR closes or relates to:

Closes #IssueNumber (e.g., Closes #123)

✅ Checklist

Please ensure the following points are addressed before merging:

I have performed a self-review of my own code.
I have added/updated tests that prove my fix or feature works (if applicable).
I have updated the documentation (if necessary).
My code follows the project's style guidelines.

TianyiZhao1437 · 2025-11-11T03:47:22Z

I think currently we always give an empty node list when initializing scheduler. But this fix makes sense to me.
@christ-tt please take a look at this.

TianyiZhao1437

LGTM. Thanks!

danget1345 · 2025-11-12T15:24:17Z

like

fix(bug): Prevent node duplication during leftover node allocation

78beea7

IAMDAVID0920 requested a review from a team November 10, 2025 03:35

TianyiZhao1437 approved these changes Nov 12, 2025

View reviewed changes

TianyiZhao1437 merged commit 0e83a7c into GradientHQ:main Nov 12, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(bug): Prevent node duplication during leftover node allocation #209

fix(bug): Prevent node duplication during leftover node allocation #209

Uh oh!

IAMDAVID0920 commented Nov 10, 2025 •

edited

Loading

Uh oh!

TianyiZhao1437 commented Nov 11, 2025 •

edited

Loading

Uh oh!

TianyiZhao1437 left a comment

Uh oh!

Uh oh!

danget1345 commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(bug): Prevent node duplication during leftover node allocation #209

fix(bug): Prevent node duplication during leftover node allocation #209

Uh oh!

Conversation

IAMDAVID0920 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 PR Title Format

📝 Change Type

💡 Description

Key Changes

🔗 Related Issues

✅ Checklist

Uh oh!

TianyiZhao1437 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TianyiZhao1437 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danget1345 commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IAMDAVID0920 commented Nov 10, 2025 •

edited

Loading

TianyiZhao1437 commented Nov 11, 2025 •

edited

Loading