Skip to content

Logic Inconsistency In ScatterMoE during Expert Parallel #121

@fabianlim

Description

@fabianlim

@willmj I noticed there is some inconsistency in the logic, although the behavior is correct

  1. When creating the ScatterMoE we use num_experts_per_device. In the case ep_degree > 1, then this will result in a the router weights having num_experts_per_device outputs.
  2. But the router weights need to be replicated across device, and this does happen in load_experts_onto_device, because the state_dict sd loaded here, will always result in the full-sized router

So we end up with this inconsistency

(Pdb) mod
Linear(in_features=1536, out_features=20, bias=False)
(Pdb) mod.weight.shape
torch.Size([40, 1536])

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions