Logic Inconsistency In ScatterMoE during Expert Parallel

@willmj I noticed there is some inconsistency in the logic, although the behavior is correct
1. When creating the [ScatterMoE](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/scattermoe_prepare.py#L279) we use `num_experts_per_device`. In the case `ep_degree > 1`, then this will result in a the router `weights` having `num_experts_per_device` outputs.
2. But the `router` weights need to be replicated across device, and this does happen in [load_experts_onto_device](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/scattermoe_prepare.py#L328C17-L328C41), because the state_dict `sd` loaded [here](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-moe/src/fms_acceleration_moe/utils/scattermoe_prepare.py#L247-L266), will always result in the full-sized router

So we end up with this inconsistency

```
(Pdb) mod
Linear(in_features=1536, out_features=20, bias=False)
(Pdb) mod.weight.shape
torch.Size([40, 1536])
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logic Inconsistency In ScatterMoE during Expert Parallel #121

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Logic Inconsistency In ScatterMoE during Expert Parallel #121

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions