Skip to content

Simplex reconfiguration framework - Part III (MSM implementation)#365

Merged
yacovm merged 17 commits into
mainfrom
reconfig-3
May 13, 2026
Merged

Simplex reconfiguration framework - Part III (MSM implementation)#365
yacovm merged 17 commits into
mainfrom
reconfig-3

Conversation

@yacovm
Copy link
Copy Markdown
Collaborator

@yacovm yacovm commented Apr 16, 2026

  • Add block building to msm.go
  • Add verification.go which contains logic for block verification
  • Add tests that mimic Simplex flow (fake_node_test.go)

- Add block building to msm.go
- Add verification.go which contains logic for block verification
- Add tests that mimic Simplex flow (fake_node_test.go)

Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
@yacovm yacovm force-pushed the reconfig-3 branch 2 times, most recently from 62fef9a to 7430e36 Compare May 5, 2026 20:04
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
yacovm and others added 2 commits May 5, 2026 22:40
Addresses review comments: finalized blocks were a prefix of notarized
blocks, so keeping two separate slices was redundant and error prone
(e.g. tryFinalizeNextBlock panicked when the two went out of sync).
Replaces both with a single blocks []blockState slice where each entry
carries a finalized flag.

Also removes the duplicate lookup in the GetBlock test fixture that
would return finalized blocks as non-finalized.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Comment thread msm/msm_test.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm_test.go Outdated
Comment thread msm/msm_test.go Outdated
Comment thread msm/msm_test.go Outdated
Comment on lines +1032 to +1037
bs := make(blockStore)
bs[0] = &outerBlock{block: genesisBlock}

var testConfig testConfig
testConfig.blockStore = bs
testConfig.validatorSetRetriever.result = NodeBLSMappings{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should have constructors for helpers? seems like it would set them up to be more re-usable + maintainable in the future?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can do that after i re-introduce verification to do that in bulk.

Comment thread msm/msm.go Outdated
Comment on lines +40 to +46
type SignatureAggregator interface {
AggregateSignatures(signatures ...[]byte) ([]byte, error)

IsQuorum(approverWeights []uint64, totalWeight uint64) bool
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we should redefine this interface in the msm. Would it be better to use the interface defined in api.go?

I think it's better for 2 reasons, the first being we would currently need to implement these SignatureAggregator interfaces using two different structs because function names cannot be overloaded. More importantly, I think we should just have one implementation of IsQuorum. I think its reasonable to assume the SignatureAggregator would have the weights of each nodes in a PoS setting, so we can just use the IsQuorum method defined in api.go:SignatureAggregator

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the signature aggregator cannot know which P-chain block we're referring to, and therefore if it gets an invocation from an epoch or from an MSM it has no clue which P-chain to lookup against.

I agree that it's best that the IsQuorum will be a unified and single API.

What do you think about doing the complete opposite - to have the Epoch use IsQuorum(approverWeights []uint64, totalWeight uint64) bool and also add weights to the membership of the communication? This will simplify the implementation of IsQuorum in the avalanchego side.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to recap our quick sync offline: let's initialize SignatureAggregators with a set of node <-> weight mappings, and then use the newly initialized signature aggregator when calling IsQuorum in the MSM approval logic. This allows us to not worry about passing in weights and keep the abstraction in AvalancheGo as is.

Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm_test.go Outdated
Comment thread msm/msm_test.go Outdated
Comment thread msm/msm_test.go Outdated
Comment thread msm/msm.go
seq := pmd.Seq

if seq == 0 {
return fmt.Errorf("attempted to build a genesis inner block")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in verification though, not block building

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but the idea is "hey, someone tried to build a genesis block".

Also it shouldn't be "inner" - it's a copy-paste error I thought I fixed.

Will fix it to something more reasonable like "received a genesis block"

Comment thread msm/msm.go Outdated
Comment on lines +40 to +46
type SignatureAggregator interface {
AggregateSignatures(signatures ...[]byte) ([]byte, error)

IsQuorum(approverWeights []uint64, totalWeight uint64) bool
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to recap our quick sync offline: let's initialize SignatureAggregators with a set of node <-> weight mappings, and then use the newly initialized signature aggregator when calling IsQuorum in the MSM approval logic. This allows us to not worry about passing in weights and keep the abstraction in AvalancheGo as is.

Comment thread msm/msm_test.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go
Comment thread msm/msm.go Outdated
}
simplexEpochInfo := constructSimplexZeroBlock(pChainHeight, newValidatorSet, prevVMBlockSeq)

return sm.buildBlockImpatiently(ctx, parentBlock, simplexMetadata, simplexBlacklist, simplexEpochInfo, pChainHeight)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

the caller BuildBlock is only called when we are selected to build a block. We also do the wait impatiently thingy here, but more often than not we will actually not have an actual block to build. I say this because when we start a chain we will probably not begin issuing transactions right away.

Therefore, more often than not the first Simplex block will just be an empty one that is produced right after the simplex chain is created. This also means that most block "zero" will be the same.

Because most zero blocks will be the same I'm thinking we could either hardcode the zeroth block and remove a lot of the branching that is caused by treating zeroth blocks so specially(no need for the build block zero method, the verify method becomes simple and a lot of branching is removed).

Another possibility -> why rush to build the zeroth block? can't we just wait like the normal blocks?

Maybe my logic is wrong somewhere, but I generally have a feeling there is a much simpler solution with handling the zeroth case.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because most zero blocks will be the same I'm thinking we could either hardcode the zeroth block and remove a lot of the branching that is caused by treating zeroth blocks so specially(no need for the build block zero method, the verify method becomes simple and a lot of branching is removed).

we can't hardcode the zero block because it has metadata fields we cannot precompute, such as ICM epoch and timestamp, previous VM block seq, etc.

Another possibility -> why rush to build the zeroth block? can't we just wait like the normal blocks?

so we can easily expand the validator set. Essentially, if we spin up a single node network, we can then add a node, add another node, add another node, all by interacting with the P-chain, and we will never need any user traffic.

If we need user traffic to build the zero epoch, we're stuck.

yacovm added 3 commits May 8, 2026 16:48
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Copy link
Copy Markdown
Collaborator

@samliok samliok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w

Comment thread msm/msm.go
Comment on lines +180 to +183
case stateBuildBlockNormalOp:
return sm.buildBlockNormalOp(ctx, parentBlock, simplexMetadataBytes, simplexBlacklistBytes, prevBlockSeq)
case stateBuildCollectingApprovals:
return sm.buildBlockCollectingApprovals(ctx, parentBlock, simplexMetadataBytes, simplexBlacklistBytes, prevBlockSeq)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we are collecting approvals for a validator set change v1, but the pchain height has advanced to a new validator set v2. am i correct to say the current code would first finish collecting approvals for v1 and then transition to v2 after all approvals are collected?

would it be better to stop the transition of v1 when we notice v2? curious about your thoughts

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have the next P-chain reference height (we moved from normal block to collecting approvals) we are essentially "locked" on that P-chain height for the epoch transition and cannot change it.

am i correct to say the current code would first finish collecting approvals for v1 and then transition to v2 after all approvals are collected?

Yes, in such a case we'll have two epoch changes one after the other.

would it be better to stop the transition of v1 when we notice v2? curious about your thoughts

So, I did think about this "problem" as well when I designed the MSM, but I don't think this is a use case we should optimize for.

We might just be able to solve this problem passively by using P-chain from ICM epochs, because they change their P-chain height slowly every time frame (I think 30 seconds or so) and therefore it makes implicit batching of P-chain changes.

I'll look at the code once I re-introduce ICM epoching and see if I can make it more robust.

Thanks for asking this 👍

Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
yacovm added 4 commits May 12, 2026 16:42
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Comment thread api.go Outdated
Comment thread api.go Outdated
Comment thread epoch.go Outdated
Comment thread pos_test.go
Comment thread msm/msm.go
Comment thread msm/msm.go
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/encoding.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go
Comment thread msm/msm.go Outdated
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Comment thread msm/encoding.go
NextEpochApprovals *NextEpochApprovals `canoto:"pointer,7"`
// SealingBlockSeq is the block sequence of the sealing block of the current epoch.
// It defines the validator set of the next epoch.
SealingBlockSeq uint64 `canoto:"uint,8"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only set when we are building telocks?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's set once we build the first Telock and then we copy over the previous Telock's sealing block seq in induction.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread msm/msm.go
}

// TODO: This P-chain height should be taken from the ICM epoch
childBlock, err := sm.BlockBuilder.BuildBlock(ctx, sm.GetPChainHeight())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 we are building a new block for the epoch, but a couple things can change between epochs

  1. validator set ordering -> this means we might have been the leader for seq x in epoch 1, but now that the epoch has changed we may not be the leader for seq x in epoch 2.
  2. Our node may be kicked out of validator set, so we shouldn't be building a block anyways

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are building a new block for the epoch, but a couple things can change between epochs

To be precise, we're building a block for a new epoch.

  1. validator set ordering -> this means we might have been the leader for seq x in epoch 1, but now that the epoch has changed we may not be the leader for seq x in epoch 2.
  2. Our node may be kicked out of validator set, so we shouldn't be building a block anyways

A new epoch, a new Epoch instance, right?

Why does it matter which node is the leader at a current round? The MSM assumes nothing about the relative position of the node within the validator set. Eventually there would be a non faulty node that will be a leader and will execute this line. When a node executes this line, it will already have the new epoch instance with the new validator set, right?

Keep in mind that the API of the MSM is very primitive - we can either BuildBlock or VerifyBlock. There is no assumption of any kind of continuation. We can only build a single block at any given round.

So, if in the first time we execute this function we haven't sealed (finalized the sealing block) our epoch, we are running under the context of the previous epoch. Otherwise, If the epoch has been sealed then this is executed under the context of the new epoch, and therefore a node that has been evicted isn't expected to execute it (and even if it has, no one will accept such a block).

Does that make sense?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense, thanks for the explaination 👍

Comment thread msm/msm_test.go
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
@yacovm yacovm merged commit c7114b6 into main May 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants