Hi @mat10d, thank you for EVOLVEpro — they were very helpful.
I'm running an experimental campaign (cas protein, Rounds 1–5, WT-normalized activity with WT = 1.0; ESM-2 15B embeddings + RF top layer). I understand from #31 that high-activity single substitutions are used to seed multi-mutant combinations in later rounds. My question is about the single-mutant training set itself, not the multi-mutant seeds.
Question
When training the RF top-layer model for recommending the next round of single mutants, is it better to:
- (A) train on all measured variants across rounds (full activity range, including low/neutral/deleterious ones), or
- (B) train on only the beneficial subset above an activity cutoff (e.g. activity > 1.1 or > 1.2)?
Why I'm asking — what I observed
I ran a cutoff sensitivity analysis, keeping only variants above a threshold for training:
| Training set |
# train variants |
Top-1 recommendation |
Top-1 y_pred |
| activity > 0.9 |
31 |
N423L |
1.610 |
| activity > 1.0 |
24 |
D78K |
1.730 |
| activity > 1.1 |
20 |
E274F |
1.815 |
| activity > 1.2 |
16 |
D78K |
1.740 |
| IQR-outlier removed |
54 |
E236R |
1.261 |
Two things stood out:
- Higher cutoff → fewer training points but higher predicted y_pred (peaks at activity > 1.1). I'm unsure whether the higher predicted values reflect a genuinely better-targeted search, or simply an optimistic bias from training on a censored, high-only label distribution (the RF can no longer "see" what low activity looks like).
- Removing the single strongest variant as an outlier completely changed the recommendation direction (toward re-exploring residues near that hotspot) and lowered all predictions.
Specific questions
- In your experience, does restricting the training set to high-activity variants improve the real hit rate of the next round, or does it mainly inflate predicted scores by removing the negative/low part of the label distribution?
- Does the RF top layer need low/neutral/deleterious examples to learn a useful activity gradient, or is it robust to a high-only (right-censored) training distribution?
- Is there a recommended way to choose the activity cutoff — or do you recommend always training on the full measured range and only applying activity filtering?
Thanks again for your time and for sharing this tool!
Hi @mat10d, thank you for EVOLVEpro — they were very helpful.
I'm running an experimental campaign (cas protein, Rounds 1–5, WT-normalized activity with WT = 1.0; ESM-2 15B embeddings + RF top layer). I understand from #31 that high-activity single substitutions are used to seed multi-mutant combinations in later rounds. My question is about the single-mutant training set itself, not the multi-mutant seeds.
Question
When training the RF top-layer model for recommending the next round of single mutants, is it better to:
Why I'm asking — what I observed
I ran a cutoff sensitivity analysis, keeping only variants above a threshold for training:
Two things stood out:
Specific questions
Thanks again for your time and for sharing this tool!