Is it better to train the top-layer model on only high-activity variants, or on the full activity range (including low/neutral variants)?

Hi @mat10d, thank you for EVOLVEpro — they were very helpful.

I'm running an experimental campaign (cas protein, Rounds 1–5, WT-normalized activity with WT = 1.0; ESM-2 15B embeddings + RF top layer). I understand from #31 that high-activity single substitutions are used to *seed multi-mutant combinations* in later rounds. My question is about the **single-mutant training set itself**, not the multi-mutant seeds.

**Question**
When training the RF top-layer model for recommending the next round of *single* mutants, is it better to:

- (A) train on **all measured variants** across rounds (full activity range, including low/neutral/deleterious ones), or
- (B) train on **only the beneficial subset** above an activity cutoff (e.g. activity > 1.1 or > 1.2)?

**Why I'm asking — what I observed**
I ran a cutoff sensitivity analysis, keeping only variants above a threshold for training:

| Training set | # train variants | Top-1 recommendation | Top-1 y_pred |
|---|---|---|---|
| activity > 0.9 | 31 | N423L | 1.610 |
| activity > 1.0 | 24 | D78K | 1.730 |
| activity > 1.1 | 20 | E274F | 1.815 |
| activity > 1.2 | 16 | D78K | 1.740 |
| IQR-outlier removed | 54 | E236R | 1.261 |

Two things stood out:

1. **Higher cutoff → fewer training points but higher predicted y_pred** (peaks at activity > 1.1). I'm unsure whether the higher predicted values reflect a genuinely better-targeted search, or simply an optimistic bias from training on a censored, high-only label distribution (the RF can no longer "see" what low activity looks like).
2. **Removing the single strongest variant as an outlier completely changed the recommendation direction** (toward re-exploring residues near that hotspot) and lowered all predictions.

**Specific questions**
1. In your experience, does restricting the training set to high-activity variants improve the *real* hit rate of the next round, or does it mainly inflate predicted scores by removing the negative/low part of the label distribution?
2. Does the RF top layer need low/neutral/deleterious examples to learn a useful activity gradient, or is it robust to a high-only (right-censored) training distribution?
3. Is there a recommended way to choose the activity cutoff — or do you recommend always training on the full measured range and only applying activity filtering?

Thanks again for your time and for sharing this tool!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it better to train the top-layer model on only high-activity variants, or on the full activity range (including low/neutral variants)? #64

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Training set	# train variants	Top-1 recommendation	Top-1 y_pred
activity > 0.9	31	N423L	1.610
activity > 1.0	24	D78K	1.730
activity > 1.1	20	E274F	1.815
activity > 1.2	16	D78K	1.740
IQR-outlier removed	54	E236R	1.261

Is it better to train the top-layer model on only high-activity variants, or on the full activity range (including low/neutral variants)? #64

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions