Skip to content

Is it better to train the top-layer model on only high-activity variants, or on the full activity range (including low/neutral variants)? #64

Description

@lyh951024

Hi @mat10d, thank you for EVOLVEpro — they were very helpful.

I'm running an experimental campaign (cas protein, Rounds 1–5, WT-normalized activity with WT = 1.0; ESM-2 15B embeddings + RF top layer). I understand from #31 that high-activity single substitutions are used to seed multi-mutant combinations in later rounds. My question is about the single-mutant training set itself, not the multi-mutant seeds.

Question
When training the RF top-layer model for recommending the next round of single mutants, is it better to:

  • (A) train on all measured variants across rounds (full activity range, including low/neutral/deleterious ones), or
  • (B) train on only the beneficial subset above an activity cutoff (e.g. activity > 1.1 or > 1.2)?

Why I'm asking — what I observed
I ran a cutoff sensitivity analysis, keeping only variants above a threshold for training:

Training set # train variants Top-1 recommendation Top-1 y_pred
activity > 0.9 31 N423L 1.610
activity > 1.0 24 D78K 1.730
activity > 1.1 20 E274F 1.815
activity > 1.2 16 D78K 1.740
IQR-outlier removed 54 E236R 1.261

Two things stood out:

  1. Higher cutoff → fewer training points but higher predicted y_pred (peaks at activity > 1.1). I'm unsure whether the higher predicted values reflect a genuinely better-targeted search, or simply an optimistic bias from training on a censored, high-only label distribution (the RF can no longer "see" what low activity looks like).
  2. Removing the single strongest variant as an outlier completely changed the recommendation direction (toward re-exploring residues near that hotspot) and lowered all predictions.

Specific questions

  1. In your experience, does restricting the training set to high-activity variants improve the real hit rate of the next round, or does it mainly inflate predicted scores by removing the negative/low part of the label distribution?
  2. Does the RF top layer need low/neutral/deleterious examples to learn a useful activity gradient, or is it robust to a high-only (right-censored) training distribution?
  3. Is there a recommended way to choose the activity cutoff — or do you recommend always training on the full measured range and only applying activity filtering?

Thanks again for your time and for sharing this tool!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions