Adds `image_features` parameter to predict for pre-computed embeddings by NetZissou · Pull Request #169 · Imageomics/pybioclip

NetZissou · 2026-03-25T13:45:28Z

Disclaimer: This PR was developed with assistance from Claude Opus 4.6 (1M context). The author has reviewed all code changes and test additions. CI has been executed successfully in the forked repo. Opening this PR to request review from the package maintainers for further feedback and iteration.

Summary

This PR adds an optional image_features parameter to predict() on TreeOfLifeClassifier and CustomLabelsClassifier (CustomLabelsBinningClassifier inherits this through CustomLabelsClassifier). When provided, the method skips image encoding and computes classification directly from pre-computed embeddings.

Embedding validation

The method validates input embeddings before classification:

Verifies tensor is 2D (N, embedding_dim)
Checks that embedding_dim matches the model's expected dimension (model.visual.output_dim)
Normalizes the embedding vector via L2 norm only if not already normalized

Test plan

New tests in TestPredictFromEmbeddings:

Results from embeddings match results from images exactly (all fields except file_name)
Species, family, and multi-image predictions
Unnormalized features auto-normalized with correct classifications
CustomLabelsClassifier and CustomLabelsBinningClassifier
Error cases: no inputs, wrong tensor dim, wrong embedding dim, image-embedding length mismatch

Closes #167

hlapp

Thanks @NetZissou. The part of the implementation approach that I don't like here is that now creating probabilities is taking place redundantly in two different functions. This also creates more code noise than I think should be needed in the predict() method.

Instead, shouldn't the clean way to handle this in the predict() method be to see whether image_features are already provided. If they are, apply basic checks like correct dimensions etc. If they are not, create them (like the are being created now from images). Then proceed with creating probabilities from image_features.

hlapp · 2026-04-17T23:08:06Z

@NetZissou just FYI, it might be advisable to rebase on main to bring in the changes from #179. It's well possible your changes so far are not in conflict at all, but some stuff did get moved around.

Allows passing pre-computed image embeddings directly to predict() on TreeOfLifeClassifier, CustomLabelsClassifier, and CustomLabelsBinningClassifier, avoiding redundant image encoding when embeddings are already available. Validates input: checks tensor is 2D, embedding_dim matches the model's expected dimension (model.visual.output_dim), and normalizes via L2 norm only if not already normalized to avoid floating point drift. Closes Imageomics#167 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces `create_probabilities_from_features` with two smaller helpers on `BaseClassifier`: `_validate_image_features` and `_resolve_image_features`. Now `TreeOfLifeClassifier.predict` and `CustomLabelsClassifier.predict` can share a single input-resolution path instead of the duplication. The commit also restores `rank` as required on `TreeOfLifeClassifier.predict` via a runtime TypeError, preserving the pre-PR behavior of the API. Adds tests for 1) length-mismatch on `CustomLabelsClassifier` 2) images+image_features input, 3) rank-omitted error Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com

@egrace479

Per @egrace479 comment, when image_features is passed, images must also be passed, which will be used to generate identifiers for the output keys. The length of image_features must match images. Updated test cases to adopt the new API design. Updated predict method docstring to reflect the new changes, and make the docstring easier to read Co-Authored-By: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com> Co-Authored-By: Hilmar Lapp <51458+hlapp@users.noreply.github.com>

NetZissou · 2026-05-20T16:39:35Z

@egrace479 @hlapp

@NetZissou, @hlapp and I discussed the implementation this morning. It would make the most sense to expect (require) the list of image identifiers (e.g., list of filename strings) that correspond to the embeddings to be passed as well. These can then be used for the image keys. The embeddings would go through the size checks previously discussed and you have the sanity check that the list lengths match. We just want to make sure it's well-documented then.

Re-implemented based on the comment above. Ready for re-review.

hlapp

Thanks @NetZissou. Some thoughts and comments:

Since this is trying to optimize performance for the case that someone has the image embeddings already, I think we can expect at least by default that they are already normalized, rather than re-normalizing them (for testing whether they are normalized) each time. So I think that should be conditional to a parameter that by default is off?
The _resolve_image_features() function doesn't really resolve image features, given that it returns probabilities, not resolved image features. So instead it should probably better be named something like _create_probabilities_for_imgs_or_img_feats() (or maybe even just _create_probabilities()?). And given the number of parameters and None being allowed in several places, all invocations should arguably use named parameter syntax.
In this implementation (as per _resolve_image_features()), there's actually really nothing directly in common between creating probabilities from image features and creating them from images: there is no code before or after the conditional that would run either way. I guess the shared part is indirect: creating probabilities from images invokes create_batched_probabilities_for_images(), whereas creating them from image features invokes create_probabilities(). Of course, indirectly the former invokes create_probabilities_for_images() which then invokes create_probabilities(), too. It seems what this really means is the following:
- If I have images and no image features, encode the images to obtain image features. This should be done batched for performance optimization.
- Once I have image features, create probabilities. This need not be batched.
In your current implementation, creating probabilities from image features will be batched if starting from images, and the batching will be bypassed if image embeddings are already in hand. Is this reasonable?

hlapp requested changes Apr 8, 2026

View reviewed changes

NetZissou and others added 2 commits April 28, 2026 12:09

NetZissou force-pushed the feature/predict-from-embeddings branch from 508064a to a36724c Compare May 4, 2026 18:49

NetZissou commented May 4, 2026

View reviewed changes

Comment thread src/bioclip/predict.py Outdated

NetZissou commented May 5, 2026

View reviewed changes

Comment thread src/bioclip/predict.py Outdated

NetZissou requested review from egrace479 and hlapp May 8, 2026 20:15

NetZissou force-pushed the feature/predict-from-embeddings branch from 1146fe5 to 1ae72a4 Compare May 8, 2026 20:19

hlapp requested changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds `image_features` parameter to predict for pre-computed embeddings#169

Adds `image_features` parameter to predict for pre-computed embeddings#169
NetZissou wants to merge 3 commits into
Imageomics:mainfrom
NetZissou:feature/predict-from-embeddings

NetZissou commented Mar 25, 2026

Uh oh!

hlapp left a comment

Uh oh!

hlapp commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

NetZissou commented May 20, 2026

Uh oh!

hlapp left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NetZissou commented Mar 25, 2026

Summary

Embedding validation

Test plan

Uh oh!

hlapp left a comment

Choose a reason for hiding this comment

Uh oh!

hlapp commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NetZissou commented May 20, 2026

Uh oh!

hlapp left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hlapp commented Apr 17, 2026 •

edited

Loading