Autoeval Module
The biotrainer autoeval
module allows to automatically evaluate an embedder model on downstream prediction tasks.
This can give a better impression of the model performance, and if it is actually creating useful embeddings.
Downstream Tasks
All relevant files for the downstream tasks are downloaded before the evaluation to
/{user_cache_dir}/biotrainer/autoeval
.
FLIP
The autoeval
module provides a curated subset of datasets in FLIP.
The tasks have been chosen to be as hard as possible for a prediction model.
We provide a flip_conversion_script.py
in the module that shows, how the datasets have been preprocessed
and curated.
Quick description of each task, you can find more information in the split descriptions:
- aav: Predict adeno-associated virus fitness.
- low_vs_high: Training on low fitness values, testing on high fitness values
- two_vs_many: Training on two mutations max, testing on many mutations
- bind: Predict binding sites of ligands for protein residues.
- from_publication: Dataset split used in this publication
- gb1: Predict a binding score for each variant of the GB1 protein.
- low_vs_high: Training on low fitness values, testing on high fitness values
- two_vs_many: Training on two mutations max, testing on others
- meltome: Predict meltome temperature of proteins.
- mixed_split: Mixed human and non-human proteins
- scl: Predict protein subcellular localization.
- mixed_hard: Mixed human and non-human proteins
- secondary_structure: Predict secondary structure of proteins
- sampled: Only available FLIP split
How to use
CLI
The autoeval
pipeline can be run directly through the cli:
biotrainer autoeval --embedder-name embedder_name --framework framework_name [--min-seq-length min_length] [--max-seq-length max_length] [--use-half-precision]
Script
You can also integrate autoeval
into your scripts or training pipelines:
from typing import Iterator
import numpy as np
from tqdm import tqdm
from biotrainer.autoeval import autoeval_pipeline
from biotrainer.utilities import seed_all
class ExampleRandomEmbedder:
def __init__(self, embedding_dim: int = 21):
self.embedding_dim = embedding_dim
# Pre-generate random state for faster random number generation
self.rng = np.random.default_rng()
def embed_per_residue(self, sequences: Iterator[str]):
for sequence in tqdm(sequences):
# Generate all embeddings at once using rng
embedding = self.rng.random((len(sequence), self.embedding_dim), dtype=np.float32)
yield sequence, embedding
def embed_per_sequence(self, sequences: Iterator[str]):
for sequence in tqdm(sequences):
# Generate single embedding using rng
embedding = self.rng.random(self.embedding_dim, dtype=np.float32)
yield sequence, embedding
seed_all(42)
embedder = ExampleRandomEmbedder()
for progress in autoeval_pipeline(embedder_name="your_embedder",
framework="flip",
custom_embedding_function_per_residue=lambda seqs: embedder.embed_per_residue(
seqs),
custom_embedding_function_per_sequence=lambda
seqs: embedder.embed_per_sequence(seqs),
):
print(progress)
Some deeper explanations:
- The
autoeval_pipeline
function is a generator function that yieldsAutoEvalProgress
objects to track progress. Therefore, you must "do something" with the pipeline return values, in order to execute the pipeline. Simply callingautoeval_pipeline(...)
will not run the pipeline, see python generators. - The custom embedding functions take an iterable of sequences as strings and need to yield a sequence and the according embedding. This is to make sure that the correct embedding is assigned to the respective sequence.
- All embeddings are calculated at once at the beginning of the training, to avoid duplicated embedding.
Report
After all tasks have been successfully finished, a report is created in the output directory. All metadata and model results are tracked there.
Example:
{
"embedder_name": "your_embedder_name", // Embedder name
"training_date": "2025-07-02", // Training date
"min_seq_len": 0, // Minimum sequence length
"max_seq_len": 2000, // Maximum sequence length
"results": {
"aav-two_vs_many": {
"config": {
},
"database_type": "Protein",
"derived_values": {
},
"training_results": {
},
"test_results": {
"test": {
"metrics": {
"loss": 12.494811077213766,
"mse": 12.501708030700684,
"rmse": 3.5357754230499268,
"spearmans-corr-coeff": 0.0014577994588762522
},
"bootstrapping": {
"results": {
"mse": {
"mean": 12.5078125,
"error": 0.099365234375
},
"rmse": {
"mean": 3.537109375,
"error": 0.01436614990234375
},
"spearmans-corr-coeff": {
"mean": 0.0007982254028320312,
"error": 0.00795745849609375
}
},
"iterations": 30,
"sample_size": 50767,
"confidence_level": 0.05
},
"test_baselines": {
},
"predictions": {}
}
}
},
// OTHER TASKS
}
}
Visualization and Leaderboard
You can visualize your results and compare against other embedder models using biocentral. Simply load the report from file in the pLM Evaluation module.