Hello, nice work!
I am reading the paper (https://arxiv.org/html/2601.11262v1) and trying to benchmark the model.
Could you confirm whether all reported retrieval metrics were computed using the Ranker class in src/livi/apps/retrieval_eval/ranker.py, specifically the logic at line 77?
I am asking because some of the models you compare against (e.g., CLEWS) do not L2-normalize their embeddings by design.
Computing cosine similarity in their original embedding space would distort the learned geometry and likely yield suboptimal performance metrics.
Were these embeddings normalized before ranking, or was a different similarity function used for them?
Hello, nice work!
I am reading the paper (https://arxiv.org/html/2601.11262v1) and trying to benchmark the model.
Could you confirm whether all reported retrieval metrics were computed using the
Rankerclass in src/livi/apps/retrieval_eval/ranker.py, specifically the logic at line 77?I am asking because some of the models you compare against (e.g., CLEWS) do not L2-normalize their embeddings by design.
Computing cosine similarity in their original embedding space would distort the learned geometry and likely yield suboptimal performance metrics.
Were these embeddings normalized before ranking, or was a different similarity function used for them?