[Base] Merge with upstream - fix collect results & new tasks by islobozhan · Pull Request #19 · elliot-project/elliot-cli

islobozhan · 2026-05-31T06:57:16Z

Merge with upstream - fix collect results & new tasks

* feat: add global-piqa-eu task groups with 32 European languages (completions + prompted, 64 tasks) * fix: reduce max_gen_toks from 2048 to 256 in prompted template (model max_seq_len=2048) --------- Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>

…M#69) - Add sib200-eu task group (lm-eval-harness, 0-shot) covering: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Catalan, Basque, Galician, Bosnian, Georgian, Macedonian, Albanian, Serbian, Turkish, Ukrainian, Icelandic, Norwegian - Bundle sib200 task YAML definitions in custom_lm_eval_tasks/sib200/ (lm-eval 0.4.11 does not ship sib200 tasks; loading via --include_path) - Register acc_norm metric for all sib200 tasks in task_metrics - Drop 'group' field from _default_template_yaml (unsupported in 0.4.11) Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>

* Add arc-challenge-mt-eu task group (22 European languages, lm-eval-harness) * Remove English (en) from arc-challenge-mt-eu: not in dataset * Bundle arc_challenge_mt task YAMLs in custom_lm_eval_tasks * Add Icelandic (is) to arc-challenge-mt-eu (mideind/icelandic-arc-challenge) --------- Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>

* collect-results: recursive multi-dir merge with duplicate override + tests - Recursively search results_dir for all jobs.csv files (rglob) instead of requiring a single top-level jobs.csv - Merge all found jobs.csv files into one DataFrame; later-sorted paths win for duplicate (model_path, task_path, n_shot) rows - Recursively search for all .json result files (rglob) instead of only looking one level deep or in a hardcoded 'results/' subdirectory - --check: compare merged results against merged jobs, write _missing.csv; if no jobs.csv found anywhere, check mode is silently disabled - Without --check: simply merge and write output_csv - Update README.md Collecting Results section to document new behaviour - Add tests/test_collect_results.py with 14 tests covering merge, duplicate override, check mode, and edge cases * collect-results: fix lighteval result parsing (n_shot=0 + skip 'all' key) * style: ruff format test_collect_results.py * style: ruff format main.py * test: skip test_datasets when HF_TOKEN not set (avoids rate-limiting) * ci: pass HF_TOKEN to test step * revert: remove conftest.py and HF_TOKEN ci.yml changes * revert: remove test_collect_results.py * collect-results: deduplicate result rows by (model_name, task, n_shot, metric_name) * collect-results: add chrf++ and bleu to fallback metric resolution --------- Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>

* rename: oellm-cli → oellm-eval (package name + binary) * rename: schedule-eval → schedule, collect-results → collect * keep oellm-output path (revert oellm-eval-output rename) --------- Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>

haideraltahan and others added 7 commits May 22, 2026 09:42

fix conflicts

ddb8645

fixes

5a898fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Base] Merge with upstream - fix collect results & new tasks#19

[Base] Merge with upstream - fix collect results & new tasks#19
islobozhan wants to merge 7 commits into
mainfrom
merge-with-upstream-5

islobozhan commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

islobozhan commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants