Skip to content

[Base] Merge with upstream - fix collect results & new tasks#19

Draft
islobozhan wants to merge 7 commits into
mainfrom
merge-with-upstream-5
Draft

[Base] Merge with upstream - fix collect results & new tasks#19
islobozhan wants to merge 7 commits into
mainfrom
merge-with-upstream-5

Conversation

@islobozhan
Copy link
Copy Markdown
Collaborator

Merge with upstream - fix collect results & new tasks

haideraltahan and others added 7 commits May 22, 2026 09:42
* feat: add global-piqa-eu task groups with 32 European languages (completions + prompted, 64 tasks)

* fix: reduce max_gen_toks from 2048 to 256 in prompted template (model max_seq_len=2048)

---------

Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>
…M#69)

- Add sib200-eu task group (lm-eval-harness, 0-shot) covering:
  Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian,
  Finnish, French, German, Greek, Hungarian, Irish, Italian,
  Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian,
  Slovak, Slovenian, Spanish, Swedish, Catalan, Basque, Galician,
  Bosnian, Georgian, Macedonian, Albanian, Serbian, Turkish,
  Ukrainian, Icelandic, Norwegian
- Bundle sib200 task YAML definitions in custom_lm_eval_tasks/sib200/
  (lm-eval 0.4.11 does not ship sib200 tasks; loading via --include_path)
- Register acc_norm metric for all sib200 tasks in task_metrics
- Drop 'group' field from _default_template_yaml (unsupported in 0.4.11)

Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>
* Add arc-challenge-mt-eu task group (22 European languages, lm-eval-harness)

* Remove English (en) from arc-challenge-mt-eu: not in dataset

* Bundle arc_challenge_mt task YAMLs in custom_lm_eval_tasks

* Add Icelandic (is) to arc-challenge-mt-eu (mideind/icelandic-arc-challenge)

---------

Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>
* collect-results: recursive multi-dir merge with duplicate override + tests

- Recursively search results_dir for all jobs.csv files (rglob) instead
  of requiring a single top-level jobs.csv
- Merge all found jobs.csv files into one DataFrame; later-sorted paths
  win for duplicate (model_path, task_path, n_shot) rows
- Recursively search for all .json result files (rglob) instead of only
  looking one level deep or in a hardcoded 'results/' subdirectory
- --check: compare merged results against merged jobs, write _missing.csv;
  if no jobs.csv found anywhere, check mode is silently disabled
- Without --check: simply merge and write output_csv
- Update README.md Collecting Results section to document new behaviour
- Add tests/test_collect_results.py with 14 tests covering merge,
  duplicate override, check mode, and edge cases

* collect-results: fix lighteval result parsing (n_shot=0 + skip 'all' key)

* style: ruff format test_collect_results.py

* style: ruff format main.py

* test: skip test_datasets when HF_TOKEN not set (avoids rate-limiting)

* ci: pass HF_TOKEN to test step

* revert: remove conftest.py and HF_TOKEN ci.yml changes

* revert: remove test_collect_results.py

* collect-results: deduplicate result rows by (model_name, task, n_shot, metric_name)

* collect-results: add chrf++ and bleu to fallback metric resolution

---------

Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>
* rename: oellm-cli → oellm-eval (package name + binary)

* rename: schedule-eval → schedule, collect-results → collect

* keep oellm-output path (revert oellm-eval-output rename)

---------

Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants