[Base] Merge with upstream - fix collect results & new tasks#19
Draft
islobozhan wants to merge 7 commits into
Draft
[Base] Merge with upstream - fix collect results & new tasks#19islobozhan wants to merge 7 commits into
islobozhan wants to merge 7 commits into
Conversation
* feat: add global-piqa-eu task groups with 32 European languages (completions + prompted, 64 tasks) * fix: reduce max_gen_toks from 2048 to 256 in prompted template (model max_seq_len=2048) --------- Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>
…M#69) - Add sib200-eu task group (lm-eval-harness, 0-shot) covering: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Catalan, Basque, Galician, Bosnian, Georgian, Macedonian, Albanian, Serbian, Turkish, Ukrainian, Icelandic, Norwegian - Bundle sib200 task YAML definitions in custom_lm_eval_tasks/sib200/ (lm-eval 0.4.11 does not ship sib200 tasks; loading via --include_path) - Register acc_norm metric for all sib200 tasks in task_metrics - Drop 'group' field from _default_template_yaml (unsupported in 0.4.11) Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>
* Add arc-challenge-mt-eu task group (22 European languages, lm-eval-harness) * Remove English (en) from arc-challenge-mt-eu: not in dataset * Bundle arc_challenge_mt task YAMLs in custom_lm_eval_tasks * Add Icelandic (is) to arc-challenge-mt-eu (mideind/icelandic-arc-challenge) --------- Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>
* collect-results: recursive multi-dir merge with duplicate override + tests - Recursively search results_dir for all jobs.csv files (rglob) instead of requiring a single top-level jobs.csv - Merge all found jobs.csv files into one DataFrame; later-sorted paths win for duplicate (model_path, task_path, n_shot) rows - Recursively search for all .json result files (rglob) instead of only looking one level deep or in a hardcoded 'results/' subdirectory - --check: compare merged results against merged jobs, write _missing.csv; if no jobs.csv found anywhere, check mode is silently disabled - Without --check: simply merge and write output_csv - Update README.md Collecting Results section to document new behaviour - Add tests/test_collect_results.py with 14 tests covering merge, duplicate override, check mode, and edge cases * collect-results: fix lighteval result parsing (n_shot=0 + skip 'all' key) * style: ruff format test_collect_results.py * style: ruff format main.py * test: skip test_datasets when HF_TOKEN not set (avoids rate-limiting) * ci: pass HF_TOKEN to test step * revert: remove conftest.py and HF_TOKEN ci.yml changes * revert: remove test_collect_results.py * collect-results: deduplicate result rows by (model_name, task, n_shot, metric_name) * collect-results: add chrf++ and bleu to fallback metric resolution --------- Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>
* rename: oellm-cli → oellm-eval (package name + binary) * rename: schedule-eval → schedule, collect-results → collect * keep oellm-output path (revert oellm-eval-output rename) --------- Co-authored-by: Haider Al-Tahan <haltahan@login02.leonardo.local>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Merge with upstream - fix collect results & new tasks