Sync 1014 by joaner · Pull Request #1 · ioai-tech/lerobot

joaner · 2025-10-14T10:41:11Z

What this does

Explain what this PR does. Feel free to tag your PR with the appropriate label(s).

Examples:

Title	Label
Fixes #[issue]	(🐛 Bug)
Adds new dataset	(🗃️ Dataset)
Optimizes something	(⚡️ Performance)

How it was tested

Explain/show how you tested your changes.

Examples:

Added test_something in tests/test_stuff.py.
Added new_feature and checked that training converges with policy X on dataset/environment Y.
Optimized some_function, it now runs X times faster than previously.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

pytest -sx tests/test_stuff.py::test_something

lerobot-train --some.option=true

SECTION TO REMOVE BEFORE SUBMITTING YOUR PR

Note: Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.

Note: Before submitting this PR, please read the contributor guideline.

… set it to 1 (#2135) * refactor(datasets): add compress_level parameter to write_image() and set it to 1 * docs(dataset): add docs to write_image()

* Add act documentation * remove citation as we link the paper * simplify docs * fix pre commit

* Remove validate_robot_cameras_for_policy as with rename processor the image keys can be renamed an mapped * fix precommit

* feat(dataset-tools): add dataset utilities and example script - Introduced dataset tools for LeRobotDataset, including functions for deleting episodes, splitting datasets, adding/removing features, and merging datasets. - Added an example script demonstrating the usage of these utilities. - Implemented comprehensive tests for all new functionalities to ensure reliability and correctness. * style fixes * move example to dataset dir * missing lisence * fixes mostly path * clean comments * move tests to functions instead of class based * - fix video editting, decode, delete frames and rencode video - copy unchanged video and parquet files to avoid recreating the entire dataset * Fortify tooling tests * Fix type issue resulting from saving numpy arrays with shape 3,1,1 * added lerobot_edit_dataset * - revert changes in examples - remove hardcoded split names * update comment * fix comment add lerobot-edit-dataset shortcut * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Michel Aractingi <michel.aractingi@huggingface.co> * style nit after copilot review * fix: bug in dataset root when editing the dataset in place (without setting new_repo_id * Fix bug in aggregate.py when accumelating video timestamps; add tests to fortify aggregate videos * Added missing output repo id * migrate delete episode to using pyav instead of decoding, writing frames to disk and encoding again. Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com> * added modified suffix in case repo_id is not set in delete_episode * adding docs for dataset tools * bump av version and add back time_base assignment * linter * modified push_to_hub logic in lerobot_edit_dataset * fix(progress bar): fixing the progress bar issue in dataset tools * chore(concatenate): removing no longer needed concatenate_datasets usage * fix(file sizes forwarding): forwarding files and chunk sizes in metadata info when splitting and aggregating datasets * style fix * refactor(aggregate): Fix video indexing and timestamp bugs in dataset merging There were three critical bugs in aggregate.py that prevented correct dataset merging: 1. Video file indices: Changed from += to = assignment to correctly reference merged video files 2. Video timestamps: Implemented per-source-file offset tracking to maintain continuous timestamps when merging split datasets (was causing non-monotonic timestamp warnings) 3. File rotation offsets: Store timestamp offsets after rotation decision to prevent out-of-bounds frame access (was causing "Invalid frame index" errors with small file size limits) Changes: - Updated update_meta_data() to apply per-source-file timestamp offsets - Updated aggregate_videos() to track offsets correctly during file rotation - Added get_video_duration_in_s import for duration calculation * Improved docs for split dataset and added a check for the possible case that the split size results in zero episodes * chore(docs): update merge documentation details Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> --------- Co-authored-by: CarolinePascal <caroline8.pascal@gmail.com> Co-authored-by: Jack Vial <vialjack@gmail.com> Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>

Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>

* incremental parquet writing * add .finalise() and a backup __del__ for stopping writers * fix missing import * precommit fixes added back the use of embed images * added lazy loading for hf_Dataset to avoid frequently reloading the dataset during recording * fix bug in video timestamps * Added proper closing of parquet file before reading * Added rigorous testing to validate the consistency of the meta data after creation of a new dataset * fix bug in episode index during clear_episode_buffer * fix(empty concat): check for empty paths list before data files concatenation * fix(v3.0 message): updating v3.0 backward compatibility message. * added fixes for the resume logic * answering co-pilot review * reverting some changes and style nits * removed unused functions * fix chunk_id and file_id when resuming * - fix parquet loading when resuming - add test to verify the parquet file integrity when resuming so that data files are now overwritten * added general function get_file_size_in_mb and removed the one for video * fix table size value when resuming * Remove unnecessary reloading of the parquet file when resuming record. Write to a new parquet file when resuming record * added back reading parquet file for image datasets only * - respond to Qlhoest comments - Use pyarrows `from_pydict` function - Add buffer for episode metadata to write to the parquet file in batches to improve efficiency - Remove the use of `to_parquet_with_hf_images` * fix(dataset_tools) with the new logic using proper finalize bug in finding the latest path of the metdata that was pointing to the data files added check for the metadata size in the case the metadatabuffer was not written yet * nit in flush_metadata_buffer * fix(lerobot_dataset) return the right dataset len when a subset of the dataset is requested --------- Co-authored-by: Harsimrat Sandhawalia <hs.sandhawalia@gmail.com>

- add missing calls to dataset.finalize in the example recording scripts - add section in the dataset docs on calling dataset.finalize

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* fix outdated example Signed-off-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com> * Update docs/source/il_robots.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com> --------- Signed-off-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

imstevenpmwork and others added 12 commits October 8, 2025 14:27

chore(docs): add missing license headers (#2140)

6c28ef8

refactor(datasets): add compress_level parameter to write_image() and…

9a49e57

… set it to 1 (#2135) * refactor(datasets): add compress_level parameter to write_image() and set it to 1 * docs(dataset): add docs to write_image()

Add act documentation (#2139)

4ccf284

* Add act documentation * remove citation as we link the paper * simplify docs * fix pre commit

fic(docs): local docs links (#2149)

829d2d1

Remove validate_robot_cameras_for_policy (#2150)

656fc0f

* Remove validate_robot_cameras_for_policy as with rename processor the image keys can be renamed an mapped * fix precommit

refactor(envs): add custom-observation-size (#2167)

0699b46

use TeleopEvents.RERECORD_EPISODE in gym_manipulator (#2165)

25f60c3

Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>

Add missing finalize calls in example (#2175)

0c79cf8

- add missing calls to dataset.finalize in the example recording scripts - add section in the dataset docs on calling dataset.finalize

$@fracapuano$

fix: very minor fix but hey devil is in details (#2168)

f29311c

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>

joaner merged commit c1d3d5c into ioai-tech:main Oct 14, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync 1014#1

Sync 1014#1
joaner merged 12 commits into
ioai-tech:mainfrom
huggingface:main

joaner commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

joaner commented Oct 14, 2025

What this does

How it was tested

How to checkout & try? (for the reviewer)

SECTION TO REMOVE BEFORE SUBMITTING YOUR PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants