Skip to content

Questions on Video Representation and Inference in VITRA #22

@UEFI-code

Description

@UEFI-code
  1. 2D->3D, Ground-Truth Data
    In the paper, it seems to be stated that the “raw videos” do not contain labeling information.
    I would like to ask: are the raw videos monocular 2D videos captured by a single camera? If so, do you then use some depth estimation algorithms to infer 3D information from them?

  2. "Brain" doesn't participate in autoregressive inference

        output_hs, inputs_masks = self.prepare_vlm_features(
            pixel_value,
            input_ids,
            attention_mask,
            current_state_mask,
            current_state,
            fov,
            use_cache=use_cache,
        )
        # handle multiple samples for one input
        samples, _ = self._forward_act_model(
            vlm_features = output_hs,
            attention_mask = inputs_masks,
            action_masks = x_mask,
            current_state = current_state,
            current_state_mask = current_state_mask,
            mode = "eval",
            repeated_diffusion_steps = sample_times,
            cfg_scale = cfg_scale,
            use_ddim = use_ddim,
            num_ddim_steps = num_ddim_steps,
        )
        action_np = samples.cpu().numpy() * x_mask.cpu().numpy()    # sample_times x T x D
        return action_np
  1. Open-Loop Robotic Ctrl, with NO feedback
    Robot break something -> Model never know

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions