Skip to content

Capture compiled artifacts #1

@deitch

Description

@deitch

The typical onnxruntime flow is the following via llm-kvc.py (very simplified):

  1. Find the onnx model in the huggingface cache as presented via the CLI flags
  2. Run a compilation process, although it is more like transpilation+compilation: input is onnx model, output is a .bundle with an ELF in it. Uses et-glow, gcc, llvm, neuralizer, etc.
  3. Run inference. Uses InferenceServer.

If you run the same one a second time, the process recognizes that the model already has been converted from onnx to .bundle, and tuns just the inference step. This takes about 1/10 the time.

The current nekko CLI does not recognize that there are intermediate artifacts to capture. We need to do this via the following:

  • decide on a compiled artifacts cache directory default location
  • have an option to the CLI to override that location
  • mount the location into the container
  • configure the llm-kvc.py script to support controlling that location

cc @jerenkrantz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions