diff --git a/README.md b/README.md index 2be964e75..acb7b96a8 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/) [![GitHub Stars](https://img.shields.io/github/stars/NVIDIA-NeMo/Export-Deploy.svg?style=social&label=Star)](https://github.com/NVIDIA-NeMo/Export-Deploy/stargazers/) - + [![๐Ÿ“– Documentation](https://img.shields.io/badge/docs-nvidia-informational?logo=book)](https://docs.nvidia.com/nemo/export-deploy/latest/index.html) [![๐Ÿ”ง Installation](https://img.shields.io/badge/install-guide-blue?logo=terminal)](https://github.com/NVIDIA-NeMo/Export-Deploy?tab=readme-ov-file#-install) @@ -21,7 +21,7 @@ -The **Export-Deploy library ("NeMo Export-Deploy")** provides tools and APIs for exporting and deploying NeMo and ๐Ÿค—Hugging Face models to production environments. It supports various deployment paths including TensorRT and vLLM deployment through NVIDIA Triton Inference Server and Ray Serve. +The **Export-Deploy library ("NeMo Export-Deploy")** provides tools and APIs for exporting and deploying NeMo and Hugging Face models to production environments. It supports various deployment paths including TensorRT and vLLM deployment through NVIDIA Triton Inference Server and Ray Serve. ![image](docs/NeMo_Repo_Overview_ExportDeploy.png) @@ -32,8 +32,8 @@ The **Export-Deploy library ("NeMo Export-Deploy")** provides tools and APIs for ## ๐Ÿš€ Key Features - Support for Large Language Models (LLMs) and Multimodal Models (MMs) -- Export Megatron-Brdige and Hugging Face models to optimized inference formats including vLLM -- Deploy Megatron-Brdige and Hugging Face models using Ray Serve or NVIDIA Triton Inference Server +- Export Megatron-Bridge and Hugging Face models to optimized inference formats including vLLM +- Deploy Megatron-Bridge, Megatron-LM and Hugging Face models using Ray Serve or NVIDIA Triton Inference Server - Multi-GPU and distributed inference capabilities - Multi-instance deployment options @@ -43,9 +43,10 @@ The **Export-Deploy library ("NeMo Export-Deploy")** provides tools and APIs for | Model / Checkpoint | vLLM | ONNX | TensorRT | |-------------------------------------------------------------------------------------------------|:---------:|:--------------------------:|:----------------------:| +| [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) | bf16 | N/A | N/A | | [Hugging Face](https://huggingface.co/docs/transformers/en/index) | bf16 | N/A | N/A | | [NIM Embedding](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/overview.html) | N/A | bf16, fp8, int8 (PTQ) | bf16, fp8, int8 (PTQ) | -| [NIM Reranking](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/overview.html) | N/A | Coming Soon | Coming Soon | +| [NIM Reranking](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/overview.html) | N/A | bf16, fp8, int8 (PTQ) | bf16, fp8, int8 (PTQ) | The support matrix above outlines the export capabilities for each model or checkpoint, including the supported precision options across various inference-optimized libraries. The export module enables exporting models that have been quantized using post-training quantization (PTQ) with the [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) library, as shown above. Models trained with low precision or quantization-aware training are also supported, as indicated in the table. @@ -57,6 +58,7 @@ Please note that not all large language models (LLMs) and multimodal models (MMs | Model / Checkpoint | RayServe | PyTriton | |-------------------------------------------------------------------------------------------|------------------------------------------|-------------------------| +| [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) | Single/Multi-Node Multi-GPU | Single-Node Multi-GPU | | [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) | Limited | Limited | | [Hugging Face](https://huggingface.co/docs/transformers/en/index) | Single-Node Multi-GPU,
Multi-instance | Single-Node Multi-GPU | | [vLLM](https://github.com/vllm-project/vllm) | N/A | Single-Node Multi-GPU | diff --git a/docs/llm/automodel/optimized/automodel-trtllm.md b/docs/llm/automodel/optimized/automodel-trtllm.md deleted file mode 100644 index 7db557d8e..000000000 --- a/docs/llm/automodel/optimized/automodel-trtllm.md +++ /dev/null @@ -1,302 +0,0 @@ - -# Deploy Automodel LLMs with TensorRT-LLM and Triton Inference Server - -This section shows how to use scripts and APIs to export a [NeMo AutoModel](https://docs.nvidia.com/nemo/automodel/latest/index.html) model (Hugging Face) model to TensorRT-LLM, and deploy it with the NVIDIA Triton Inference Server. - - -## Quick Example - -1. Pull down and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use: - - ```shell - docker pull nvcr.io/nvidia/nemo:vr - - docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ - -w /opt/Export-Deploy \ - --name nemo-fw \ - nvcr.io/nvidia/nemo:vr - ``` - -2. Run the following deployment script to verify that everything is working correctly. The script exports the Hugging Face model to TensorRT-LLM and subsequently serves it on the Triton server: - - ```shell - python /opt/Export-Deploy/scripts/deploy/nlp/deploy_triton.py \ - --hf_model_id_path meta-llama/Llama-3.2-1B \ - --model_type LlamaForCausalLM \ - --triton_model_name llama \ - --tensor_parallelism_size 1 - ``` - -3. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example). - -4. In a separate terminal, access the running container as follows: - - ```shell - docker exec -it nemo-fw bash - ``` - -5. To send a query to the Triton server, run the following script: - - ```shell - python /opt/Export-Deploy/scripts/deploy/nlp/query.py -mn llama -p "What is the color of a banana?" -mol 5 - ``` - -## Use a Script to Deploy Hugging Face Models on a Triton Server - -You can deploy a Hugging Face model on Triton using the provided script. - -### Export and Deploy a Hugging Face Model - -After executing the script, it will export the model to TensorRT-LLM and then initiate the service on Triton. - -1. Start the container using the steps described in the **Quick Example** section. - -2. To deploy a model that needs to be downloaded from Hugging Face, you need to generate a Hugging Face token that has access to these models. Visit `Hugging Face `__ for more information. After you have the token, perform one of the following steps. - - - Log in to Hugging Face: - - ```shell - huggingface-cli login - ``` - - - Or, set the HF_TOKEN environment variable: - - ```shell - export HF_TOKEN=your_token_here - ``` - - **Note:** If you're using a locally downloaded model, you don't need to provide a Hugging Face token unless the model requires it for downloading additional resources. - -2. To begin serving a Hugging Face model, you can use either a model ID from the Hugging Face Hub or a path to a locally downloaded model: - - a. Using a Hugging Face model ID: - - ```shell - - python /opt/Export-Deploy/scripts/deploy/nlp/deploy_triton.py \ - --hf_model_id_path meta-llama/Meta-Llama-3-8B-Instruct \ - --model_type LlamaForCausalLM \ - --triton_model_name llama \ - --tensor_parallelism_size 1 - ``` - - b. To use a locally downloaded model: - - ```shell - python /opt/Export-Deploy/scripts/deploy/nlp/deploy_triton.py \ - --hf_model_id_path /path/to/your/local/model \ - --model_type LlamaForCausalLM \ - --triton_model_name llama \ - --tensor_parallelism_size 1 - ``` - - The following parameters are defined in the ``deploy_triton.py`` script: - - - ``--hf_model_id_path``: path or identifier of the Hugging Face model. This can be either: - - A Hugging Face model ID (e.g., "meta-llama/Meta-Llama-3-8B-Instruct") - - A local path to a downloaded model directory (e.g., "/path/to/your/local/model") - - ``--model_type``: type of the model. See the table below for supported model types. - - ``--triton_model_name``: name of the model on Triton. - - ``--triton_model_version``: version of the model. Default is 1. - - ``--triton_port``: port for the Triton server to listen for requests. Default is 8000. - - ``--triton_http_address``: HTTP address for the Triton server. Default is 0.0.0.0. - - ``--triton_model_repository``: TensorRT temp folder. Default is ``/tmp/trt_llm_model_dir/``. - - ``--tensor_parallelism_size``: number of GPUs to split the tensors for tensor parallelism. Default is 1. - - ``--pipeline_parallelism_size``: number of GPUs to split the model for pipeline parallelism. Default is 1. - - ``--dtype``: data type of the model on TensorRT-LLM. Default is "bfloat16". Currently, only "bfloat16" is supported. - - ``--max_input_len``: maximum input length of the model. Default is 256. - - ``--max_output_len``: maximum output length of the model. Default is 256. - - ``--max_batch_size``: maximum batch size of the model. Default is 8. - - ``--max_num_tokens``: maximum number of tokens. Default is None. - - ``--opt_num_tokens``: optimum number of tokens. Default is None. - -3. The following table shows the supported Hugging Face model types and their corresponding ``model_type`` values: - - - | Hugging Face Model | model_type | - | :------------------- | ------------------------------| - | GPT2LMHeadModel | GPTForCausalLM | - | GPT2LMHeadCustomModel| GPTForCausalLM| - | GPTBigCodeForCausalLM | GPTForCausalLM| - | Starcoder2ForCausalLM | GPTForCausalLM| - | JAISLMHeadModel | GPTForCausalLM| - | GPTForCausalLM | GPTForCausalLM| - | NemotronForCausalLM | GPTForCausalLM| - | OPTForCausalLM | OPTForCausalLM| - | BloomForCausalLM | BloomForCausalLM| - | RWForCausalLM | FalconForCausalLM| - | FalconForCausalLM | FalconForCausalLM| - | PhiForCausalLM | PhiForCausalLM| - | Phi3ForCausalLM | Phi3ForCausalLM| - | Phi3VForCausalLM | Phi3ForCausalLM| - | Phi3SmallForCausalLM | Phi3ForCausalLM| - | PhiMoEForCausalLM | Phi3ForCausalLM| - | MambaForCausalLM | MambaForCausalLM| - | GPTNeoXForCausalLM | GPTNeoXForCausalLM| - | GPTJForCausalLM | GPTJForCausalLM| - | MptForCausalLM | MPTForCausalLM| - | MPTForCausalLM | MPTForCausalLM| - | GLMModel | ChatGLMForCausalLM| - | ChatGLMModel | ChatGLMForCausalLM| - | ChatGLMForCausalLM | ChatGLMForCausalLM| - | ChatGLMForConditionalGeneration| ChatGLMForCausalLM| - | LlamaForCausalLM | LLaMAForCausalLM| - | LlavaLlamaModel | LLaMAForCausalLM| - | ExaoneForCausalLM | LLaMAForCausalLM| - | MistralForCausalLM | LLaMAForCausalLM| - | MixtralForCausalLM | LLaMAForCausalLM| - | ArcticForCausalLM | LLaMAForCausalLM| - | Grok1ModelForCausalLM | GrokForCausalLM| - | InternLMForCausalLM | LLaMAForCausalLM| - | InternLM2ForCausalLM | LLaMAForCausalLM| - | InternLMXComposer2ForCausalLM | LLaMAForCausalLM| - | GraniteForCausalLM | LLaMAForCausalLM| - | GraniteMoeForCausalLM | LLaMAForCausalLM| - | MedusaForCausalLM | MedusaForCausalLm| - | MedusaLlamaForCausalLM | MedusaForCausalLm| - | ReDrafterForCausalLM | ReDrafterForCausalLM| - | BaichuanForCausalLM | BaichuanForCausalLM| - | BaiChuanForCausalLM | BaichuanForCausalLM| - | SkyworkForCausalLM | LLaMAForCausalLM| - | GEMMA | GemmaForCausalLM| - | GEMMA2 | GemmaForCausalLM| - | QWenLMHeadModel | QWenForCausalLM| - | QWenForCausalLM | QWenForCausalLM| - | Qwen2ForCausalLM | QWenForCausalLM| - | Qwen2MoeForCausalLM | QWenForCausalLM| - | Qwen2ForSequenceClassification | QWenForCausalLM| - | Qwen2VLForConditionalGeneration| QWenForCausalLM| - | Qwen2VLModel | QWenForCausalLM| - | WhisperEncoder | WhisperEncoder| - | EncoderModel | EncoderModel| - | DecoderModel | DecoderModel| - | DbrxForCausalLM | DbrxForCausalLM| - | RecurrentGemmaForCausalLM | RecurrentGemmaForCausalLM| - | CogVLMForCausalLM | CogVLMForCausalLM| - | DiT | DiT| - | DeepseekForCausalLM | DeepseekForCausalLM| - | DeciLMForCausalLM | DeciLMForCausalLM| - | DeepseekV2ForCausalLM | DeepseekV2ForCausalLM| - | EagleForCausalLM | EagleForCausalLM| - | CohereForCausalLM | CohereForCausalLM| - | MLLaMAModel | MLLaMAForCausalLM| - | MllamaForConditionalGeneration | MLLaMAForCausalLM| - | BertForQuestionAnswering | BertForQuestionAnswering| - | BertForSequenceClassification | BertForSequenceClassification| - | BertModel | BertModel| - | RobertaModel | RobertaModel| - | RobertaForQuestionAnswering | RobertaForQuestionAnswering| - | RobertaForSequenceClassification | RobertaForSequenceClassification| - -4. Whenever the script is executed, it initiates the service by exporting the Hugging Face model to TensorRT-LLM. To skip the exporting step in the optimized inference option, you can specify an empty directory to save the TensorRT-LLM engine produced. Stop the running container and then run the following command to specify an empty directory: - - ```shell - - mkdir tmp_triton_model_repository - - docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ - -v ${PWD}:/opt/checkpoints/ \ - -w /opt/Export-Deploy \ - nvcr.io/nvidia/nemo:vr - - python /opt/Export-Deploy/scripts/deploy/nlp/deploy_triton.py \ - --hf_model_id_path /path/to/your/local/model \ - --model_type LlamaForCausalLM \ - --triton_model_name llama \ - --triton_model_repository /opt/checkpoints/tmp_triton_model_repository \ - --tensor_parallelism_size 1 - ``` - - The model will be exported to the specified folder after executing the script mentioned above so that it can be reused later. - -5. To load the exported model directly, run the following script within the container: - - ```shell - python /opt/Export-Deploy/scripts/deploy/nlp/deploy_triton.py \ - --triton_model_name llama \ - --triton_model_repository /opt/checkpoints/tmp_triton_model_repository \ - --model_type LlamaForCausalLM - ``` - -## Export and Deploy a LLM Model with TensorRT-LLM API - -Alternatively, if the **deploy_triton** script is unable to export your model to TensorRT-LLM, you can leverage the new [TensorRT-LLM LLM API](https://nvidia.github.io/TensorRT-LLM/latest/quick-start-guide.html#run-offline-inference-with-llm-api). This API provides a streamlined way to export and deploy models. See the example below: - -```shell -python /opt/Export-Deploy/scripts/deploy/nlp/deploy_trtllm_api_triton.py \ - --hf_model_id_path /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ - --triton_model_name llama \ - --tensor_parallel_size 1 -``` - -After starting the Triton server, you can query the deployed model using the **/opt/Export-Deploy/scripts/deploy/nlp/query.py** script, as demonstrated in the previous steps. - - -## Use NeMo Export and Deploy Module APIs to Run Inference - -Up until now, we have used scripts for exporting and deploying Hugging Face models. However, NeMo's deploy and export modules offer straightforward APIs for deploying models to Triton and exporting Hugging Face models to TensorRT-LLM. - -### Export a Hugging Face Model to TensorRT-LLM - -You can use the APIs in the export module to export a Hugging Face model to TensorRT-LLM. The following code example assumes the ``/opt/checkpoints/tmp_trt_llm`` path exists. - -1. Run the following command: - - ```python - from nemo_export.tensorrt_llm import TensorRTLLM - - trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/") - # Using a Hugging Face model ID - trt_llm_exporter.export_hf_model( - hf_model_path="meta-llama/Meta-Llama-3-8B-Instruct", - model_type="LlamaForCausalLM", - tensor_parallelism_size=1, - ) - # Or using a local model path - - trt_llm_exporter.export_hf_model( - hf_model_path="/path/to/your/local/model", - model_type="LlamaForCausalLM", - tensor_parallelism_size=1, - } - - trt_llm_exporter.forward( - ["What is the best city in the world?"], - max_output_token=15, - top_k=1, - top_p=0.0, - temperature=1.0, - ) - ``` - -2. Be sure to check the TensorRTLLM class docstrings for details. - -### Deploy a NeMo Automodel LLM to TensorRT-LLM with APIs - -You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the ``/opt/checkpoints/tmp_trt_llm`` path exists. - -1. Run the following command: - - ```python - from nemo_export.tensorrt_llm import TensorRTLLM - from nemo_deploy import DeployPyTriton - - trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/") - # Using a Hugging Face model ID - trt_llm_exporter.export_hf_model( - hf_model_path="meta-llama/Llama-2-7b-hf", - model_type="LlamaForCausalLM", - tensor_parallelism_size=1, - ) - # Or using a local model path - - trt_llm_exporter.export_hf_model( - hf_model_path="/path/to/your/local/model", - model_type="LlamaForCausalLM", - tensor_parallelism_size=1, - ) - - nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="llama", http_port=8000) - nm.deploy() - nm.serve() - ``` diff --git a/docs/llm/automodel/optimized/index.md b/docs/llm/automodel/optimized/index.md index 149fb0a12..3891d171e 100644 --- a/docs/llm/automodel/optimized/index.md +++ b/docs/llm/automodel/optimized/index.md @@ -1,12 +1,11 @@ # Export and Deploy NeMo Automodel LLMs -NeMo Export-Deploy library offers scripts and APIs to export [NeMo AutoModel](https://docs.nvidia.com/nemo/automodel/latest/index.html) models to two inference optimized libraries, TensorRT-LLM and vLLM, and to deploy the exported model with the NVIDIA Triton Inference Server. +NeMo Export-Deploy library offers scripts and APIs to export [NeMo AutoModel](https://docs.nvidia.com/nemo/automodel/latest/index.html) models to the vLLM inference optimized library, and to deploy the exported model with the NVIDIA Triton Inference Server. ```{toctree} :maxdepth: 4 :titlesonly: :hidden: -Deploy TensorRT-LLM with Triton Deploy vLLM with Triton ``` \ No newline at end of file diff --git a/docs/llm/index.md b/docs/llm/index.md index 40760c81b..93bd944de 100644 --- a/docs/llm/index.md +++ b/docs/llm/index.md @@ -1,6 +1,6 @@ # Export and Deploy Large Language Models -The Export-Deploy library provides comprehensive tools and APIs for exporting and deploying Large Language Models (LLMs) to production environments. This library supports multiple checkpoint formats and offers various deployment paths including TensorRT-LLM and vLLM deployment through NVIDIA Triton Inference Server and Ray Serve. +The Export-Deploy library provides comprehensive tools and APIs for exporting and deploying Large Language Models (LLMs) to production environments. This library supports multiple checkpoint formats and offers various deployment paths including vLLM deployment through NVIDIA Triton Inference Server and Ray Serve. ## Overview @@ -17,10 +17,7 @@ The library supports several checkpoint formats, each with specific capabilities **Supported Export and Deployment Paths:** - Model deployment with Triton and Ray Serve - -**Export and Deployment Paths Coming Soon:** -- TensorRT-LLM export and deployment with Triton and Ray Serve -- vLLM export and deployment with Triton and Ray Serve +- vLLM export and deployment with Triton ### AutoModel Model/Checkpoints @@ -29,7 +26,6 @@ The library supports several checkpoint formats, each with specific capabilities **Supported Export and Deployment Paths:** - Model deployment with Triton and Ray Serve -- TensorRT-LLM export and deployment with Triton and Ray Serve - vLLM export and deployment with Triton and Ray Serve diff --git a/docs/llm/mbridge/index.md b/docs/llm/mbridge/index.md index 35f403f59..6a5a0a3a4 100644 --- a/docs/llm/mbridge/index.md +++ b/docs/llm/mbridge/index.md @@ -5,8 +5,7 @@ The [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) checkpoint With the Export-Deploy library, you can seamlessly export and deploy Megatron-Bridge checkpoints across a variety of production environments. The following export and deployment paths are supported for Megatron-Bridge models: - **Model deployment with Triton and Ray Serve:** Directly serve Megatron-Bridge models using NVIDIA Triton Inference Server or Ray Serve for scalable inference. -- **TensorRT-LLM export and deployment with Triton and Ray Serve:** Convert Megatron-Bridge checkpoints into optimized TensorRT-LLM engines for high-performance inference, deployable via Triton or Ray Serve. Support for this feature is coming soon. -- **vLLM export and deployment with Triton:** Export Megatron-Bridge models to the vLLM format for efficient serving with Triton. Support for this feature is coming soon. +- **vLLM export and deployment with Triton:** Export Megatron-Bridge models to the vLLM format for efficient serving with Triton. ```{toctree} diff --git a/docs/llm/mbridge/optimized/index.md b/docs/llm/mbridge/optimized/index.md index 90429b532..6ff15b103 100644 --- a/docs/llm/mbridge/optimized/index.md +++ b/docs/llm/mbridge/optimized/index.md @@ -1,6 +1,6 @@ # Deploy Megatron-Bridge LLMs by Exporting to Inference Optimized Libraries -Export-Deploy supports optimizing and deploying Megatron-Bridge checkpoints using inference-optimized libraries such as vLLM and TensorRT-LLM. +Export-Deploy supports optimizing and deploying Megatron-Bridge checkpoints using inference-optimized libraries such as vLLM. ```{toctree} :maxdepth: 1 @@ -9,5 +9,3 @@ Export-Deploy supports optimizing and deploying Megatron-Bridge checkpoints usin vLLM ``` -**Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM is coming soon. Please check back for updates. - diff --git a/docs/mm/index.md b/docs/mm/index.md index e94ea65fd..3d20a905e 100644 --- a/docs/mm/index.md +++ b/docs/mm/index.md @@ -1,6 +1,6 @@ # Export and Deploy Multimodal Models -The Export-Deploy library provides comprehensive tools and APIs for exporting and deploying Multimodal Models (MMs) to production environments. This library supports multiple checkpoint formats and offers various deployment paths including TensorRT-LLM deployment through NVIDIA Triton Inference Server. +The Export-Deploy library provides comprehensive tools and APIs for exporting and deploying Multimodal Models (MMs) to production environments. This library supports multiple checkpoint formats and offers various deployment paths through NVIDIA Triton Inference Server. ## Overview @@ -17,7 +17,6 @@ The library supports several checkpoint formats, each with specific capabilities **Export and Deployment Paths Coming Soon:** - Model deployment with Triton and Ray Serve -- TensorRT-LLM export and deployment with Triton and Ray Serve ### AutoModel Model/Checkpoints @@ -26,7 +25,6 @@ The library supports several checkpoint formats, each with specific capabilities **Export and Deployment Paths Coming Soon:** - Model deployment with Triton -- TensorRT-LLM export and deployment with Triton diff --git a/nemo_deploy/llm/query_llm.py b/nemo_deploy/llm/query_llm.py index 3a6973a77..e900e4377 100755 --- a/nemo_deploy/llm/query_llm.py +++ b/nemo_deploy/llm/query_llm.py @@ -449,7 +449,7 @@ def query_llm( class NemoQueryvLLM(NemoQueryLLMBase): - """Sends a query to Triton for TensorRT-LLM API deployment inference. + """Sends a query to Triton for vLLM deployment inference. Example: from nemo_deploy import NemoQueryvLLM diff --git a/nemo_export_deploy_common/import_utils.py b/nemo_export_deploy_common/import_utils.py index 1543ab46c..ca66938d8 100644 --- a/nemo_export_deploy_common/import_utils.py +++ b/nemo_export_deploy_common/import_utils.py @@ -36,7 +36,6 @@ ) MISSING_TRITON_MSG = "pytriton is not available. Please install it with `pip install nvidia-pytriton`." -MISSING_TENSORRT_LLM_MSG = "tensorrt_llm is not available. Please install it with `pip install tensorrt-llm`." MISSING_TENSORRT_MSG = "tensorrt is not available. Please install it with `pip install nvidia-tensorrt`." MISSING_NEMO_MSG = "nemo is not available. Please install it with `pip install nemo`." MISSING_MBRIDGE_MSG = (