[PoC] Improve TRTLLM deployment UX #650

rmccorm4 · 2024-11-22T21:37:11Z

Changes

Remove mandatory template values in configs with some sensible defaults
Support building TRTLLM engine on model load if none found using LLM API
Add env vars for conveniently configuring engine and tokenizers from a single location instead of specifying it in all the model configs

Example Usage

Quickstart - no engine, no tokenizer, build on demand

# Launch TRTLLM container
docker run -ti \
    --gpus all \
    --network=host \
    --shm-size=1g \
    --ulimit memlock=-1 \
    -e HF_TOKEN \
    -v ${HOME}:/mnt \
    -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
    nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3

# Clone these changes
git clone -b rmccormick/ux https://github.com/triton-inference-server/tensorrtllm_backend.git

# Specify directory for engines and tokenizer config to either be read from, or written to
export TRTLLM_ENGINE_DIR="/tmp/hackathon"
# Specify model to build if TRTLLM_ENGINE_DIR has no engines
export TRTLLM_MODEL="meta-llama/Meta-Llama-3.1-8B-Instruct"
# Workaround to support HF Tokenizer while engine is being built on demand to avoid
# ordering issues with model loading, or if tokenizer exists in a different location.
export TRTLLM_TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"

# Start server
tritonserver --model-repository ./tensorrtllm_backend/all_models/inflight_batcher_llm

Pre-built engine + tokenizer already in same location

export TRTLLM_ENGINE_DIR="/tmp/hackathon"

# Start server
tritonserver --model-repository ./tensorrtllm_backend/all_models/inflight_batcher_llm

Pre-built engine + tokenizer in different locations

export TRTLLM_ENGINE_DIR="/tmp/hackathon"
export TRTLLM_TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"

# Start server
tritonserver --model-repository ./tensorrtllm_backend/all_models/inflight_batcher_llm

Further customization/configuration

Manually tune/configure values in the config.pbtxt files as needed for tuning Triton or TRT-LLM runtime fields.

Open Items

Ordering: If engine and tokenizer don't exist, and preprocessing/postprocessing models load before tensorrt_llm model builds engine and downloads tokenizer, then they will fail to load with no tokenizer found.
- Added TRTLLM_TOKENIZER env var as a WAR for the ordering issue for now.
Support building engine from a TRTLLM-generated config.json if config is found but engines are not
Support configuring more TRTLLM backend/runtime fields from the engine's config.json
Test multi-gpu engine (ex: Llama 70B)
Re-use common logic around tokenizer / env vars in preprocessing and postprocessing models

[Extra] Probably not in scope for this PR, but there is also a Python Model shutdown segfault

[ced35d0-lcedt:2992 :0:2992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x60)
==== backtrace (tid:   2992) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000b44d0 triton::backend::python::Metric::SaveToSharedMemory()  :0
 2 0x00000000000b536e triton::backend::python::Metric::Clear()  :0
 3 0x00000000000b9291 triton::backend::python::MetricFamily::~MetricFamily()  :0
 4 0x00000000000677f2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()  ???:0
 5 0x000000000006822a pybind11::class_<triton::backend::python::MetricFamily, std::shared_ptr<triton::backend::python::MetricFamily> >::dealloc()  ???:0
 6 0x0000000000037b5d pybind11::detail::clear_instance()  :0
 7 0x0000000000038b13 pybind11_object_dealloc()  ???:0
 8 0x000000000011bea5 PyODict_DelItem()  ???:0
 9 0x0000000000144b37 PyType_GenericAlloc()  ???:0
10 0x000000000005142e triton::backend::python::Stub::~Stub()  :0
11 0x0000000000028f53 main()  ???:0
12 0x0000000000029d90 __libc_init_first()  ???:0
13 0x0000000000029e40 __libc_start_main()  ???:0
14 0x0000000000029db5 _start()  ???:0
=================================
I1122 22:08:07.287915 2117 model_lifecycle.cc:624] "successfully unloaded 'tensorrt_llm' version 1"

…TLLM engine on model load if none found, add env vars for conveniently configuring engine and tokenizers from a single location

…omment it out because it can't be ingested directly

rmccorm4 · 2024-11-25T19:49:36Z

all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt

@@ -330,7 +330,7 @@ parameters: {

 instance_group [
  {
-    count: ${bls_instance_count}
+    count: 1


TODO: Should evaluate what reasonable instance count defaults are

[PoC] Remove need for template values in configs, support building TR…

934ca8d

…TLLM engine on model load if none found, add env vars for conveniently configuring engine and tokenizers from a single location

rmccorm4 marked this pull request as draft November 22, 2024 21:37

rmccorm4 added 2 commits November 22, 2024 14:54

Remove kvcache config, add TRTLLM_TOKENIZER env var support

4ad7815

Add placeholder for using engine build config from config,json, but c…

1bea632

…omment it out because it can't be ingested directly

rmccorm4 commented Nov 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC] Improve TRTLLM deployment UX #650

[PoC] Improve TRTLLM deployment UX #650

rmccorm4 commented Nov 22, 2024 •

edited

Loading

rmccorm4 Nov 25, 2024

[PoC] Improve TRTLLM deployment UX #650

Are you sure you want to change the base?

[PoC] Improve TRTLLM deployment UX #650

Conversation

rmccorm4 commented Nov 22, 2024 • edited Loading

Changes

Example Usage

Quickstart - no engine, no tokenizer, build on demand

Pre-built engine + tokenizer already in same location

Pre-built engine + tokenizer in different locations

Further customization/configuration

Open Items

rmccorm4 Nov 25, 2024

Choose a reason for hiding this comment

rmccorm4 commented Nov 22, 2024 •

edited

Loading