Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PoC] Improve TRTLLM deployment UX #650

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

[PoC] Improve TRTLLM deployment UX #650

wants to merge 3 commits into from

Conversation

rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Nov 22, 2024

Changes

  • Remove mandatory template values in configs with some sensible defaults
  • Support building TRTLLM engine on model load if none found using LLM API
  • Add env vars for conveniently configuring engine and tokenizers from a single location instead of specifying it in all the model configs

Example Usage

Quickstart - no engine, no tokenizer, build on demand

# Launch TRTLLM container
docker run -ti \
    --gpus all \
    --network=host \
    --shm-size=1g \
    --ulimit memlock=-1 \
    -e HF_TOKEN \
    -v ${HOME}:/mnt \
    -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
    nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3

# Clone these changes
git clone -b rmccormick/ux https://github.com/triton-inference-server/tensorrtllm_backend.git

# Specify directory for engines and tokenizer config to either be read from, or written to
export TRTLLM_ENGINE_DIR="/tmp/hackathon"
# Specify model to build if TRTLLM_ENGINE_DIR has no engines
export TRTLLM_MODEL="meta-llama/Meta-Llama-3.1-8B-Instruct"
# Workaround to support HF Tokenizer while engine is being built on demand to avoid
# ordering issues with model loading, or if tokenizer exists in a different location.
export TRTLLM_TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"

# Start server
tritonserver --model-repository ./tensorrtllm_backend/all_models/inflight_batcher_llm

Pre-built engine + tokenizer already in same location

export TRTLLM_ENGINE_DIR="/tmp/hackathon"

# Start server
tritonserver --model-repository ./tensorrtllm_backend/all_models/inflight_batcher_llm

Pre-built engine + tokenizer in different locations

export TRTLLM_ENGINE_DIR="/tmp/hackathon"
export TRTLLM_TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"

# Start server
tritonserver --model-repository ./tensorrtllm_backend/all_models/inflight_batcher_llm

Further customization/configuration

Manually tune/configure values in the config.pbtxt files as needed for tuning Triton or TRT-LLM runtime fields.

Open Items

  • Ordering: If engine and tokenizer don't exist, and preprocessing/postprocessing models load before tensorrt_llm model builds engine and downloads tokenizer, then they will fail to load with no tokenizer found.
    • Added TRTLLM_TOKENIZER env var as a WAR for the ordering issue for now.
  • Support building engine from a TRTLLM-generated config.json if config is found but engines are not
  • Support configuring more TRTLLM backend/runtime fields from the engine's config.json
  • Test multi-gpu engine (ex: Llama 70B)
  • Re-use common logic around tokenizer / env vars in preprocessing and postprocessing models
[Extra] Probably not in scope for this PR, but there is also a Python Model shutdown segfault
[ced35d0-lcedt:2992 :0:2992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x60)
==== backtrace (tid:   2992) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000b44d0 triton::backend::python::Metric::SaveToSharedMemory()  :0
 2 0x00000000000b536e triton::backend::python::Metric::Clear()  :0
 3 0x00000000000b9291 triton::backend::python::MetricFamily::~MetricFamily()  :0
 4 0x00000000000677f2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()  ???:0
 5 0x000000000006822a pybind11::class_<triton::backend::python::MetricFamily, std::shared_ptr<triton::backend::python::MetricFamily> >::dealloc()  ???:0
 6 0x0000000000037b5d pybind11::detail::clear_instance()  :0
 7 0x0000000000038b13 pybind11_object_dealloc()  ???:0
 8 0x000000000011bea5 PyODict_DelItem()  ???:0
 9 0x0000000000144b37 PyType_GenericAlloc()  ???:0
10 0x000000000005142e triton::backend::python::Stub::~Stub()  :0
11 0x0000000000028f53 main()  ???:0
12 0x0000000000029d90 __libc_init_first()  ???:0
13 0x0000000000029e40 __libc_start_main()  ???:0
14 0x0000000000029db5 _start()  ???:0
=================================
I1122 22:08:07.287915 2117 model_lifecycle.cc:624] "successfully unloaded 'tensorrt_llm' version 1"

…TLLM engine on model load if none found, add env vars for conveniently configuring engine and tokenizers from a single location
@rmccorm4 rmccorm4 marked this pull request as draft November 22, 2024 21:37
@@ -330,7 +330,7 @@ parameters: {

instance_group [
{
count: ${bls_instance_count}
count: 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Should evaluate what reasonable instance count defaults are

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant