Paralel processing of /explain/ #108

xsuchy · 2025-01-14T08:59:29Z

Today, we are able to process only one request to /explain in a time. This does not scale.

We should investigate how to parallelize this.

In https://github.com/fedora-copr/logdetective/pull/106/files I discovered that:

I found that llama_cpp.server should have --parallel 2.See https://www.reddit.com/r/LocalLLaMA/comments/1be845y/multiple_concurrent_generations_with_llamacpp/
Is this what we need for parallel access?

But later @tt discovered that:

unfortunately, we are using llama-cpp-python's server and not the one in llama.cpp :// we are using this one and it doesn't know the parallel flag - I saw only threading being configurable https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#openai-compatible-web-server

We should either replace our server or find another way how to parallelize this.

jpodivin · 2025-01-15T12:38:41Z

Replacement with the server from llama.cpp would be the most straightforward.

fedora-copr-github-bot added this to CPT Kanban Jan 14, 2025

github-project-automation bot moved this to Needs triage in CPT Kanban Jan 14, 2025

praiskup removed this from CPT Kanban Jan 15, 2025

jpodivin added the enhancement New feature or request label Jan 17, 2025

jpodivin mentioned this issue Jan 17, 2025

Moving to llama.cpp server #111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paralel processing of /explain/ #108

Paralel processing of /explain/ #108

xsuchy commented Jan 14, 2025

jpodivin commented Jan 15, 2025

Paralel processing of /explain/ #108

Paralel processing of /explain/ #108

Comments

xsuchy commented Jan 14, 2025

jpodivin commented Jan 15, 2025