How to scale Vanna training and deployment for larger databases? #340

andreped · 2024-04-05T17:10:45Z

andreped
Apr 5, 2024

We have been successfully been using Vanna for our application. But we recently moved back to GPT-3.5 from GPT-4, due to performance issues. With this change we started getting out-of-tokens issues again. This happens because we train with the entire DDL/schema as a single string. I just realized now that we were doing it, so no wonder we were having issues.

What is the optimal way to build the vector store (ChromaDB) using the DDL? I assume we want to chunk it. I did not really want to do it manually, as I was not sure how to do it best.

I noticed that Vanna offers a training planner, which seems to take care of the DDL as well. Is this the recommended way to do it?

andreped · 2024-04-15T12:11:05Z

andreped
Apr 15, 2024
Author

@zainhoda Any thoughts on this? Say that we have an SQL DB and I can automatically extract the DDL. How am I to chunk it and use it appropriately to get the accuracy I should expect using Vanna?

0 replies

zainhoda · 2024-04-15T16:28:44Z

zainhoda
Apr 15, 2024
Maintainer

I think there's a few phases to this:

Short-Term

So in discussions with other users, what has been successful elsewhere is that in the initial training process, use the DDL statements in full with a large context model and then use that to generate a lot of question-SQL pairs.
Once a sufficient corpus of question-SQL pairs exist, we've had users completely eliminate the DDL from the training data since the SQL statements themselves have sufficient information to allow the LLM to generate the new SQL.
I think one thing we could start experimenting with is potentially creating an n_results_dynamic parameter where it will prefer question-SQL pairs and only use the documentation and DDL if there aren't sufficient question-SQL pairs. That would be a relatively easy solution to avoid token issues but the issue here may be that if the question is very "novel" then the question-SQL pairs may not contain the necessary information to generate the SQL.

Medium-Term

I think that vn.add_ddl and vn.add_documentation should do some token length checks and if they exceed a certain size, then they should be broken up (perhaps with a chunk=True parameter).
The tricky part for DDL is going to be chunking in a way that preserves the primary key and any foreign keys so that if the individual chunk is retrieved and sent to the LLM, the foreign keys are mentioned in the chunk so that the LLM knows what other tables it can join to. If we are able to do that successfully, the other issue we'll run into is if we retrieve multiple chunks for the same table that have the information repeated and so potentially waste tokens by repeating information.

Long-Term

In the long-term, any training data entered should construct a knowledge graph behind the scenes to give a structured understanding of what tables, columns, values, and concepts exist in the database. Constructing and traversing the knowledge graph is non-trivial. I've been looking into potentially using a similar approach to this: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

1 reply

MichaelMMeskhi Sep 11, 2024

This might be a bit late but maybe if we can identify the key in the DDL they can be inserted into the metadata to preserve that info in case that single chunk is broken up?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to scale Vanna training and deployment for larger databases? #340

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

How to scale Vanna training and deployment for larger databases? #340

andreped Apr 5, 2024

Replies: 2 comments · 1 reply

andreped Apr 15, 2024 Author

zainhoda Apr 15, 2024 Maintainer

Short-Term

Medium-Term

Long-Term

MichaelMMeskhi Sep 11, 2024

andreped
Apr 5, 2024

Replies: 2 comments 1 reply

andreped
Apr 15, 2024
Author

zainhoda
Apr 15, 2024
Maintainer