Replies: 2 comments 1 reply
-
@zainhoda Any thoughts on this? Say that we have an SQL DB and I can automatically extract the DDL. How am I to chunk it and use it appropriately to get the accuracy I should expect using Vanna? |
Beta Was this translation helpful? Give feedback.
-
I think there's a few phases to this: Short-TermSo in discussions with other users, what has been successful elsewhere is that in the initial training process, use the DDL statements in full with a large context model and then use that to generate a lot of question-SQL pairs. Medium-TermI think that Long-TermIn the long-term, any training data entered should construct a knowledge graph behind the scenes to give a structured understanding of what tables, columns, values, and concepts exist in the database. Constructing and traversing the knowledge graph is non-trivial. I've been looking into potentially using a similar approach to this: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/ |
Beta Was this translation helpful? Give feedback.
-
We have been successfully been using Vanna for our application. But we recently moved back to GPT-3.5 from GPT-4, due to performance issues. With this change we started getting out-of-tokens issues again. This happens because we train with the entire DDL/schema as a single string. I just realized now that we were doing it, so no wonder we were having issues.
What is the optimal way to build the vector store (ChromaDB) using the DDL? I assume we want to chunk it. I did not really want to do it manually, as I was not sure how to do it best.
I noticed that Vanna offers a training planner, which seems to take care of the DDL as well. Is this the recommended way to do it?
Beta Was this translation helpful? Give feedback.
All reactions