VectorCode is a code repository indexing tool. It helps you write better prompt for your coding LLMs by indexing and providing information about the code repository you're working on. This repository also contains the corresponding neovim plugin because that's what I used to write this tool.
Note
This project is in beta quality and only implements very basic retrieval and embedding functionalities. There are plenty of rooms for improvements and any help is welcomed.
Note
Chromadb, the vector database backend behind this project, supports multiple embedding engines. I developed this tool using SentenceTransformer, but if you encounter any issues with a different embedding function, please open an issue (or even better, a pull request :D).
LLMs usually have very limited understanding about close-source and/or infamous
projects, as well as cutting edge developments that have not made it into the
releases. Their capabilities on these projects are quite limited. Take my little
toy sudoku-solving project as an example: When I wrote the first few lines and
want the LLM to fill in the list of solvers that I implemented in
solver_candidates
, without project context, the completions are simply random
guesses that might be part of another sudoku project:
But with RAG context provided by VectorCode, my completion LLM was able to
provide completions that I actually implemented:
This makes the completion results far more usable.
A similar strategy
is implemented in continue, a popular AI completion
and chat plugin available on VSCode and JetBrain products.
A working instance of Chromadb. A local docker image will suffice.An embedding tool supported by Chromadb, which you can find out more from here and here
As long as you managed to install VectorCode
itself, you're good to go!
I recommend using pipx
. This will take care of
the dependencies of vectorcode
and create a dedicated virtual environment
without messing up your system Python.
Run the following command:
pipx install vectorcode
To install the latest commit from GitHub, clone the repo and run pipx install <path_to_repo>
.
This repo doubles as a neovim plugin. Use your favourite plugin manager to install.
For lazy.nvim
:
{
"Davidyz/VectorCode",
dependencies = { "nvim-lua/plenary.nvim" },
opts = {
n_query = 1, -- number of retrieved documents
notify = true, -- enable notifications
timeout_ms = 5000, -- timeout in milliseconds for the query operation.
exclude_this = true, -- exclude the buffer from which the query is called.
-- This avoids repetition when you change some code but
-- the embedding has not been updated.
},
cond = function() return vim.fn.executable('vectorcode') == 1 end,
}
It might be helpful to add VectorCode as a dependency of your AI completion plugin.
This tool uses a JSON file to store the configuration. The global config is located at
$HOME/.config/vectorcode/config.json
. You can also set a project-specific
configuration at <project_root>/.vectorcode/config.json
. Options in the
project configuration will override the global config. The closest parent directory
of the current working directory that contains a project-specific config will be
used as the project-root, but this can be overridden by the --project_root
flag.
{
"embedding_function": 'SomeEmbeddingFunction',
"embedding_params": {
}
"host": "localhost",
"port": 8000,
"db_path": "~/.local/share/vectorcode/chromadb/",
}
The following are the available options for the JSON configuration file:
embedding_function
: One of the embedding functions supported by Chromadb (find more here and here). For example, Chromadb supports Ollama aschromadb.utils.embedding_functions.OllamaEmbeddingFunction
, and the corresponding value forembedding_function
would beOllamaEmbeddingFunction
. Default:SentenceTransformerEmbeddingFunction
;embedding_params
: Whatever initialisation parameters your embedding function takes. ForOllamaEmbeddingFunction
, if you setembedding_params
to:Then the embedding function object will be initialised as{ "url": "http://127.0.0.1:11434/api/embeddings", "model_name": "nomic-embed-text" }
OllamaEmbeddingFunction(url="http://127.0.0.1:11434/api/embeddings", model_name="nomic-embed-text")
. Default:{}
;host
andport
: Chromadb server host and port. Default: not set, in favour of local persistent client set bydb_path
. Please only use with local or LAN Chromadb server because ChromaDB authentication is still WIP;db_path
: Path to local persistent database. Ifhost
orport
is set, this will be ignored. Default:~/.local/share/vectorcode/chromadb/
;chunk_size
: integer, the maximum number of characters per chunk. A larger value reduces the number of items in the database, and hence accelerates the search, but at the cost of potentially truncated data and lost information. Default:-1
(no chunking), but it's highly recommended to set it to a positive integer that works for your model when working with large documents;overlap_ratio
: float between 0 and 1, the ratio of overlapping content in a between 2 adjacent chunks. A larger ratio improves the coherences of chunks, but at the cost of increasing number of entries in the database and hence slowing down the search. Default:0.2
;query_multplier
: when you use thequery
command to retrieven
documents, VectorCode will checkn * query_multplier
chunks and return at mostn
documents. A larger value ofquery_multplier
guarantees the return ofn
documents, but with the risk of including too many less-relevant chunks that may affect the document selection. Default:-1
(any negative value means selecting documents based on all indexed chunks).
For the convenience of deployment, environment variables in the configuration values will be automatically expanded so that you can override thing at run time without modifying the JSON.
Also, some of the built-in embedding functions supported by Chromadb requires
external library (such as openai
) that are not included in the dependency
list. This is what Chromadb did, so I did the same. If you installed
vectorcode
via pipx
, you can install extra libraries by running the
following command:
pipx inject vectorcode openai
And openai
will be added to the virtual environment of vectorcode
.
This is an incomplete list of command-line options. You can always use
vectorcode -h
to view the full list of arguments.
This tool creates a collection
(just like tables in traditional databases) for each
project. The collections are identified by project root, which, by default, is
the current working directory. You can override this by using the --project_root <path_to_your_project_root>
argument.
vectorcode init
Create a project-local configuration at the current directory (or the directory
specified by the --project_root
flag). This directory acts like a .git
directory. Consider the following file directory:
foo/
foo/.vectorcode/
foo/bar/
Running vectorcode init
command in foo/
creates the foo/.vectorcode/
directory, which can contain the project-local config.json
. When
foo/.vectorcode/
is present, foo/
will be used as the project-root for
VectorCode when you run vectorcode
command from any of the subdirectories of
foo/
(such as foo/bar/
), unless overridden by --project_root
.
When you run vectorcode init
and a global configuration file is present, it'll
be copied to your project-local config directory. If a project-local
configuration is found, the global configuration will be ignored to avoid
confusion.
vectorcode vectorise src/*.py
"Orphaned documents" that has been removed in your filesystem but still "exists"
in the database will be automatically cleaned. This will respect .gitignore
under project root, unless the -f
/--force
flag is set.
Extra options:
--overlap
or-o
: ratio of overlaps between chunks;--chunk_size
or-c
: maximum number of characters per chunk;--recursive
or-r
: recursively vectorise files in a directory;--force
or-f
: override.gitignore
.
vectorcode query "some query message"
Extra options:
--overlap
and--chunk_size
: same asvectorcode vectorise
;--number
or-n
: maximum number of returned documents;--multiplier
or-m
: query multiplier. See CLI tool;--exclude
: files from which the query results should be ignored.
vectorcode ls
vectorcode drop
For vectorise
, query
and ls
, adding --pipe
or -p
flag will convert the
output into a structured format. This is explained in detail here.
vectorcode -s {bash,zsh,tcsh}
or
vectorcode --print-completion {bash,zsh,tcsh}
will print the completion script for the corresponding shell. Please consult the documentation of your shell for instructions of how to use them.
In this document I will be using qwen2.5-coder as an example. Adjust your config when needed.
This is NOT a completion plugin, but a helper that facilitates prompting. It
provides APIs so that your completion engine (such as
cmp-ai
) can leverage the repository-level
context.
Using cmp-ai
as an example, the
configuration
provides a prompt
option, with which you can customize the prompt sent to the
LLM for each of the completion.
By consulting the qwen2.5-coder documentation, we know that a trivial prompt can be constructed as the following:
prompt = function(lines_before, lines_after)
return '<|fim_prefix|>'
.. lines_before
.. '<|fim_suffix|>'
.. lines_after
.. '<|fim_middle|>'
end
However, the information from such a context is limited to the document itself. By utilising VectorCode and this plugin, you'll be able to construct contexts that contain repository-level information:
prompt = function(lines_before, lines_after)
local file_context = ""
local ok, retrieval = pcall(
-- safeguard the query call if your embedding function is over the
-- network and may timeout on large documents.
require("vectorcode").query,
lines_before .. " " .. lines_after,
{ n_query = n_query }
)
if ok then
for _, source in pairs(retrieval) do
-- This works for qwen2.5-coder.
file_context = file_context
.. "<|file_sep|>"
.. source.path
.. "\n"
.. source.document
.. "\n"
end
end
return file_context
..'<|fim_prefix|>'
.. lines_before
.. '<|fim_suffix|>'
.. lines_after
.. '<|fim_middle|>'
Note that, the use of <|file_sep|>
is documented in
qwen2.5-coder documentation
and is likely to be model-specific. You may need to figure out the best prompt
structure for your own model.
The number of files returned by the query
function call can be configured
either by the setup
function, or passed as an argument to the query
call
which overrides the setup
setting for this call:
require("vectorcode").query(some_query_message, {n_query=5})
The second parameter follows the same structure as the opts
table for the
setup
function. Settings in this table will override the options in setup
for this query
call. This allows adjusting the number of retrieved documents
on the fly.
Note
This API is synchronous and will block your main nvim UI.
For applications that are sensitive to timing, the above process may not be responsive enough. As you can see from using the CLI, the query itself takes some noticeable amount of time. This is why I wrote a per-buffer async caching mechanism that will overcome the issue to some extent.
To use the per-buffer async cache, you need to use the following API:
-
require("vectorcode.cacher").register_buffer(buf_nr?, opts?, query_cb?, events?)
: Register a buffer for background query.buf_nr
(optional): integer, the buffer number to setup async runner in;opts
(optional): table, the same structure as what you use forsetup
and the synchronousquery
. This opts will be used for all the async updates managed by this plugin. This defaults to the option configured insetup
;query_cb
(optional):fun(bufnr: integer):string
, a function that will be used to construct the query message. You can use this function to customise the message sent to thevectorcode
CLI. This defaults to the whole buffer (for now);events
(optional):string[]
, an array ofautocmd
events on which the queries will be initialised. This defaults to{ "BufWritePost", "InsertEnter", "BufReadPost" }
.
Calling this function on a buffer that has been registered will update its
opts
andquery_cb
. -
require("vectorcode.cacher).query_from_cache(bufnr?)
: Returns the retrieval results from the most recent async cache for the given buffer. If the buffer has not been registered, it will return an empty array. The returned data is in the same format as the synchronousquery
API.bufnr
(optional): integer, the buffer number to retrieve cache from. Defaults to the current buffer.
With these async caching mechanism, you'll be able to utilise the retrieval with minimum latency and without blocking the main UI. All you need to do is to setup some kind of autocmd that register buffers, for example:
vim.api.nvim_create_autocmd("LspAttach", {
callback = function()
local bufnr = vim.api.nvim_get_current_buf()
require("vectorcode.cacher").register_buffer(bufnr)
end,
})
And in your completion prompt construction, you can use
require("vectorcode.cacher").query_from_cache(bufnr)
to get the cached
retrieval results that you use to build your prompt.
opts = {
tabline = {
lualine_y = { require("vectorcode.cacher").lualine() }
}
}
Under the hood, the caching mechanism stores the information in
vim.b[bufnr].vectorcode_cache
. The variable is a table with the following
definition:
{
enabled = true, -- controls whether the async jobs will be run.
retrieval = {}, -- the cached retrieval result.
options = {}, -- options passed from the `opts` argument when registering the
-- buffer.
}
When the --pipe
flag is set, the output of the CLI tool will be structured
into some sort of JSON string.
The number of added, updated and removed entries will be printed.
{
"add": int,
"update": int,
"removed": int,
}
add
: number of added documents;update
: number of updated (existing) documents;removed
: number of removed documents due to original documents being deleted.
A JSON array of query results of the following format will be printed:
{
"path": str,
"document": str,
}
path
: path to the file;document
: content of the file.
A JSON array of collection information of the following format will be printed:
{
"project-root": str,
"user": str,
"hostname": str,
"collection_name": str,
"size": int,
"num_files": int,
"embedding_function": str
}
project_root
: path to the project directory (your code repository);user
: your *nix username, which are automatically added when vectorising to avoid collision;hostname
: your *nix hostname. The purpose of this field is the same as theuser
field;collection_name
: the unique identifier of the collection in the database. This is the first 63 characters of the sha256 hash of the absolute path of the project root.size
: number of chunks in the collection;num_files
: number of files in the collection;embedding_function
: name of embedding function used for the collection.
The drop
command doesn't offer a --pipe
model output at the moment.
- query by
file pathexcluded paths; - chunking support;
- add metadata for files;
- chunk-size configuration;
- smarter chunking (semantics/syntax based);
- configurable document selection from query results.
-
NeoVim Lua API with cache to skip the retrieval when a project has not been indexedReturns empty array instead; - job pool for async caching;
- persistent-client;
- proper remote Chromadb support (with authentication, etc.);
- respect
.gitignore
; - implement some sort of project-root anchors (such as
.git
or a custom.vectorcode.json
) that enhances automatic project-root detection. Implemented project-level.vectorcode/config.json
as root anchor