Skip to content

A code repository indexing tool to supercharge your LLM completion experience.

License

Notifications You must be signed in to change notification settings

Davidyz/VectorCode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VectorCode

VectorCode is a code repository indexing tool. It helps you write better prompt for your coding LLMs by indexing and providing information about the code repository you're working on. This repository also contains the corresponding neovim plugin because that's what I used to write this tool.

Note

This project is in beta quality and only implements very basic retrieval and embedding functionalities. There are plenty of rooms for improvements and any help is welcomed.

Note

Chromadb, the vector database backend behind this project, supports multiple embedding engines. I developed this tool using SentenceTransformer, but if you encounter any issues with a different embedding function, please open an issue (or even better, a pull request :D).

Why VectorCode?

LLMs usually have very limited understanding about close-source and/or infamous projects, as well as cutting edge developments that have not made it into the releases. Their capabilities on these projects are quite limited. Take my little toy sudoku-solving project as an example: When I wrote the first few lines and want the LLM to fill in the list of solvers that I implemented in solver_candidates, without project context, the completions are simply random guesses that might be part of another sudoku project: But with RAG context provided by VectorCode, my completion LLM was able to provide completions that I actually implemented: This makes the completion results far more usable. A similar strategy is implemented in continue, a popular AI completion and chat plugin available on VSCode and JetBrain products.

Prerequisites

  • A working instance of Chromadb. A local docker image will suffice.
  • An embedding tool supported by Chromadb, which you can find out more from here and here

As long as you managed to install VectorCode itself, you're good to go!

Installation

I recommend using pipx. This will take care of the dependencies of vectorcode and create a dedicated virtual environment without messing up your system Python.

Run the following command:

pipx install vectorcode

To install the latest commit from GitHub, clone the repo and run pipx install <path_to_repo>.

NeoVim users:

This repo doubles as a neovim plugin. Use your favourite plugin manager to install.

For lazy.nvim:

{
  "Davidyz/VectorCode",
  dependencies = { "nvim-lua/plenary.nvim" },
  opts = { 
    n_query = 1, -- number of retrieved documents
    notify = true, -- enable notifications
    timeout_ms = 5000, -- timeout in milliseconds for the query operation.
    exclude_this = true, -- exclude the buffer from which the query is called.
                         -- This avoids repetition when you change some code but
                         -- the embedding has not been updated.
  },
  cond = function() return vim.fn.executable('vectorcode') == 1 end,
}

It might be helpful to add VectorCode as a dependency of your AI completion plugin.

Configuration

CLI tool

This tool uses a JSON file to store the configuration. The global config is located at $HOME/.config/vectorcode/config.json. You can also set a project-specific configuration at <project_root>/.vectorcode/config.json. Options in the project configuration will override the global config. The closest parent directory of the current working directory that contains a project-specific config will be used as the project-root, but this can be overridden by the --project_root flag.

{
    "embedding_function": 'SomeEmbeddingFunction',
    "embedding_params": {
    }
    "host": "localhost",  
    "port": 8000,
    "db_path": "~/.local/share/vectorcode/chromadb/",
}

The following are the available options for the JSON configuration file:

  • embedding_function: One of the embedding functions supported by Chromadb (find more here and here). For example, Chromadb supports Ollama as chromadb.utils.embedding_functions.OllamaEmbeddingFunction, and the corresponding value for embedding_function would be OllamaEmbeddingFunction. Default: SentenceTransformerEmbeddingFunction;
  • embedding_params: Whatever initialisation parameters your embedding function takes. For OllamaEmbeddingFunction, if you set embedding_params to:
    {
      "url": "http://127.0.0.1:11434/api/embeddings",
      "model_name": "nomic-embed-text"
    }
    Then the embedding function object will be initialised as OllamaEmbeddingFunction(url="http://127.0.0.1:11434/api/embeddings", model_name="nomic-embed-text"). Default: {};
  • host and port: Chromadb server host and port. Default: not set, in favour of local persistent client set by db_path. Please only use with local or LAN Chromadb server because ChromaDB authentication is still WIP;
  • db_path: Path to local persistent database. If host or port is set, this will be ignored. Default: ~/.local/share/vectorcode/chromadb/;
  • chunk_size: integer, the maximum number of characters per chunk. A larger value reduces the number of items in the database, and hence accelerates the search, but at the cost of potentially truncated data and lost information. Default: -1 (no chunking), but it's highly recommended to set it to a positive integer that works for your model when working with large documents;
  • overlap_ratio: float between 0 and 1, the ratio of overlapping content in a between 2 adjacent chunks. A larger ratio improves the coherences of chunks, but at the cost of increasing number of entries in the database and hence slowing down the search. Default: 0.2;
  • query_multplier: when you use the query command to retrieve n documents, VectorCode will check n * query_multplier chunks and return at most n documents. A larger value of query_multplier guarantees the return of n documents, but with the risk of including too many less-relevant chunks that may affect the document selection. Default: -1 (any negative value means selecting documents based on all indexed chunks).

For the convenience of deployment, environment variables in the configuration values will be automatically expanded so that you can override thing at run time without modifying the JSON.

Also, some of the built-in embedding functions supported by Chromadb requires external library (such as openai) that are not included in the dependency list. This is what Chromadb did, so I did the same. If you installed vectorcode via pipx, you can install extra libraries by running the following command:

pipx inject vectorcode openai

And openai will be added to the virtual environment of vectorcode.

Usage

CLI tool

This is an incomplete list of command-line options. You can always use vectorcode -h to view the full list of arguments.

This tool creates a collection (just like tables in traditional databases) for each project. The collections are identified by project root, which, by default, is the current working directory. You can override this by using the --project_root <path_to_your_project_root> argument.

Initialising Project-Local Configuration

vectorcode init 

Create a project-local configuration at the current directory (or the directory specified by the --project_root flag). This directory acts like a .git directory. Consider the following file directory:

foo/
foo/.vectorcode/
foo/bar/

Running vectorcode init command in foo/ creates the foo/.vectorcode/ directory, which can contain the project-local config.json. When foo/.vectorcode/ is present, foo/ will be used as the project-root for VectorCode when you run vectorcode command from any of the subdirectories of foo/ (such as foo/bar/), unless overridden by --project_root.

When you run vectorcode init and a global configuration file is present, it'll be copied to your project-local config directory. If a project-local configuration is found, the global configuration will be ignored to avoid confusion.

Vectorising documents

vectorcode vectorise src/*.py

"Orphaned documents" that has been removed in your filesystem but still "exists" in the database will be automatically cleaned. This will respect .gitignore under project root, unless the -f/--force flag is set.

Extra options:

  • --overlap or -o: ratio of overlaps between chunks;
  • --chunk_size or -c: maximum number of characters per chunk;
  • --recursive or -r: recursively vectorise files in a directory;
  • --force or -f: override .gitignore.

Querying from a collection

vectorcode query "some query message"

Extra options:

  • --overlap and --chunk_size: same as vectorcode vectorise;
  • --number or -n: maximum number of returned documents;
  • --multiplier or -m: query multiplier. See CLI tool;
  • --exclude: files from which the query results should be ignored.

Listing all collections

vectorcode ls 

Removing a collection

vectorcode drop 

For vectorise, query and ls, adding --pipe or -p flag will convert the output into a structured format. This is explained in detail here.

Shell Completion

vectorcode -s {bash,zsh,tcsh}

or

vectorcode --print-completion {bash,zsh,tcsh}

will print the completion script for the corresponding shell. Please consult the documentation of your shell for instructions of how to use them.

NeoVim plugin

In this document I will be using qwen2.5-coder as an example. Adjust your config when needed.

This is NOT a completion plugin, but a helper that facilitates prompting. It provides APIs so that your completion engine (such as cmp-ai) can leverage the repository-level context.

Using cmp-ai as an example, the configuration provides a prompt option, with which you can customize the prompt sent to the LLM for each of the completion.

By consulting the qwen2.5-coder documentation, we know that a trivial prompt can be constructed as the following:

prompt = function(lines_before, lines_after)
    return '<|fim_prefix|>' 
        .. lines_before 
        .. '<|fim_suffix|>' 
        .. lines_after 
        .. '<|fim_middle|>'
end

However, the information from such a context is limited to the document itself. By utilising VectorCode and this plugin, you'll be able to construct contexts that contain repository-level information:

prompt = function(lines_before, lines_after)
    local file_context = ""
    local ok, retrieval = pcall(
        -- safeguard the query call if your embedding function is over the
        -- network and may timeout on large documents.
        require("vectorcode").query,
            lines_before .. " " .. lines_after,
            { n_query = n_query } 
        )
    if ok then
        for _, source in pairs(retrieval) do
            -- This works for qwen2.5-coder.
            file_context = file_context
                .. "<|file_sep|>"
                .. source.path
                .. "\n"
                .. source.document
                .. "\n"
        end
    end
    return file_context
        ..'<|fim_prefix|>' 
        .. lines_before 
        .. '<|fim_suffix|>' 
        .. lines_after 
        .. '<|fim_middle|>'

Note that, the use of <|file_sep|> is documented in qwen2.5-coder documentation and is likely to be model-specific. You may need to figure out the best prompt structure for your own model.

The number of files returned by the query function call can be configured either by the setup function, or passed as an argument to the query call which overrides the setup setting for this call:

require("vectorcode").query(some_query_message, {n_query=5})

The second parameter follows the same structure as the opts table for the setup function. Settings in this table will override the options in setup for this query call. This allows adjusting the number of retrieved documents on the fly.

Note

This API is synchronous and will block your main nvim UI.

Asynchronous Caching

For applications that are sensitive to timing, the above process may not be responsive enough. As you can see from using the CLI, the query itself takes some noticeable amount of time. This is why I wrote a per-buffer async caching mechanism that will overcome the issue to some extent.

To use the per-buffer async cache, you need to use the following API:

  • require("vectorcode.cacher").register_buffer(buf_nr?, opts?, query_cb?, events?): Register a buffer for background query.

    • buf_nr (optional): integer, the buffer number to setup async runner in;
    • opts (optional): table, the same structure as what you use for setup and the synchronous query. This opts will be used for all the async updates managed by this plugin. This defaults to the option configured in setup;
    • query_cb (optional): fun(bufnr: integer):string, a function that will be used to construct the query message. You can use this function to customise the message sent to the vectorcode CLI. This defaults to the whole buffer (for now);
    • events (optional): string[], an array of autocmd events on which the queries will be initialised. This defaults to { "BufWritePost", "InsertEnter", "BufReadPost" }.

    Calling this function on a buffer that has been registered will update its opts and query_cb.

  • require("vectorcode.cacher).query_from_cache(bufnr?): Returns the retrieval results from the most recent async cache for the given buffer. If the buffer has not been registered, it will return an empty array. The returned data is in the same format as the synchronous query API.

    • bufnr (optional): integer, the buffer number to retrieve cache from. Defaults to the current buffer.

With these async caching mechanism, you'll be able to utilise the retrieval with minimum latency and without blocking the main UI. All you need to do is to setup some kind of autocmd that register buffers, for example:

vim.api.nvim_create_autocmd("LspAttach", {
  callback = function()
    local bufnr = vim.api.nvim_get_current_buf()
    require("vectorcode.cacher").register_buffer(bufnr)
  end,
})

And in your completion prompt construction, you can use require("vectorcode.cacher").query_from_cache(bufnr) to get the cached retrieval results that you use to build your prompt.

Lualine Integration

opts = {
  tabline = {
    lualine_y = { require("vectorcode.cacher").lualine() }
  }
}

The Boring Stuff

Under the hood, the caching mechanism stores the information in vim.b[bufnr].vectorcode_cache. The variable is a table with the following definition:

{
  enabled = true, -- controls whether the async jobs will be run. 
  retrieval = {}, -- the cached retrieval result.
  options = {}, -- options passed from the `opts` argument when registering the
                -- buffer.
}

For Developers

When the --pipe flag is set, the output of the CLI tool will be structured into some sort of JSON string.

vectorise

The number of added, updated and removed entries will be printed.

{
    "add": int,
    "update": int,
    "removed": int,
}
  • add: number of added documents;
  • update: number of updated (existing) documents;
  • removed: number of removed documents due to original documents being deleted.

query

A JSON array of query results of the following format will be printed:

{
    "path": str,
    "document": str,
}
  • path: path to the file;
  • document: content of the file.

ls

A JSON array of collection information of the following format will be printed:

{
    "project-root": str,
    "user": str,
    "hostname": str,
    "collection_name": str,
    "size": int,
    "num_files": int,
    "embedding_function": str
}
  • project_root: path to the project directory (your code repository);
  • user: your *nix username, which are automatically added when vectorising to avoid collision;
  • hostname: your *nix hostname. The purpose of this field is the same as the user field;
  • collection_name: the unique identifier of the collection in the database. This is the first 63 characters of the sha256 hash of the absolute path of the project root.
  • size: number of chunks in the collection;
  • num_files: number of files in the collection;
  • embedding_function: name of embedding function used for the collection.

drop

The drop command doesn't offer a --pipe model output at the moment.

TODOs

  • query by file path excluded paths;
  • chunking support;
    • add metadata for files;
    • chunk-size configuration;
    • smarter chunking (semantics/syntax based);
    • configurable document selection from query results.
  • NeoVim Lua API with cache to skip the retrieval when a project has not been indexed Returns empty array instead;
  • job pool for async caching;
  • persistent-client;
  • proper remote Chromadb support (with authentication, etc.);
  • respect .gitignore;
  • implement some sort of project-root anchors (such as .git or a custom .vectorcode.json) that enhances automatic project-root detection. Implemented project-level .vectorcode/config.json as root anchor

About

A code repository indexing tool to supercharge your LLM completion experience.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published