Skip to content

Latest commit

 

History

History
61 lines (39 loc) · 2.66 KB

README.md

File metadata and controls

61 lines (39 loc) · 2.66 KB

Language Processing & Indexing with FAISS

This repository hosts main.py, a Python script that trawls through a specified directory, processes various types of document formats (.doc, .xlsx, .pdf, .csv, and .txt), and indexes them using Facebook's FAISS (Facebook AI Similarity Search) library with the help of embeddings generated by OpenAI's Models. The primary goal is to create an efficient search and retrieval system for a variety of text documents.

Prerequisites

To successfully run this project, ensure that you have installed:

  • Python 3.6+
  • docx Python library.
  • pandas Python library.
  • Loggers provided by the standard logging Python library.
  • langchain Python library version 0.1+. This library provides utilities to load various file types (TXTLoader, CSVLoader, PyPDFLoader) and embeddings (OpenAIEmbeddings).
  • faiss Python library.

To install these dependencies, run the following pip command:

pip install python-docx pandas logging langchain faiss-cpu

If you have the appropriate hardware requirements, you can use faiss-gpu instead of faiss-cpu to leverage GPU acceleration.

Usage

To run the script, follow the steps outlined below:

  1. Clone this repository to your local machine.

    git clone <repo_url>
  2. Populate a directory with the documents you wish to process.

  3. Inject your personal OpenAI key into the script by replacing 'YOUR_OPENAI_KEY'.

    openai_key = 'YOUR_OPENAI_KEY'
  4. Include the path to your documents directory by replacing '/path/to/your/directory'.

    root_dir = '/path/to/your/directory'
  5. Run the script.

    python main.py

The script traverses all files in the specified directory and its sub-directories. It converts .doc files into .txt files, and .xls files into .csv files. These converted documents, alongside existing .pdf, .csv, and .txt files, are then loaded into memory one by one. Each file is transformed into an embedding using an OpenAI model, then added to the FAISS index. Once all documents have been processed, the final FAISS index is saved locally as faiss.index.

Please note that our script respects your privacy: it does not send any data directly to OpenAI or any other online service. All processing happens locally on your machine.

However, be mindful of the fact that the script logs errors that occur while processing a document. You can view these warnings in your command line console output.

License

This project follows the Unlicense, allowing unlimited freedom to use, modify, and distribute this project as per your needs or liking.