A benchmark for gene embeddings from pretrained models.
The package includes three main sections:
- Descriptions - Retrieval of textual descriptions from pre existing files or database.
- Encode the textual data
- Tasks - Evaluation of the efficacy of these encodings in downstream tasks.
Retrieval of textual descriptions from pre existing files or NCBI. Currently, the package can extract description for a single entity type (for example, gene symbols ..) or for multiple entity types (for example, gene symbols and disease...)
Currently, the descriptions can be retrieved from the NCBI database and from a pre existing CSV with the text descriptions.
NCBIDescriptor
: Creates descriptions for gene symbol based on data from the NCBI. The descriptions are build as follows: "Gene symbol {symbol} full name {name} with the summary {summary}. The user can also decide if he would like partial descriptions (missing gene name or summary), if not, None will be returned for that gene.CSVDescriptions
: Retrieves any entity description (gene, disease...) from a CSV file. The csv file needs to include a column with the entity names (default is 'id').
MultiEntityTypeDescriptor
: This class needs a description_dict
with the keys as the entity type and the value the wanted Single entity type descriptions class.
After creating an instance of your desired class, use the method '''describe''' to generate the descriptions. For all classes, input a pandas series or dataframe with the entity names and set '''allow_missing''' to True if you don't mind to have missing entity descriptions, or to false if you would like the extraction the stop if some entities are missing. The result will be the corresponding dataframe/series with the descriptions replacing the entities imputed.
All methods have a summary attribute that will print a short summary of the description retrieval, it includes:
- If partial description were allowed
- If missing entities were allowed
- Number of missing entities
- List of missing entities
descriptor = NCBIDescriptor(allow_partial=False)
descriptions = descriptor.describe(
pd.DataFrame(
columns=["symbol"], index=[1, 2, 3], data=["BRCA1", "FOXP2", "BRCA1"]
)
)
descriptor = CSVDescriptions(csv_file_path=csv_path, index_col="id")
descriptions = descriptor.describe(
pd.Series(["BRCA1", "FOXP2", "NOTAGENENAME", "PLAC4"]), allow_missing=True
)
description_dict = {
"symbol": NCBIDescriptor(allow_partial=False),
"disease": CSVDescriptions(csv_file_path=csv_path, index_col="id"),
}
descriptor = MultiEntityTypeDescriptor(description_dict=description_dict)
descriptions = descriptor.describe(
pd.DataFrame(
columns=["symbol", "disease", "symbol"],
data=[("BRCA1", "cancer", "FOXP2"), ("PLAC4", "als", "IAMNOTAGENE")],
)
)
This class enable to work with dataframe that contain index for the encoded elements (usually symbols) and the encoding themselves as the columns. The following is a way to load the Gene2Vec csv of encoding as an encoder:
enc = PreComputedEncoder(encoder_model_name: "/path/to/gene2vec.csv")
The precomputed encoder actually maps specific strings to specific encoding. Hence it is limited in it's encoding scope and usually will work with gene symbols or disease id's .
In addition we enable encoding using the HuggingFace sentence encoders see
enc = SentenceTransformerEncoder(encoder_model_name: "BAAI/bge-large-en-v1.5")
Note that the models are downloaded and might need a lot of storage space.
You can make use of the environment variable SENTENCE_TRANSFORMERS_HOME
to control where the models are downloaded to.
SentenceTransformerEncoder can encode any string and is not limited and is very useful when coupled with descriptions.
The package has the means to create, load and manipulate task definitions. The list of available tasks is provided here For each task we provided python scripts that enable to create the tasks either from a local file or to download them directly.
One can easily load the task definitions using the load_task_definition
method and the task name (according to the list above)
from gene_benchmark.tasks import TaskDefinition
tasks = load_task_definition(task_name='TF vs non-TF')
If the user wants to use a task but wishes to exclude certain symbols he can easily do so using exclude_symbols
tasks = load_task_definition(task_name='TF vs non-TF',exclude_symbols=['BRCA1'])
If the user wants to use one of the labels of a multi-label task, it can be done by:
tasks = load_task_definition(task_name='Pathways',sub_task='Diseases')
Or the user wishes to load a multi-label task but only the labels with a certain label rate
tasks = load_task_definition(task_name='Pathways',multi_label_th=0.1)
When running tasks that are not inside the main task repository, they can be loaded from a different data dir by setting 'data_dir' to the alternative task directory.
The package includes a pipeline object that given a task, prompt maker and encoder can preform the entire task from load to predictions.
from gene_benchmark.tasks import EntitiesTask
from gene_benchmark.descriptor import NCBIDescriptor
from gene_benchmark.encoder import SentenceTransformerEncoder,
task = EntitiesTask(task_name='TF vs non-TF', encoder=SentenceTransformerEncoder(), prompt_builder=NCBIPromptsMaker())
_ = task.run()
print(task.summary())
The script enables the performance of multiple tasks on multiple models. Each model is defined by an encoder and an prompt maker. The script uses yaml files to define each model. The model_name field is used for the report
Following is a model that uses NCBI prompts and hugginface sentence encoder.
descriptor:
class_name: NCBIPromptsMaker
encoder:
class_name: SentenceTransformerEncoder
class_args:
encoder: "sentence-transformers/all-mpnet-base-v2"
model_name: mpnet
We can create a model that encodes only the symbols:
encoder:
class_name: SentenceTransformerEncoder
class_args:
encoder: "sentence-transformers/all-mpnet-base-v2"
We also support Gene2Vec using the precomputed encoders:
encoder:
class_name: PreComputedEncoder
class_args:
encoder_model_name: "/path/to/gene2vec.csv"
See scGPT The encoding was extracted from the pre-trained "blood" model. Weights file can be generated using the extraction script
encoder:
class_name: PreComputedEncoder
class_args:
encoder_model_name: "/path/to/ScGPT_weights/blood_model_embedding.csv"