`foldseek result2msa` on a clustering result creates an a3m database where cluster ids are inconsistent with `createtsv` output #401

shiraz-shah · 2024-12-17T18:54:24Z

The fact that foldseek is even able to make MSAs based on the clustering result itself is extremely impressive. However, the following issue makes the feature not very useful:

Expected Behavior

If the ids in the result2msa output a3m database matched the clusters in createtsv output, then it would be easy to fetch MSAs for one's cluster of choice.

Current Behavior

Now, one has to do:

ffindex_get result2msa_output result2msa_output.index -n 1
... repeat n times ...
ffindex_get result2msa_output result2msa_output.index -n n

iteratively where n is the total number of clusters in createtsv_output.tsv. Then scan all individual output a3m files for sequence ids to map clusters to a3m files. Because there is no straigtforward link between cluster ids in createtsv_output.tsv and database ids.

Steps to Reproduce (for bugs)

Perform foldseek clustering with foldseek cluster
Create clustering tsv file with foldseek createtsv on clustering result db
Create a3m db with foldseek result2msa on clustering result db
Navigate a3m db with ffindex_get and compare it with clustering tsv file

Context

For large clusterings, one might have hundreds of thousands of clusters, most of which are singleton clusters, so a3m files for those are not needed. One would only want a3m files for clusters that contain multiple members. But right now one is forced to write out all a3m files to disk, including singletons, then examine each files contents to figure out which a3m file corresponds to which cluster, before having to discard most of them anyway.

Your Environment

foldseek Version: 9.427df8a

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`foldseek result2msa` on a clustering result creates an a3m database where cluster ids are inconsistent with `createtsv` output #401

`foldseek result2msa` on a clustering result creates an a3m database where cluster ids are inconsistent with `createtsv` output #401

shiraz-shah commented Dec 17, 2024 •

edited

Loading

foldseek result2msa on a clustering result creates an a3m database where cluster ids are inconsistent with createtsv output #401

foldseek result2msa on a clustering result creates an a3m database where cluster ids are inconsistent with createtsv output #401

Comments

shiraz-shah commented Dec 17, 2024 • edited Loading

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment

`foldseek result2msa` on a clustering result creates an a3m database where cluster ids are inconsistent with `createtsv` output #401

`foldseek result2msa` on a clustering result creates an a3m database where cluster ids are inconsistent with `createtsv` output #401

shiraz-shah commented Dec 17, 2024 •

edited

Loading