You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The fact that foldseek is even able to make MSAs based on the clustering result itself is extremely impressive. However, the following issue makes the feature not very useful:
Expected Behavior
If the ids in the result2msa output a3m database matched the clusters in createtsv output, then it would be easy to fetch MSAs for one's cluster of choice.
Current Behavior
Now, one has to do:
ffindex_get result2msa_output result2msa_output.index -n 1
... repeat n times ...
ffindex_get result2msa_output result2msa_output.index -n n
iteratively where n is the total number of clusters in createtsv_output.tsv. Then scan all individual output a3m files for sequence ids to map clusters to a3m files. Because there is no straigtforward link between cluster ids in createtsv_output.tsv and database ids.
Steps to Reproduce (for bugs)
Perform foldseek clustering with foldseek cluster
Create clustering tsv file with foldseek createtsv on clustering result db
Create a3m db with foldseek result2msa on clustering result db
Navigate a3m db with ffindex_get and compare it with clustering tsv file
Context
For large clusterings, one might have hundreds of thousands of clusters, most of which are singleton clusters, so a3m files for those are not needed. One would only want a3m files for clusters that contain multiple members. But right now one is forced to write out all a3m files to disk, including singletons, then examine each files contents to figure out which a3m file corresponds to which cluster, before having to discard most of them anyway.
Your Environment
foldseek Version: 9.427df8a
The text was updated successfully, but these errors were encountered:
The fact that foldseek is even able to make MSAs based on the clustering result itself is extremely impressive. However, the following issue makes the feature not very useful:
Expected Behavior
If the ids in the
result2msa
output a3m database matched the clusters increatetsv
output, then it would be easy to fetch MSAs for one's cluster of choice.Current Behavior
Now, one has to do:
iteratively where
n
is the total number of clusters in createtsv_output.tsv. Then scan all individual output a3m files for sequence ids to map clusters to a3m files. Because there is no straigtforward link between cluster ids in createtsv_output.tsv and database ids.Steps to Reproduce (for bugs)
foldseek cluster
foldseek createtsv
on clustering result dbfoldseek result2msa
on clustering result dbffindex_get
and compare it with clustering tsv fileContext
For large clusterings, one might have hundreds of thousands of clusters, most of which are singleton clusters, so a3m files for those are not needed. One would only want a3m files for clusters that contain multiple members. But right now one is forced to write out all a3m files to disk, including singletons, then examine each files contents to figure out which a3m file corresponds to which cluster, before having to discard most of them anyway.
Your Environment
foldseek Version: 9.427df8a
The text was updated successfully, but these errors were encountered: