Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

foldseek result2msa on a clustering result creates an a3m database where cluster ids are inconsistent with createtsv output #401

Open
shiraz-shah opened this issue Dec 17, 2024 · 0 comments

Comments

@shiraz-shah
Copy link

shiraz-shah commented Dec 17, 2024

The fact that foldseek is even able to make MSAs based on the clustering result itself is extremely impressive. However, the following issue makes the feature not very useful:

Expected Behavior

If the ids in the result2msa output a3m database matched the clusters in createtsv output, then it would be easy to fetch MSAs for one's cluster of choice.

Current Behavior

Now, one has to do:

ffindex_get result2msa_output result2msa_output.index -n 1
... repeat n times ...
ffindex_get result2msa_output result2msa_output.index -n n

iteratively where n is the total number of clusters in createtsv_output.tsv. Then scan all individual output a3m files for sequence ids to map clusters to a3m files. Because there is no straigtforward link between cluster ids in createtsv_output.tsv and database ids.

Steps to Reproduce (for bugs)

  • Perform foldseek clustering with foldseek cluster
  • Create clustering tsv file with foldseek createtsv on clustering result db
  • Create a3m db with foldseek result2msa on clustering result db
  • Navigate a3m db with ffindex_get and compare it with clustering tsv file

Context

For large clusterings, one might have hundreds of thousands of clusters, most of which are singleton clusters, so a3m files for those are not needed. One would only want a3m files for clusters that contain multiple members. But right now one is forced to write out all a3m files to disk, including singletons, then examine each files contents to figure out which a3m file corresponds to which cluster, before having to discard most of them anyway.

Your Environment

foldseek Version: 9.427df8a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant