Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

createtsv output from foldseek search does not match the alignments in result2msa #405

Open
shiraz-shah opened this issue Jan 4, 2025 · 6 comments

Comments

@shiraz-shah
Copy link

Foldseek has this game-chaning feature where it can generate structural MSAs!

With the caveat that these are query-centered, meaning they're based on underlying pairwise alignments to your query. Quite acceptable for a lot of downstream applications. However:

Expected Behavior

One would expect that the MSAs generated by result2msa should match those given by e.g. createtsv, or other foldseek alignment output modalities.

Current Behavior and steps to reproduce:

They do not match. Here is output from createtsv:

> foldseek createtsv proteins proteins proteins_a proteins_a.tsv
> cat proteins_a.tsv | grep ^F1y_010103_1_131 | cut -f2- | column -t
F1y_010103_1_131  1546  1.00   2.807E-38  0  207  208  0   207  208
F1y_010102_1_61   423   0.190  1.147E-09  0  199  208  39  229  242
F1y_010108_25_20  417   0.201  1.631E-09  4  201  208  56  249  262
F1y_010104_22_5   362   0.184  4.108E-08  1  207  208  45  252  253
F1y_010108_14_18  354   0.162  6.569E-08  1  207  208  45  234  236
F1y_010104_8_5    351   0.164  7.833E-08  1  204  208  45  241  245
F1y_010108_33_23  327   0.146  3.202E-07  1  200  208  50  244  255
F1y_010101_6_2    321   0.128  4.553E-07  3  200  208  47  232  233
F1y_010108_18_33  307   0.155  1.035E-06  2  199  208  53  242  243
F1y_010108_5_3    299   0.143  1.655E-06  4  201  208  50  251  252
F1y_010108_1_47   274   0.155  7.173E-06  0  204  208  58  268  274
F1y_010102_1_64   263   0.292  1.368E-05  3  145  208  49  183  245
F1y_010108_13_55  248   0.107  3.297E-05  0  201  208  48  242  244
F1y_010108_18_23  241   0.485  4.971E-05  3  72   208  48  115  222
F1y_010101_6_18   240   0.274  5.272E-05  3  142  208  50  163  217

And here is the output som result2msa (using ffindex to fetch the alignment for query number 602, same as above):

> foldseek result2msa proteins proteins proteins_a proteins_msa --msa-format-mode 6
> ffindex_get proteins_msa proteins_msa.index -n 602 | head -2
>F1y_010103_1_131
MKNMFEKTTYISSRGREENEYLMDRDGFSILVMGFTGKKALEWKLKYIEAFNNMEEKLKSSNILTDEEKLKLQLFSKDPAEVAYAHNKLVEIATAPLIAENTVMKPKADYHDEVLNKDDLINTTVIAKDLGLRSAAKLNNIMHSNNIIYKNRSGTWCPYADYEWLITENYADYKSYNVENFNMCLKWTEKGRKWIIENYGKWICNSKN
>F1y_010102_1_61	33	0.273	1.490E-02	37	140	208	122	240	242
-------------------------------------KKQLEqqnAKLqpkadFADAAFAtddkvdiGMSAKiLKlgfGRNTLF--EKL-----RKAG--VFFANRnepKQRFIdAGY---FE---MKEKFiernNHPGFVVTK-VL----VTQKGLaylnHLfggkPSDGKLARI-------------------------------------------------------------------

It's evident that the first alignment is not the same. All statistics including fident, bitscore, and coordinates are different. Let's grep out the headers to check the remaining alignments:

> ffindex_get proteins_msa proteins_msa.index -n 602 | grep '^>' | cut -c2- | column -t
F1y_010103_1_131                                                 
F1y_010102_1_61   33  0.273  1.490E-02  37   140  208  122  240  242
F1y_010108_25_20  27  0.333  1.314E+00  51   71   208  159  187  262
F1y_010104_22_5   37  0.322  1.151E-03  90   138  208  121  168  253
F1y_010108_14_18  21  0.179  8.394E+01  18   54   208  170  190  236
F1y_010104_8_5    35  0.209  4.142E-03  39   183  208  92   191  245
F1y_010108_33_23  36  0.225  2.183E-03  7    71   208  106  168  255
F1y_010101_6_2    25  0.291  3.430E+00  123  188  208  141  191  233
F1y_010108_18_33  34  0.226  1.082E-02  35   73   208  72   143  243
F1y_010108_5_3    26  0.216  2.491E+00  44   130  208  141  238  252
F1y_010108_1_47   38  0.226  6.065E-04  15   143  208  1    144  274
F1y_010102_1_64   77  0.486  4.414E-16  2    63   208  48   117  245
F1y_010108_13_55  24  0.315  8.954E+00  49   81   208  127  159  244
F1y_010108_18_23  72  0.507  2.113E-14  20   65   208  65   128  222
F1y_010101_6_18   45  0.326  2.615E-06  13   75   208  43   128  217

As seen here all of the same targets were found, but none of the alignments are identical to those within the createtsv output. Most have worse coverages and even non-significant E-values.

Your Environment

Latest foldseek version 0d8d966cfa50b07c5ee83aa9060d795f5ee186a4

Thanks again, Martin, for your amazing work!

@martin-steinegger
Copy link
Collaborator

this looks odd. Do you have data and commands to reproduce it?

@shiraz-shah
Copy link
Author

I have both. However, I've tried this on multiple data sets, workflows and versions of foldseek and the error is always similar. So I imagine you may be able to reproduce it on your own.

Specifically, I've tried foldseek versions 0d8d966cfa50b07c5ee83aa9060d795f5ee186a4 and 9.427df8a and both produce this error.

I've tried both on curstom tsv2db databases where 3di's were based off ProstT5, as well as proper createdb databases made from PDB files.

I've tried on result2msa outputs from both search and cluster result dbs.

Finally, I've tried extracting a3ms from result2msa output using both ffindex_get and unpackdb, and the error is exactly the same.

I'll try to send you a minimal data plus commands example later.

@martin-steinegger
Copy link
Collaborator

martin-steinegger commented Jan 5, 2025

Did you set the -a parameter in the search?
-a is needed to includes the cigar string, which is required to reconstruct the MSA.

Independent, you could also give Foldmason a try https://github.com/steineggerlab/foldmason
Here the MSAs are not query centered.

@shiraz-shah
Copy link
Author

shiraz-shah commented Jan 6, 2025

Martin, the -a flag did the trick!

The MSAs look good now and the alignment statistics are identical to those within the createtsv output!

I've also tried to get the -a flag to work for cluster but the cluster MSAs still look bad. Do you have any thoughts on this? foldseek cluster -h indeed does mention -a as a valid option, but I couldn't get it to fix the MSAs.

Foldmason is indeed promising, but it requires proper PDB/mCIF files, right? If one only has ProstT5 3Di + aa files, then I think that's not enough for Foldmason, is it? I'm working with viral proteins, so I don't have good Alphafold PDBs.

I've been able to modify FAMSA like they did in the Puente-Lelievre preprint to align 3di files with a modified substitution matrix, then replace 3Di chars in the MSA with the original input AAs. It works, but it's not better than default FAMSA. It's not worse either. But I imagine Foldmason would be much better than this hack.

@shiraz-shah
Copy link
Author

shiraz-shah commented Jan 10, 2025

TL;DR:
-a fixes result2msa MSAs for search result DBs.

-a does not fix result2msa MSAs for cluster result DBs.

Thanks, Martin!

@martin-steinegger
Copy link
Collaborator

Yes, cluster databases do not contain alignment information and therefore have no cigar strings. You need to realign them, for example, using the structuralign module with the -a option.

foldseek cluster seq clu tmp
foldseek structueralign seq seq clu aln -a
foldseek result2msa seq seq aln msa

For more documentation regarding Foldseek, check the MMseqs2 Wiki, which explains most of the internals of Foldseek, as Foldseek is built on MMseqs2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants