Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering performance issue: Foldseek exceeds expected runtime #403

Open
YFeriel opened this issue Dec 19, 2024 · 2 comments
Open

Clustering performance issue: Foldseek exceeds expected runtime #403

YFeriel opened this issue Dec 19, 2024 · 2 comments

Comments

@YFeriel
Copy link

YFeriel commented Dec 19, 2024

Hello Foldseek team, @martin-steinegger @milot-mirdita

I am currently using Foldseek to perform a clusterization, and I am facing some issues with runtime duration. Here is the setup and process I followed:

  1. I downloaded the AlphaFold/UniProt database using the foldseek databases command.
  2. I concatenated this database with my own protein database, which contains approximately 700,000 structures.
  3. I ran the clusterization on a compute node with 64 CPUs using the following command:
    foldseek cluster /data/foldseek/concat_db /data/cluster_results /localscratch/yferiel.38388460.0/tmp_clusters -k 7 --threads 64

Despite running on a node with 64 CPUs, the clusterization has been taking over 7 days and is still not completed. According to the Foldseek article, it was mentioned that clustering on 64 CPUs typically takes about 5 days.

Additionally, on the compute cluster I am using, the maximum runtime per job is 7 days.

My questions are:

  1. Is there a way to accelerate the clusterization process given my setup?
  2. Does Foldseek support parallelism, or is there a specific configuration I could try to leverage more CPUs effectively?
  3. I attempted adding mpirun -np 64, but the command didn’t work. Does Foldseek support MPI-based parallelism, or is there an alternative method to achieve better performance?

Any advice or suggestions would be greatly appreciated!

Thank you in advance for your help.

Foldssek Output (for bugs)

Create directory /localscratch/yferiel.38388460.0/tmp_clusters
cluster /data/foldseek/concat_db /data/cluster_results /localscratch/yferiel.38388460.0/tmp_clusters -k 7 --threads 64

MMseqs Version: 0dd4b7f
Substitution matrix aa:3di.out,nucl:3di.out
Seed substitution matrix aa:3di.out,nucl:3di.out
Sensitivity 4
k-mer length 7
Target search mode 0
k-score seq:2147483647,prof:2147483647
Max sequence length 65535
Max results per query 1000
Split database 0
Split mode 2
Split memory limit 0
Coverage threshold 0.8
Coverage mode 0
Compositional bias 0
Compositional bias 1
Diagonal scoring true
Exact k-mer matching 0
Mask residues 0
Mask residues probability 0.9
Mask lower case residues 1
Minimum diagonal score 30
Selected taxa
Spaced k-mers 1
Preload mode 0
Spaced k-mer pattern
Local temporary path
Threads 64
Compressed 0
Verbosity 3
TMscore threshold 0
TMscore threshold mode 0
LDDT threshold 0
Sort by structure bit score 0
Alignment type 2
Exact TMscore 0
Add backtrace false
Alignment mode 3
Alignment mode 0
E-value threshold 0.01
Seq. id. threshold 0
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Max reject 2147483647
Max accept 2147483647
Gap open cost aa:10,nucl:10
Gap extension cost aa:1,nucl:1
TMalign hit order 0
TMalign fast 1
Cluster mode 0
Max connected component depth 1000
Similarity type 2
Weight file name
Cluster Weight threshold 0.9
Single step clustering false
Cascaded clustering steps 3
Cluster reassign false
Remove temporary files false
Force restart with latest tmp false
MPI runner
k-mers per sequence 300
Scale k-mers per sequence aa:0.000,nucl:0.200
Adjust k-mer length false
Shift hash 67
Include only extendable false
Skip repeating k-mers false
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0

Set cluster sensitivity to -s 8.000000
Set cluster mode SET COVER
Set cluster iterations to 3
kmermatcher /data/foldseek/concat_db_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref --sub-mat 'aa:3di.out,nucl:3di.out' --alph-size aa:21,nucl:5 --min-seq-id 0 --kmer-per-seq 300 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --cov-mode 0 -k 7 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

kmermatcher /data/foldseek/concat_db_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref --sub-mat 'aa:3di.out,nucl:3di.out' --alph-size aa:21,nucl:5 --min-seq-id 0 --kmer-per-seq 300 --spaced-kmer-mode 1 --kmer-per-seq-scale aa:0.000,nucl:0.200 --adjust-kmer-len 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --cov-mode 0 -k 7 -c 0.8 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Database size: 215346985 type: Aminoacid

Not enough memory to process at once need to split
[=================================================================] 215.35M 3m 3s 870ms
Process file into 4 parts
Generate k-mers list for 1 split
[=================================================================] 215.35M 2m 23s 573ms
Sort kmer 0h 5m 47s 225ms
Sort by rep. sequence 0h 0m 7s 440ms
Generate k-mers list for 2 split
[=================================================================] 215.35M 2m 49s 529ms
Sort kmer 0h 5m 37s 749ms
Sort by rep. sequence 0h 0m 8s 618ms
Generate k-mers list for 3 split
[=================================================================] 215.35M 2m 51s 697ms
Sort kmer 0h 5m 15s 714ms
Sort by rep. sequence 0h 0m 13s 360ms
Generate k-mers list for 4 split
[=================================================================] 215.35M 2m 29s 413ms
Sort kmer 0h 1m 42s 200ms
Sort by rep. sequence 0h 0m 11s 510ms
Merge splits ... Time for fill: 0h 10m 4s 90ms
Time for merging to pref: 0h 0m 0s 0ms
Time for processing: 0h 51m 0s 760ms
structurerescorediagonal /data/foldseek/concat_db /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_rescore1 --exact-tmscore 0 --tmscore-threshold 0 --tmscore-threshold-mode 0 --lddt-threshold 0 --alignment-type 2 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 64 --compressed 0 -v 3

[=================================================================] 215.35M 31h 16m 46s 508ms
Time for merging to pref_rescore1: 0h 2m 12s 27ms
Time for processing: 31h 20m 26s 320ms
clust /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_rescore1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover
[=================================================================] 215.35M 3m 51s 449ms
Sort entries
Find missing connections
Found 184261107 new connections.
Reconstruct initial order
[=================================================================] 215.35M 3m 48s 30ms
Add missing connections
[=================================================================] 215.35M 32s 481ms

Time for read in: 0h 8m 49s 690ms
Total time: 0h 10m 14s 537ms

Size of the sequence database: 215346985
Size of the alignment database: 215346985
Number of clusters: 149526165

Writing results 0h 0m 16s 890ms
Time for merging to pre_clust: 0h 0m 0s 1ms
Time for processing: 0h 11m 6s 481ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/order_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 0ms
Time for processing: 0h 1m 14s 873ms
filterdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_filter1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_filter2 --filter-file /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/order_redundancy --threads 64 --compressed 0 -v 3

Filtering using file(s)
[=================================================================] 149.53M 2m 3s 200ms
Time for merging to pref_filter2: 0h 1m 4s 417ms
Time for processing: 0h 3m 50s 865ms
structurealign /data/foldseek/concat_db /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_filter2 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln.linclust --tmscore-threshold 0 --tmscore-threshold-mode 0 --lddt-threshold 0 --sort-by-structure-bits 0 --alignment-type 2 --exact-tmscore 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 64 --compressed 0 -v 3

[=================================================================] 149.53M 4h 56m 21s 768ms
Time for merging to aln.linclust: 0h 1m 26s 176ms
Time for processing: 5h 38m 5s 930ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/order_redundancy /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pre_clustered_seqs -v 3 --subdb-mode 1

Time for merging to pre_clustered_seqs: 0h 0m 0s 0ms
Time for processing: 0h 1m 45s 100ms
clust /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pre_clustered_seqs /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln.linclust /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clust.linclust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover
[=================================================================] 149.53M 1m 55s 192ms
Sort entries
Find missing connections
Found 173934104 new connections.
Reconstruct initial order
[=================================================================] 149.53M 1m 51s 311ms
Add missing connections
[=================================================================] 149.53M 32s 135ms

Time for read in: 0h 4m 47s 830ms
Total time: 0h 5m 54s 232ms

Size of the sequence database: 149526165
Size of the alignment database: 149526165
Number of clusters: 111807234

Writing results 0h 0m 12s 545ms
Time for merging to clust.linclust: 0h 0m 0s 0ms
Time for processing: 0h 6m 30s 662ms
mergeclusters /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pre_clust /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clust.linclust --threads 64 --compressed 0 -v 3

Clustering step 1
[=================================================================] 149.53M 34s 732ms
Clustering step 2
[=================================================================] 111.81M 1m 0s 239ms
Write merged clustering
[=================================================================] 215.35M 1m 15s 523ms
Time for merging to clu_redundancy: 0h 0m 51s 558ms
Time for processing: 0h 2m 55s 62ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_redundancy /data/foldseek/concat_db_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ss -v 3 --subdb-mode 1

Time for merging to input_step_redundancy_ss: 0h 0m 0s 0ms
Time for processing: 0h 1m 8s 838ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_redundancy /data/foldseek/concat_db_ca /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ca -v 3 --subdb-mode 1

Time for merging to input_step_redundancy_ca: 0h 0m 0s 0ms
Time for processing: 0h 1m 15s 672ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_redundancy /data/foldseek/concat_db /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 0ms
Time for processing: 0h 1m 8s 961ms
prefilter /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_step0 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 1 -k 7 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 100 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 0 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 64 --compressed 0 -v 3

Query database size: 111807234 type: Aminoacid
Target split mode. Searching through 3 splits
Estimated memory consumption: 182G
Target database size: 111807234 type: Aminoacid
Process prefiltering step 1 of 3

Index table k-mer threshold: 185 at k-mer size 7
Index table: counting k-mers
[=================================================================] 37.38M 53s 721ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 37.38M 6s 375ms
Index statistics
Entries: 420189516
DB size: 12169 MB
Avg k-mer size: 0.328273
Top 10 k-mers
GQYYGNY 124540
AAEEEDP 93786
KIIIWDP 90430
LFEEAPS 68917
IWWDDKI 63080
WDDQKTK 60986
LFEEEPS 59563
FEEEAPV 52757
YEEQDSQ 51063
EYYAALV 48027
Time for index table init: 0h 1m 9s 963ms
k-mer similarity threshold: 185
Starting prefiltering scores calculation (step 1 of 3)
Query db start 1 to 111807234
Target db start 1 to 37383649
[=================================================================] 111.81M 2h 19m 3s 792ms

2.034511 k-mers per position
261116 DB matches per sequence
33622 overflows
24 sequences passed prefiltering per query sequence
1 median result list length
39281587 sequences with 0 size result lists
Time for merging to pref_step0_tmp_0: 0h 0m 55s 144ms
Time for merging to pref_step0_tmp_0_tmp: 0h 1m 39s 845ms
Process prefiltering step 2 of 3

Index table k-mer threshold: 185 at k-mer size 7
Index table: counting k-mers
[=================================================================] 37.22M 1m 3s 106ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 37.22M 5s 855ms
Index statistics
Entries: 420927068
DB size: 12174 MB
Avg k-mer size: 0.328849
Top 10 k-mers
GQYYGNY 124192
AAEEEDP 94425
KIIIWDP 91462
LFEEAPS 69744
IWWDDKI 63734
WDDQKTK 60864
LFEEEPS 59588
FEEEAPV 53786
YEEQDSQ 51530
EYYAALV 47833
Time for index table init: 0h 1m 19s 949ms
k-mer similarity threshold: 185
Starting prefiltering scores calculation (step 2 of 3)
Query db start 1 to 111807234
Target db start 37383650 to 74608124
[=================================================================] 111.81M 2h 13m 12s 977ms

2.034511 k-mers per position
264176 DB matches per sequence
35256 overflows
24 sequences passed prefiltering per query sequence
1 median result list length
39536708 sequences with 0 size result lists
Time for merging to pref_step0_tmp_1: 0h 0m 54s 780ms
Time for merging to pref_step0_tmp_1_tmp: 0h 2m 8s 350ms
Process prefiltering step 3 of 3

Index table k-mer threshold: 185 at k-mer size 7
Index table: counting k-mers
[=================================================================] 37.20M 31s 830ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 37.20M 7s 945ms
Index statistics
Entries: 422442059
DB size: 12182 MB
Avg k-mer size: 0.330033
Top 10 k-mers
GQYYGNY 125930
AAEEEDP 94585
KIIIWDP 91220
LFEEAPS 69973
IWWDDKI 63895
DDQIKIK 61549
WDDQKTK 61179
LFEEEPS 60583
FEEEAPV 53838
YEEQDSQ 51198
Time for index table init: 0h 0m 51s 326ms
k-mer similarity threshold: 185
Starting prefiltering scores calculation (step 3 of 3)
Query db start 1 to 111807234
Target db start 74608125 to 111807234
[=================================================================] 111.81M 2h 5m 42s 192ms

2.034511 k-mers per position
264625 DB matches per sequence
35306 overflows
24 sequences passed prefiltering per query sequence
1 median result list length
39606615 sequences with 0 size result lists
Time for merging to pref_step0_tmp_2: 0h 0m 58s 263ms
Time for merging to pref_step0_tmp_2_tmp: 0h 1m 35s 906ms
Merging 3 target splits to pref_step0
Preparing offsets for merging: 0h 0m 45s 262ms
[=================================================================] 111.81M 8m 42s 849ms
Time for merging to pref_step0: 0h 1m 3s 947ms
Time for merging target splits: 0h 10m 45s 55ms
Time for merging to pref_step0_tmp: 0h 5m 27s 737ms
Time for processing: 7h 22m 37s 479ms
structurealign /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_step0 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln_step0 --tmscore-threshold 0 --tmscore-threshold-mode 0 --lddt-threshold 0 --sort-by-structure-bits 0 --alignment-type 2 --exact-tmscore 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 64 --compressed 0 -v 3

[=================================================================] 111.81M 22h 20m 10s 590ms
Time for merging to aln_step0: 0h 1m 19s 609ms
Time for processing: 22h 32m 41s 836ms
clust /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln_step0 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step0 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover
[=================================================================] 111.81M 28m 42s 739ms
Sort entries
Find missing connections
Found 2135109749 new connections.
Reconstruct initial order
[=================================================================] 111.81M 56m 48s 981ms
Add missing connections
[=================================================================] 111.81M 50m 12s 798ms

Time for read in: 2h 44m 46s 139ms
Total time: 2h 53m 15s 547ms

Size of the sequence database: 111807234
Size of the alignment database: 111807234
Number of clusters: 68722297

Writing results 0h 0m 7s 955ms
Time for merging to clu_step0: 0h 0m 0s 3ms
Time for processing: 2h 53m 43s 479ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step0 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ss -v 3 --subdb-mode 1

Time for merging to input_step1_ss: 0h 0m 0s 0ms
Time for processing: 0h 0m 28s 846ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step0 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy_ca /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ca -v 3 --subdb-mode 1

Time for merging to input_step1_ca: 0h 0m 0s 0ms
Time for processing: 0h 0m 29s 516ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step0 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step_redundancy /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1 -v 3 --subdb-mode 1

Time for merging to input_step1: 0h 0m 0s 0ms
Time for processing: 0h 0m 28s 718ms
prefilter /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_step1 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 4.5 -k 7 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 200 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 64 --compressed 0 -v 3

Query database size: 68722297 type: Aminoacid
Target split mode. Searching through 2 splits
Estimated memory consumption: 162G
Target database size: 68722297 type: Aminoacid
Process prefiltering step 1 of 2

Index table k-mer threshold: 146 at k-mer size 7
Index table: counting k-mers
[=================================================================] 34.53M 44s 682ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 34.53M 11s 255ms
Index statistics
Entries: 1194650611
DB size: 16601 MB
Avg k-mer size: 0.933321
Top 10 k-mers
VLLLLLL 2225745
VSLSLSL 1571417
VSSSSSS 1553687
SSSSSSS 1224579
NVSVSSS 1215989
LLLLLLV 971880
NVSSSSS 712330
SVNVSSS 674536
SSSLLLV 652511
SSVSNSV 633430
Time for index table init: 0h 1m 10s 884ms
k-mer similarity threshold: 146
Starting prefiltering scores calculation (step 1 of 2)
Query db start 1 to 68722297
Target db start 1 to 34532558
[=================================================================] 68.72M 13h 43m 12s 619ms

15.006364 k-mers per position
967821 DB matches per sequence
8753 overflows
82 sequences passed prefiltering per query sequence
140 median result list length
11830310 sequences with 0 size result lists
Time for merging to pref_step1_tmp_0: 0h 0m 39s 754ms
Time for merging to pref_step1_tmp_0_tmp: 0h 3m 44s 305ms
Process prefiltering step 2 of 2

Index table k-mer threshold: 146 at k-mer size 7
Index table: counting k-mers
[=================================================================] 34.19M 3m 10s 175ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 34.19M 11s 187ms
Index statistics
Entries: 1203088172
DB size: 16649 MB
Avg k-mer size: 0.939913
Top 10 k-mers
VLLLLLL 2225385
VSLSLSL 1577075
VSSSSSS 1550918
SSSSSSS 1221518
NVSVSSS 1216445
LLLLLLV 971123
NVSSSSS 712243
SVNVSSS 679810
SSSLLLV 658947
SSVSNSV 633800
Time for index table init: 0h 3m 36s 725ms
k-mer similarity threshold: 146
Starting prefiltering scores calculation (step 2 of 2)
Query db start 1 to 68722297
Target db start 34532559 to 68722297
[=================================================================] 68.72M 17h 48m 4s 387ms

15.006364 k-mers per position
972847 DB matches per sequence
9403 overflows
82 sequences passed prefiltering per query sequence
140 median result list length
12322936 sequences with 0 size result lists
Time for merging to pref_step1_tmp_1: 0h 0m 35s 622ms
Time for merging to pref_step1_tmp_1_tmp: 0h 3m 50s 81ms
Merging 2 target splits to pref_step1
Preparing offsets for merging: 0h 0m 26s 484ms
[================================================================] 68.72M =14m 6s 739ms
Time for merging to pref_step1: 0h 0m 37s 358ms
Time for merging target splits: 0h 15m 28s 932ms
Time for merging to pref_step1_tmp: 0h 7m 8s 47ms
Time for processing: 32h 31m 28s 338ms
structurealign /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln_step1 --tmscore-threshold 0 --tmscore-threshold-mode 0 --lddt-threshold 0 --sort-by-structure-bits 0 --alignment-type 2 --exact-tmscore 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.01 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 64 --compressed 0 -v 3

[=================================================================] 68.72M 21h 10m 0s 517ms
Time for merging to aln_step1: 0h 0m 50s 864ms
Time for processing: 25h 45m 11s 582ms
clust /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/aln_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step1 --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 64 --compressed 0 -v 3 --cluster-weight-threshold 0.9

Clustering mode: Set Cover
[=================================================================] 68.72M 43m 19s 831ms
Sort entries
Find missing connections
Found 2071414253 new connections.
Reconstruct initial order
[=================================================================] 68.72M 49m 16s 343ms
Add missing connections
[=================================================================] 68.72M 44m 28s 611ms

Time for read in: 3h 2m 24s 369ms
Total time: 3h 10m 50s 583ms

Size of the sequence database: 68722297
Size of the alignment database: 68722297
Number of clusters: 34529705

Writing results 0h 0m 4s 223ms
Time for merging to clu_step1: 0h 0m 0s 24ms
Time for processing: 3h 11m 8s 448ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step2_ss -v 3 --subdb-mode 1

Time for merging to input_step2_ss: 0h 0m 0s 0ms
Time for processing: 0h 0m 16s 694ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1_ca /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step2_ca -v 3 --subdb-mode 1

Time for merging to input_step2_ca: 0h 0m 0s 0ms
Time for processing: 0h 0m 17s 274ms
createsubdb /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/clu_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step1 /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step2 -v 3 --subdb-mode 1

Time for merging to input_step2: 0h 0m 0s 0ms
Time for processing: 0h 0m 16s 632ms
prefilter /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step2_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/input_step2_ss /localscratch/yferiel.38388460.0/tmp_clusters/4547329476816437654/pref_step2 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 8 -k 7 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 64 --compressed 0 -v 3

Query database size: 34529705 type: Aminoacid
Estimated memory consumption: 150G
Target database size: 34529705 type: Aminoacid
Index table k-mer threshold: 107 at k-mer size 7
Index table: counting k-mers
[=================================================================] 34.53M 10m 17s 308ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 34.53M 19s 801ms
Index statistics
Entries: 2473273389
DB size: 23917 MB
Avg k-mer size: 1.932245
Top 10 k-mers
DDDDDDD 14776321
DDDDDDP 13093909
DDDDDPP 11500036
DDDDPDD 9107859
DDDPDDD 8270776
DDDDPPP 7786765
DDDPPPP 6854484
DDPPPPP 5727000
VLVLVVV 5555350
SVSVVVV 5227077
Time for index table init: 0h 11m 18s 790ms
Hard disk might not have enough free space (343G left).The prefilter result might need up to 1T.
Process prefiltering step 1 of 1

k-mer similarity threshold: 107
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 34529705
Target db start 1 to 34529705
[=============================

@martin-steinegger
Copy link
Collaborator

Yes this kind of clustering takes time. It seems the prefilter did process quite a bit already.

[=============================

If you want to speed up the process you could pre-cluster it first with MMseqs2 and then cluster the representatives using Foldseek.

@YFeriel
Copy link
Author

YFeriel commented Dec 30, 2024

Dear @martin-steinegger

Thank you for your suggestion. I understand the rationale behind using MMseqs2 for pre-clustering, but I am facing a particular challenge with my dataset. The proteins I am studying have very low sequence homology, which is precisely why I opted to use Foldseek—to explore structural homology instead of sequence similarity.

Given this, I am wondering: would it still be meaningful to use MMseqs2 for pre-clustering in this context, knowing that the sequence homology is negligible? Would the pre-clustering step provide any advantage when applied to such a dataset.
Your insight on this matter would be greatly appreciated.

Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants