Skip to content

Latest commit

 

History

History
223 lines (159 loc) · 9.09 KB

Readme.md

File metadata and controls

223 lines (159 loc) · 9.09 KB

Testing carnivore diet analysis on Oxford Nanopore Flongle

Execute on server

  1. Merge all sequences in one file

mkdir Data

cat fastq_pass/*.fastq > Data/fastq_pass.fastq

File size is 269Mo

  1. Calculate stats with seqkit, filter out long reads, convert to fasta

seqkit fx2tab Data/fastq_pass.fastq -l -q -n -i -H -j 32 > JR_flongle_m12s_stats.txt

seqkit stats Data/fastq_pass.fastq

file | format | type | num_seqs | sum_len | min_len | avg_len | max_len

Data/fastq_pass.fastq |FASTQ | DNA | 295,234 | 109,867,923 | 84 | 372.1 | 4,367

seqkit seq --min-len 100 --max-len 400 Data/fastq_pass.fastq > Data/fastq_pass_100-400bp.fastq

seqkit fq2fa Data/fastq_pass_100-400bp.fastq -o Data/fasta_pass.fasta

seqkit stats Data/fasta_pass.fasta

file format type num_seqs sum_len min_len avg_len max_len
Data/fasta_pass.fasta FASTA DNA 230,394 67,470,630 101 292.8 400

3.1. Cluster sequences with usearch, then sort by size and remove rare clusters

../usearch/usearch -cluster_fast Data/fasta_pass.fasta -id 0.80 -threads 32 -sizeout -centroids Data/centroids.fasta -uc Data/clusters.uc -consout Data/jr_m12s_jm_consensus.fasta

Seqs|230394 (230.4k) Clusters|115250 (115.2k) Max size|5989 Avg size|2.0 Min size|1 Singletons|102113 (102.1k), 44.3% of seqs, 88.6% of clusters Max mem|730Mb Time|24:47 Throughput|154.9 seqs/sec.

../usearch/usearch -sortbysize Data/jr_m12s_jm_consensus.fasta -fastaout Data/jr_m12s_jm_consensus_min5.fasta -minsize 5

seqkit stats Data/jr_m12s_jm_consensus_min5.fasta

file format type num_seqs sum_len min_len avg_len max_len
Data/jr_m12s_jm_consensus_min5.fasta FASTA DNA 4,296 1,232,430 178 286.9 336

https://drive5.com/usearch/manual/cmd_cluster_fast.html

https://drive5.com/usearch/manual/identity.html

3.2. Blastn against vertebrate mitochondrial database

mkdir blast

blastn -db ../JR_SoilF1/db_ncbi_vrtmito/ncbi_vrtmito.fasta -query Data/jr_m12s_jm_consensus_min5.fasta -out blast/jr_flongle_m12s_clust_mito.txt -max_target_seqs 1 -perc_identity 80 -outfmt "6 qseqid sseqid sacc staxids evalue pident nident slen qstart qend length mismatch"

3.3. Analysing Blastn results In this notebook. But not pretty, will work in R markdown in the future

3.4. Cutting adaptors Our PCR products were designed for illumina sequencing, they contain adaptors at each end that we need to remove before processing in ecotag

1st PCR F R Ref Mean length target sequence Barcode length w/o primers Adaptor N Primer Full Primer
MiMammal-U (forward) F Ushio 2017 219 169 ACACTCTTTCCCTACACGACGCTCTTCCGATCT NNNNNN GGGTTGGTAAATTTCGTGCCAGC ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNGGGTTGGTAAATTTCGTGCCAGC
MiMammal-U (reverse) R Ushio 2017 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT NNNNNN CATAGTGGGGTATCTAATCCCAGTTTG GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNCATAGTGGGGTATCTAATCCCAGTTTG

Their reverse needs to be searched for as well, as I believe nanopores have no preferential direction. Rev Forward: GCTGGCACGAAATTTACCAACCCNNNNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Rev Reverse: CAAACTGGGATTAGATACCCCACTATGNNNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

I started from the raw fastq, converted to fasta seqkit fq2fa Data/fastq_pass.fastq -o Data/raw_pass.fasta

Forward first cutadapt -o Data/fasta_pass_cutFOR.fasta Data/raw_pass.fasta -e 0.15 -g ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNGGGTTGGTAAATTTCGTGCCAGC...CAAACTGGGATTAGATACCCCACTATGNNNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -j 4 --discard-untrimmed -O 20 > Data/fasta_pass_cutFOR.report.txt

Reverse first cutadapt -o Data/fasta_pass_cutREV.fasta Data/raw_pass.fasta -e 0.15 -g GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNCATAGTGGGGTATCTAATCCCAGTTTG...GCTGGCACGAAATTTACCAACCCNNNNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -j 4 --discard-untrimmed -O 20 > Data/fasta_pass_cutREV.report.txt

seqkit seq --min-len 150 --max-len 300 -j 16 Data/fasta_pass_cutREV.fasta > Data/fasta_pass_cutREV_min150.fasta seqkit seq --min-len 150 --max-len 300 -j 16 Data/fasta_pass_cutFOR.fasta > Data/fasta_pass_cutFOR_min150.fasta seqkit seq -r -p -v -j 16 Data/fasta_pass_cutFOR_min150.fasta > Data/fasta_pass_cutFOR_min150_rc.fasta cat Data/fasta_pass_cutFOR_min150_rc.fasta Data/fasta_pass_cutREV_min150.fasta > Data/fasta_pass_cut.fasta seqkit stat -j 16 Data/*.fasta

We lose more than half of the reads through this step (fasta_pass_cut.fasta) compared to fasta_pass

file format type num_seqs sum_len min_len avg_len max_len
Data/fasta_pass_cutFOR.fasta FASTA DNA 104,470 18,944,808 3 181.3 2,352
Data/fasta_pass_cutFOR_min150.fasta FASTA DNA 97,902 16,125,229 150 164.7 300
Data/fasta_pass_cutREV.fasta FASTA DNA 27,955 5,047,612 14 180.6 1,774
Data/fasta_pass_cutREV_min150.fasta FASTA DNA 26,684 4,454,156 150 166.9 285
Data/fasta_pass.fasta FASTA DNA 230,394 67,470,630 101 292.8 400
Data/raw_pass.fasta FASTA DNA 295,234 109,867,923 84 372.1 4,367
Data/fasta_pass_cut.fasta FASTA DNA 124,586 20,579,385 150 165.2 300

3.5 Clustering on cut sequences ../usearch/usearch -cluster_fast Data/fasta_pass_cut.fasta -id 0.80 -threads 32 -sizeout -centroids Data/centroids_cut.fasta -uc Data/clusters_cut.uc -consout Data/jr_m12s_cut_consensus.fasta

Clustering works much better on these cut sequences

Seqs|124583 (124.6k) Clusters|23342 (23.3k) Max size|2805 Avg size|5.3 Min size|1 Singletons|16636 (16.6k), 13.4% of seqs, 71.3% of clusters Max mem|425Mb Time|02:15 Throughput|922.8 seqs/sec.

../usearch/usearch -sortbysize Data/jr_m12s_cut_consensus.fasta -fastaout Data/jr_m12s_cut_consensus_min2.fasta -minsize 2

seqkit stats -j 16 Data/jr_m12s_cut_consensus_min2.fasta

file format type num_seqs sum_len min_len avg_len max_len
Data/jr_m12s_cut_consensus_min5.fasta FASTA DNA 6,706 1,118,271 148 166.8 308

3.6. Ecotag For comparison, I used ecotag to identify the taxa corresponding to OTUs

ecotag -d ../../db/EMBL_191121 -R ../../db/MiMammal-U/M12s_v1_clean_uniqID_uniq_length.fasta Data/jr_m12s_cut_consensus_min2.fasta > ecotag/jr_flongle_m12s_clust_ecotag.fasta obitab -d -o ecotag/jr_flongle_m12s_clust_ecotag.fasta > ecotag/jr_flongle_m12s_clust_ecotag.txt obiclean -r 1 -d 1 -H ecotag/jr_flongle_m12s_clust_ecotag.fasta > ecotag/jr_flongle_m12s_clust_ecotag_CleanH.fasta obitab -d -o ecotag/jr_flongle_m12s_clust_ecotag_CleanH.fasta > ecotag/jr_flongle_m12s_clust_ecotag_CleanH.txt

Results from Ecotag on the clusters

fam=data %>% group_by(family_name) %>% tally() family_name|n :-----:|:-----: | 1 Canidae|327 2 Cervidae|2 3 Columbidae|4 4 Cricetidae|117 5 Felidae|408 6 Leporidae|1 7 Muridae|1 8 NA|2946

genus=data %>% group_by(genus_name) %>% tally() genus_name|n :-----:|:-----: 1 Canis|1 2 Neotoma|2 3 Patagioenas|2 4 Rattus|1 5 Urocyon|7 6 NA|3793

scientific_name n
1 Arvicolinae 19
2 Canidae 319
3 Caniformia 153
4 Canis 1
5 Carnivora 1286
6 Cervidae 1
7 Columbidae 2
8 Cricetidae 91
9 Felidae 394
10 Feliformia 782

4.1. Blasting all sequences against mitochondrial database Given that the clustering is not optimal, we can also check what the results look like if we blass all 230,394 sequences

blastn -db ../JR_SoilF1/db_ncbi_vrtmito/ncbi_vrtmito.fasta -query Data/fasta_pass.fasta -out blast/jr_flongle_m12s_allseq_mito.txt -max_target_seqs 1 -perc_identity 80 -outfmt "6 qseqid sseqid sacc staxids evalue pident nident slen qstart qend length mismatch"

blastn -db ../JR_SoilF1/db_ncbi_vrtmito/ncbi_vrtmito.fasta -query Data/fasta_pass_cut.fasta -out blast/jr_flongle_m12s_allseqcut_mito.txt -max_target_seqs 1 -perc_identity 80 -num_threads 32 -outfmt "6 qseqid sseqid sacc staxids evalue pident nident slen qstart qend length mismatch"

4.2. Consensus sequences

Comparing Families from Miseq and Nanopore Found 11 out of 14 (except Geomyidae, Equidae, Passerellidae)

Found 7 families with at least 10 reads in Nanopore run. 7 families in total

Family sum(count) 1 Felidae 58365 2 Canidae 17552 3 Cricetidae 16287 4 Viverridae 62 5 Cervidae 25 6 Muridae 14 7 Columbidae 13

Geting consensus Viverridae matched to Puma concolor mitochondrion, complete genome at 97% Phocidae, only 8 reads, matched to Canis latrans at 93% Mustelidae, only 5 reads, matched to Mustela frenata but only at 82%. Interesting because not found in Miseq! Muridae, 14 sequences, matched to Microtus multiplex at 91%

Did it in R and geneious at first. Can probably be done with seqkit, mafft/muscle and a consensus package

Seqkit, extract all reads for a given otu (sacc or genus, family..) cat Data/fasta_pass_cut.fasta | seqkit grep -f margay_seqID.txt > margay.fasta

Alignment https://www.drive5.com/muscle/

Consensus https://www.bioinformatics.nl/cgi-bin/emboss/help/cons http://www.stevekellylab.com/software/mergealign https://www.mothur.org/wiki/Consensus.seqs