GitHub - ThorntonLab/Cridland2013AnnotPipeline: TE annotation pipeline from Cridland et al. (2013) doi: 10.1093/molbev/mst129

This README discuss the scripts needed to annotate TEs detected with the TE detection pipeline.

These scripts require the following output files

fasta files generated by the detection pipeline have been assembled with Phrap (-forcelevel 10 -minscore 10 -minmatch 10)
these contigs have been aligned back to the reference genome with BLAST () #NOTE - the way the annotation script recognizes filenames these output files need to have the same name as the contig file with the addition of a .chromosome.output at the end of the file name
these contigs have been aligned to a database containing all of the reference TE sequence only

annotate_TEs.pl

The input for this script is as follows

the directory where the contig files and the blast output files are located - these need to be in the same directory
the name of the lefthand contig file - this is the file that is created for upstream of the breakpoint - the script will parse this filename and then use it to reconstruct the file names of the right contig file and the blast output files associated with these
the estimate positions file generated by the TE detection pipeline - the line*_te_pos_estimate file - the pipeline checks both that there is a TE insertion and that the contig reconstructed aligns to the region of the genome estimated
the chrom_num file - this is the file with the numeric conversion for chromosome arms
the output file

The output file of this script has the following fields - this file can generate duplicate TE identifications for TEs shared with the reference - this is often the case when there are several small TEs very near each other in the reference and multiple contigs are created in this region - therefore this file needs to be edited for redundancy after it is generated

chromosome
upstream estimate
downstream estimate
a string with an id number, then a :, then a string with a preliminary description of what TE is located there - this was a first attempt at TE annotation, but was not what was used in the final paper
the word shared or novel to designate a TE shared with the refernce sequence (same TE at same location) or not
if novel - a number - this is for the upstream contig 0 = complete reconstruction - one contig that contains the TE breakpoint 1 = 2 contigs - one that is TE and one that aligns to the estimated genomic region, but do not have the exact breakpoint 2 = did not resolve a contig
if novel - same as 6 except for the downstream contig

trim_repetition.pl

this script takes the output of the annotate_TEs.pl script and removes repetitive entries

the input for this script is

the output of annotate_TEs.pl
the output file - same format as input

###In the paper the data is then restricted to certain chromosome regions - this was done at this point in the process #########

make_best_te.pl

This script determines the best match for each TE for each contig based on length of match and e-value

the input for this script is

the blast output file for the alignment to the TE database (this processes one file - I write a perl wrapper to go through each file in a directory)
the output file - this file is set up to append so you can run many instances and write to a single output file

the output file for this script is 1) contig (contig name field can be parsed to chromosome number . ID number . left/right . fasta . Contig#) 2) TE 3) length of TE match

make_best_te_contig.pl

This script determines which contig has the resolved TE breakpoing - this is only applicable for situations where a breakpoint is fully reconstructed

the input for this file is

the blast output file for the alignment to the reference sequence
the output file
the output file from annotate_TEs.pl following removal of redundant records

the output of this file is 1) which contig contains the breakpoint 2) the total length of this contig

With the output files from make_best_te.pl and make_best_te_contig.pl the user can determine which TE has been annotated - In the paper we compare which TE is annotated with the upstream and downstream estimate for each event (left/right)

evaluate_shared.pl

This script identifies presence/absence calls for TEs shared with the reference

the input for this script is

the blast output file from the alignment back to the reference genome
the output file
the te_pos_estimate file
the file with the positions of the TEs in the reference
distance for absence breakpoints

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
TE_positions		TE_positions
annotate_TEs.pl		annotate_TEs.pl
chrom_num		chrom_num
evaluate_novel.pl		evaluate_novel.pl
evaluate_shared.pl		evaluate_shared.pl
make_best_te.pl		make_best_te.pl
make_best_te_contig.pl		make_best_te_contig.pl
transposon_all.fasta		transposon_all.fasta
trim_repetition.pl		trim_repetition.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

ThorntonLab/Cridland2013AnnotPipeline

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages