Skip to content

Commit

Permalink
Merge branch 'release/v3.3.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
ACEnglish committed May 25, 2022
2 parents 1d2f42b + 757c2eb commit b9a3c59
Show file tree
Hide file tree
Showing 98 changed files with 23,733 additions and 21,169 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
[![pylint](imgs/pylint.svg)](https://github.com/spiralgenetics/truvari/actions/workflows/pylint.yml)
[![FuncTests](https://github.com/spiralgenetics/truvari/actions/workflows/func_tests.yml/badge.svg?branch=develop&event=push)](https://github.com/spiralgenetics/truvari/actions/workflows/func_tests.yml)
[![coverage](imgs/coverage.svg)](https://github.com/spiralgenetics/truvari/actions/workflows/func_tests.yml)
[![develop](https://img.shields.io/github/commits-since/spiralgenetics/truvari/v3.1.0)](https://github.com/spiralgenetics/truvari/commits/develop)
[![develop](https://img.shields.io/github/commits-since/spiralgenetics/truvari/v3.2.0)](https://github.com/spiralgenetics/truvari/commits/develop)
[![Downloads](https://pepy.tech/badge/truvari)](https://pepy.tech/project/truvari)

Toolkit for benchmarking, merging, and annotating Structrual Variants
Expand Down
12 changes: 12 additions & 0 deletions docs/api/truvari.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,10 @@ sizesim
^^^^^^^
.. autofunction:: sizesim

vcf_ranges
^^^^^^^^^^
.. autofunction:: vcf_ranges

weighted_score
^^^^^^^^^^^^^^
.. autofunction:: weighted_score
Expand All @@ -174,6 +178,10 @@ help_unknown_cmd
^^^^^^^^^^^^^^^^
.. autofunction:: help_unknown_cmd

make_temp_filename
^^^^^^^^^^^^^^^^^^
.. autofunction:: make_temp_filename

optimize_df_memory
^^^^^^^^^^^^^^^^^^
.. autofunction:: optimize_df_memory
Expand All @@ -182,6 +190,10 @@ restricted_float
^^^^^^^^^^^^^^^^
.. autofunction:: restricted_float

restricted_int
^^^^^^^^^^^^^^^^
.. autofunction:: restricted_int

setup_logging
^^^^^^^^^^^^^
.. autofunction:: setup_logging
Expand Down
31 changes: 31 additions & 0 deletions docs/v3.3.0/Citations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Citing Truvari

Pre-print on Biorxiv while in submission:

Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity
doi: https://doi.org/10.1101/2022.02.21.481353

# Citations

List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list.
* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8)
* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract)
* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true)
* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728)
* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full)
* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0)
* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078)
* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5)
* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract)
* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0)
* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9)
* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452)
* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true)
* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x)
* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full)
* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y)
* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246)
* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7)
* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9)
* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract)
* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract)
63 changes: 63 additions & 0 deletions docs/v3.3.0/Comparing-two-SV-programs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
A frequent application of comparing SVs is to perform a 'bakeoff' of performance
between two SV programs against a single set of base calls.

Beyond looking at the Truvari results/report, you may like to investigate what calls
are different between the programs.

Below is a set of scripts that may help you generate those results. For our examples,
we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs.

*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf.

Generate the Truvari report for Asvs and Bsvs
=============================================

```bash
truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ...
truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ...
```

Combine the TPs within each report
==================================

```bash
cd cmp_A/
paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt
cd ../cmp_B/
paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt
```

Grab the FNs missed by only one program
=======================================

```bash
(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt
```

Pull the TP sets' difference
============================

```bash
cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt
cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt
```

To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`.

To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`.

Calculate the overlap
=====================

One may wish for summary numbers of how many calls are shared/unique between the two programs.
Truvari has a program to help. See [[Consistency-report|consistency]] for details.

Shared FPs between the programs
===============================

All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again.

```bash
bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz
truvari bench -b cmp_A/fp.vcf -c cmp_B/fp.vcf.gz -o shared_fps ...
```
90 changes: 90 additions & 0 deletions docs/v3.3.0/Development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Truvari API
Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/).

Documentation can also be seen using
```python
import truvari
help(truvari)
```

# docker

A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run
```bash
docker build -t truvari .
```

You can then run Truvari through docker using
```bash
docker run -v `pwd`:/data -it truvari
```
Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point.
```bash
docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz
```

If you'd like to interact within the docker container for things like running the CI/CD scripts
```bash
docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari
```
You'll now be inside the container and can run FuncTests or run Truvari directly
```bash
bash repo_utils/truvari_ssshtests.sh
truvari anno svinfo -i example.vcf.gz
```

# CI/CD

Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts.

```bash
pip install pylint anybadge coverage
```

Check code formatting with
```bash
python repo_utils/pylint_maker.py
```
We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting.

Test the code and generate a coverage report with
```bash
bash repo_utils/truvari_ssshtests.sh
```

Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage.

# git flow

To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/<name>`

When contributing to the code, be sure you're working off of develop and have run `git flow init`.

# versioning

Truvari uses [Semantic Versioning](https://semver.org/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version.

# docs

The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`.

# Creating a release
Follow these steps to create a release

0) Bump release version
1) Run tests locally
2) Update API Docs
3) Freeze the Wiki
4) Ensure all code is checked in
5) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/)
6) Use github action to make a testpypi release
7) Check test release
```bash
python3 -m venv test_truvari
python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari
```
8) Use GitHub action to make a pypi release
9) Change Updates Wiki
10) Download release-tarball.zip from step #8’s action
11) Create release (include #9) from the tag
12) Checkout develop and Bump to dev version and README ‘commits since’ badge
55 changes: 55 additions & 0 deletions docs/v3.3.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm).

To show this difference, consider the following two sequences.:

```
AGATACAGGAGTACGAACAGTACAGTACGA
|||||||||||||||*||||||||||||||
ATCACAGATACAGGAGTACGTACAGTACAGTACGA
30bp Aligned
1bp Mismatched (96% similarity)
5bp Left-Trimmed (~14% of the bottom sequence)
```

The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences.


```python
import swalign
import Levenshtein
import edlib

seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA"
seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA"

scoring = swalign.NucleotideScoringMatrix(2, -1)
alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5)
aln = alner.align(seq1, seq2)
mat_tot = aln.matches
mis_tot = aln.mismatches
denom = float(mis_tot + mat_tot)
if denom == 0:
ident = 0
else:
ident = mat_tot / denom
scr = edlib.align(seq1, seq2)
totlen = len(seq1) + len(seq2)

print('swalign', ident)
# swalign 0.966666666667
print('levedit', Levenshtein.ratio(seq1, seq2))
# levedit 0.892307692308
print('edlib', (totlen - scr["editDistance"]) / totlen)
# edlib 0.9076923076923077
```

Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio.

If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%.

[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed.

The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference.

However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`.
17 changes: 17 additions & 0 deletions docs/v3.3.0/GIAB-stratifications.sh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
As part of the GIAB analysis, tandem repeat (TR) stratifications for false positives (FP), false negatives (FN), and true positives (TP) were analyzed. See [this discussion](https://groups.google.com/d/msg/giab-analysis-team/tAtVBm9Fdrw/I2qeazh8AwAJ) for details.

The stratifications.sh script automates the procedure above by taking a Truvari bench output directory as the first argument and TR repeat annotations (found in the link above) to create the following files in the output directory.

Running this script requires `bedtools`, `vcf-sort`, and `bgzip` to be in your environment.


| FileName | Description |
|---------------------|---------------------------------------|
| fn_ins_nonTR.vcf.gz | FN insertions not found in TR regions |
| fn_ins_TR.vcf.gz | FN insertions found in TR regions |
| fp_ins_nonTR.vcf.gz | FP insertions not found in TR regions |
| fp_ins_TR.vcf.gz | FP insertions found in TR regions |
| fn_del_nonTR.vcf.gz | FN insertions not found in TR regions |
| fn_del_TR.vcf.gz | FN insertions found in TR regions |
| fp_del_nonTR.vcf.gz | FP insertions not found in TR regions |
| fp_del_TR.vcf.gz | FP insertions found in TR regions |
35 changes: 35 additions & 0 deletions docs/v3.3.0/Home.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs)

Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity
doi: https://doi.org/10.1101/2022.02.21.481353

# Before you start
VCFs aren't always created with a strong adherence to the format's specification.

Truvari expects input VCFs to be valid so that it will only output valid VCFs.

We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF.

Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble.

Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below.

# Index

- [[Updates|Updates]]
- [[Installation|Installation]]
- Truvari Commands:
- [[bench|bench]]
- [[The multimatch parameter|The--multimatch-parameter]]
- [[Edit Distance Ratio vs Sequence Similarity|Edit-Distance-Ratio-vs-Sequence-Similarity]]
- [[Multi-allelic VCFs|Multi-allelic-VCFs]]
- [[stratifications.sh|GIAB-stratifications.sh]]
- [[Comparing two SV programs|Comparing-two-SV-programs]]
- [[consistency|consistency]]
- [[anno|anno]]
- [[collapse|collapse]]
- [[vcf2df|vcf2df]]
- [[segment|segment]]
- [[Development|Development]]
- [[Citations|Citations]]
- [[Resources|Resources]]
47 changes: 47 additions & 0 deletions docs/v3.3.0/Installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
Recommended
===========
For stable versions of Truvari, use pip
```
python3 -m pip install truvari
```
Specific versions can be installed via
```
python3 -m pip install truvari==3.0.0
```
See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases.

Manual Installation
===================
To build Truvari directly, clone the repository and switch to a specific tag.
```
git clone https://github.com/spiralgenetics/truvari.git
git checkout tags/v3.0.0
python3 setup.py install
```
To see a list of all available tags, run:
```
git tag -l
```
If you have an older clone of the repository, but don't see the version you're looking for in tags, make sure to pull the latest changes:
```
git pull
git fetch --all --tags
```

Mamba
=====
Users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.)

Creating an environment with Truvari and its dependencies.
```
mamba create -c conda-forge -c bioconda -n truvari truvari
```
Note: as of May 4, 2022 mamba is up to date with Truvari's v3.2.0 release. Until we implement an auto-deployment to our github actions, mamba official releases may lag behind.

Building from develop
=====================
The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled.

Docker
======
See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container.
Loading

0 comments on commit b9a3c59

Please sign in to comment.