Skip to content

Commit

Permalink
Merge pull request #30 from Illumina/GT-808
Browse files Browse the repository at this point in the history
GT-808 v2.3 release
  • Loading branch information
traxexx authored Jul 25, 2019
2 parents 9648877 + 353c74c commit cd9a580
Show file tree
Hide file tree
Showing 6 changed files with 40 additions and 15 deletions.
33 changes: 22 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@
* [Introduction](#Introduction)
* [Installation](#Installation)
* [Run Paragraph from VCF](#RunParagraphFromVCF)
* [Example](#Example)
* [Test example](#TestExample)
* [Input requirements](#InputRequirements)
* [Run time](#RunTime)
* [Population-scale genotyping](#PopulationScaleGenotyping)
* [Run Paragraph on complex variants](#RunParagraphOnComplexVariants)
* [Further Information](#FurtherInformation)
* [Documentation](#Documentation)
Expand Down Expand Up @@ -37,7 +39,7 @@ Genotyping calls in this paper can be found at [paper-data/download-instructions
Please check [doc/Installation.md](doc/Installation.md) for system requirements and installation instructions.

## <a name='RunParagraphFromVCF'></a>Run Paragraph from VCF
### <a name='Example'></a>Example
### <a name='TestExample'></a>Test example
After installation, run `multigrmpy.py` script from the build/bin directory on an example dataset as follows:

```bash
Expand Down Expand Up @@ -72,9 +74,6 @@ If successful, the last 3 lines of genotypes.vcf.gz will the same as in [expecte
### VCF format
paraGRAPH will independently genotype each entry of the input VCF. You can use either indel-style representation (full REF and ALT allele sequence in 4th and 5th columns) or symbolic alleles, as long as they meet the format requirement of VCF 4.0+.

It typically takes up to a few seconds to genotype a single event in one sample (single-threaded).
It took us 30 minutes to genotype ~20,000 SVs using 20 CPU cores (with I/O).

Currently we support 4 symbolic alleles:
- `<DEL>` for deletion
- Must have END key in INFO field.
Expand All @@ -90,20 +89,32 @@ Currently we support 4 symbolic alleles:
Must be tab-deliemited.

Required columns:
- ID: Each sample must have a unique ID. The output VCF will include genotypes for all samples in the manifest
- id: Each sample must have a unique ID. The output VCF will include genotypes for all samples in the manifest
- path: Path to the BAM/CRAM file.
- depth: Average depth across the genome. Can be calculated with bin/idxdepth or samtools.
- depth: Average depth across the genome. Can be calculated with bin/idxdepth (faster than samtools).
- read length: Average read length (bp) across the genome.

Optional columns:

- depth sd: Specify standard deviation for genome depth. Used for the normal test of breakpoint read depth. Default is sqrt(5*depth).

- depth variance: Square of depth sd.
- sex: Affects chrX and chrY genotyping. Allow "male" or "M", "female" or "F", and "unknown" (quotes shouldn't be included in the manifest). If not specified, the sample will be treated as unknown.

## <a name='RunTime'></a>Run time

- On a 30x HiSeqX sample, Paragraph typically takes 1-2 seconds to genotype a simple SV in confident regions.

- If the SV is in a low-complexity region with abnormal read pileups, the running time could vary.

- For efficiency, it is recommended to manually set the "-M" option (maximum allowed read count for a variant) to skip these high-depth regions. We recommend "-M" as 20 times of your mean sample depth.

## <a name='PopulationScaleGenotyping'></a>Population-scale genotyping

- sex: Affects chrX and chrY genotyping.
Allow "male" or "M", "female" or "F", and "unknown" (quotes shouldn't be included in the manifest).
If not specified, the sample will be treated as unknown.
To efficiently genotype SVs across a population, we recommend doing single-sample mode as follows:
- Create a manifest for each single sample
- Run `multigrmpy.py` for each manifest. Be sure to set "-M" option for each sample according to its depth.
- Multithreading (option "-t") is highly recommended for population-scale genotyping
- Merge all `genotypes.vcf.gz` to create a big VCF of all samples. You can use either `bcftools merge` or your custom script.

## <a name='RunParagraphOnComplexVariants'></a>Run Paragraph on complex variants
For more complicated events (e.g. genotype a deletion together with its nearby SNP), you can provide a custimized JSON to paraGRAPH:
Expand Down
5 changes: 4 additions & 1 deletion src/c++/include/idxdepth/Parameters.hh
Original file line number Diff line number Diff line change
Expand Up @@ -27,17 +27,19 @@ namespace idxdepth
class Parameters
{
public:
explicit Parameters(std::string bam_path, std::string bam_index_path, std::string reference_path)
explicit Parameters(std::string bam_path, std::string bam_index_path, std::string reference_path, bool altcontig)
: bam_path_(std::move(bam_path))
, bam_index_path_(std::move(bam_index_path))
, reference_path_(std::move(reference_path))
, alt_contig_(altcontig)
, threads_(1)
{
}

const std::string& bam_path() const { return bam_path_; }
const std::string& bam_index_path() const { return bam_index_path_; }
const std::string& reference_path() const { return reference_path_; }
bool include_alt_contig() const { return alt_contig_; }

const std::string& include_regex() const { return include_regex_; }
void set_include_regex(std::string const& ar) { include_regex_ = ar; }
Expand All @@ -53,6 +55,7 @@ private:
std::string bam_path_;
std::string bam_index_path_;
std::string reference_path_;
bool alt_contig_;
std::string include_regex_;
std::string autosome_regex_;
std::string sex_chromosome_regex_;
Expand Down
1 change: 1 addition & 0 deletions src/c++/lib/common/KlibImpl.hh
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ extern "C" {
#include "ksw.h"
}

#include <boost/core/noncopyable.hpp>
#include <cstdlib>
#include <cstring>
#include <memory>
Expand Down
6 changes: 6 additions & 0 deletions src/c++/lib/idxdepth/DepthEstimation.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,12 @@ Json::Value estimateDepths(Parameters const& parameters)
for (int i = 0; i < header->n_targets; ++i)
{
const std::string contig = std::string(header->target_name[i]);

if (!parameters.include_alt_contig() && contig.length() > 5)
{
continue;
}

if (parameters.include_regex().empty() || std::regex_match(contig, include_regex))
{
bam_chromosomes.insert(contig);
Expand Down
3 changes: 2 additions & 1 deletion src/c++/main/idxdepth.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ int main(int argc, char const* argv[])
("output,o", po::value<string>(), "Output file name. Will output to stdout if omitted.")
("output-bins,O", po::value<string>(), "Output binned coverage in tsv format.")
("reference,r", po::value<string>(), "FASTA with reference genome")
("altcontig",po::value<bool>()->default_value(false), "Include ALT contigs in estimation")
("include-regex,I", po::value<string>()->default_value(""), "Regex to identify contigs to include")
("autosome-regex", po::value<string>()->default_value("(chr)?[1-9][0-9]?"),
"Regex to identify autosome chromosome names (default: '(chr)?[1-9][0-9]?'")
Expand Down Expand Up @@ -131,7 +132,7 @@ int main(int argc, char const* argv[])
logger->info("Output path for binned coverage: {}", output_path_bins);
}

Parameters parameters(bam_path, bam_index_path, reference_path);
Parameters parameters(bam_path, bam_index_path, reference_path, vm["altcontig"].as<bool>());
parameters.set_include_regex(vm["include-regex"].as<string>());
parameters.set_autosome_regex(vm["autosome-regex"].as<string>());
parameters.set_sex_chromosome_regex(vm["sex-chromosome-regex"].as<string>());
Expand Down
7 changes: 5 additions & 2 deletions src/python/lib/grm/vcfgraph/vcfupdate.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,8 +157,11 @@ def update_vcf_from_grmpy(inVcfFilename, grmpyOutput, outVcfFilename, sample_nam
# start at the same position but come in different records. This could be fixed
# by collapsing these into single VCF records as alleles before running
# through paragraph.
record = header.new_record(contig=raw_record.chrom, start=raw_record.start, stop=raw_record.stop,
alleles=raw_record.alleles, id=raw_record.id, qual=raw_record.qual, filter=raw_record.filter, info=raw_record.info)
try:
record = header.new_record(contig=raw_record.chrom, start=raw_record.start, stop=raw_record.stop,
alleles=raw_record.alleles, id=raw_record.id, qual=raw_record.qual, filter=raw_record.filter, info=raw_record.info)
except:
raise Exception("Format error in vcf line: " + str(raw_record))

varIdCounts = defaultdict(int)
varId = VCFGraph.generate_variant_id(record, varIdCounts)
Expand Down

0 comments on commit cd9a580

Please sign in to comment.