GATK

Modules included in this section

GATK_CatVariants
GATK_gvcf
GATK_hard_filters
GATK_merge_gvcf
GATK_pre_processing
GATK_SelectVariants
GATK_VQSR
GenotypeGVCFs
Picard_CollectAlignmentSummaryMatrics
Picard_CollectVariantCalling
VEP

`GATK_CatVariants`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module to concatenate chromosome to get one VCF file for each sample.

Attention

The module generate script for each sample - chromosom.

The programs included in the module are the following:

CatVariants (GATK)

Requires

self.sample_data[sample][chr]["GATK_vcf"]

Output

self.sample_data[sample]["vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list	Comma-separated list of chromosome names as mentioned in the BAM file

Lines for parameter file

GATK_CatVariants1:
    module: GATK_CatVariants
    base: GATK_SelectVariants_VEPfiltered
    script_path:     /path/to/java -cp /path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GATK_gvcf`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for generate gVCF file from BAM file.

Attention

The module generate script for each sample-chromosom.

The programs included in the module are the following:

HaplotypeCaller (GATK)

Requires

self.sample_data[sample]["bam"]

Output

self.sample_data[sample][chr]["GATK_g.vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list	Comma-separated list of chromosome names as mentioned in the BAM file

Lines for parameter file

GATK_gvcf:  # check about -nct for parallization and deal with memmory problem
    module: GATK_gvcf
    base: GATK_pre_processing
    script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:    /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    qsub_params:
        -pe:      shared 15
    redirects:
        -nct: 15

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GATK_hard_filters`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available..

Attention

The module generate script for each chromosom.

The programs included in the module are the following:

SelectVariants and VariantFiltration (GATK)

Requires

self.sample_data[chr]["vcf"]

Output

self.sample_data[chr]["vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list		Comma-separated list of chromosome names as mentioned in the BAM file
filterExpression_SNP		filter e xpression for SNP
filterExpression_INDEL		filter e xpression for INDEL

Lines for parameter file

GATK_hard_filters1:
    module: GATK_hard_filters 
    base: GenotypeGVCFs1
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    filterExpression_SNP: '"QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"'
    filterExpression_INDEL: '"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0 || SOR > 10.0 || InbreedingCoeff < -0.8"'

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GATK_merge_gvcf`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for combine g.vcf files to cohorts.

Attention

The module generate script for each sample-chromosom.

The programs included in the module are the following:

CombineGVCFs (GATK)

Requires

self.sample_data[sample][chr]["GATK_g.vcf"]

Output

self.sample_data["cohorts"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list	Comma-separated list of chromosome names as mentioned in the BAM file
cohort_size		number of g.vcf file to be in each cohort

Lines for parameter file

gatk_merge_gvcf:
    module: GATK_merge_gvcf
    base: GATK_gvcf
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:    /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    cohort_size: 10
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GATK_pre_processing`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for generating ready-to-GATK-use BAM files from fastq files.

Attention

The module lacks the “base recalibration process (BQSR)” step

The programs included in the module are the following:

FastqToSam Picard tool to generate uBAM
MarkIlluminaAdapters Picard tool to Mark Illumina Adapters
SamToFastq Picard tool uBAM to fastq
MergeBamAlignment Picard tool to merge BAM and uBAM
MarkDuplicates Picard tool to remove PCR duplicates
BWA MEM mapping with BWA MEM

Requires

A fastq file in the following locations:
- self.sample_data[sample]["fastq.F"]
- self.sample_data[sample]["fastq.R"]

Output

self.sample_data[sample]["bam"]

Parameters that can be set

Parameter	Values	Comments
picard_path	path to PICARD	Full path to the PICARD .jar file
bwa_mem_path
genome_reference

Lines for parameter file

GATK_pre_processing:
    module: GATK_pre_processing
    base: fQC_trim
    script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    picard_path:     /path/to/picard.jar
    bwa_mem_path:    /path/to/bwa mem
    genome_reference:    /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    threads: 20
    qsub_params:
        -pe: shared 20

References

http://broadinstitute.github.io/picard/

`GATK_SelectVariants`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for separation of multi-VCF per-chromosome to one VCF per-sample per-chromosome

Attention

The module generates a script for each sample/chromosome.

The programs included in the module are the following:

SelectVariants (GATK)

Requires

self.sample_data[chr]["vcf"]

Output

self.sample_data[sample][chr]["GATK_vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference		path to reference genome
chrom_list		Comma-separated list of chromosome names as mentioned in the BAM file

Lines for parameter file

GATK_SelectVariants_VEPfiltered:
    module: GATK_SelectVariants
    base: VEP1
    script_path: /path/to/GenomeAnalysisTK.jar        
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    redirects:
        --setFilteredGtToNocall: null

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GATK_VQSR`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for apply VQSR filters

Attention

The module generates script for each chromosoms.

The programs included in the module are the following:

VariantRecalibrator and ApplyRecalibration (GATK)

Requires

self.sample_data[chr]["vcf"]

Output

self.sample_data[chr]["vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list		list of chromosomes names as mentioned in BAM file separated by ‘,’
ts_filter_level_SNP		filter e xpression for SNP
ts_filter_level_INDEL		filter e xpression for INDEL
resource_SNP
resource_INDEL

Lines for parameter file

GATK_VQSR1:
    module: GATK_VQSR 
    base: GenotypeGVCFs1
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:   /path/to/bundle/b37/human_g1k_v37_decoy.fasta
    resource_SNP: 
        - hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/bundle/b37/hapmap_3.3.b37.vcf
        - omni,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/1000G_omni2.5.b37.vcf
        - 1000G,known=false,training=true,truth=false,prior=10.0 /path/to/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf
        - dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf
    resource_INDEL: 
        - mills,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf
        - dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf 
    ts_filter_level_SNP: 99.0
    ts_filter_level_INDEL: 99.0
    maxGaussians: 4
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GenotypeGVCFs`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for perform joint genotyping on gVCF files produced by HaplotypeCaller.

Attention

The module generate script for each cohort-chromosom.

The programs included in the module are the following:

GenotypeGVCFs (GATK)

Requires

self.sample_data["cohorts"]

Output

self.sample_data[chr]["vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list		list of chromosomes names as mentioned in BAM file separated by ‘,’

Lines for parameter file

GenotypeGVCFs1:
    module: GenotypeGVCFs
    base: gatk_merge_gvcf
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`Picard_CollectAlignmentSummaryMatrics`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for statistical information about the mapping generated by CollectAlignmentSummaryMetrics from Picard tools.

The programs included in the module are the following:

CollectAlignmentSummaryMatrics from PICARD tools.

Requires

A fastq file in the following location:
- self.sample_data[sample]["bam"]

Output

Parameters that can be set

Parameter	Values	Comments
genome_reference

Lines for parameter file

Picard_CollectAlignmentSummaryMatrics1:
    module: Picard_CollectAlignmentSummaryMatrics
    base: GATK_pre_processing
    script_path: /path/to/java -jar /path/to/picard-1.139/dist/picard.jar
    genome_reference:    /path/to/bundle/b37/human_g1k_v37_decoy.fasta

References

http://broadinstitute.github.io/picard/

`Picard_CollectVariantCalling`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for generating SNP and indel statistics information

The programs included in the module are the following:

CollectVariantCallingMetrics Picard tool to generate A collection of metrics relating to snps and indels within a variant-calling file (VCF)

Requires

A fastq file in the following location:
- self.sample_data[chr]["vcf"]

Output

Lines for parameter file

Picard_CollectVariantCalling1:
    module: Picard_CollectVariantCalling 
    base: GATK_hard_filters1
    script_path: /path/to/java -jar /path/to/picard.jar
    DBSNP: /path/to/bundle/b37/dbsnp_138.b37.vcf
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"

References

http://broadinstitute.github.io/picard/

`VEP`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for annotation of the multi VCF file

Attention

The module generates a script for each chromosome.

The programs included in the module are the following:

VEP (Variant Effect Predictor. )

Requires

self.sample_data[chr]["vcf"]

Output

self.sample_data[chr]["vcf"] - annotated multi-VCF per chromosome

Parameters that can be set

Parameter	Values	Comments
chrom_list	Comma-separated list of chromosome names as mentioned in the BAM file

Note

VEP parameters can be passed via redirects

Lines for parameter file

VEP1:
    module: VEP 
    base: GATK_hard_filters1
    script_path: /path/to/vep
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    redirects:
        --format: vcf
        --offline: null
        --species: homo_sapiens
        --fork: 10
        --assembly: GRCh37
        --max_af: null
        --pick: null
        --dir: /path/to/VEP/ensembl-vep-release-88.10/cache
        --check_existing: null
        --symbol: null
        --force_overwrite: null
        --vcf: null

References

McLaren, William, et al. “The ensembl variant effect predictor.” Genome biology 17.1 (2016): 122.‏