GATK

GATK_CatVariants

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module to concatenate chromosome to get one VCF file for each sample.

Attention

The module generate script for each sample - chromosom.

The programs included in the module are the following:

  • CatVariants (GATK)

Requires

  • self.sample_data[sample][chr]["GATK_vcf"]

Output

  • self.sample_data[sample]["vcf"]

Parameters that can be set

Parameter

Values

Comments

genome_reference

chrom_list

Comma-separated list of chromosome names as mentioned in the BAM file

Lines for parameter file

GATK_CatVariants1:
    module: GATK_CatVariants
    base: GATK_SelectVariants_VEPfiltered
    script_path:     /path/to/java -cp /path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

GATK_gvcf

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for generate gVCF file from BAM file.

Attention

The module generate script for each sample-chromosom.

The programs included in the module are the following:

  • HaplotypeCaller (GATK)

Requires

  • self.sample_data[sample]["bam"]

Output

  • self.sample_data[sample][chr]["GATK_g.vcf"]

Parameters that can be set

Parameter

Values

Comments

genome_reference

chrom_list

Comma-separated list of chromosome names as mentioned in the BAM file

Lines for parameter file

GATK_gvcf:  # check about -nct for parallization and deal with memmory problem
    module: GATK_gvcf
    base: GATK_pre_processing
    script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:    /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    qsub_params:
        -pe:      shared 15
    redirects:
        -nct: 15

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

GATK_hard_filters

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available..

Attention

The module generate script for each chromosom.

The programs included in the module are the following:

  • SelectVariants and VariantFiltration (GATK)

Requires

  • self.sample_data[chr]["vcf"]

Output

  • self.sample_data[chr]["vcf"]

Parameters that can be set

Parameter

Values

Comments

genome_reference

chrom_list

Comma-separated list of chromosome names as mentioned in the BAM file

filterExpression_SNP

filter e xpression for SNP

filterExpression_INDEL

filter e xpression for INDEL

Lines for parameter file

GATK_hard_filters1:
    module: GATK_hard_filters 
    base: GenotypeGVCFs1
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    filterExpression_SNP: '"QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"'
    filterExpression_INDEL: '"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0 || SOR > 10.0 || InbreedingCoeff < -0.8"'

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

GATK_merge_gvcf

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for combine g.vcf files to cohorts.

Attention

The module generate script for each sample-chromosom.

The programs included in the module are the following:

  • CombineGVCFs (GATK)

Requires

  • self.sample_data[sample][chr]["GATK_g.vcf"]

Output

  • self.sample_data["cohorts"]

Parameters that can be set

Parameter

Values

Comments

genome_reference

chrom_list

Comma-separated list of chromosome names as mentioned in the BAM file

cohort_size

number of g.vcf file to be in each cohort

Lines for parameter file

gatk_merge_gvcf:
    module: GATK_merge_gvcf
    base: GATK_gvcf
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:    /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    cohort_size: 10
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

GATK_pre_processing

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for generating ready-to-GATK-use BAM files from fastq files.

Attention

The module lacks the “base recalibration process (BQSR)” step

The programs included in the module are the following:

  • FastqToSam Picard tool to generate uBAM

  • MarkIlluminaAdapters Picard tool to Mark Illumina Adapters

  • SamToFastq Picard tool uBAM to fastq

  • MergeBamAlignment Picard tool to merge BAM and uBAM

  • MarkDuplicates Picard tool to remove PCR duplicates

  • BWA MEM mapping with BWA MEM

Requires

  • A fastq file in the following locations:

    • self.sample_data[sample]["fastq.F"]

    • self.sample_data[sample]["fastq.R"]

Output

  • self.sample_data[sample]["bam"]

Parameters that can be set

Parameter

Values

Comments

picard_path

path to PICARD

Full path to the PICARD .jar file

bwa_mem_path

genome_reference

Lines for parameter file

GATK_pre_processing:
    module: GATK_pre_processing
    base: fQC_trim
    script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    picard_path:     /path/to/picard.jar
    bwa_mem_path:    /path/to/bwa mem
    genome_reference:    /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    threads: 20
    qsub_params:
        -pe: shared 20

References

http://broadinstitute.github.io/picard/

GATK_SelectVariants

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for separation of multi-VCF per-chromosome to one VCF per-sample per-chromosome

Attention

The module generates a script for each sample/chromosome.

The programs included in the module are the following:

  • SelectVariants (GATK)

Requires

  • self.sample_data[chr]["vcf"]

Output

  • self.sample_data[sample][chr]["GATK_vcf"]

Parameters that can be set

Parameter

Values

Comments

genome_reference

path to reference genome

chrom_list

Comma-separated list of chromosome names as mentioned in the BAM file

Lines for parameter file

GATK_SelectVariants_VEPfiltered:
    module: GATK_SelectVariants
    base: VEP1
    script_path: /path/to/GenomeAnalysisTK.jar        
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    redirects:
        --setFilteredGtToNocall: null

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

GATK_VQSR

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for apply VQSR filters

Attention

The module generates script for each chromosoms.

The programs included in the module are the following:

  • VariantRecalibrator and ApplyRecalibration (GATK)

Requires

  • self.sample_data[chr]["vcf"]

Output

  • self.sample_data[chr]["vcf"]

Parameters that can be set

Parameter

Values

Comments

genome_reference

chrom_list

list of chromosomes names as mentioned in BAM file separated by ‘,’

ts_filter_level_SNP

filter e xpression for SNP

ts_filter_level_INDEL

filter e xpression for INDEL

resource_SNP

resource_INDEL

Lines for parameter file

GATK_VQSR1:
    module: GATK_VQSR 
    base: GenotypeGVCFs1
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:   /path/to/bundle/b37/human_g1k_v37_decoy.fasta
    resource_SNP: 
        - hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/bundle/b37/hapmap_3.3.b37.vcf
        - omni,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/1000G_omni2.5.b37.vcf
        - 1000G,known=false,training=true,truth=false,prior=10.0 /path/to/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf
        - dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf
    resource_INDEL: 
        - mills,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf
        - dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf 
    ts_filter_level_SNP: 99.0
    ts_filter_level_INDEL: 99.0
    maxGaussians: 4
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

GenotypeGVCFs

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for perform joint genotyping on gVCF files produced by HaplotypeCaller.

Attention

The module generate script for each cohort-chromosom.

The programs included in the module are the following:

  • GenotypeGVCFs (GATK)

Requires

  • self.sample_data["cohorts"]

Output

  • self.sample_data[chr]["vcf"]

Parameters that can be set

Parameter

Values

Comments

genome_reference

chrom_list

list of chromosomes names as mentioned in BAM file separated by ‘,’

Lines for parameter file

GenotypeGVCFs1:
    module: GenotypeGVCFs
    base: gatk_merge_gvcf
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

Picard_CollectAlignmentSummaryMatrics

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for statistical information about the mapping generated by CollectAlignmentSummaryMetrics from Picard tools.

The programs included in the module are the following:

  • CollectAlignmentSummaryMatrics from PICARD tools.

Requires

  • A fastq file in the following location:

    • self.sample_data[sample]["bam"]

Output

Parameters that can be set

Parameter

Values

Comments

genome_reference

Lines for parameter file

Picard_CollectAlignmentSummaryMatrics1:
    module: Picard_CollectAlignmentSummaryMatrics
    base: GATK_pre_processing
    script_path: /path/to/java -jar /path/to/picard-1.139/dist/picard.jar
    genome_reference:    /path/to/bundle/b37/human_g1k_v37_decoy.fasta

References

http://broadinstitute.github.io/picard/

Picard_CollectVariantCalling

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for generating SNP and indel statistics information

The programs included in the module are the following:

  • CollectVariantCallingMetrics Picard tool to generate A collection of metrics relating to snps and indels within a variant-calling file (VCF)

Requires

  • A fastq file in the following location:

    • self.sample_data[chr]["vcf"]

Output

Lines for parameter file

Picard_CollectVariantCalling1:
    module: Picard_CollectVariantCalling 
    base: GATK_hard_filters1
    script_path: /path/to/java -jar /path/to/picard.jar
    DBSNP: /path/to/bundle/b37/dbsnp_138.b37.vcf
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"

References

http://broadinstitute.github.io/picard/

VEP

Authors

Michal Gordon

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for annotation of the multi VCF file

Attention

The module generates a script for each chromosome.

The programs included in the module are the following:

Requires

  • self.sample_data[chr]["vcf"]

Output

  • self.sample_data[chr]["vcf"] - annotated multi-VCF per chromosome

Parameters that can be set

Parameter

Values

Comments

chrom_list

Comma-separated list of chromosome names as mentioned in the BAM file

Note

VEP parameters can be passed via redirects

Lines for parameter file

VEP1:
    module: VEP 
    base: GATK_hard_filters1
    script_path: /path/to/vep
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    redirects:
        --format: vcf
        --offline: null
        --species: homo_sapiens
        --fork: 10
        --assembly: GRCh37
        --max_af: null
        --pick: null
        --dir: /path/to/VEP/ensembl-vep-release-88.10/cache
        --check_existing: null
        --symbol: null
        --force_overwrite: null
        --vcf: null

References

McLaren, William, et al. “The ensembl variant effect predictor.” Genome biology 17.1 (2016): 122.‏