GATK¶
Modules included in this section
GATK_CatVariants
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module to concatenate chromosome to get one VCF file for each sample.
Attention
The module generate script for each sample - chromosom.
The programs included in the module are the following:
CatVariants
(GATK)
Requires¶
self.sample_data[sample][chr]["GATK_vcf"]
Output¶
self.sample_data[sample]["vcf"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
genome_reference | ||
chrom_list | Comma-separated list of chromosome names as mentioned in the BAM file |
Lines for parameter file¶
GATK_CatVariants1:
module: GATK_CatVariants
base: GATK_SelectVariants_VEPfiltered
script_path: /path/to/java -cp /path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References¶
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_gvcf
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module for generate gVCF file from BAM file.
Attention
The module generate script for each sample-chromosom.
The programs included in the module are the following:
HaplotypeCaller
(GATK)
Requires¶
self.sample_data[sample]["bam"]
Output¶
self.sample_data[sample][chr]["GATK_g.vcf"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
genome_reference | ||
chrom_list | Comma-separated list of chromosome names as mentioned in the BAM file |
Lines for parameter file¶
GATK_gvcf: # check about -nct for parallization and deal with memmory problem
module: GATK_gvcf
base: GATK_pre_processing
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
qsub_params:
-pe: shared 15
redirects:
-nct: 15
References¶
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_hard_filters
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module for apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available..
Attention
The module generate script for each chromosom.
The programs included in the module are the following:
SelectVariants and VariantFiltration
(GATK)
Requires¶
self.sample_data[chr]["vcf"]
Output¶
self.sample_data[chr]["vcf"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
genome_reference | ||
chrom_list | Comma-separated list of chromosome names as mentioned in the BAM file | |
filterExpression_SNP | filter e xpression for SNP | |
filterExpression_INDEL | filter e xpression for INDEL |
Lines for parameter file¶
GATK_hard_filters1:
module: GATK_hard_filters
base: GenotypeGVCFs1
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
filterExpression_SNP: '"QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"'
filterExpression_INDEL: '"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0 || SOR > 10.0 || InbreedingCoeff < -0.8"'
References¶
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_merge_gvcf
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module for combine g.vcf files to cohorts.
Attention
The module generate script for each sample-chromosom.
The programs included in the module are the following:
CombineGVCFs
(GATK)
Requires¶
self.sample_data[sample][chr]["GATK_g.vcf"]
Output¶
self.sample_data["cohorts"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
genome_reference | ||
chrom_list | Comma-separated list of chromosome names as mentioned in the BAM file | |
cohort_size | number of g.vcf file to be in each cohort |
Lines for parameter file¶
gatk_merge_gvcf:
module: GATK_merge_gvcf
base: GATK_gvcf
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
cohort_size: 10
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References¶
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_pre_processing
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module for generating ready-to-GATK-use BAM files from fastq files.
Attention
The module lacks the “base recalibration process (BQSR)” step
The programs included in the module are the following:
FastqToSam
Picard tool to generate uBAMMarkIlluminaAdapters
Picard tool to Mark Illumina AdaptersSamToFastq
Picard tool uBAM to fastqMergeBamAlignment
Picard tool to merge BAM and uBAMMarkDuplicates
Picard tool to remove PCR duplicatesBWA MEM
mapping with BWA MEM
Requires¶
A fastq file in the following locations:
self.sample_data[sample]["fastq.F"]
self.sample_data[sample]["fastq.R"]
Output¶
self.sample_data[sample]["bam"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
picard_path | path to PICARD | Full path to the PICARD .jar file |
bwa_mem_path | ||
genome_reference |
Lines for parameter file¶
GATK_pre_processing:
module: GATK_pre_processing
base: fQC_trim
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
picard_path: /path/to/picard.jar
bwa_mem_path: /path/to/bwa mem
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
threads: 20
qsub_params:
-pe: shared 20
References¶
GATK_SelectVariants
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module for separation of multi-VCF per-chromosome to one VCF per-sample per-chromosome
Attention
The module generates a script for each sample/chromosome.
The programs included in the module are the following:
SelectVariants
(GATK)
Requires¶
self.sample_data[chr]["vcf"]
Output¶
self.sample_data[sample][chr]["GATK_vcf"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
genome_reference | path to reference genome | |
chrom_list | Comma-separated list of chromosome names as mentioned in the BAM file |
Lines for parameter file¶
GATK_SelectVariants_VEPfiltered:
module: GATK_SelectVariants
base: VEP1
script_path: /path/to/GenomeAnalysisTK.jar
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
redirects:
--setFilteredGtToNocall: null
References¶
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_VQSR
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module for apply VQSR filters
Attention
The module generates script for each chromosoms.
The programs included in the module are the following:
VariantRecalibrator
andApplyRecalibration
(GATK)
Requires¶
self.sample_data[chr]["vcf"]
Output¶
self.sample_data[chr]["vcf"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
genome_reference | ||
chrom_list | list of chromosomes names as mentioned in BAM file separated by ‘,’ | |
ts_filter_level_SNP | filter e xpression for SNP | |
ts_filter_level_INDEL | filter e xpression for INDEL | |
resource_SNP | ||
resource_INDEL |
Lines for parameter file¶
GATK_VQSR1:
module: GATK_VQSR
base: GenotypeGVCFs1
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/bundle/b37/human_g1k_v37_decoy.fasta
resource_SNP:
- hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/bundle/b37/hapmap_3.3.b37.vcf
- omni,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/1000G_omni2.5.b37.vcf
- 1000G,known=false,training=true,truth=false,prior=10.0 /path/to/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf
- dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf
resource_INDEL:
- mills,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf
- dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf
ts_filter_level_SNP: 99.0
ts_filter_level_INDEL: 99.0
maxGaussians: 4
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References¶
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GenotypeGVCFs
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module for perform joint genotyping on gVCF files produced by HaplotypeCaller.
Attention
The module generate script for each cohort-chromosom.
The programs included in the module are the following:
GenotypeGVCFs
(GATK)
Requires¶
self.sample_data["cohorts"]
Output¶
self.sample_data[chr]["vcf"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
genome_reference | ||
chrom_list | list of chromosomes names as mentioned in BAM file separated by ‘,’ |
Lines for parameter file¶
GenotypeGVCFs1:
module: GenotypeGVCFs
base: gatk_merge_gvcf
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
References¶
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
Picard_CollectAlignmentSummaryMatrics
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module for statistical information about the mapping generated by CollectAlignmentSummaryMetrics from Picard tools.
The programs included in the module are the following:
CollectAlignmentSummaryMatrics
from PICARD tools.
Requires¶
A fastq file in the following location:
self.sample_data[sample]["bam"]
Output¶
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
genome_reference |
Lines for parameter file¶
Picard_CollectAlignmentSummaryMatrics1:
module: Picard_CollectAlignmentSummaryMatrics
base: GATK_pre_processing
script_path: /path/to/java -jar /path/to/picard-1.139/dist/picard.jar
genome_reference: /path/to/bundle/b37/human_g1k_v37_decoy.fasta
References¶
Picard_CollectVariantCalling
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module for generating SNP and indel statistics information
The programs included in the module are the following:
CollectVariantCallingMetrics
Picard tool to generate A collection of metrics relating to snps and indels within a variant-calling file (VCF)
Requires¶
A fastq file in the following location:
self.sample_data[chr]["vcf"]
Output¶
Lines for parameter file¶
Picard_CollectVariantCalling1:
module: Picard_CollectVariantCalling
base: GATK_hard_filters1
script_path: /path/to/java -jar /path/to/picard.jar
DBSNP: /path/to/bundle/b37/dbsnp_138.b37.vcf
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References¶
VEP
¶
Authors: | Michal Gordon |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A class that defines a module for annotation of the multi VCF file
Attention
The module generates a script for each chromosome.
The programs included in the module are the following:
VEP
(Variant Effect Predictor. )
Requires¶
self.sample_data[chr]["vcf"]
Output¶
self.sample_data[chr]["vcf"]
- annotated multi-VCF per chromosome
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
chrom_list | Comma-separated list of chromosome names as mentioned in the BAM file |
Note
VEP parameters can be passed via redirects
Lines for parameter file¶
VEP1:
module: VEP
base: GATK_hard_filters1
script_path: /path/to/vep
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
redirects:
--format: vcf
--offline: null
--species: homo_sapiens
--fork: 10
--assembly: GRCh37
--max_af: null
--pick: null
--dir: /path/to/VEP/ensembl-vep-release-88.10/cache
--check_existing: null
--symbol: null
--force_overwrite: null
--vcf: null
References¶
McLaren, William, et al. “The ensembl variant effect predictor.” Genome biology 17.1 (2016): 122.