GATK
Modules included in this section
GATK_CatVariants
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module to concatenate chromosome to get one VCF file for each sample.
Attention
The module generate script for each sample - chromosom.
The programs included in the module are the following:
CatVariants
(GATK)
Requires
self.sample_data[sample][chr]["GATK_vcf"]
Output
self.sample_data[sample]["vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
genome_reference |
||
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
Lines for parameter file
GATK_CatVariants1:
module: GATK_CatVariants
base: GATK_SelectVariants_VEPfiltered
script_path: /path/to/java -cp /path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_gvcf
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for generate gVCF file from BAM file.
Attention
The module generate script for each sample-chromosom.
The programs included in the module are the following:
HaplotypeCaller
(GATK)
Requires
self.sample_data[sample]["bam"]
Output
self.sample_data[sample][chr]["GATK_g.vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
genome_reference |
||
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
Lines for parameter file
GATK_gvcf: # check about -nct for parallization and deal with memmory problem
module: GATK_gvcf
base: GATK_pre_processing
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
qsub_params:
-pe: shared 15
redirects:
-nct: 15
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_hard_filters
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available..
Attention
The module generate script for each chromosom.
The programs included in the module are the following:
SelectVariants and VariantFiltration
(GATK)
Requires
self.sample_data[chr]["vcf"]
Output
self.sample_data[chr]["vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
genome_reference |
||
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
|
filterExpression_SNP |
filter e xpression for SNP |
|
filterExpression_INDEL |
filter e xpression for INDEL |
Lines for parameter file
GATK_hard_filters1:
module: GATK_hard_filters
base: GenotypeGVCFs1
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
filterExpression_SNP: '"QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"'
filterExpression_INDEL: '"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0 || SOR > 10.0 || InbreedingCoeff < -0.8"'
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_merge_gvcf
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for combine g.vcf files to cohorts.
Attention
The module generate script for each sample-chromosom.
The programs included in the module are the following:
CombineGVCFs
(GATK)
Requires
self.sample_data[sample][chr]["GATK_g.vcf"]
Output
self.sample_data["cohorts"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
genome_reference |
||
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
|
cohort_size |
number of g.vcf file to be in each cohort |
Lines for parameter file
gatk_merge_gvcf:
module: GATK_merge_gvcf
base: GATK_gvcf
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
cohort_size: 10
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_pre_processing
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for generating ready-to-GATK-use BAM files from fastq files.
Attention
The module lacks the “base recalibration process (BQSR)” step
The programs included in the module are the following:
FastqToSam
Picard tool to generate uBAMMarkIlluminaAdapters
Picard tool to Mark Illumina AdaptersSamToFastq
Picard tool uBAM to fastqMergeBamAlignment
Picard tool to merge BAM and uBAMMarkDuplicates
Picard tool to remove PCR duplicatesBWA MEM
mapping with BWA MEM
Requires
A fastq file in the following locations:
self.sample_data[sample]["fastq.F"]
self.sample_data[sample]["fastq.R"]
Output
self.sample_data[sample]["bam"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
picard_path |
path to PICARD |
Full path to the PICARD .jar file |
bwa_mem_path |
||
genome_reference |
Lines for parameter file
GATK_pre_processing:
module: GATK_pre_processing
base: fQC_trim
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
picard_path: /path/to/picard.jar
bwa_mem_path: /path/to/bwa mem
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
threads: 20
qsub_params:
-pe: shared 20
References
GATK_SelectVariants
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for separation of multi-VCF per-chromosome to one VCF per-sample per-chromosome
Attention
The module generates a script for each sample/chromosome.
The programs included in the module are the following:
SelectVariants
(GATK)
Requires
self.sample_data[chr]["vcf"]
Output
self.sample_data[sample][chr]["GATK_vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
genome_reference |
path to reference genome |
|
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
Lines for parameter file
GATK_SelectVariants_VEPfiltered:
module: GATK_SelectVariants
base: VEP1
script_path: /path/to/GenomeAnalysisTK.jar
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
redirects:
--setFilteredGtToNocall: null
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_VQSR
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for apply VQSR filters
Attention
The module generates script for each chromosoms.
The programs included in the module are the following:
VariantRecalibrator
andApplyRecalibration
(GATK)
Requires
self.sample_data[chr]["vcf"]
Output
self.sample_data[chr]["vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
genome_reference |
||
chrom_list |
list of chromosomes names as mentioned in BAM file separated by ‘,’ |
|
ts_filter_level_SNP |
filter e xpression for SNP |
|
ts_filter_level_INDEL |
filter e xpression for INDEL |
|
resource_SNP |
||
resource_INDEL |
Lines for parameter file
GATK_VQSR1:
module: GATK_VQSR
base: GenotypeGVCFs1
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/bundle/b37/human_g1k_v37_decoy.fasta
resource_SNP:
- hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/bundle/b37/hapmap_3.3.b37.vcf
- omni,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/1000G_omni2.5.b37.vcf
- 1000G,known=false,training=true,truth=false,prior=10.0 /path/to/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf
- dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf
resource_INDEL:
- mills,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf
- dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf
ts_filter_level_SNP: 99.0
ts_filter_level_INDEL: 99.0
maxGaussians: 4
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GenotypeGVCFs
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for perform joint genotyping on gVCF files produced by HaplotypeCaller.
Attention
The module generate script for each cohort-chromosom.
The programs included in the module are the following:
GenotypeGVCFs
(GATK)
Requires
self.sample_data["cohorts"]
Output
self.sample_data[chr]["vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
genome_reference |
||
chrom_list |
list of chromosomes names as mentioned in BAM file separated by ‘,’ |
Lines for parameter file
GenotypeGVCFs1:
module: GenotypeGVCFs
base: gatk_merge_gvcf
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
Picard_CollectAlignmentSummaryMatrics
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for statistical information about the mapping generated by CollectAlignmentSummaryMetrics from Picard tools.
The programs included in the module are the following:
CollectAlignmentSummaryMatrics
from PICARD tools.
Requires
A fastq file in the following location:
self.sample_data[sample]["bam"]
Output
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
genome_reference |
Lines for parameter file
Picard_CollectAlignmentSummaryMatrics1:
module: Picard_CollectAlignmentSummaryMatrics
base: GATK_pre_processing
script_path: /path/to/java -jar /path/to/picard-1.139/dist/picard.jar
genome_reference: /path/to/bundle/b37/human_g1k_v37_decoy.fasta
References
Picard_CollectVariantCalling
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for generating SNP and indel statistics information
The programs included in the module are the following:
CollectVariantCallingMetrics
Picard tool to generate A collection of metrics relating to snps and indels within a variant-calling file (VCF)
Requires
A fastq file in the following location:
self.sample_data[chr]["vcf"]
Output
Lines for parameter file
Picard_CollectVariantCalling1:
module: Picard_CollectVariantCalling
base: GATK_hard_filters1
script_path: /path/to/java -jar /path/to/picard.jar
DBSNP: /path/to/bundle/b37/dbsnp_138.b37.vcf
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References
VEP
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for annotation of the multi VCF file
Attention
The module generates a script for each chromosome.
The programs included in the module are the following:
VEP
(Variant Effect Predictor. )
Requires
self.sample_data[chr]["vcf"]
Output
self.sample_data[chr]["vcf"]
- annotated multi-VCF per chromosome
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
Note
VEP parameters can be passed via redirects
Lines for parameter file
VEP1:
module: VEP
base: GATK_hard_filters1
script_path: /path/to/vep
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
redirects:
--format: vcf
--offline: null
--species: homo_sapiens
--fork: 10
--assembly: GRCh37
--max_af: null
--pick: null
--dir: /path/to/VEP/ensembl-vep-release-88.10/cache
--check_existing: null
--symbol: null
--force_overwrite: null
--vcf: null
References
McLaren, William, et al. “The ensembl variant effect predictor.” Genome biology 17.1 (2016): 122.