GATK
Modules included in this section
GATK_CatVariants
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module to concatenate chromosome to get one VCF file for each sample.
Attention
The module generate script for each sample - chromosom.
The programs included in the module are the following:
CatVariants(GATK)
Requires
self.sample_data[sample][chr]["GATK_vcf"]
Output
self.sample_data[sample]["vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
|---|---|---|
genome_reference |
||
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
Lines for parameter file
GATK_CatVariants1:
module: GATK_CatVariants
base: GATK_SelectVariants_VEPfiltered
script_path: /path/to/java -cp /path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_gvcf
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for generate gVCF file from BAM file.
Attention
The module generate script for each sample-chromosom.
The programs included in the module are the following:
HaplotypeCaller(GATK)
Requires
self.sample_data[sample]["bam"]
Output
self.sample_data[sample][chr]["GATK_g.vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
|---|---|---|
genome_reference |
||
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
Lines for parameter file
GATK_gvcf: # check about -nct for parallization and deal with memmory problem
module: GATK_gvcf
base: GATK_pre_processing
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
qsub_params:
-pe: shared 15
redirects:
-nct: 15
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_hard_filters
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available..
Attention
The module generate script for each chromosom.
The programs included in the module are the following:
SelectVariants and VariantFiltration(GATK)
Requires
self.sample_data[chr]["vcf"]
Output
self.sample_data[chr]["vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
|---|---|---|
genome_reference |
||
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
|
filterExpression_SNP |
filter e xpression for SNP |
|
filterExpression_INDEL |
filter e xpression for INDEL |
Lines for parameter file
GATK_hard_filters1:
module: GATK_hard_filters
base: GenotypeGVCFs1
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
filterExpression_SNP: '"QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"'
filterExpression_INDEL: '"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0 || SOR > 10.0 || InbreedingCoeff < -0.8"'
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_merge_gvcf
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for combine g.vcf files to cohorts.
Attention
The module generate script for each sample-chromosom.
The programs included in the module are the following:
CombineGVCFs(GATK)
Requires
self.sample_data[sample][chr]["GATK_g.vcf"]
Output
self.sample_data["cohorts"]
Parameters that can be set
Parameter |
Values |
Comments |
|---|---|---|
genome_reference |
||
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
|
cohort_size |
number of g.vcf file to be in each cohort |
Lines for parameter file
gatk_merge_gvcf:
module: GATK_merge_gvcf
base: GATK_gvcf
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
cohort_size: 10
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_pre_processing
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for generating ready-to-GATK-use BAM files from fastq files.
Attention
The module lacks the “base recalibration process (BQSR)” step
The programs included in the module are the following:
FastqToSamPicard tool to generate uBAMMarkIlluminaAdaptersPicard tool to Mark Illumina AdaptersSamToFastqPicard tool uBAM to fastqMergeBamAlignmentPicard tool to merge BAM and uBAMMarkDuplicatesPicard tool to remove PCR duplicatesBWA MEMmapping with BWA MEM
Requires
A fastq file in the following locations:
self.sample_data[sample]["fastq.F"]self.sample_data[sample]["fastq.R"]
Output
self.sample_data[sample]["bam"]
Parameters that can be set
Parameter |
Values |
Comments |
|---|---|---|
picard_path |
path to PICARD |
Full path to the PICARD .jar file |
bwa_mem_path |
||
genome_reference |
Lines for parameter file
GATK_pre_processing:
module: GATK_pre_processing
base: fQC_trim
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
picard_path: /path/to/picard.jar
bwa_mem_path: /path/to/bwa mem
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
threads: 20
qsub_params:
-pe: shared 20
References
GATK_SelectVariants
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for separation of multi-VCF per-chromosome to one VCF per-sample per-chromosome
Attention
The module generates a script for each sample/chromosome.
The programs included in the module are the following:
SelectVariants(GATK)
Requires
self.sample_data[chr]["vcf"]
Output
self.sample_data[sample][chr]["GATK_vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
|---|---|---|
genome_reference |
path to reference genome |
|
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
Lines for parameter file
GATK_SelectVariants_VEPfiltered:
module: GATK_SelectVariants
base: VEP1
script_path: /path/to/GenomeAnalysisTK.jar
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
redirects:
--setFilteredGtToNocall: null
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GATK_VQSR
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for apply VQSR filters
Attention
The module generates script for each chromosoms.
The programs included in the module are the following:
VariantRecalibratorandApplyRecalibration(GATK)
Requires
self.sample_data[chr]["vcf"]
Output
self.sample_data[chr]["vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
|---|---|---|
genome_reference |
||
chrom_list |
list of chromosomes names as mentioned in BAM file separated by ‘,’ |
|
ts_filter_level_SNP |
filter e xpression for SNP |
|
ts_filter_level_INDEL |
filter e xpression for INDEL |
|
resource_SNP |
||
resource_INDEL |
Lines for parameter file
GATK_VQSR1:
module: GATK_VQSR
base: GenotypeGVCFs1
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
genome_reference: /path/to/bundle/b37/human_g1k_v37_decoy.fasta
resource_SNP:
- hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/bundle/b37/hapmap_3.3.b37.vcf
- omni,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/1000G_omni2.5.b37.vcf
- 1000G,known=false,training=true,truth=false,prior=10.0 /path/to/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf
- dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf
resource_INDEL:
- mills,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf
- dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf
ts_filter_level_SNP: 99.0
ts_filter_level_INDEL: 99.0
maxGaussians: 4
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
GenotypeGVCFs
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for perform joint genotyping on gVCF files produced by HaplotypeCaller.
Attention
The module generate script for each cohort-chromosom.
The programs included in the module are the following:
GenotypeGVCFs(GATK)
Requires
self.sample_data["cohorts"]
Output
self.sample_data[chr]["vcf"]
Parameters that can be set
Parameter |
Values |
Comments |
|---|---|---|
genome_reference |
||
chrom_list |
list of chromosomes names as mentioned in BAM file separated by ‘,’ |
Lines for parameter file
GenotypeGVCFs1:
module: GenotypeGVCFs
base: gatk_merge_gvcf
script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
genome_reference: /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
References
Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.
Picard_CollectAlignmentSummaryMatrics
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for statistical information about the mapping generated by CollectAlignmentSummaryMetrics from Picard tools.
The programs included in the module are the following:
CollectAlignmentSummaryMatricsfrom PICARD tools.
Requires
A fastq file in the following location:
self.sample_data[sample]["bam"]
Output
Parameters that can be set
Parameter |
Values |
Comments |
|---|---|---|
genome_reference |
Lines for parameter file
Picard_CollectAlignmentSummaryMatrics1:
module: Picard_CollectAlignmentSummaryMatrics
base: GATK_pre_processing
script_path: /path/to/java -jar /path/to/picard-1.139/dist/picard.jar
genome_reference: /path/to/bundle/b37/human_g1k_v37_decoy.fasta
References
Picard_CollectVariantCalling
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for generating SNP and indel statistics information
The programs included in the module are the following:
CollectVariantCallingMetricsPicard tool to generate A collection of metrics relating to snps and indels within a variant-calling file (VCF)
Requires
A fastq file in the following location:
self.sample_data[chr]["vcf"]
Output
Lines for parameter file
Picard_CollectVariantCalling1:
module: Picard_CollectVariantCalling
base: GATK_hard_filters1
script_path: /path/to/java -jar /path/to/picard.jar
DBSNP: /path/to/bundle/b37/dbsnp_138.b37.vcf
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
References
VEP
- Authors
Michal Gordon
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for annotation of the multi VCF file
Attention
The module generates a script for each chromosome.
The programs included in the module are the following:
VEP(Variant Effect Predictor. )
Requires
self.sample_data[chr]["vcf"]
Output
self.sample_data[chr]["vcf"]- annotated multi-VCF per chromosome
Parameters that can be set
Parameter |
Values |
Comments |
|---|---|---|
chrom_list |
Comma-separated list of chromosome names as mentioned in the BAM file |
Note
VEP parameters can be passed via redirects
Lines for parameter file
VEP1:
module: VEP
base: GATK_hard_filters1
script_path: /path/to/vep
chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"
redirects:
--format: vcf
--offline: null
--species: homo_sapiens
--fork: 10
--assembly: GRCh37
--max_af: null
--pick: null
--dir: /path/to/VEP/ensembl-vep-release-88.10/cache
--check_existing: null
--symbol: null
--force_overwrite: null
--vcf: null
References
McLaren, William, et al. “The ensembl variant effect predictor.” Genome biology 17.1 (2016): 122.