SNP Genotype Calling and Quality Control

  In genetic analyses, the term ‘batch effect’ refers to systematic differences caused by batch heterogeneity. Controlling this unintended effect is the most important step in quality control (QC) processes that precede analyses. Currently, batch effects are not appropriately controlled by statistics, and newer approaches are required.

  The proposed method was used to assess genotyping data obtained using an Affymetrix Human Mapping array of 3,619 subjects consisting of 1,074 patients with Alzheimer’s disease, 296 with mild cognitive impairment (MCI), and 1,153 controls. The samples were gathered from seven different hospitals in Korea and they were genotyped in five discrete sets. Furthermore, one hundred subjects were genotyped twice and these data were used to evaluate performance. The data had a batch effect and batches were clustered using K-medoid clustering using the averages of the probe intensity measurements for each batch. We called the subjects’ genotypes in these different clusters separately. The proposed method improves the accuracy of called genotypes without the need to filter a lot of subjects and SNPs, and therefore is a reasonable approach for controlling batch effects. We implemented R functions(createUpstreamCode and downstreamQC) and it can be downloaded from http://healthstat.snu.ac.kr/.

  We proposed a new strategy that detects batch effects with probe intensity measurement and calls genotypes in the presence of batch effects. The application of the proposed method to real data shows that it produces a balanced approach. Furthermore, the proposed method can be extended to various scenarios with a simple modification.

createUpstreamCode

createUpstreamCode function creates codes needed for Axiom SNP chip calling and upstream QC. Automatically generated codes should be run manually by users. It is the purpose of user convenience by helping users look into the codes of each step.

Requirements

Linux
Axiom APT
R {SNPolisher}
Python
Annotation File

Input Parameter

Parameter

Description

wd

Directory where callng and QC works

celDir

Directory where cel(*.cel) file exist

celName

Format of the cel file

aptDir

Location of Axiom APT program

aptLibDir

Location of Axiom APT - R libraries

DQC_thr

Threshold of DQC value

smpCallRate_thr

Threshold of sample call rate

plateCallRate_thr

Trheshold of plate call rate

platePassRate_thr

Threhold of plate pass rate

annoDir

Location of annotation file

Output

The function creates following subfolders under wd and code.sh, code.R or code.py are automatically stored in them.

Folder

Code

Description

1.celfile_list

code.sh

List the *.cel files located in the celDir

2.Generate_DQC

code.sh

Generate DQC value of each sample using APT

3.Sample_QC_DQC

code.R

Remove samples of which DQC value is lower than DQC_thr

4.Generate_CallRate

code.sh

Generate sample call rates using APT

5.Sample_QC_CallRate

code.R

Remove samples of which call rate is lower than smpCallRate_thr

6.Palte_QC

code.R

Remove plates of which average call rate is lower than plateCallRate_thr or plate pass rate is lower than platePassRate_thr

7.Genotype_calling

code.sh

Genotype passing samples and plates using APT

8.SNPolisher

code.R

Classify SNP

9.Call2plink

code1.R

Prepare for converting APT outputs to Plink format(*.ped/*.map)

code2.py

Convert APT outputs to plink format (*.ped/*.map)

downstreamQC

All we need to do following downstream QC is downstreamQC funtion.

Individaul QC
1) Identification of individuals with discordant sex information
2) Identification of individuals with elevated missing data rates
3) Identification of individuals with outlying heterozygosity rate
4) Identification of duplicated or related individuals

SNP QC
1) identification of SNPs with significantly different missing genotype rates between cases and controls
2) identification of SNPs demonstrating a significant deviation from Hardy-Weinberg equilibrium(HWE)
3) the removal of all makers with a very low minor allele frequency
4) identification of SNPs with significantly different maf between plates

Requirements

PLINK

Input Parameter

Parameter

Description

wd

Working directory

plinkDir

Location of PLINK program

genotypeDir

Directory of genotype file

genotypeFile

File name of genotype file

phenoDir

Directory of phenotype file if separately exists

qcDir

Specify a direcotory where the results would be stored

ind_qc_sex

Whether include 'Discordant sex information QC'

ind_qc_sex_thr

If ind_qc_sex=TRUE, set threshold(refer to PLINK --check-sex)

ind_qc_miss

Whether include 'Missnig rate QC'

ind_qc_miss_thr

If ind_qc_miss=TRUE, set threshold (refer to PLINK --missing)

ind_qc_het

Whether include 'heterozygosity QC'

ind_qc_het_thr

If ind_qc_het=TRUE, set threshold (refer to PLINK --het)

ind_qc_relate

Whether include 'Relateness QC'

ind_qc_relate_thr

If ind_qc_relate=TRUE, set threshold (refer to PLINK --indep-pairwise)

snp_qc_miss_cs_cntrl

Whether include 'missing genotype rates b/w cases and controls QC'

snp_qc_miss_cs_cntrol_thr

If snp_qc_miss_cs_cntrl=TRUE, set threshold (refer to PLINK --test-missing)

snp_qc_miss

Whether include 'missing genotype rates QC'

snp_qc_miss_thr

If snp_qc_miss_cs_cntrl=TRUE, set threshold (refer to PLINK --geno)

snp_qc_hwe

Whether include 'HWE QC'

snp_qc_hwe_thr

If snp_qc_miss_cs_cntrl=TRUE, set threshold (refer to PLINK --hwe)

snp_qc_hwe_incl_nonctrl

If snp_qc_miss_cs_cntrl=TRUE, whether include case calculating hwe (refer to PLINK --hwe include-nonctrl)

snp_qc_maf

Whether include 'MAF QC'

snp_qc_maf_thr

If snp_qc_miss_cs_cntrl=TRUE, set threshold (refer to PLINK --maf)

Output

Quality controlled genotype data in PLINK format (clean_data.bed, clean_data.bim, clean_data.fam)

 

Download

R code can be downloaded from CallingQC_Code.zip

 


Authors

·  Sujin Seo <sujin91kr@gmail.com>

·  Kyungtaek Park <qkrrudxor147@gmail.com>

·  Jang Jae Lee <jjjlee21@gmail.com>

·  Kyu Yeong Choi <khaser@gmail.com>

·  Kun Ho Lee <leekho@chosun.ac.kr>

·  Sungho Won <sunghow@gmail.com>