In genetic analyses, the term ‘batch effect’ refers to systematic differences caused by batch heterogeneity. Controlling this unintended effect is the most important step in quality control (QC) processes that precede analyses. Currently, batch effects are not appropriately controlled by statistics, and newer approaches are required.
The proposed method was used to assess genotyping data obtained using an Affymetrix Human Mapping array of 3,619 subjects consisting of 1,074 patients with Alzheimer’s disease, 296 with mild cognitive impairment (MCI), and 1,153 controls. The samples were gathered from seven different hospitals in Korea and they were genotyped in five discrete sets. Furthermore, one hundred subjects were genotyped twice and these data were used to evaluate performance. The data had a batch effect and batches were clustered using K-medoid clustering using the averages of the probe intensity measurements for each batch. We called the subjects’ genotypes in these different clusters separately. The proposed method improves the accuracy of called genotypes without the need to filter a lot of subjects and SNPs, and therefore is a reasonable approach for controlling batch effects. We implemented R functions(createUpstreamCode and downstreamQC) and it can be downloaded from http://healthstat.snu.ac.kr/.
We proposed a new strategy that detects batch effects with probe intensity measurement and calls genotypes in the presence of batch effects. The application of the proposed method to real data shows that it produces a balanced approach. Furthermore, the proposed method can be extended to various scenarios with a simple modification.
createUpstreamCode function creates codes needed for Axiom SNP chip calling and upstream QC. Automatically generated codes should be run manually by users. It is the purpose of user convenience by helping users look into the codes of each step.
Directory where callng and QC works
Directory where cel(*.cel) file exist
Format of the cel file
Location of Axiom APT program
Location of Axiom APT - R libraries
Threshold of DQC value
Threshold of sample call rate
Trheshold of plate call rate
Threhold of plate pass rate
Location of annotation file
The function creates following subfolders under wd and code.sh, code.R or code.py are automatically stored in them.
List the *.cel files located in the celDir
Generate DQC value of each sample using APT
Remove samples of which DQC value is lower than DQC_thr
Generate sample call rates using APT
Remove samples of which call rate is lower than smpCallRate_thr
Remove plates of which average call rate is lower than plateCallRate_thr or plate pass rate is lower than platePassRate_thr
Genotype passing samples and plates using APT
Prepare for converting APT outputs to Plink format(*.ped/*.map)
Convert APT outputs to plink format (*.ped/*.map)
All we need to do following downstream QC is downstreamQC funtion.
1) Identification of individuals with discordant sex information
2) Identification of individuals with elevated missing data rates
3) Identification of individuals with outlying heterozygosity rate
4) Identification of duplicated or related individuals
1) identification of SNPs with significantly different missing genotype rates between cases and controls
2) identification of SNPs demonstrating a significant deviation from Hardy-Weinberg equilibrium(HWE)
3) the removal of all makers with a very low minor allele frequency
4) identification of SNPs with significantly different maf between plates
Location of PLINK program
Directory of genotype file
File name of genotype file
Directory of phenotype file if separately exists
Specify a direcotory where the results would be stored
Whether include 'Discordant sex information QC'
If ind_qc_sex=TRUE, set threshold(refer to PLINK --check-sex)
Whether include 'Missnig rate QC'
If ind_qc_miss=TRUE, set threshold (refer to PLINK --missing)
Whether include 'heterozygosity QC'
If ind_qc_het=TRUE, set threshold (refer to PLINK --het)
Whether include 'Relateness QC'
If ind_qc_relate=TRUE, set threshold (refer to PLINK --indep-pairwise)
Whether include 'missing genotype rates b/w cases and controls QC'
If snp_qc_miss_cs_cntrl=TRUE, set threshold (refer to PLINK --test-missing)
Whether include 'missing genotype rates QC'
If snp_qc_miss_cs_cntrl=TRUE, set threshold (refer to PLINK --geno)
Whether include 'HWE QC'
If snp_qc_miss_cs_cntrl=TRUE, set threshold (refer to PLINK --hwe)
If snp_qc_miss_cs_cntrl=TRUE, whether include case calculating hwe (refer to PLINK --hwe include-nonctrl)
Whether include 'MAF QC'
If snp_qc_miss_cs_cntrl=TRUE, set threshold (refer to PLINK --maf)
Quality controlled genotype data in PLINK format (clean_data.bed, clean_data.bim, clean_data.fam)
R code can be downloaded from CallingQC_Code.zip
· Sujin Seo <email@example.com>
· Kyungtaek Park <firstname.lastname@example.org>
· Jang Jae Lee <email@example.com>
· Kyu Yeong Choi <firstname.lastname@example.org>
· Kun Ho Lee <email@example.com>
· Sungho Won <firstname.lastname@example.org>