Admix-kit

an integrated toolkit and pipeline for genetic analyses of admixed populations

Hou, Kangcheng
Gogarten, Stephanie
Kim, Joohyun
Hua, Xing
Dias, Julie-Alexia
Sun, Quan
Wang, Ying
Tan, Taotao
Adebamowo, Sally
Adeyemo, Adebowale
Auer, Paul
Bensellak, Taoufik
Berndt, Sonja
Bhukar, Rohan
Cao, Hongyuan
Cario, Clinton
Chatterjee, Nilanjan
Chen, Jiawen
Chikowore, Tinashe
Choudhury, Ananyo
Conomos, Matthew
Conti, David
Cullina, Sinead
Darst, Burcu
Ding, Yi
Dong, Ruocheng
Duan, Rui
Fakim, Yasmina
Franceschini, Nora
Ge, Tian
Ghoorah, Anisah W
Gignoux, Chris
Gogarten, Stephanie
Hanchard, Neil
Hanisch, Rachel
Hauser, Michael
Hazelhurst, Scott
Hirbo, Jibril
Hornsby, Whitney
Hou, Kangcheng
Hua, Xing
Huerta, Alicia
Hysong, Micah
Jin, Jin
Johar, Angad
Judd, Jon
Kachuri, Linda
Kamiza, Abram Bunya
Kenny, Eimear
Khan, Alyna
Kharitonova, Elena
Kim, Joohyun
Konigsberg, Iain
Kooperberg, Charles
Kosel, Matt
Kullo, Iftikhar
Lange, Ethan
Li, Yun
Li, Qing
Liivrand, Maria
Lohmueller, Kirk
Lu, Kevin
Mandla, Ravi
Manning, Alisa
Martin, Iman
Martin, Alicia
McDonnell, Shannon
Mechanic, Leah
Mercader, Josep
Mester, Rachel
Ng, Maggie
Nguyen, Kevin
Norland, Kristján
Ockerman, Franklin
Loohuis, Loes Olde
Onyenobi, Ebuka
Pasaniuc, Bogdan
Patel, Aniruddh
Petter, Ella
Rice, Kenneth
Rothstein, Joseph
Rowan, Bryce
Rowley, Robb
Ruan, Yunfeng
Sankararaman, Sriram
Sartori, Ambra
Schaid, Dan
Shemirani, Ruhollah
Shortt, Jonathan
Sim, Xueling
Smith, Johanna L
Stanislawski, Maggie
Stram, Daniel
Sun, Quan
Tayo, Bamidele
Truong, Buu
Tsuo, Kristin
Urbut, Sarah
Wang, Ying
Wang, Wallace Minxian
Wilson, Riley
Witte, John
Wojcik, Genevieve
Zhang, Jingning
Zhang, Ruyue
Zhang, Haoyu
Zhang, Yuji
Zhong, Michael
Zhou, Laura
Atkinson, Elizabeth G
Martin, Alicia
Shortt, Jonathan
Hirbo, Jibril
Li, Yun
Pasaniuc, Bogdan
Zhang, Haoyu

Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, United States
Department of Biostatistics, University of Washington, Seattle, WA, 98195, United States
Vanderbilt Genetics Institute and Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, United States
Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, 20892, United States
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02120, United States
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, United States
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, United States
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, United States
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, United States
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, United States
Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, United States
Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37232, United States
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, United States
Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, United States
Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, 20892, United States

Corresponding authors. Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90024, United States. E-mails: [email protected] (K.H.) and [email protected] (B.P.); Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, 20892, United States. E-mail: [email protected] (H.Z.)

Members of the Polygenic Risk Methods in Diverse Populations (PRIMED) Consortium Methods Working Group are listed in the Supplementary Data.

Received September 30, 2023

Received in revised form February 08, 2024

Accepted March 13, 2024

Revised March 29, 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Bioinformatics 40(4), April 2024. | DOI: 10.1093/bioinformatics/btae148

1 Introduction

Admixed individuals inherit a mosaic of ancestry segments originating from multiple continental ancestral populations, leading to their complex and diverse genetic backgrounds encompassing a wide spectrum of human genetic variation (). Admixed individuals carry an elevated number of genetic variants in the 1000 Genomes Project (). For example, African Americans contain genetic variants originating from both European and African ancestral populations, offering a unique opportunity to study genetic variation from multiple continental populations within a single population. Therefore, an understanding of such genetic ancestry mosaicism within admixed populations offers opportunities to gain insights into the origins and health implications of various genetic traits and diseases, contributing to a more comprehensive understanding of human genetics (^,).

Despite the genetic richness and crucial insights they can offer, admixed populations remain significantly underrepresented in current genetic studies (). This underrepresentation can be attributed to various challenges, including the complexity of analyzing diverse genetic backgrounds and the lack of efficient tools and standardized practices for handling the genetic data of admixed populations. This gap not only hinders progress in genetic research but also exacerbates health disparities. For example, findings with datasets from European ancestry groups for genetic risk prediction models can introduce bias to personalized risk prevention strategies (^,). Genetic admixture is key to understanding variations in phenotype and disease prevalence across populations (). A notable example is the lower white blood cell count observed in individuals of African ancestry (). Such genetic differences, if overlooked, can lead to clinical misinterpretations and unnecessary procedures, including bone marrow biopsies ().

To address these challenges, we introduce admix-kit, an integrated and flexible python toolkit along with workflows developed using Workflow Development Language (WDL), specifically designed for the simulation and analysis of genetic data from admixed populations. We anticipate that our proposed software packages and workflows will help overcome these analytical challenges, enabling the inclusion of admixed individuals in future genetic studies.

2 Results

2.1 Computational toolkit for analyzing admixed genotypes

We begin by outlining the data structures and computational tools in admix-kit for analyzing admixed genetic datasets. Both genotype and local ancestry data are organized as two matrices of shape N × M × 2 (N and M denote the number of individuals and SNPs respectively, and ‘2’ denotes the two haplotypes; Fig. 1a). Given that storage of these matrices often exceeds memory capacity (due to large N and/or M), we adopt a chunked array representation, implemented with the Dask python library (). Each chunk is loaded from disk on demand, thus conserving memory by loading data only when needed and facilitating large-scale analyses. We use pgenlib as an efficient python interface to read phased genotype. Local ancestry matrices are stored in a compressed format that leverages their contiguous nature (local ancestries for nearby SNPs are often identical within each individual). By translating genotype and local ancestry matrices into local-ancestry-specific (LAS) genotype dosages, we have also implemented a set of utility functions tailored for LAS genetic analysis, including LAS allele frequencies, polygenic scores, and phenotype modeling that allow for LAS genetic architecture (Fig. 1b).

Figure 1
Open multimedia modal
Overview of admix-kit’s data structure, functionality, and illustrative analyses using a simulated dataset. (a) Local ancestry and phased genotypes are stored in matrix format. Individual-specific and SNP-specific covariates are stored as two tables with matching orders. (b) Analysis based on ancestry-specific genotype dosage. Starting with a phased genotype for an individual (0/1 denotes presence of minor allele), genotypes are separated into ancestry-specific dosages. Local ancestry-informed downstream analyses can be subsequently performed. (c) visualization of local ancestry tracts. (d) Consistency of genome-wide genetic ancestry of simulated dataset using YRI and CEU in 1000 Genomes as reference populations. (e) Consistency of allele frequencies from the simulated admixed genotypes.

2.2 Workflow for simulating admixture genotypes

Genotype simulation is essential to facilitate testing and benchmarking genetic analysis methodologies. One of the significant challenges lies in simulating admixed genomes, which often becomes the most time-consuming step among common analyses involving admixture. We develop a workflow to specifically address this bottleneck (Supplementary Fig. S1a). We primarily focus on two-way admixture for demonstration while noting our software and pipeline are adaptable to various admixture scenarios. First, starting from a small reference panel such as 1000 Genomes Project, HAPGEN2 () is used to enhance the diversity and size of reference dataset by increasing number of unique haplotypes via recombining initial sets of haplotypes within each genetic ancestry group, such as European or African. This step increases the reference data sample size while preserving the minor allele frequency (MAF) and linkage disequilibrium (LD) structure. Second, using the expanded haplotype sets in both genetic ancestry groups, we simulate admixture process using haptools () with parameters for genetic ancestry proportion and the number of admixture generations. We are primarily interested in simulation scenarios involving an instantaneous admixture event, followed by generations of haplotype recombination (we note other scenarios such continuous admixture events can also be simulated). This process mimics random mating and recombination events to generate realistic distribution of local ancestry segments, MAF and LD structure for the generated genotypes. To make this simulation process more accessible, we have implemented these functionalities as command-line tools within admix-kit (Supplementary Fig. S1a). In details, admix hapgen2 --pfile ${src_plink2} --n-indiv ${n_indiv} --out ${expanded_pop} is used to expand the source population with HAPGEN2. And admix admix-simu --pfile-list "['pop1', 'pop2']" --admix-prop "[0.2,0.8]" --n-indiv ${n_indiv} --n-gen ${n_gen} --out ${admix} can be used to simulate the admixture process across source populations.

Furthermore, we included a number of functions to perform LAS genetic analysis. For genotype-phenotype association testing (admix assoc), we have implemented a suite of methods allowing for genetic effects heterogeneity across local ancestry backgrounds (^,^,^,). These approaches can enhance statistical power by modeling LAS-genetic architecture within admixed populations. We also include functionality to estimate genetic effects concordance across local ancestry backgrounds (admix genet-cor), which is crucial to understand and inform trait-specific optimal strategies for downstream analyses including polygenic score weight estimation and application (). For polygenic scoring (admix calc-partial-pgs), we implemented the calculation of partial polygenic scores, allowing scores to be computed separately for each ancestry background. Such an approach can improve polygenic scoring accuracy in admixed populations (^,^,^,). We also provided a user-friendly WDL-based workflow for genotype simulation that can be run on cloud-based computing platforms [e.g. AnVIL (https://anvilproject.org/), BioData Catalyst (https://biodatacatalyst.nhlbi.nih.gov/)] (Supplementary Fig. S1b). Users can input essential parameters to define the admixture process and provide the input genotype path of ancestral populations (a set of preprocessed 1000 genomes dataset is provided for default usage). The workflow will run through each aforementioned step and produce the simulated admixed genotype dataset. The admix-kit software is encapsulated in a publicly available docker image (URLs).

2.3 Example analysis of a simulated dataset

We demonstrate the practicality of admix-kit through analyses of a simulated dataset. All associated code and notebooks have been made publicly accessible (https://github.com/UW-GAC/admix-kit_workflow). This ensures our results are fully reproducible and can be seamlessly deployed in a cloud platform (e.g. AnVIL). We used the AnVIL workflow to simulate N = 1000 admixed individuals with M = 174K SNPs on chromosomes 1 and 2 presented in 1000 Genomes project, using a demographic model similar to African American individuals with over 8 generations of admixture and an average ancestry proportion of 80% African and 20% European () (ancestry proportion varies by individual). Notably, the genotype simulation took <30 minutes with scalability to a much larger number of individuals and SNPs. Using principal component analysis (PCA), we observed that individuals within the simulated dataset are positioned along a cline between individuals labeled as European and African in the 1000 Genomes reference dataset, suggesting high quality of the simulated genotype dataset (Fig. 1c and d). Allele frequencies computed within genotype segments corresponding to the respective local ancestry displayed high consistency with those computed in the reference population, indicating high preservation of MAF structure of the simulated genotype (Fig. 1e).

3 Discussion

Addressing the underrepresentation of admixed individuals in genetic studies is pivotal not only for scientific necessity but also as a commitment to equity. With this goal in mind, we introduce admix-kit, a comprehensive toolkit and workflow tailored for admixed populations. We anticipate that our software package and workflows will facilitate greater inclusion of admixed individuals in future genetic studies.

Development of software and methodology in genetic studies relies heavily on the use of simulated datasets. These datasets help benchmark performance and facilitate comparisons with existing software. Traditionally, simulated datasets are usually derived from publicly available reference populations. Often, these populations are selected based on a high degree of genetic similarity among individuals in the population (e.g. individuals having all four grandparents from a small geographic region.) For instance, HAPGEN2 has recently been widely used for simulating large-scale genetic datasets that mimic the LD structures of reference populations such as European, African, American, East Asian, and South Asian using data from the 1000 Genomes Project (^,^,^,). While these simulations can recreate datasets with similar LD as the reference populations, they cannot accurately reflect the genetic structure observed in admixed populations where ancestral segments mixing over generations (see example in Supplementary Fig. S2). Consequently, these sampling conditions are not representative of global human genetic variations. As a remedy, simulating admixture among reference populations can provide datasets that more rigorously test the performance of new software. For example, our simulation pipeline can be used to investigate factors that potentially impact accuracy of ancestry inference (including ancestry composition in reference panel, demographic model of simulated admixed population and error in inferred local ancestry) and to understand how errors in ancestry inference propagate to downstream disease mapping and prediction applications. In addition to the admixed genotype simulation provided by previously introduced admix-simu (see URLs) and haptools (), admix-kit provides a suite of methods for statistical genetic analysis of complex traits taking into account of the genetic effects heterogeneity across local ancestry backgrounds (we provide example notebooks illustrating each functionality; URLs).

Admix-kit holds significant potentials in the development of Polygenic Risk Scores (PRS). The efficacy of PRS is known to hinge on the similarity of the target population to the training population (). With the PRIMED consortium working on methods to improve the performance of PRS in diverse populations, simulations will be pivotal for method evaluation (). In this context, we expect that admix-kit will be an essential part of this effort.

Acknowledgements

We thank Ziqi Xu for testing an early version of software.

References

Atkinson EG, Maihofer AX, Kanai M et al Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat Genet 2021;53:195–204.
Cited Here
Auton A, Brooks LD, Durbin RM et al; 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 2015;526:68–74.
Cited Here
Bitarello BD, Mathieson I. Polygenic scores for height in admixed populations. G3 (Bethesda) 2020;10:4027–36.
Cited Here
Cai M, Xiao J, Zhang S et al A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am J Hum Genet 2021;108:632–55.
Cited Here
Ding Y, Hou K, Xu Z et al Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature 2023;618:774–81.
Cited Here
Gurdasani D, Barroso I, Zeggini E et al Genomics of disease risk in globally diverse populations. Nat Rev Genet 2019;20:520–35.
Cited Here
Hou K, Bhattacharya A, Mester R et al On powerful GWAS in admixed populations. Nat Genet 2021;53:1631–3.
Cited Here
Hou K, Ding Y, Xu Z et al Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat Genet 2023;55:549–58.
Cited Here
Kachuri L, Chatterjee N, Hirbo J et al; Polygenic Risk Methods in Diverse Populations (PRIMED) Consortium Methods Working Group. Principles and methods for transferring polygenic risk scores across global populations. Nat Rev Genet 2023;25:8–25.
Cited Here
Kidd JM, Gravel S, Byrnes J et al Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation. Am J Hum Genet 2012;91:660–71.
Cited Here
Marnetto D, Pärna K, Läll K et al Ancestry deconvolution and partial polygenic score can improve susceptibility predictions in recently admixed individuals. Nat Commun 2020;11:1628.
Cited Here
Martin AR, Kanai M, Kamatani Y et al Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet 2019;51:584–91.
Cited Here
Massarat AR, Lamkin M, Reeve C et al Haptools: a toolkit for admixture and haplotype analysis. Bioinformatics 2023;39. https://academic.oup.com/bioinformatics/article/39/3/btad104/7058928.
Cited Here
Mester R, Hou K, Ding Y et al Impact of cross-ancestry genetic architecture on GWASs in admixed populations. Am J Hum Genet 2023;110:927–39.
Cited Here
Miao J, Guo H, Song G et al Quantifying portable genetic effects and improving cross-ancestry genetic prediction with GWAS summary statistics. Nat Commun 2023;14:832–13.
Cited Here
Mills MC, Rahal C. The GWAS diversity monitor tracks diversity by disease in real time. Nat Genet 2020;52:242–3.
Cited Here
Pasaniuc B, Zaitlen N, Lettre G et al Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a breast cancer consortium. PLoS Genet 2011;7:e1001371.
Cited Here
Reich D, Nalls MA, Kao WHL et al Reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene. PLoS Genet 2009;5:e1000360.
Cited Here
Rocklin M. Dask: parallel computation with blocked algorithms and task scheduling. In: Proceedings of the 14th Python in Science Conference, Austin, Texas, United States. Austin, Texas, United States: SciPy Organizers. 2015.
Cited Here
Ruan Y, Lin Y-F, Feng Y-CA et al; Stanley Global Asia Initiatives. Improving polygenic prediction in ancestrally diverse populations. Nat Genet 2022;54:573–80.
Cited Here
Seldin MF, Pasaniuc B, Price AL et al New approaches to disease mapping in admixed populations. Nat Rev Genet 2011;12:523–8.
Cited Here
Su Z, Marchini J, Donnelly P et al HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 2011;27:2304–5.
Cited Here
Sun Q, Rowland BT, Chen J et al Improving polygenic risk prediction in admixed populations by explicitly modeling ancestral-specific effects via GAUDI. Nat Comm 2024;15(1):1016.
Cited Here
Tan T, Atkinson EG. Strategies for the genomic analysis of admixed populations. Annu Rev Biomed Data Sci 2023;6:105–27.
Cited Here
Van Driest SL, Abul-Husn NS, Glessner JT et al Association between a common, benign genotype and unnecessary bone marrow biopsies among African American patients. JAMA Intern Med 2021;181:1100–5.
Cited Here
Wojcik GL, Graff M, Nishimura KK et al Genetic analyses of diverse populations improves discovery for complex traits. Nature 2019;570:514–8.
Cited Here
Zhang H, Zhan J, Jin J et al A new method for multi-ancestry polygenic prediction improves performance across diverse populations. Nat Genetic 2023;55(10):1757–68.
Cited Here