ERICA

Evolutionary Relationship Inference using a CNN-based Approach

Resolving the relationships among taxa is one of the basic tasks in evolutionary biology. Considering that processes such as incomplete lineage sorting (ILS), horizontal gene transfer, and hybridization may lead to the discordance between gene trees and species tree, bifurcation phylogenetic trees cannot represent the full evolutionary history of organisms.

ERICA is a deep learning-based approach for inferring complex evolutionary history from sequence data. In brief, ERICA resolves the local and global topology relationships of focal taxa based on the pre-trained convolutional neural networks. For more flexible use, ERICA can accept both comparative and population genomic data for topology inference. The ERICA pipeline also provides a putative list of introgressed loci by identifying topological discordance across the genome.

An offline version and a detailed instruction of the software are available at https://github.com/YuboZhangPKU/ERICA.

The datasets used in model training are available HERE.

The ERICA online pipeline consists of three steps with available demo data:

1. Uploading or generating custom sequence alignments
2. Evaluating evolutionary relationships using trained models
3. Post-processing and visualizing the results

ERICA first infers the topology of focal species or populations from DNA sequence alignments directly. For example, there are three topological structures in the four-taxon case, which contains three ingroup taxa and one outgroup taxon. Similarly, fifteen topologies exist when there are four ingroup taxa.

As different genomic regions or individuals may have heterogeneous evolutionary histories, we used a three-dimensional or fifteen-dimensional vector to represent the probability of each possible topology.

Sketch diagrams show possible topologies of the four-taxon model (a), regions with spatial heterogeneity (b), and individual heterogeneity (c). Major alleles and minor alleles in sequence alignment are labeled in light grey and dark yellow, respectively.

(a): the three rooted topologies are encoded in One-Hot format.
(b): a window with two segments having different evolutionary histories
(c): in the case of multiple samples per taxon, two haplotypes of taxon P2 are similar to P1, and one is similar to P3, likely due to ancestral polymorphism or gene flow.

The topology inference depends on the aligned DNA sequences. ERICA accepts multiple sequence alignment (MSA) data as input and predicts probabilities using trained convolutional neural networks (CNNs) for genomic data with a step window size of 5 kb. In the current version, due to the limitations of the model input dimension, from 1 to 8 haplotypes could be sampled for each taxon, and the order of haplotypes does not affect the results. For taxon with sequences less than 8, sequences would be randomly resampling. For large population genomic datasets, representative samples could be selected manually for downstream analysis.

We also provided a support for SNP report file in the VCF format. The consensus sequence for each individual would be generated based on the reference genome and the variant calls recorded in VCF file.

Click HERE to view the demo results.

Gene flow between non-sister species can lead to a different genealogy from the species tree. For example, in the four-taxon case with topology A (((P1, P2), P3), O) as the species tree, the gene flow between P1, P3 and P2, P3 will increase the probability of topology B or topology C, respectively. Similar patterns are observed for the other two species trees. For the five-taxon case with asymmetric and symmetric species trees, the topology changes caused by gene flow are listed below.

Therefore, introgressed regions can be identified by evaluating the discordance between gene trees and species trees with the highly credible phylogeny predicted by the CNN models. More specifically, the species tree could be determined by the genome-wide major topology of ERICA results and/ or combined with other prior knowledge. The post-processing programs can calculate the mean value of given windows and filter regions supporting alternative topologies. Default or user-defined threshold can be used to distinguish signatures of introgression from incomplete lineage sorting (ILS).

The visualization module of the ERICA pipeline can show the probability of each topology along chromosome or focal genomic regions. A set of diagrams, like line plot, area plot, and the probability of the major topology could be plotted.

An example line plot shows topology probabilities of 10 kb sliding windows within a 1Mb interval of a real genomic data set. Note that the major topology supports P1 and P2 are the sister species (topo A) but there is a strong signal of introgression between P2 and P3 (topo C) located in 0.71 - 0.83 Mb.

TIY

Data pre-processing note: generating custom sequence alignments

The prediction module of the ERICA pipeline uses DNA sequence alignments as input. Alignments containing 32 and 40 rows are required for the four-taxon and five-taxon analyses with each row representing an aligned sequence. Adjacent eight rows are recognized as belonging to one taxon, and there is no additional annotation required.

When there are not enough samples available for a taxon, sequences can be utilized multiple times to fit the data dimensions. Instead, for large datasets, representative samples could be selected first according to dimensionality reduction and clustering analysis.

The pipeline can also handle genotype data stored in variant call format (VCF) files, which are widely used in population-level studies. The program is designed for diploid genomes and is insensitive to the order of alleles. Both phased and unphased genotype data can be used. By randomly resampling a given number of individuals, the program can work properly with at least one individual per population. Up to four or eight individuals could be provided, depends on whether two alleles of each individual are used.

Last Modified at October 11, 2022
Copyright@ 2020-2023 by the Computaional Center, CIBR,
For any comments or suggestions, please mail to:HPC
ICP备案号：京ICP备18029179号