Estimation of gametic frequencies in multilocus polymorphic systems based on the numerical distribution of multilocus genotypes in a population sample ("analysis without pedigrees") is difficult because some gametes are not recognized in the data obtained. Even in the case of codominant systems, where all alleles can be recognized by genotypes, so that direct estimation of the frequencies of genes (alleles) is possible ("complete data"), estimation of the frequencies of multilocus gametes based on the data on multilocus genotypes is sometimes impossible, whether population data or even family data are used for studying genotypic segregation or analysis of linkage ("incomplete data"). Such "incomplete data" are analyzed based on the corresponding genetic models using the expectation-maximization (EM) algorithm. In this study, the EM algorithm based on the random-marriage model for a nonsubdivided population was used to estimate gametic frequencies. The EM algorithm used in the study does not set any limitations on the number of loci and the number of alleles of each locus. Locus and alleles are identified by numeration making possible to arrange loops. In each combination of alleles for a given combination of m out of L loci (L is the total number of loci studied), all alleles are assigned value 1, and the remaining alleles are assigned value 0. The sum of zeros and unities for each gamete is its gametic value (h), and the sum of the gametic values of the gametes that form a given genotype is the genotypic value (g) of this genotype. Then, gametes with the same h are united into a single class, which reduces the number of the estimated parameters. In a general case of m loci, this procedure yields m + 1 classes of gametes and 2m + 1 classes of genotypes with genotypic values g = 0, 1, 2, ..., 2m. The unknown frequencies of the m + 1 classes of gametes can be represented as functions of the gametic frequencies whose maximum likelihood estimations (MLEs) have been obtained in all previous EM procedures and the only unknown frequency (Pm(m)) that is to be estimated in the given EM procedure. At the expectation step, the expected frequencies (Fm(g) of the genotypes with genotypic values g are expressed in terms of the products of the frequencies of m + 1 classes of gametes. The data on genotypes are the numbers (ng) of individuals with genotypic values g = 0, 1, 2, 3, ..., 2m. The maximization step is the maximization of the logarithm of the likelihood function (LLF) for ng values. Thus, the EM algorithm is reduced, in each case, to solution of only one equation with one unknown parameter with the use of the ng values, i.e., the numbers of individuals after the corresponding regrouping of the data on the individuals' genotypes. Treatment of the data obtained by Kurbatova on the MNSs and Rhesus systems with alleles C, Cw, c, D, d, E, e with the use of Weir's EM algorithm and the EM algorithm suggested in this study yielded similar results. However, the MLEs of the parameters obtained with the use of either algorithm often converged to a wrong solution: the sum of the frequencies of all gametes (4 and 12 gametes for MNSs and Rhesus, respectively) was not equal to 1.0 even if the global maximum of LLF was reached for each of them (as it was for MNSs with the use of Weir's EM algorithm), with each parameter falling within admissible limits (e.g., [0, min(PN,Ps)] for PNs). The chi 2 function is suggested to be used as a goodness-of-fit function for the distribution of genotypes in a sample in order to select acceptable solutions. However, the minimum of this function only guarantee the acceptability of solutions if all limitations on the parameters are met: the sum of estimations of gametic frequencies is 1.0, each frequency falls within the admissible limits, and the "gametic algebra" is complied with (none of the frequencies is negative).
Read full abstract