A simple way to improve multivariate analyses of paleoecological data sets
Abstract Multivariate methods such as cluster analysis and ordination are basic to paleoecology, but the messy nature of fossil occurrence data often makes it difficult to recover clear patterns. A recently described faunal similarity index based on the Forbes coefficient improves results when its complement is employed as a distance metric. This index involves adding terms to the Forbes equation and ignoring one of the counts it employs (that of species found in neither of the samples under consideration). Analyses of simulated data matrices demonstrate its advantages. These matrices include large and small samples from two partially overlapping species pools. In a cluster analysis, the widely used Dice coefficient and the Euclidean distance metric both create groupings that reflect sample size, the Simpson index suggests large differences that do not exist, and the corrected Forbes index creates groupings based strictly on true faunal overlap. In a principal coordinates analysis (PCoA) the Forbes index almost removes the sample-size signal but other approaches create a second axis strongly dominated by sample size. Meanwhile, species lists of late Pleistocene mammals from the United States capture biogeographic signals that standard ordination methods do recover, but the adjusted Forbes coefficient spaces the points out more sensibly. Finally, when biome-scale lists for living mammals are added to the data set and extinct species are removed, correspondence analysis misleadingly separates out the biome lists, and PCoA based on the Dice coefficient places them to the edge of the cloud of fossil assemblage data points. PCoA based on the Forbes index places them in more reasonable positions. Thus, only the adjusted Forbes index is able to recover true biological patterns. These results suggest that the index may be useful in analyzing not only paleontological data sets but any data set that includes species lists having highly variable lengths.
- Research Article
20
- 10.1152/jn.1993.70.6.2289
- Dec 1, 1993
- Journal of Neurophysiology
1. The responses of 32 taste neurons in the solitary nucleus of the rat to 12 stimuli were analyzed with multidimensional scaling (MDS) and cluster analysis (CA) procedures. These analyses of empirical taste data were compared with similar analyses of two model data sets of known configuration to help clarify the implications of these methods commonly used in forming conclusions about the organization of the taste system. 2. To relate to possible conclusions about groupings in taste, both model data sets were chosen as the best possible examples of ungrouped data, the first being completely regular (in the form of a checkerboard) across the taste space, the second randomly arranged. The analysis of the present empirical data appear to be similar to the present ungrouped models, more so the random than the regular model, in the sense that all are amenable to grouping. 3. Because of the similarity of these model MDS and CA solutions to the present empirical solutions and to most published analyses of this sort, the idea is suggested that the appearance of the plots per se for empirical data does not support the conclusion of grouping. And, technically, MDS and CA do not have the statistical power to provide conclusions about issues of neural organization. 4. MDS and CA analyses have two very powerful roles relating to their ability to disclose the hidden organization of complex data sets; they may lend support for or refute theories about the data sets developed from other considerations, and may help generate theories for further consideration. The question of groupings is only one of many such issues. 5. Because data in the present and other reports are quite adequately accounted by MDS solutions of low dimensionality, it is suggested that their organization is characterized as continuous (i.e., rather than belonging to several disjoint spaces). 6. The use of correlations as distance measures in MDS and CA procedures distorts the spatial solutions, making analysis by visual inspection misleading. For example, using correlations, the true or natural spatial arrangements of data sets are probably less circular or spherical than shown in published MDS solutions. Also they are probably more evenly distributed across the space in the sense that the points are actually more concentrated toward the centers of the spaces; this may have strong influences on interpretations of the general form of the solutions. CA solutions can be influenced in analogous fashion. These problems of distortion of the solutions can be avoided with use of direct, linear estimates of distances. (ABSTRACT TRUNCATED AT 400 WORDS)
- Dissertation
- 10.7907/z9d798df.
- Jan 1, 2017
The proliferation of sensors and advancement of technology has led to the production and collection of unprecedented amounts of data in recent years. The data are often noisy, non-linear, and high-dimensional, and the effectiveness of traditional tools may be limited. Thus, the technological advances that enable the ubiquitous collection of data from the cosmological scale to the subatomic scale also necessitate the development of complementary tools that address the new nature of the data. Recently, there has been much interest in and success with developing topologically-motivated techniques for data analysis. These approaches are especially useful when a topological method is sensitive to large- and small-scale features that might not be detected by methods that require a level of geometric detail that is not provided by the data or by methods that may obscure geometric features, such as principal component analysis (PCA), multi–dimensional scaling (MDS), and cluster analysis. Our work explores topological data analysis through two frameworks. In the first part, we provide a tool for detecting material coherence from a set of spatially sparse particle trajectories via the study of a map induced on homology by the braid corresponding to the motion of particles. While the theory of coherent structures has received a great deal of attention and benefited from many advances in recent years, many of these techniques are limited when the data are sparse. We demonstrate through various examples that our work provides a practical and scalable tool for identifying coherent sets from a sparse set of particle trajectories using eigenanalysis. In the second part, we formalize the local-to-global structure captured by topology in the setting of point clouds. We extend existing tools in topological data analysis and provide a theoretical framework for studying topological features of a point cloud over a range of resolutions, enabling the analysis of topological features using statistical methods. We apply our tools to the analysis of high-dimensional geospatial sensor data and provide a statistic for quantifying climate anomalies.
- Research Article
16
- 10.1016/0098-3004(78)90054-7
- Jan 1, 1978
- Computers and Geosciences
Methods for the quantification of assemblage zones based on multivariate analysis of weighted and unweighted data
- Research Article
111
- 10.1017/pab.2019.23
- Sep 1, 2019
- Paleobiology
The estimation of origination and extinction rates and their temporal variation is central to understanding diversity patterns and the evolutionary history of clades. The fossil record provides the only direct evidence of extinction and biodiversity changes through time and has long been used to infer the dynamics of diversity changes in deep time. The software PyRate implements a Bayesian framework to analyze fossil occurrence data to estimate the rates of preservation, origination, and extinction while incorporating several sources of uncertainty. Building upon this framework, we present a suite of methodological advances including more complex and realistic models of preservation and the first likelihood-based test to compare the fit across different models. Further, we develop a new reversible jump Markov chain Monte Carlo algorithm to estimate origination and extinction rates and their temporal variation, which provides more reliable results and includes an explicit estimation of the number and temporal placement of statistically significant rate changes. Finally, we implement a new C++ library that speeds up the analyses by orders of magnitude, therefore facilitating the application of the PyRate methods to large data sets. We demonstrate the new functionalities through extensive simulations and with the analysis of a large data set of Cenozoic marine mammals. We compare our analytical framework against two widely used alternative methods to infer origination and extinction rates, revealing that PyRate decisively outperforms them across a range of simulated data sets. Our analyses indicate that explicit statistical model testing, which is often neglected in fossil-based macroevolutionary analyses, is crucial to obtain accurate and robust results.
- Research Article
364
- 10.1186/s12859-022-04675-1
- May 31, 2022
- BMC Bioinformatics
BackgroundCluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), “fuzzy” (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis).ResultsWe found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Δ = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ = 3).ConclusionsTraditional intuitions about statistical power only partially apply to cluster analysis: increasing the number of participants above a sufficient sample size did not improve power, but effect size was crucial. Notably, for the popular dimensionality reduction and clustering algorithms tested here, power was only satisfactory for relatively large effect sizes (clear separation between subgroups). Fuzzy clustering provided higher power in multivariate normal distributions. Overall, we recommend that researchers (1) only apply cluster analysis when large subgroup separation is expected, (2) aim for sample sizes of N = 20 to N = 30 per expected subgroup, (3) use multi-dimensional scaling to improve cluster separation, and (4) use fuzzy clustering or mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.
- Research Article
52
- 10.1093/sysbio/syy035
- May 18, 2018
- Systematic Biology
Time-calibrated phylogenies of living species have been widely used to study the tempo and mode of species diversification. However, it is increasingly clear that inferences about species diversification-extinction rates in particular-can be unreliable in the absence of paleontological data. We introduce a general framework based on the fossilized birth-death process for studying speciation-extinction dynamics on phylogenies of extant and extinct species. The model assumes that phylogenies can be modeled as a mixture of distinct evolutionary rate regimes and that a hierarchical Poisson process governs the number of such rate regimes across a tree. We implemented the model in BAMM, a computational framework that uses reversible jump Markov chain Monte Carlo to simulate a posterior distribution of macroevolutionary rate regimes conditional on the branching times and topology of a phylogeny. The implementation, we describe can be applied to paleontological phylogenies, neontological phylogenies, and to phylogenies that include both extant and extinct taxa. We evaluate performance of the model on data sets simulated under a range of diversification scenarios. We find that speciation rates are reliably inferred in the absence of paleontological data. However, the inclusion of fossil observations substantially increases the accuracy of extinction rate estimates. We demonstrate that inferences are relatively robust to at least some violations of model assumptions, including heterogeneity in preservation rates and misspecification of the number of occurrences in paleontological data sets.
- Research Article
15
- 10.1097/tp.0000000000003316
- Aug 18, 2020
- Transplantation
A Primer on Machine Learning.
- Research Article
3
- 10.5897/ajb2018.16725
- Apr 3, 2019
- African Journal of Biotechnology
Fusaria are very diverse and destructive pathogens affecting different crops. However, their identity and diversity are unresolved in countries like Ethiopia, where various crop species are grown under differing environmental conditions. The objectives of this paper were to identify Fusarium spp. associated with sorghum stalk rot in Southern Ethiopia, and elucidate the genetic diversity within and between the species. For this purpose, Fusaria associated with sorghum from two locations in Southern Ethiopia were isolated. Sequencing of the elongation factor 1-alpha gene (EF-1α) was used for species identification. In addition, AFLP analysis was employed for further diversity studies within and between the Fusarium spp. Sequence analyses revealed the presence of two Fusarium spp. The first was identified as Fusarium andiyazi, while the identity of the second remains to be solved. AFLP analysis clustered the isolates into two major groups. The Dice similarity coefficients ranged from 0.39 to 0.91 for isolates of F. andiyazi while isolates within the new Fusarium spp. had a Dice similarity coefficient varying between 0.69 and 0.96. Cluster analysis and principal coordinate analysis clearly indicated a genetic separation between the two species. Both groups were pathogenic to mature sorghum plants following a toothpick inoculation test. More researches are required to identify the new species and elucidate the pathogenicity of the isolates. Key words: EF-1α, Fusarium andiyazi, genetic similarity, sequence analysis, Sorghum bicolor.  
- Research Article
34
- 10.1016/j.ympev.2014.01.003
- Jan 17, 2014
- Molecular Phylogenetics and Evolution
Assessment of genetic diversity among Indian potato (Solanum tuberosum L.) collection using microsatellite and retrotransposon based marker systems
- Research Article
25
- 10.1007/s00122-003-1354-5
- Jul 24, 2003
- Theoretical and Applied Genetics
Genotypic diversity has been detected among aromatic grapevines (Vitis vinifera) by molecular markers (AFLPs). The 22 primer-pairs generated a total of 1,331 bands of which 564 (40%) were polymorphic over all the genotypes. The bootstrap analysis pointed out that a large number of polymorphic bands (200-400) has to be used for a better estimation of the genetic distances among genotypes; 383 polymorphic AFLP bands were used for the cluster and the principal coordinate analyses because they did not present missing data across all the genotypes. The cluster analysis (UPGMA), based on polymorphic AFLP markers, revealed no relationship between the Moscato and Malvasia grapevines. The Malvasias, unlike the Moscatos distinguished by their distinct muscat aroma, have to be considered a more complex group because it includes muscat and non-muscat grapevines. The principal coordinate analysis (PCO) confirmed the pattern of the cluster analysis only for those varieties which presented a low coefficient of dissimilarity, while for the other varieties there was no correspondence between the two analyses. The pattern of aggregation among aromatic grapevines in the cluster and principal coordinate analyses does not support any classification that might include an aromatic grapevine group in V. vinifera. Even though some synonyms and homonyms are present among aromatic grapevines (V. vinifera), genetic diversity exists among genotypes in AFLP markers.
- Research Article
- 10.1016/j.ajodo.2015.03.015
- Jun 1, 2015
- American Journal of Orthodontics and Dentofacial Orthopedics
Inference from a sample mean--Part 1.
- Research Article
4
- 10.22092/cbj.2014.109674
- Jan 15, 2014
Shiri, M. R., Choukan, R., and Aliyev, R. T. 2014. Study of genetic diversity among maize hybrids using SSR markers and morphological traits under two different irrigation conditions. Crop Breeding Journal 4 (1): 65-72. Genetic diversity of 38 maize hybrids was studied using 12 SSR primers and morphological traits under two different irrigation conditions. The 38 hybrids were evaluated in two trials, one under well-watered (WW) conditions and one under drought-stress (DS) conditions, using an RBCD design with three replications for two years (2008-09) in Moghan, Iran. The total number of PCR-amplified products was 40 bands, all of them polymorphic. Primer Phi031 generated the highest number of bands (6). Among the studied primers, UMC2359, PHI031 and UMC1862 showed the maximum polymorphism information content (PIC) and the greatest diversity. These were the most informative primers and thus could be used to assess the diversity of maize hybrids. To determine the genetic relationship among maize hybrids, cluster analysis was performed based on both morphological traits (using the Ward method) and SSR markers (using the CLINK method). Maize hybrids were divided into three main groups based on SSR markers. Principal coordinate analysis (PCoA) of a similarity matrix of hybrids showed that the first 13 coordinates explained 84.73% of the total variance, whereas the first two coordinates explained only 28.14% of total variance. Cluster analysis of morphological traits divided the maize hybrids into two groups under both WW and DS conditions. Grouping hybrids based on morphological data under WW and DS conditions yielded different groups. Generally, results indicated that SSR markers are able to more efficiently classify closely related maize hybrids than morphological traits.
- Research Article
- 10.22058/jpmb.2017.31701.1081
- Jul 1, 2017
Comparing different methods of estimating the genetic diversity could define their usefulness in plant breeding programs. In this study, a total of 18 morphological traits and 20 simple sequence repeat (SSR) loci were used to study the morphological and genetic diversity among 20 maize hybrids selected from different countries, and to classify the hybrids into groups based on molecular profiles and morphological traits. To collect morphological data, a field experiment was carried out using an RBCD design with three replications in Moghan, Ardabil, Iran. The highest estimates for genetic coefficients of variation were observed in anthesis-silking interval, followed by grain yields, leaf chlorophyll rates, kernel row numbers, and ear heights. The total number of PCR-amplified products was 84 bands, all of which were polymorphic. Among the studied primers,NC009,BNLG1108,BNLG1194,PHI026 and PHI057 showed the maximum polymorphism information content(PIC) and the greatest diversity. To determine the genetic relationship among maize hybrids, the cluster analysis was performed based on both morphological traits(using the Ward method) and SSR markers (using the CLINK method). The cluster analysis of morphological traits divided the maize hybrids into five groups. Furthermore, Maize hybrids were divided into seven main groups based on SSR markers. Principal coordinate analysis (PCoA) of a similarity matrix of hybrids for SSR data showed that the first 15 coordinates explained 97.21% of the total variance, whereas the first two coordinates explained only 33.14% of the total variance. Generally, results indicated that SSR markers were able to classify closely related maize hybrids more efficiently than morphological traits.
- Research Article
9
- 10.3906/tar-1711-37
- Aug 7, 2018
- TURKISH JOURNAL OF AGRICULTURE AND FORESTRY
Terrestrial orchid species are natural sources of salep and a closely related group of plant species widely distributed throughout Turkey. The phylogenetic relationship among fourteen different tuber-producing orchid species was investigated after analyzing phenotypic and genetic variation within and among the natural population through fifteen morphometric traits and ten random amplified polymorphic DNA (RAPD) primer combinations. Statistical analyses (principal component analysis (PCA), principal coordinate analysis (PCoA), and cluster analysis) using the generated data identified taxonomic and genetic distance within the studied plant samples. The results of PCA from morphological traits show that there are no major groupings within and among different species instead somehow overlapping with few distinctly characterized species. In addition, the UPGMA-based phenogram with Euclidean distance (0-1) produces five major clusters among the studied orchid species according to their taxonomic status with few exceptions. On the other hand, PCoA and the phylogenetic dendrogram with the coefficient (0.56-0.79) from RAPD band profiles determine the true genetic diversity of those species. Although both combinations of genetic and phenotypic characteristics reveal the phylogenetic relationship of some those studied species very effectively, they are not clear for others. These results suggest that in the natural population of terrestrial orchid species significant amounts of gene flow are ongoing at intra/interspecies level. Therefore, it is recommended that conservation studies of these groups of orchid species should be done as a geographical unit rather than according to taxonomic status.
- Research Article
9
- 10.1007/s00704-017-2319-y
- Dec 1, 2017
- Theoretical and Applied Climatology
The precipitation patterns of seventeen locations in Bangladesh from 1961 to 2014 were studied using a cluster analysis and metric multidimensional scaling. In doing so, the current research applies four major hierarchical clustering methods to precipitation in conjunction with different dissimilarity measures and metric multidimensional scaling. A variety of clustering algorithms were used to provide multiple clustering dendrograms for a mixture of distance measures. The dendrogram of pre-monsoon rainfall for the seventeen locations formed five clusters. The pre-monsoon precipitation data for the areas of Srimangal and Sylhet were located in two clusters across the combination of five dissimilarity measures and four hierarchical clustering algorithms. The single linkage algorithm with Euclidian and Manhattan distances, the average linkage algorithm with the Minkowski distance, and Ward’s linkage algorithm provided similar results with regard to monsoon precipitation. The results of the post-monsoon and winter precipitation data are shown in different types of dendrograms with disparate combinations of sub-clusters. The schematic geometrical representations of the precipitation data using metric multidimensional scaling showed that the post-monsoon rainfall of Cox’s Bazar was located far from those of the other locations. The results of a box-and-whisker plot, different clustering techniques, and metric multidimensional scaling indicated that the precipitation behaviour of Srimangal and Sylhet during the pre-monsoon season, Cox’s Bazar and Sylhet during the monsoon season, Maijdi Court and Cox’s Bazar during the post-monsoon season, and Cox’s Bazar and Khulna during the winter differed from those at other locations in Bangladesh.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.