Abstract

Data used to be hard to come by and even harder to analyze in any large scale fashion. Advances in collection and storage capabilities have made it relatively convenient to produce and accumulate large volumes of data, often automatically archiving them in data repositories. The number of nucleotide bases in Gen-Bank, National Center for Biotechnology Information's genetic sequence database, has been doubling approximately every 18 months; a recent release contains more than 110 billion bases. On the other hand, data are still scarce in some domains. Microarray data, for example, are typically high dimensional and have a high variable-to-case ratio. It is not uncommon to have more variables than cases in a data sample. The problem space is still large, as the space is determined not by the sample size but by the number of variables. This situation further complicates the analysis process as the amount of data often does not provide sufficient statistical support for many analysis methods. The amount of data available combined with the number of variables that need to be considered is of a scale far beyond what is amenable to manual inspection. Automated and semi-automated data analysis is thus essential to sieve through the data for meaningful conclusions. This process is variously called data mining, knowledge discovery, machine learning, inductive inference and others, depending on the discipline.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call