Superposition‐Based Abstractions for Quantum Data Encoding Verification
ABSTRACT Many quantum algorithms operate on classical data, by first encoding classical data into the quantum domain using quantum data encoding circuits. To be effective for large data sets, encoding circuits that operate on large data sets are required. However, as the size of the data sets increases, the encoding circuits quickly become large, complex and error prone. Errors in the encoding circuit will provide incorrect inputs to quantum algorithms, making them ineffective. To address this problem, a formal method is proposed for verification of encoding circuits. The key idea to address scalability is the use of abstractions that reduce the verification problem to bit‐vector space. The major outcome of this work is that using this approach, the authors have been able to verify encoding circuits with up to 8191 qubits with very low memory (85 MB) and time (0.29s), demonstrating that the proposed approach can easily be employed to verify even much larger encoding circuits. The results are very significant because, traditional verification approaches that rely on modelling quantum circuits in Hilbert space have only demonstrated verification scalability up to 250 qubits. Also, this is the first approach to tackle the verification of quantum encoding circuits.
- Research Article
313
- 10.1074/mcp.m900317-mcp200
- Nov 1, 2009
- Molecular & cellular proteomics : MCP
Comprehensive characterization of a proteome is a fundamental goal in proteomics. To achieve saturation coverage of a proteome or specific subproteome via tandem mass spectrometric identification of tryptic protein sample digests, proteomics data sets are growing dramatically in size and heterogeneity. The trend toward very large integrated data sets poses so far unsolved challenges to control the uncertainty of protein identifications going beyond well established confidence measures for peptide-spectrum matches. We present MAYU, a novel strategy that reliably estimates false discovery rates for protein identifications in large scale data sets. We validated and applied MAYU using various large proteomics data sets. The data show that the size of the data set has an important and previously underestimated impact on the reliability of protein identifications. We particularly found that protein false discovery rates are significantly elevated compared with those of peptide-spectrum matches. The function provided by MAYU is critical to control the quality of proteome data repositories and thereby to enhance any study relying on these data sources. The MAYU software is available as standalone software and also integrated into the Trans-Proteomic Pipeline.
- Conference Article
14
- 10.1109/icdm.2001.989590
- Nov 29, 2001
In developed countries, especially over the last decade, there has been an explosive growth in the capability to generate, collect and use very large data sets. The objects of these data sets could be simultaneously described by quantitative and qualitative attributes. At present, algorithms able to process either very large data sets (in metric spaces) or mixed (qualitative and quantitative) incomplete data (missing value) sets have been developed, but not for very large mixed incomplete data sets. In this paper we introduce a new clustering method named GLC+ to process very large mixed incomplete data sets in order to obtain a partition in connected sets.
- Research Article
64
- 10.1016/j.radonc.2011.12.006
- Jan 20, 2012
- Radiotherapy and Oncology
Multivariate modeling of complications with data driven variable selection: Guarding against overfitting and effects of data set size
- Research Article
972
- 10.1111/j.1467-9868.2007.00633.x
- Jan 4, 2008
- Journal of the Royal Statistical Society Series B: Statistical Methodology
SummarySpatial statistics for very large spatial data sets is challenging. The size of the data set, n, causes problems in computing optimal spatial predictors such as kriging, since its computational cost is of order n3. In addition, a large data set is often defined on a large spatial domain, so the spatial process of interest typically exhibits non-stationary behaviour over that domain. A flexible family of non-stationary covariance functions is defined by using a set of basis functions that is fixed in number, which leads to a spatial prediction method that we call fixed rank kriging. Specifically, fixed rank kriging is kriging within this class of non-stationary covariance functions. It relies on computational simplifications when n is very large, for obtaining the spatial best linear unbiased predictor and its mean-squared prediction error for a hidden spatial process. A method based on minimizing a weighted Frobenius norm yields best estimators of the covariance function parameters, which are then substituted into the fixed rank kriging equations. The new methodology is applied to a very large data set of total column ozone data, observed over the entire globe, where n is of the order of hundreds of thousands.
- Research Article
13
- 10.1109/tgrs.2021.3110056
- Jan 1, 2022
- IEEE Transactions on Geoscience and Remote Sensing
Quantum algorithms are designed to process quantum data (quantum bits) in a gate-based quantum computer. They are proven rigorously that they reveal quantum advantages over conventional algorithms when their inputs are certain quantum data or some classical data mapped to quantum data. However, in a practical domain, data are classical in nature, and they are \nvery big in dimension, size, and so on. Hence, there is a challenge to map (embed) classical data to quantum data, and even no quantum advantages of quantum algorithms are demonstrated over conventional ones when one processes the mapped classical data in a gate-based quantum computer. For the practical domain of earth observation (EO), due to the different sensors on remotesensing platforms, we can map directly some types of EO data to quantum data. In particular, we have polarimetric synthetic aperture radar (PolSAR) images characterized by polarized beams. A polarized state of the polarized beam and a quantum bit are the Doppelganger of a physical state. We map them to each other, and we name this direct mapping a natural embedding, otherwise an artificial embedding. Furthermore, we process our naturally embedded data in a gate-based quantum computer by using a quantum algorithm regardless of its quantum advantages over conventional techniques; namely, we use the QML network as a quantum algorithm to prove that we naturally embedded \nour data in input qubits of a gate-based quantum computer. Therefore, we employed and directly processed PolSAR images in a QML network. Furthermore, we designed and provided a QML network with an additional layer of a neural network, namely, a hybrid quantum-classical network, and demonstrate how to program (via optimization and backpropagation) this hybrid quantum-classical network when employing and processing PolSAR images. In this work, we used a gate-based quantum computer offered by an IBM Quantum and a classical simulator for a gate-based quantum computer. Our contribution is that we provided very specific EO data with a natural embedding feature, the Doppelganger of quantum bits, and processed them in a hybrid quantum-classical network. More importantly, in the future, these PolSAR data can be processed by future quantum algorithms and future quantum computing platforms to obtain \n(or demonstrate) some quantum advantages over conventional techniques for EO problems.
- Research Article
23
- 10.1007/s11634-020-00412-9
- Aug 3, 2020
- Advances in Data Analysis and Classification
Archetypoid analysis (ADA) has proven to be a successful unsupervised statistical technique to identify extreme observations in the periphery of the data cloud, both in classical multivariate data and functional data. However, two questions remain open in this field: the use of ADA for outlier detection and its scalability. We propose to use robust functional archetypoids and adjusted boxplot to pinpoint functional outliers. Furthermore, we present a new archetypoid algorithm for obtaining results from large data sets in reasonable time. Functional time series are occurring in many practical problems, so this paper focuses on functional data settings. The new algorithm for detecting functional anomalies, called CRO-FADALARA, can be used with both univariate and multivariate curves. Our proposal for outlier detection is compared with all the state-of-the-art methods in a controlled study, showing a good performance. Furthermore, CRO-FADALARA is applied to two large time series data sets, where outliers curves are discussed and the reduction in computational time is clearly stated. A third case study with a small ECG data set is discussed, given its importance in functional data scenarios. All data, R code and a new R package are freely available.
- Research Article
233
- 10.1093/molbev/msn104
- May 2, 2008
- Molecular Biology and Evolution
The ongoing generation of prodigious amounts of genomic sequence data from myriad vertebrates is providing unparalleled opportunities for establishing definitive phylogenetic relationships among species. The size and complexities of such comparative sequence data sets not only allow smaller and more difficult branches to be resolved but also present unique challenges, including large computational requirements and the negative consequences of systematic biases. To explore these issues and to clarify the phylogenetic relationships among mammals, we have analyzed a large data set of over 60 megabase pairs (Mb) of high-quality genomic sequence, which we generated from 41 mammals and 3 other vertebrates. All sequences are orthologous to a 1.9-Mb region of the human genome that encompasses the cystic fibrosis transmembrane conductance regulator gene (CFTR). To understand the characteristics and challenges associated with phylogenetic analyses of such a large data set, we partitioned the sequence data in several ways and utilized maximum likelihood, maximum parsimony, and Neighbor-Joining algorithms, implemented in parallel on Linux clusters. These studies yielded well-supported phylogenetic trees, largely confirming other recent molecular phylogenetic analyses. Our results provide support for rooting the placental mammal tree between Atlantogenata (Xenarthra and Afrotheria) and Boreoeutheria (Euarchontoglires and Laurasiatheria), illustrate the difficulty in resolving some branches even with large amounts of data (e.g., in the case of Laurasiatheria), and demonstrate the valuable role that very large comparative sequence data sets can play in refining our understanding of the evolutionary relationships of vertebrates.
- Book Chapter
- 10.1007/978-3-642-18993-7_22
- Jan 1, 2003
We discuss some features of an experimental system for visualization of large (medical) volume data sets. Input voxel data sets are subdivided into blocks first. Then each block is decomposed into a multiresolution data representation by applying a reversible 3D integer Haar wavelet transform (S-transform). The resulting transform coefficients are encoded using a Golomb-Rice algorithm. For volume visualization, we selectively load and decompress blocks to a proper resolution. Then they are rendered into a common view. From our experiments, we learned that large volume data sets should preferably be stored using a multiresolution data representation. Depending on the size of the volume data set and the rendering mode, we also found that a block-based data representation can provide some advantages, but it may not always be the best choice.
- Discussion
5
- 10.1148/radiol.2020201366
- May 12, 2020
- Radiology
Deep Learning and Lung Cancer: AI to Extract Information Hidden in Routine CT Scans.
- Conference Article
3
- 10.1109/bigdataservice.2015.24
- Mar 1, 2015
Collections of time-series data appear in a wide variety of contexts. To gain insight into the underlying phenomenon (that the data represents), one must analyze the time-series data. Analysis can quickly become challenging for very large data (~terabytes or more) sets, and it may be infeasible to scan the entire data-set on each query due to time limits or resource constraints. To avoid this problem, one might pre-compute partial results by scanning the data-set (usually as the data arrives). However, for complex queries, where the value of a new data record depends on all of the data previously seen, this might be infeasible because incorporating a large amount of historical data into a query requires a large amount of storage. We present an approach to performing complex queries over very large data-sets in a manner that is (i) practical, meaning that a query does not require a scan of the entire data-set, and (ii) fixed-cost, meaning that the amount of storage required only depends on the time-range spanned by the entire data-set (and not the size of the data-set itself). We evaluate our approach with three different data-sets: (i) a 4-year commercial analytics data-set from a production content-delivery platform with over 15 million mobile users, (ii) an 18-year data-set from the Linux-kernel commit-history, and (iii) an 8-day data-set from Common Crawl HTTP logs. Our evaluation demonstrates the feasibility and practicality of our approach for a diverse set of complex queries on a diverse set of very large data-sets.
- Book Chapter
5
- 10.1007/978-3-319-72926-8_23
- Dec 21, 2017
Pre-processing of large scale datasets in order to ensure data quality is a very important task in data mining. One of the serious threats to data quality is the lack of data collected during field experiments, which negatively affects the data quality. The missing data usually have significant effects in many real-life pattern classification scenarios, especially when it leads to biased parameter estimates but also disqualify for analysis purposes. The process of filling in the missing data based on other valid values of rest of the variables of a data set is known as the imputation process. In this paper, we present a new data-driven machine learning approach for imputing the missing data. Even though Machine Learning methods are used in order to impute missing data in the literature, it is difficult to decide on a single method to apply on a given data set for imputation. This is because imputation process is not considered as science but as art that focuses on choosing the best method with the least biased value. For this reason, we compare different machine learning methods, such as decision tree (C4.5), Bayesian network, clustering algorithm and artificial neural networks in this work. The comparison of the algorithms indicates that, for predicting categorical and numerical missing information in large survey data sets, clustering method is the most efficient out of the others methods found in literature. A hybrid method is introduced which combines unsupervised learning methods with supervised ones based on the missing ratio, for achieving a data imputation with higher accuracy. Additionally, some statistical imputation methods such as Mean\Mode, Hot-Deck have been applied emphasizing their limitations in large scale datasets in comparison to the machine learning methods. A comparison of all above mentioned methods, traditional statistical methods and machine learning methods has been made and conclusions are drawn for achieving data imputation with higher accuracy in data sets of large scale survey. Also, another objective of these experiments is to discover the effect of balancing the training data set in the performance of classifiers. All methods are tested to a real world data set, population and housing census.
- Conference Article
6
- 10.1145/3093338.3106387
- Jul 9, 2017
Macroscopes for Making Sense of Science
- Research Article
7
- 10.1002/env.2780
- Dec 2, 2022
- Environmetrics
There has been a great deal of recent interest in the development of spatial prediction algorithms for very large datasets and/or prediction domains. These methods have primarily been developed in the spatial statistics community, but there has been growing interest in the machine learning community for such methods, primarily driven by the success of deep Gaussian process regression approaches and deep convolutional neural networks. These methods are often computationally expensive to train and implement and consequently, there has been a resurgence of interest in random projections and deep learning models based on random weights—so called reservoir computing methods. Here, we combine several of these ideas to develop the random ensemble deep spatial (REDS) approach to predict spatial data. The procedure uses random Fourier features as inputs to an extreme learning machine (a deep neural model with random weights), and with calibrated ensembles of outputs from this model based on different random weights, it provides a simple uncertainty quantification. The REDS method is demonstrated on simulated data and on a classic large satellite data set.
- Research Article
- 10.47839/ijc.11.3.565
- Aug 1, 2014
- International Journal of Computing
The advances and use of technology in all walks of life results in tremendous growth of data available for data mining. Large amount of knowledge available can be utilized to improve decision-making process. The data contains the noise or outlier data to some extent which hampers the classification performance of classifier built on that training data. The learning process on large data set becomes very slow, as it has to be done serially on available large datasets. It has been proved that random data reduction techniques can be used to build optimal decision trees. Thus, we can integrate data cleaning and data sampling techniques to overcome the problems in handling large data sets. In this proposed technique outlier data is first filtered out to get clean data with improved quality and then random sampling technique is applied on this clean data set to get reduced data set. This reduced data set is used to construct optimal decision tree. Experiments performed on several data sets proved that the proposed technique builds decision trees with enhanced classification accuracy as compared to classification performance on complete data set. Due to use of classification filter a quality of data is improved and sampling reduces the size of the data set. Thus, the proposed method constructs more accurate and optimal sized decision trees and it also avoids problems like overloading of memory and processor with large data sets. In addition, the time required to build a model on clean data is significantly reduced providing significant speedup.
- Book Chapter
1
- 10.1007/978-4-431-65930-3_6
- Jan 1, 2000
Phylogeny is central to the understanding of biodiversity and evolutionary processes. However, elucidating phylogenetic relationships in many groups has remained problematic due to their sheer size. The feasibility of phylogenetic analyses of large data sets has been questioned on both theoretical and empirical grounds. Some have suggested that large data sets be broken into a series of smaller problems for phylogenetic analysis. However, recent empirical studies and critical developments in methods of data analysis indicate that large data sets are tractable. We have learned a great deal about the analysis of large data sets via the angiosperms, for which three large molecular data sets have been constructed (plastid atpB and rbcL and nuclear 18S rDNA). We discuss three approaches successfully applied in our analyses of these large data sets. Parsimony analyses of separate and combined data sets representing hundreds of taxa indicate that “bigger is better.” That is, both empirical and simulation studies demonstrate that two solutions to dilemmas posed by large data sets is the addition of taxa as well as characters. Recent developments in software also greatly facilitate the parsimony analysis of large data sets. Applications such as NONA and the RATCHET can retrieve shorter trees than found by PAUP, and in much shorter run times. The recent development of “quick search” methods such as the fast bootstrap and fast jackknife are also of great utility in the analysis of large data sets. These methods are rapid and emphasize only those clades with strong support. All three of these approaches have recently been applied to a 567-taxon data set for angiosperms based on atpB, rbcL, and 18S rDNA sequences (a total of 4733 bp/taxon). Analyses of the combined three-gene data set have yielded the best-resolved and best-supported topology to date for angiosperms, with virtually all major clades, as well as the spine of the tree, well supported. These developments indicate that the phylogenetic analysis of large data sets is not only feasible, but relatively straightforward.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.