Advances in data compression and pattern recognition

Luis Rueda

doi:10.22215/etd/2002-05091

Abstract

In this thesis, we present our contributions to two important areas of computer science, namely data compression and pattern recognition. In the area of data compression, we introduce an enhanced version of the static Fano coding, based on the concept of the rearrangement of two lists. The algorithm is formally analyzed and rigorously tested empirically. Its superiority over other similar algorithms has been demonstrated. We introduce a greedy algorithm for the adaptive Fano coding. We formalize the partitioning algorithms for the multi-symbol input and the binary output. The empirical results on the well-known benchmarks demonstrate the advantageous properties of our greedy algorithm. This greedy algorithm has also been extended for the multi-symbol output alphabet. We also present a more efficient algorithm which uses a new tree structure derived from the binary search tree, namely the Fano binary search tree. This new scheme, the corresponding tree-based operators, and the conditional shifting heuristic used to consistently maintain this Fano tree, have been theoretically analyzed and tested on well-known benchmarks files. In the area of pattern recognition, we present the formal theory for optimal pairwise linear classifiers, and a formal analysis of why heuristic functions work. With regard to the theory of optimal pairwise linear classifiers, we derive the necessary and sufficient conditions for the classifier to be optimal and as a pair of straight lines for the two-dimensional case, when the underlying distribution is normal. This, in particular, resolves Minsky's paradox, which has been open since 1957. The corresponding classifiers have been empirically analyzed on synthetic and real-life data showing its superiority over the traditional Fisher's approach. These results have also been extended for the case of multi-dimensional features. We also present a formal analysis that relates the accuracy of heuristic functions to the optimality of the solutions that heuristic algorithms yield. We prove that given a heuristic algorithm, which could utilize either of two heuristic functions, the more accurate heuristic has a higher probability of yielding a superior solution. This result has also been tested for the database query optimization problem.

Full Text