N-gram Matching Research Articles

String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU’s for general purpose computing has been increasing for various applications. Therefore it is of great interest to exploit the thread feature of a GPU to provide a high performance search algorithm. This paper proposes an optimized new approach to N-gram model for string search in a number of lengthy documents and its GPU implementation. The algorithm exploits GPGPUs for searching strings in many documents employing character level N-gram matching with parallel Score Table approach and search using CUDA API. The new approach of Score table used for frequency storage of N-grams in a document, makes the search independent of the document’s length and allows faster access to the frequency values, thus decreasing the search complexity. The extensive thread feature in a GPU has been exploited to enable parallel pre-processing of trigrams in a document for Score Table creation and parallel search in huge number of documents, thus speeding up the whole search process even for a large pattern size. Experiments were carried out for many documents of varied length and search strings from the standard Lorem Ipsum text on NVIDIA’s GeForce GT 540M GPU with 96 cores. Results prove that the parallel approach for Score Table creation and searching gives a good speed up than the same approach executed serially.

Read full abstract

Information is growing more rapidly on the World Wide Web (WWW) has made it necessary to make all this information not only available to people but also to the machines. Ontology and token are widely being used to add the semantics in data processing or information processing. A concept formally refers to the meaning of the specification which is encoded in a logic-based language, explicit means concepts, properties that specification is machine readable and also a conceptualization model how people think about things of a particular subject area. In modern scenario more ontologies has been developed on various different topics, results in an increased heterogeneity of entities among the ontologies. The concept integration becomes vital over last decade and a tool to minimize heterogeneity and empower the data processing. There are various techniques to integrate the concepts from different input sources, based on the semantic or syntactic match values. In this paper, an approach is proposed to integrate concept (Ontologies or Tokens) using edit distance or n-gram match values between pair of concept and concept frequency is used to dominate the integration process. The proposed techniques performance is compared with semantic similarity based integration techniques on quality parameters like Recall, Precision, FMeasure & integration efficiency over the different size of concepts. The analysis indicates that edit distance value based interaction outperformed n-gram integration and semantic similarity techniques.

Read full abstract

N-gram Matching Research Articles

Related Topics

Articles published on N-gram Matching

Android malware dataset construction methodology to minimize bias–variance tradeoff

Gazetteer based unsupervised learning approach for location extraction from complaint tweets

A novel alignment-free DNA sequence similarity analysis approach based on top-k n-gram match-up

The Date of Alphonsus, Emperor of Germany: The Evidence of Unique N-Gram Matches

Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling.

GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

Automated Scoring of Students’ English-to-Chinese Translations of Three Text Types

Text-mining based localisation of player-specific events from a game-log of cricket

Text-mining based localisation of player-specific events from a game-log of cricket

Keyphrase based Evaluation of Automatic Text Summarization

Concept Integration using Edit Distance and N-Gram Match

Modeling the scholars: Detecting intertextuality through enhanced word-level n-gram matching

Expected dependency pair match: predicting translation quality with expected syntactic structure

Comparison of Stemming and N-gram Matching for Term Conflation in Arabic Text

Character contiguity in N-gram-based word matching: the case for Arabic text searching

Applying query structuring in cross-language retrieval

A Comparison of Spelling-Correction Methods for the Identification of Word Forms in Historical Text Databases

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

N-gram Matching Research Articles

Related Topics

Articles published on N-gram Matching

Android malware dataset construction methodology to minimize bias–variance​ tradeoff

Gazetteer based unsupervised learning approach for location extraction from complaint tweets

A novel alignment-free DNA sequence similarity analysis approach based on top-k n-gram match-up

The Date of Alphonsus, Emperor of Germany: The Evidence of Unique N-Gram Matches

Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling.

GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

Automated Scoring of Students’ English-to-Chinese Translations of Three Text Types

Text-mining based localisation of player-specific events from a game-log of cricket

Text-mining based localisation of player-specific events from a game-log of cricket

Keyphrase based Evaluation of Automatic Text Summarization

Concept Integration using Edit Distance and N-Gram Match

Modeling the scholars: Detecting intertextuality through enhanced word-level n-gram matching

Expected dependency pair match: predicting translation quality with expected syntactic structure

Comparison of Stemming and N-gram Matching for Term Conflation in Arabic Text

Character contiguity in N-gram-based word matching: the case for Arabic text searching

Applying query structuring in cross-language retrieval

A Comparison of Spelling-Correction Methods for the Identification of Word Forms in Historical Text Databases

Android malware dataset construction methodology to minimize bias–variance tradeoff