Levenshtein Edit Distance Research Articles

This paper focuses on applications of various machine learning techniques on an anonymized policing dataset used in EU SPIRIT Horizon 2020 project to identify fraudulent identities and help Law Enforcement Agencies (LEAs) in their investigation in finding potential criminals and identity resolution. Lack of qualitative data and appropriate methodology to carry out research on criminal fraudulent identities is a common reason for fewer research in this area. Additionally, it is a very sensitive data to work with and minor inaccuracy in prediction of result causes massive impact in the society as genuine people could be questioned whereas criminals could be sent free. Both of these issues are addressed in this paper by application of 39 million records from policing dataset and working towards higher accuracy while building the model. Various machine learning approaches are applied to train the dataset to make predictions and the research focus on being able to predict the 5 suspected fraudulent identities out of 39 million records in the policing dataset. One of the applied machine learning techniques include TensorFlow along with Keras model which has seldomly been applied by researchers in detection of criminal data. To compare the results and test accuracy of TensorFlow model, other machine learning techniques such as Support Vector Machine, Naïve Bayes and K-nearest Neighbours are also applied to have a comparative study on the obtained outcomes from each model. The goal of this research is to find fraudulent IDs amongst all the anonymized IDs in the criminal dataset using TensorFlow and three other machine learning models and select the most optimal model out of them. Since the model is comparing two names so string-matching techniques such as Levenshtein edit distance, Hamming Distance, Jaro-Winkler and Soundex were applied to select an effective approach first before building the model and analysing the results. TensorFlow model demonstrated highest accuracy with relatively least execution time and the only model to successfully predict all the 5 suspects from the policing dataset.

Read full abstract

Задачи, связанные с классификацией последовательностей символов некоторого алфавита, часто возникают в таких областях, как биоинформатика и обработка естественного языка. Методы глубокого обучения, в особенности модели на основе рекуррентных нейронных сетей, в последние несколько лет зарекомендовали себя как наиболее эффективный способ решения подобных задач. Однако существующие подходы имеют серьезный недостаток — низкую интерпретируемость получаемых результатов. Крайне сложно установить какие именно свойства входной последовательности ответственны за её принадлежность к тому или иному классу. Упрощение же таких моделей с целью повышения их интерпретируемости, в свою очередь, приводит к снижению качества классификации. Такие недостатки ограничивают применение современных методов машинного обучения во многих предметных областях. В настоящей работе мы представляем принципиально новую, интерпретируемую архитектуру нейронных сетей, основанную на поиске набора коротких подпоследовательностей — мотивов, наличие которых влияет на принадлежность последовательности к определенному классу. Ключевой составляющей предлагаемого решения является разработанный нами алгоритм дифференцируемого выравнивания, являющийся дифференцируемым аналогом таких классических способов сравнения строк, как редакционное расстояние Левенштейна и алгоритм Смита–Ватермана. В отличие от предыдущих работ, посвященных классификации последовательностей на основе мотивов, новый метод позволяет не только выполнять поиск в произвольной части строки, но и учитывать возможные вставки.

Read full abstract

Levenshtein Edit Distance Research Articles

Related Topics

Articles published on Levenshtein Edit Distance

Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

Comparisons of machine learning techniques for detecting fraudulent criminal identities

Using optimized clustering to identify students' science learning paths to knowledge integration

Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets.

Disclaimer effect of key audit matters in China: negative press coverage and boilerplate

Hybrid Tamil spell checker with combined character splitting

Analysis of End User Access of Warn-on-Forecast Guidance Products during an Experimental Forecasting Task

Guest Editorial Special Issue: “From Deletion-Correction to Graph Reconstruction: In Memory of Vladimir I. Levenshtein”

Unconstrained online handwritten Uyghur word recognition based on recurrent neural networks and connectionist temporal classification

Analysis and safety engineering of fuzzy string matching algorithms

Optimized SAT encoding of conformance checking artefacts

The bird's-eye view: A data-driven approach to understanding patient journeys from claims data.

Development of Density Functional Tight-Binding Parameters Using Relative Energy Fitting and Particle Swarm Optimization.

Protein function prediction from dynamic protein interaction network using gene expression data.

Automating Error Frequency Analysis via the Phonemic Edit Distance Ratio.

Spatial Analysis of New Testament Textual Emendations Utilizing Confusion Distances

Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social Media

Классификация последовательностей на основе коротких мотивов

Interactive feature selection for efficient customer recognition in contact centers: Dealing with common names

IMPLEMENTASI FUZZY SEARCH UNTUK PENDETEKSI KATA ASING PADA DOKUMEN MICROSOFT WORD

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Levenshtein Edit Distance Research Articles

Related Topics

Articles published on Levenshtein Edit Distance

Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

Comparisons of machine learning techniques for detecting fraudulent criminal identities

Using optimized clustering to identify students' science learning paths to knowledge integration

Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets.

Disclaimer effect of key audit matters in China: negative press coverage and boilerplate

Hybrid Tamil spell checker with combined character splitting

Analysis of End User Access of Warn-on-Forecast Guidance Products during an Experimental Forecasting Task

Guest Editorial Special Issue: “From Deletion-Correction to Graph Reconstruction: In Memory of Vladimir I. Levenshtein”

Unconstrained online handwritten Uyghur word recognition based on recurrent neural networks and connectionist temporal classification

Analysis and safety engineering of fuzzy string matching algorithms

Optimized SAT encoding of conformance checking artefacts

The bird's-eye view: A data-driven approach to understanding patient journeys from claims data.

Development of Density Functional Tight-Binding Parameters Using Relative Energy Fitting and Particle Swarm Optimization.

Protein function prediction from dynamic protein interaction network using gene expression data.

Automating Error Frequency Analysis via the Phonemic Edit Distance Ratio.

Spatial Analysis of New Testament Textual Emendations Utilizing Confusion Distances

Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social Media

Классификация последовательностей на основе коротких мотивов

Interactive feature selection for efficient customer recognition in contact centers: Dealing with common names

IMPLEMENTASI FUZZY SEARCH UNTUK PENDETEKSI KATA ASING PADA DOKUMEN MICROSOFT WORD