An Efficient Minimal Text Segmentation Method for URL Domain Names

Yiqian Li,Shouning Qu,Lianjiang Zhu,Tao Du,Antonio J Peña

doi:10.1155/2021/9946729

Abstract

Text segmentation of the URL domain name is a straightforward and convenient method to analyze users’ online behaviors and is crucial to determine their areas of interest. However, the performance of popular word segmentation tools is relatively low due to the unique structure of the website domain name (such as extremely short lengths, irregular names, and no contextual relationship). To address this issue, this paper proposes an efficient minimal text segmentation (EMTS) method for URL domain names to achieve efficient adaptive text mining. We first designed a targeted hierarchical task model to reduce noise interference in minimal texts. We then presented a novel method of integrating conflict game into the two-directional maximum matching algorithm, which can make the words with higher weight and greater probability to be selected, thereby enhancing the accuracy of recognition. Next, Chinese Pinyin and English mapping were embedded in the word segmentation rules. Besides, we incorporated a correction factor that considers the text length into the F1-score to optimize the performance evaluation of text segmentation. The experimental results show that the EMTS yielded around 20 percentage points improvement with other word segmentation tools in terms of accuracy and topic extraction, providing high-quality data for the subsequent text analysis.

Highlights

In recent years, the internet has become one of the most important infrastructures in human society, having an increasingly broad and deep impact on people’s economic and social activities [1]
Since the URL domain name text has no context semantic relationship, word segmentation is the first step in the rapid extraction of web page attribute information and provides fast and accurate data support for analyzing the users network behaviors [4, 5]
In order to cope with the above challenge and promote the research on the behavioral emotion contained in URL, this paper proposes an efficient minimal text segmentation (EMTS) method for URL domain names

Summary

Introduction

The internet has become one of the most important infrastructures in human society, having an increasingly broad and deep impact on people’s economic and social activities [1]. Since the URL domain name text has no context semantic relationship, word segmentation is the first step in the rapid extraction of web page attribute information and provides fast and accurate data support for analyzing the users network behaviors [4, 5]. In order to cope with the above challenge and promote the research on the behavioral emotion contained in URL, this paper proposes an efficient minimal text segmentation (EMTS) method for URL domain names. Its primary goal is to perform text parsing on random website domain names and extract keywords with high accuracy because the prerequisite for quickly and accurately analyzing users’ online behavior preferences is to extract the emotional effects contained between samples [10].

Related Work

Experiments

Findings

Result

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Programming	Publication Date: Jul 2, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

An Efficient Minimal Text Segmentation Method for URL Domain Names

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming

Lead the way for us

Similar Papers

Ancient Chinese Lexicon Construction Based on Unsupervised Algorithm of Minimum Entropy and CBDB Optimization
Yuyao Li ... Xiujuan Huang
-
Yuyao Li, et. al.Yuyao Li ... Xiujuan Huang
01 Jan 2020
01 Jan 2020

New Words Discovery Method Based On Word Segmentation Result
Heyang Liu ... Pengdong Gao
-
Heyang Liu, et. al.Heyang Liu ... Pengdong Gao
01 Jun 2018
01 Jun 2018

Methods for text segmentation from scene images
Deepak Kumar ... A G Ramakrishnan
ELCVIA Electronic Letters on Computer Vision and Image Analysis | VOL. 13
Deepak Kumar, et. al.Deepak Kumar ... A G Ramakrishnan
07 Jun 2014
ELCVIA Electronic Letters on Computer Vision and Image Analysis | VOL. 13

Can Word Segmentation and Length of Text Affect L2 Reading? Evidence From Advanced L2 Learners of Chinese and the Three Reading Models
Ken Chen
Theory and Practice in Language Studies | VOL. 11
Ken ChenKen Chen
01 Sep 2021
Theory and Practice in Language Studies | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Efficient Minimal Text Segmentation Method for URL Domain Names

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming