Duplicate Web Pages Research Articles

Varicocele embolization is a growing treatment modality owing to the safety, efficacy, and quick return to work following the procedure. The internet is the most dominant source of information for many. We aimed to assess the quality of information accessible by patients considering treatment. A list of applicable, commonly used searchable terms was generated. Each term was assessed across the five most-used English language search engines to determine the two most commonly used terms. These two terms were then investigated across each search engine, with the first 25 web pages returned by each engine included for analysis. Duplicate web pages, nontext content such as video or audio, and web pages behind paywalls were excluded. Web pages were analyzed for quality and readability using validated tools including DISCERN score, JAMA Benchmark Criteria, HONcode Certification, Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Gunning-Fog Index. Secondary features including age, rank, author, and publisher were recorded. The most common applicable terms were "Testicular embolization" (378,300 results) and "Varicocele embolization" (375,800 results). Mean DISCERN quality of information provided by websites is "fair"; Adherence to JAMA Benchmark Criteria was 13.5%. Flesh-Kincaid readability tests demonstrated an average "9th grade" reading level. Scientific journals showed the highest quality scores, but were least up to date with an average web page age of 11.2 years. Web pages produced by "for-profit" organizations were the second most current (average age 2.7 years), but displayed the lowest quality of information scores. While quality of online information available to patients is "fair," adherence to JAMA benchmark criteria is poor. "For-profit" organization websites are far more numerous and significantly more up-to-date, yet showed significantly lower quality of information scores. Scientific journals were unsurprisingly of higher quality, yet more challenging for the general public to read. These findings call for the production of high-quality and comprehensible content regarding interventional radiology, where physicians can reliably direct their patients for information.

Read full abstract

在搜索引擎的检索结果页面中,用户经常会得到内容近似的网页.为了提高检索整体性能和用户满意度,提出了一种基于概念和语义网络的近似网页检测算法DWDCS(near-duplicate webpages detection based on concept and semantic network).改进了经典基于小世界理论提取文档关键词的算法.首先对文档概念进行抽取和归并,不但解决了“表达差异”问题,而且有效降低了语义网络的复杂度;从网络结构的几何特征对其进行分析,同时利用网页的语法和结构信息构建特征向量进行文档相似度的计算,由于无须使用语料库,使得算法天生具有领域无关的优点.实验结果表明,与经典的网页去重算法(I-Match)和单纯依赖词汇共现小世界模型的算法相比,DWDCS 具有很好的抵抗噪声的能力,在大规模实验中获得了准确率>90%和召回率>85％的良好测试结果.良好的时空间复杂度及算法性能不依赖于语料库的优点,使其在大规模网页去重实际应用中获得了良好的效果.;Reprinting websites and blogs produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, the near-Duplicate WebPages Detection based on Concept and Semantic network (DWDCS) is proposed. In the course of developing a near-duplicate detection system for a multi-billion pages repository, this paper makes two research contributions. First, the key concept is extracted, instead of the keyphrase, to build Small Word Network (SWN). This not only reduces the complexity of the semantic network, but also resolves the “expression difference” problem. Second, this paper considers both syntactic and semantic information to present and compute the documents’ similarities. In a large-scale test, experimental results demonstrate that this approach outperforms that of both I-Match and keyphrase extraction algorithms based on SWN. Many advantages such as linear time and space complexity, without using a corpus, make the algorithm valuable in actual practice.

Read full abstract

Duplicate Web Pages Research Articles

Related Topics

Articles published on Duplicate Web Pages

A Study the review of Duplicate Data in Cloud Computing

Application of Artificial Intelligence to Patient-Targeted Health Information on Kidney Stone Disease

URL’S Phishing Detection Based on Machine Learning Approach

On the Commonly-Used Incorrect Visual Representation of Accuracy and Precision

Pelvic vein embolization: an assessment of the readability and quality of online information for patients

SETJoin: a novel top-k similarity join algorithm

Varicocele Embolization: An Assessment of the Quality and Readability of Online Patient Information

A NOVEL TWO-PHASE PAGE FEATURE AND KTH KEYPHRASE FINGERPRINT BASED DUPLICATE DETECTION TECHNIQUE

Natural Language Semantic Construction Based on Cloud Database

Detection and elimination of similar Web pages based on text structure and string of feature code

Near Duplicate Web Page Detection using NDupDet Algorithm

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

An Efficient Approach for Finding Near Duplicate Web pages using Minimum Weight Overlapping Method

Near Duplicated Web Pages Detection Based on Concept and Semantic Network

Research of a Novel P2P Search Algorithm Based on Small-World Phenomena

Correlation Based Method to Detect and Remove Redundant Web Document

New algorithm based on repeat sequence deletion

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Duplicate Web Pages Research Articles

Related Topics

Articles published on Duplicate Web Pages

A Study the review of Duplicate Data in Cloud Computing

Application of Artificial Intelligence to Patient-Targeted Health Information on Kidney Stone Disease

URL’S Phishing Detection Based on Machine Learning Approach

On the Commonly-Used Incorrect Visual Representation of Accuracy and Precision

Pelvic vein embolization: an assessment of the readability and quality of online information for patients

SETJoin: a novel top-k similarity join algorithm

Varicocele Embolization: An Assessment of the Quality and Readability of Online Patient Information

A NOVEL TWO-PHASE PAGE FEATURE AND KTH KEYPHRASE FINGERPRINT BASED DUPLICATE DETECTION TECHNIQUE

Natural Language Semantic Construction Based on Cloud Database

Detection and elimination of similar Web pages based on text structure and string of feature code

Near Duplicate Web Page Detection using NDupDet Algorithm

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

An Efficient Approach for Finding Near Duplicate Web pages using Minimum Weight Overlapping Method

Near Duplicated Web Pages Detection Based on Concept and Semantic Network

Research of a Novel P2P Search Algorithm Based on Small-World Phenomena

Correlation Based Method to Detect and Remove Redundant Web Document

New algorithm based on repeat sequence deletion