Abstract

Patent data from various countries in the world implies the essence of scientific discovery and technological innovation of all human beings, but language differences have become a huge obstacle to patent data retrieval and communication. We hope to build a bridge from Chinese to English in the patent domain, so that English speakers can make better use of Chinese patent data. With the help of natural language processing technologies such as optical character recognition, Chinese text processing, machine translation and English text processing, we construct digital Chinese-English segment-aligned multi-field patent (CESMP) data from scanned Chinese patents. The current CESMP data consists of 610,310 patent documents in XML format. Each patent document contains six required fields (date, publication, ipc, title, abstract, and claim) and four optional fields (cpc, wipo, originalapplicant, and currentowner), among which the wipo, title, abstract, and claim fields are aligned with Chinese and English segments. Supported by well-structured bilingual patent data, on the one hand, the resource construction algorithms can efficiently build a bilingual patent dictionary and a parallel patent segment bank; on the other hand, the deep natural language processing algorithms can be effectively implemented into many practical intelligent applications such as cross-language patent retrieval, patent spam filtering, patent network analysis, patent machine translation, etc.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call