Abstract
In this research, two estimation algorithms for extracting cross-lingual news pairs based on machine learning from financial news articles have been proposed. Every second, innumerable text data, including all kinds news, reports, messages, reviews, comments, and tweets are generated on the Internet, and these are written not only in English but also in other languages such as Chinese, Japanese, French, etc. By taking advantage of multi-lingual text resources provided by Thomson Reuters News, we developed two estimation algorithms for extracting cross-lingual news pairs from multilingual text resources. In our first method, we propose a novel structure that uses the word information and the machine learning method effectively in this task. Simultaneously, we developed a bidirectional Long Short-Term Memory (LSTM) based method to calculate cross-lingual semantic text similarity for long text and short text, respectively. Thus, when an important news article is published, users can read similar news articles that are written in their native language using our method.
Highlights
Text similarity, as its name suggests, refers to how similar a given text query is to others
The fundamental objective is to develop algorithms for estimation of semantic similarity for the given two pieces of text written in different languages, applicable for both long text and short text, by taking advantage the untapped vast suppository of text resources from Thomson Reuters economics news reports
We developed a new recurrent structure inspired by Manhattan LSTM (MaLSTM), by modifying the Siamese Long Short-Term Memory (LSTM) modules to “unbalanced” ones, and adding a full-connect neural network layer following the output of LSTM modules, which is more flexible and effective than a text similarity task
Summary
As its name suggests, refers to how similar a given text query is to others. The text could be in the form of character level, word level, sentence level, paragraph level, or even longer, document level. We mainly discuss text that is in the form of sentences (i.e., short text) and documents (i.e., long text). The fundamental objective is to develop algorithms for estimation of semantic similarity for the given two pieces of text written in different languages, applicable for both long text and short text, by taking advantage the untapped vast suppository of text resources from Thomson Reuters economics news reports. We excavate cross-lingual resources from the enormous database of Thomson Reuters News and build an effective cross-lingual system by taking advantage of this un-developed treasure
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.