Abstract

Open access platforms and retail websites are both trying to present the most relevant offerings to their patrons. Retail websites deploy recommender systems that collect data about their customers. These systems are successful but intrude on privacy. As an alternative, this paper presents an algorithm that uses text mining techniques to find the most important themes of an open access book or chapter. By locating other publications that share one or more of these themes, it is possible to recommend closely related books or chapters.
 The algorithm splits the full text in trigrams. It removes all trigrams containing words that are commonly used in everyday language and in (open access) book publishing. The most occurring remaining trigrams are distinctive to the publication and indicate the themes of the book. The next step is finding publications that share one or more of the trigrams. The strength of the connection can be measured by counting – and ranking – the number of shared trigrams. The algorithm was used to find connections between 10,997 titles: 67% in English, 29% in German and 6% in Dutch or a combination of languages. The algorithm is able to find connected books across languages.
 It is possible use the algorithm for several use cases, not just recommender systems. Creating benchmarks for publishers or creating a collection of connected titles for libraries are other possibilities. Apart from the OAPEN Library, the algorithm can be applied to other collections of open access books or even open access journal articles. Combining the results across multiple collections will enhance its effectiveness.

Highlights

  • Open access platforms and retail websites have one thing in common: they are trying to present the most relevant offerings possible to their patrons

  • Recommender systems based on personal data are successful but are not a viable option for those who want to protect the privacy of their users

  • Deploying a ngrams based algorithm is a good alternative for open access books, as it uses the contents of the publications

Read more

Summary

Introduction

Open access platforms and retail websites have one thing in common: they are trying to present the most relevant offerings possible to their patrons. Removing all trigrams that contain commonly used words brings the remaining number back to two Deploying this procedure to the complete text of a book still creates a large set of trigrams, the need for additional filtering using terms that are common for open access academic books. A text mining algorithm written in the R programming language uses the full text of the publications, filters out the trigrams and creates an overview of closely related books and chapters. Different users may have different needs: a reader might be interested in finding a few select titles, while a library might want to download a larger collection of books around a certain topic

Background
Libraries and Privacy
Recommender Systems
Ngrams
Other Experiments
Finding Related Titles by Algorithm
The Algorithm
The Data Set
Finding Connected Titles
Single Book
Groups
Finding Translations
Use Cases
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.