Within the broad field of Artificial Intelligence (AI), Machine learning (ML) looks at improving the performances of computers in executing tasks for which they were not specifically pre-programmed. Applied to the field of Natural Language Processing (NLP), ML helps computers to autonomously learn tasks such as the recognition, understanding and generation of natural language (i.e. the language spoken by humans). In other words, ML applied to NLP refers to the ability of humans to interact with computers in the same way in which humans interact among themselves. On the part of the computers this implies being able to understand human language, to understand its meaning, and to interact with it thorough the generation of new language. Examples of these applications are very common in the current information society. Digital devices including phones, tablets, watches and an increasing number of home furniture, are nowadays equipped with personal assistants (generally called “AI”) which can be activated and can communicate through voice. More often than what one may think, when calling the consumer support service of a growing number of companies – or more commonly when contacting them via social networks – it is not a human who answers the phone call, replies to the tweets or other notifications. This study focuses specifically on this element, i.e. how computers learn a language. The reason is straightforward: when humans learn a new language they usually store the training information (e.g. the text book used to learn it) as an electrochemical trace in the area of the brain dedicated to language. Humans do not need a copyright exception in order to store that copy. Traditional copyright law and theory (in addition to common sense) have that this activity is outwith the copyright realm. However, it is far from clear that when a computer makes the corresponding digital copy of training material in order to learn a language this activity is likewise excluded from the copyright domain. On the contrary, normally any digital copy, temporary or permanent, in whole or in part, direct or indirect, has the potential to infringe copyright. Normally, computers learning natural languages need to “train models” using specific ML algorithms. The trained models represent the “memory” of a machine which has learned a language. But how is this memory created? Or in NLP parlance, how are the models trained? Usually, models are trained on corpora, that is to say on literary works often “available on the internet”. The question thus becomes the following: is the act of training a model for ML purposes a copyright relevant activity? The answer to this question is not only relevant in terms of copyright law and theory, but more broadly in terms of innovation policy as it has the potential to determine who has to ask whom for what permission in order to perform ML functions. In other words: who owns AI? In more precise terms, the research question of this short contribution will focus on the act of training a model for ML/NLP purposes and attempts to answer the question of whether this act infringes copyright and in particular the right of reproduction. In addition to this, the contribution also intends to explore whether there are other rights that may be infringed, in particular the right of adaptation, and thus determine whether a ML trained model can be considered a creative adaptation of the original corpora. The reference legal framework will be EU copyright law, with occasional reference to domestic law when necessary.
Read full abstract