This article presents a novel approach for the citation network construction from Jewish Responsa literature based on automatic extraction of references from texts. Jewish Responsa literature contains thousands of answers to questions related to Jewish law (Halachah), spanning over 1,300 years by authors from all over the world. This literature is abundant with references, but because of their high lexical and format variability their automatic identification and extraction is very challenging. In this article we present a novel, multi layered approach that splits the reference extraction task into two main subtasks: i) reference boundaries’ identification; ii) reference internal components’ identification. We experimented with several different machine learning models: CRF (Conditional Random Field) model, BERT (Bidirectional Encoder Representations from Transformers) model, and a combined approach, BERT-CRF. Additionally, we examined the influence of the training corpus on the model’s accuracy by comparing the performance of the models trained on modern Hebrew vs. Rabbinic Hebrew. We found that the best results were achieved by a BERT-CRF model trained on Rabbinic Hebrew. The constructed network can be utilized to build various tools for analyzing trends and influences in the Jewish Halachic corpus, such as the most influencing authors, the authors’ sources of authority, and their evolution over time and place.
Read full abstract