From science to law enforcement, many research questions are answerable only by poring over a large amount of unstructured text documents. While people can extract information from such documents with high accuracy, this is often too time-consuming to be practical. On the other hand, automated approaches produce nearly-immediate results, but are not reliable enough for applications where near-perfect precision is essential. Motivated by two use cases from criminal justice, we consider the benefits and drawbacks of various human-only, human–machine, and machine-only approaches. Finding no tool well suited for our use cases, we develop a human-in-the-loop method for fast but accurate extraction of structured data from unstructured text. The tool is based on automated extraction followed by human validation, and is particularly useful in cases where purely manual extraction is not practical. Testing on three criminal justice datasets, we find that the combination of the computer speed and human understanding yields precision comparable to manual annotation while requiring only a fraction of time, and significantly outperforms the precision of all fully automated baselines.
Read full abstract