The nation’s first three thematic collections networks (TCNs), part of the massive effort to digitize the nation’s biological collections, are dealing with significant technological and labor challenges. Each TCN digitizes millions of specimen records of a major group of organisms, unified around a significant scientific topic, such as the effects of climate change. “The TCNs are developing new methods of digitization, along with practiced workflows, that are enabling not only their own digitization efforts but those of later-funded TCNs and other institutions,” says Pamela Soltis, of the University of Florida, who is co principal investigator (PI) and research director of iDigBio, the National Science Foundation–funded coordinating group of the digitization effort. The North American Lichen and Bryophyte TCN, led by the University of Wisconsin–Madison, is on track to digitize some 4 million records of specimens. The TCN’s 75 collaborating collections have imaged 400,000 lichen and 350,000 bryophyte specimen labels, which adds to more than 1.5 million records that had already been digitized but are now integrated into the TCN’s system. “What we’re trying to do now is read label images with optical character recognition [OCR] programs,” says PI Corinna Gries. The hope is that the software will accurately turn the images into words and then that “natural language processing will parse them automatically into a database,” she explains. The technology works fairly well for modern-era labels that were originally done on a computer, but the OCR programs fail, she says, when they try to decipher labels that are handwritten; typed using old fonts or a mix of fonts; or include an image, such as a map. To make that information accessible online, the data must then be transcribed. “That will be a big effort of manpower,” says Gries. “So we’re trying to put out [a call for] volunteers [to] help us to transcribe, and we’re hoping that people get interested in the lichens in their area” (http://lbcc1. acis.ufl.edu). Perhaps the biggest technological challenge taken on by a TCN is that of digitizing the records of some 56 million insect specimens held in vials, pinned in drawers, or mounted on slides—that’s the goal of InvertNet, which is based at the Illinois Natural History Survey. There, researchers hope to automate the process to a much higher level than has ever been done, explains PI Christopher Dietrich. “The problem with specimens in insect collections is you have tiny little labels pinned under the specimen,” he says. “So the system we’re developing is going to allow us to take images [from] multiple angles of the drawers so we can virtually tilt the specimens. It’s basically photography—but instead of taking the pictures of the drawer from the top down or front to back or left to right, we can do a partial 3-D reconstruction of the drawer.” InvertNet, aided by its computer science and engineering partners, is nearing completion of a robotic system prototype. The plan is for 15 of InvertNet’s largest collaborating collections to each have its own robot to take images of an entire drawer containing hundreds of specimens, at a clip of 5 minutes per drawer. “Technology... [has] turned out to be more of a challenge than we anticipated,” Dietrich says. “A lot of details in the workload are problematic, a lot of things take a long time to figure out. There are issues with software and with hardware—the whole process of putting this together in an automated way. We’re trying to reduce the possibilities of human error and make it as efficient as possible.” Taking a different approach, the Plants, Herbivores, and Parasitoids TCN, led by the American Museum of Natural History, is relying on humans. “We can train a human in a couple of days to do quite accurate work,” says PI Randall T. Schuh. “Machines are much harder to train. We believe humans not only do a pretty good job, but our ability to train them is something we have confidence in. They can make a lot of decisions for us that machines simply can’t.” Furthermore, he adds, using humans allows the TCN to give aspiring biologists and others direct scientific training through their participation in the data-capture process. The TCN’s 31 collections have captured over 400,000 insect records and a roughly equal number of host-plant specimens toward a goal of 1.38 million insects and 2.6 million plants. In his TCN’s case, says Schuh, the main challenge is not technological but “just the sheer number of specimens.” Schuh says that they’re close to being on track for the imaging and are already looking at analyzing the data. “We’re working on the question of host-plant–insect interaction: How are insects related to the plants that they feed on?” he says. “We’ve been working on analytic approaches to the data [that] we’re capturing. In the next couple months, we’ll have substantial results.”
Read full abstract