Abstract

Advances in big data technologies are making it possible to analyze large amounts of data in near real time. These technologies offer great promise in the area of data-driven plant breeding. To fully realize this promise, disparate sources of genotype, environment, management, and socioeconomic data need to be integrated. Collectively, this data could be used to inform genetic predictive models for maize, wheat, and other crops. Some of the primary challenges to analyzing these disparate sources collectively are errors in location data, which include flipped latitude and longitude values, missing negative signs, and, in some cases, missing data. To address these challenges, we have developed an Integrated Tool for AgData Lat Long Imputation and Cleaning (ITALLIC), which detects and corrects errors in location data and imputes missing values for location-dependent data, such as region name.Location information is considered valid if a multipolygon bounding its coordinates corresponds to the country label. This validation step easily detects common errors, such as missing negative signs or flipped latitude and longitude data. To suggest corrections, combinations of alternative latitude and longitude values are generated, and a query is used to determine the country for each of these possible coordinate pairs. If one of the coordinate pairs corresponds to a country, those coordinates are suggested as the putatively correct latitude and longitude values for that data entry. If this approach fails to correct the error, an open-source API is used to geocode the location.In addition to identifying and correcting potential errors, ITALLIC includes a visualization tool that makes it easy for users to validate results. Illustratively, when used to analyze data from over 1,400 plant breeding stations around the world, ITALLIC enabled us to validate or correct errors in over 90% of the data. Being able to examine suggested corrections visually made the validation process seamless and convenient. In a few instances, latitude and longitude values were flipped, resulting in a plant breeding station being listed as located in the middle of the ocean. The visualization tool was able to plot both the location in the middle of the ocean and the station’s suggested correct location with a line connecting them. Being able to visualize both the erroneous data point and its suggested location helped us quickly identify and correct errors. ITALLIC is freely available for installation via the publicly accessible Anaconda package management system and the source code has been made available on GitHub. ITALLIC is under active development and has been integrated into the GEMS™ Agroinformatics platform.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call