Outlier analyses of the Protein Data Bank archive using a probability-density-ranking approach

Chenghua Shao,Huanwang Yang,Zonghong Liu,Sijian Wang,Stephen K Burley,Stephen K Burley,Stephen K Burley,Stephen K Burley

doi:10.1038/sdata.2018.293

Chenghua Shao, Huanwang Yang + Show 6 more

Open Access

https://doi.org/10.1038/sdata.2018.293

Copy DOI

Journal: Scientific Data	Publication Date: Dec 1, 2018
Citations: 7	License type: open-access

Affiliation: Rutgers, The State University of New Jersey

Abstract

Outlier analyses are central to scientific data assessments. Conventional outlier identification methods do not work effectively for Protein Data Bank (PDB) data, which are characterized by heavy skewness and the presence of bounds and/or long tails. We have developed a data-driven nonparametric method to identify outliers in PDB data based on kernel probability density estimation. Unlike conventional outlier analyses based on location and scale, Probability Density Ranking can be used for robust assessments of distance from other observations. Analyzing PDB data from the vantage points of probability and frequency enables proper outlier identification, which is important for quality control during deposition-validation-biocuration of new three-dimensional structure data. Ranking of Probability Density also permits use of Most Probable Range as a robust measure of data dispersion that is more compact than Interquartile Range. The Probability-Density-Ranking approach can be employed to analyze outliers and data-spread on any large data set with continuous distribution.

Highlights

The Protein Data Bank (PDB) supports secure storage and dissemination of three-dimensional (3D) structures of large biological molecules[1,2]
Founded in 1971 as the first open-access digital data resource in biology, the PDB has developed into the single global archive of >140,000 3D structures deposited by researchers worldwide, using experimental methods including
Since 2003, the PDB archive has been managed by the Worldwide Protein Data Bank partnership[2]

Summary

Introduction

The Protein Data Bank (PDB) supports secure storage and dissemination of three-dimensional (3D) structures of large biological molecules (proteins, DNA, and RNA)[1,2]. Founded in 1971 as the first open-access digital data resource in biology, the PDB has developed into the single global archive of >140,000 3D structures deposited by researchers worldwide, using experimental methods including. Many PDB structures represent groundbreaking scientific discoveries, garnering numerous Nobel Prizes, including five Chemistry awards in the 21st century[3,4,5,6,7,8,9,10,11]. Since 2003, the PDB archive has been managed by the Worldwide Protein Data Bank partnership (wwPDB, pdb.org)[2]. Over the past five decades, PDB data have enabled scientific breakthroughs in fundamental biology, biomedicine, and energy research[16]

Methods

Results

Conclusion