Abstract

BackgroundAs the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised.ObjectiveThe main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information.MethodsRNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients.ResultsIn most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types.ConclusionsWe attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients.

Highlights

  • High-throughput sequencing and array technologies, such as next-generation sequencing and microarrays, can be applied to personalized genomics and for medical purposes

  • This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients

  • Gender consisted of two groups, age was classified into nine groups, cancer type consisted of 32 groups, race consisted of five groups, and cancer stage consisted of four groups

Read more

Summary

Introduction

High-throughput sequencing and array technologies, such as next-generation sequencing and microarrays, can be applied to personalized genomics and for medical purposes These technologies will enable comprehensive multiomics analysis at various levels, including genomics, transcriptomics, and proteomics. The ability to collect and store personal data has exploded, making genomic analysis a viable method for improving diagnostic accuracy and personalized medicine These advances require both the collection and sharing of high-resolution genetic profiles among researchers and institutions. The Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA) sets standards for the privacy and security of health records in the United States [3] Public databases such as The Cancer Genome Atlas (TCGA) obtain patient consent to share their genetic data. As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.