Large Annotated Datasets Research Articles

BackgroundWith the large and increasing volume of textual data, automated methods for identifying significant topics to classify textual documents have received a growing interest. While many efforts have been made in this direction, it still remains a real challenge. Moreover, the issue is even more complex as full texts are not always freely available. Then, using only partial information to annotate these documents is promising but remains a very ambitious issue.MethodsWe propose two classification methods: a k-nearest neighbours (kNN)-based approach and an explicit semantic analysis (ESA)-based approach. Although the kNN-based approach is widely used in text classification, it needs to be improved to perform well in this specific classification problem which deals with partial information. Compared to existing kNN-based methods, our method uses classical Machine Learning (ML) algorithms for ranking the labels. Additional features are also investigated in order to improve the classifiers’ performance. In addition, the combination of several learning algorithms with various techniques for fixing the number of relevant topics is performed. On the other hand, ESA seems promising for this classification task as it yielded interesting results in related issues, such as semantic relatedness computation between texts and text classification. Unlike existing works, which use ESA for enriching the bag-of-words approach with additional knowledge-based features, our ESA-based method builds a standalone classifier. Furthermore, we investigate if the results of this method could be useful as a complementary feature of our kNN-based approach.ResultsExperimental evaluations performed on large standard annotated datasets, provided by the BioASQ organizers, show that the kNN-based method with the Random Forest learning algorithm achieves good performances compared with the current state-of-the-art methods, reaching a competitive f-measure of 0.55 % while the ESA-based approach surprisingly yielded unsatisfactory results.ConclusionsWe have proposed simple classification methods suitable to annotate textual documents using only partial information. They are therefore adequate for large multi-label classification and particularly in the biomedical domain. Thus, our work contributes to the extraction of relevant information from unstructured documents in order to facilitate their automated processing. Consequently, it could be used for various purposes, including document indexing, information retrieval, etc.

BackgroundHigh-throughput RNA interference (RNAi) screening has become a widely used approach to elucidating gene functions. However, analysis and annotation of large data sets generated from these screens has been a challenge for researchers without a programming background. Over the years, numerous data analysis methods were produced for plate quality control and hit selection and implemented by a few open-access software packages. Recently, strictly standardized mean difference (SSMD) has become a widely used method for RNAi screening analysis mainly due to its better control of false negative and false positive rates and its ability to quantify RNAi effects with a statistical basis. We have developed GUItars to enable researchers without a programming background to use SSMD as both a plate quality and a hit selection metric to analyze large data sets.ResultsThe software is accompanied by an intuitive graphical user interface for easy and rapid analysis workflow. SSMD analysis methods have been provided to the users along with traditionally-used z-score, normalized percent activity, and t-test methods for hit selection. GUItars is capable of analyzing large-scale data sets from screens with or without replicates. The software is designed to automatically generate and save numerous graphical outputs known to be among the most informative high-throughput data visualization tools capturing plate-wise and screen-wise performances. Graphical outputs are also written in HTML format for easy access, and a comprehensive summary of screening results is written into tab-delimited output files.ConclusionWith GUItars, we demonstrated robust SSMD-based analysis workflow on a 3840-gene small interfering RNA (siRNA) library and identified 200 siRNAs that increased and 150 siRNAs that decreased the assay activities with moderate to stronger effects. GUItars enables rapid analysis and illustration of data from large- or small-scale RNAi screens using SSMD and other traditional analysis methods. The software is freely available at http://sourceforge.net/projects/guitars/.

Large Annotated Datasets Research Articles

Articles published on Large Annotated Datasets

Unconstrained Still/Video-Based Face Verification with Deep Convolutional Neural Networks

Fine-tuning Convolutional Neural Networks for Biomedical Image Analysis: Actively and Incrementally.

Words Matter: Scene Text for Image Classification and Retrieval

Non-invasive Fetal ECG Signal Quality Assessment for Multichannel Heart Rate Estimation.

Memorable and rich video summarization

A Novel Approach for Detecting Emotion in Text

Correspondence Driven Saliency Transfer.

Large scale biomedical texts classification: a kNN and an ESA-based approaches

Improving semi-supervised learning through optimum connectivity

Improving Concept-Based Image Retrieval with Training Weights Computed from Tags

Learning to classify gender from four million images

HuPBA8k+: Dataset and ECOC-Graph-Cut based segmentation of human limbs

Nonparametric label propagation using mutual local similarity in nearest neighbors

Combinatorial Approach for Large-scale Identification of Linked Peptides from Tandem Mass Spectrometry Spectra

Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique

GUItars: A GUI Tool for Analysis of High-Throughput RNA Interference Screening Data

Markup SVG—An Online Content-Aware Image Abstraction and Annotation Tool

A framework for unsupervised training of object detectors from unlabeled surveillance video

A Compositional and Dynamic Model for Face Aging

Unsupervised and simultaneous training of multiple object detectors from unlabeled surveillance video

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Annotated Datasets Research Articles

Articles published on Large Annotated Datasets

Unconstrained Still/Video-Based Face Verification with Deep Convolutional Neural Networks

Fine-tuning Convolutional Neural Networks for Biomedical Image Analysis: Actively and Incrementally.

Words Matter: Scene Text for Image Classification and Retrieval

Non-invasive Fetal ECG Signal Quality Assessment for Multichannel Heart Rate Estimation.

Memorable and rich video summarization

A Novel Approach for Detecting Emotion in Text

Correspondence Driven Saliency Transfer.

Large scale biomedical texts classification: a kNN and an ESA-based approaches

Improving semi-supervised learning through optimum connectivity

Improving Concept-Based Image Retrieval with Training Weights Computed from Tags

Learning to classify gender from four million images

HuPBA8k+: Dataset and ECOC-Graph-Cut based segmentation of human limbs

Nonparametric label propagation using mutual local similarity in nearest neighbors

Combinatorial Approach for Large-scale Identification of Linked Peptides from Tandem Mass Spectrometry Spectra

Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique

GUItars: A GUI Tool for Analysis of High-Throughput RNA Interference Screening Data

Markup SVG—An Online Content-Aware Image Abstraction and Annotation Tool

A framework for unsupervised training of object detectors from unlabeled surveillance video

A Compositional and Dynamic Model for Face Aging

Unsupervised and simultaneous training of multiple object detectors from unlabeled surveillance video