GPTZoo: A Large-scale Dataset of GPTs for the Research Community
The rapid advancements in Large Language Models (LLMs) have revolutionized natural language processing, with GPTs, customized versions of ChatGPT available on the GPT Store, emerging as a prominent technology for specific domains and tasks. To support academic research on GPTs, we introduce GPTZoo, a large-scale dataset comprising 730,420 GPT instances. Each instance includes rich metadata with 21 attributes describing its characteristics, as well as instructions, knowledge files, and third-party services utilized during its development. GPTZoo aims to provide researchers with a comprehensive and readily available resource to study the real-world applications, performance, and potential of GPTs. To facilitate efficient retrieval and analysis of GPTs, we also developed an automated command-line interface (CLI) that supports keyword-based searching of the dataset. To promote open research and innovation, the GPTZoo dataset will undergo continuous updates, and we are granting researchers public access to GPTZoo and its associated tools.
- Research Article
15
- 10.1016/j.procs.2010.04.199
- May 1, 2010
- Procedia Computer Science
High-performance astrophysical visualization using Splotch
- Research Article
42
- 10.1146/annurev-ento-072121-075258
- Oct 13, 2021
- Annual Review of Entomology
Community (or citizen) science, the involvement of volunteers in scientific endeavors, has a long history. Over the past few centuries, the contributions of volunteers to our understanding of patterns and processes in entomology have been inspiring. From the collation of large-scale and long-term data sets, which have been instrumental in underpinning our knowledge of the status and trends of many insect groups, to action, including species management, whether for conservation or control, community scientists have played pivotal roles. Contributions, such as pest monitoring by farmers and species discoveries by amateur naturalists, set foundations for the research engaging entomologists today. The next decades will undoubtedly bring new approaches, tools, and technologies to underpin community science. The potential to increase inclusion within community science is providing exciting opportunities within entomology. An increase in the diversity of community scientists, alongside an increasing taxonomic and geographic breadth of initiatives, will bring enormous benefits globally for people and nature.
- Conference Article
5
- 10.1109/fg52635.2021.9667083
- Dec 15, 2021
In this paper we explore the influence of some frequently used Convolutional Neural Networks (CNNs), training settings, and training set structures, on Action Unit (AU) detection. Specifically, we first compare 10 different shallow and deep CNNs in AU detection. Second, we investigate how the different training settings (i.e. centering/normalizing the inputs, using different augmentation severities, and balancing the data) impact the performance in AU detection. Third, we explore the effect of increasing the number of labelled subjects and frames in the training set on the AU detection performance. These comparisons provide the research community with useful tips about the choice of different CNNs and training settings in AU detection. In our analysis, we use a large-scale naturalistic dataset, consisting of ~55K videos captured in the wild. To the best of our knowledge, there is no work that had investigated the impact of such settings on a large-scale AU dataset.
- Research Article
3
- 10.1890/0012-9623-90.4.360
- Oct 1, 2009
- The Bulletin of the Ecological Society of America
<i>Annual Reports To Council</i> Ecological Society of America August 2009
- Book Chapter
6
- 10.1007/978-3-031-27077-2_40
- Jan 1, 2023
Fish species recognition is an integral part of sustainable marine biodiversity and aquaculture. The rapid emergence of deep learning methods has shown great potential on classification and recognition tasks when trained on a large scale dataset. Nevertheless, some practical challenges remain for automating the task, e.g., the lack of appropriate methods applied to a complicated fish habitat. In addition, most publicly accessible fish datasets have small-scale and low resolution, imbalanced data distributions, or limited labels and annotations, etc. In this work, we aim to overcome the aforementioned challenges. First, we construct the OceanFish database with higher image quality and resolution that covers a large scale and diversity of marine-domain fish species in East China sea. The current version covers 63, 622 pictures of 136 fine-grained fish species. Accompanying the dataset, we propose a fish recognition testbed by incorporating two widely applied deep neural network based object detection models to exploit the facility of the enlarged dataset, which achieves a convincing performance in detection precision and speed. The scale and hierarchy of OceanFish can be further enlarged by enrolling new fish species and annotations. Interested readers may ask for access and re-use this benchmark datasets for their own classification tasks upon inquiries. We hope that the OceanFish database and the fish recognition testbed can serve as a generalized benchmark that motivates further development in related research communities.
- Conference Instance
1
- 10.1007/978-3-540-69139-6-71
- Dec 1, 2008
The enormous growth of public sequence databases and continuing addition of fully sequenced genomes is a fertile area for data mining. The clustering research is at the cross road of research from several research communities such as document retrieval, image segmentation, and artificial intelligence research communities especially from machine learning and data mining in which the data size is very large. In this paper, we surveyed the clustering aspects of protein sequence data sets by pointing out the problems that was encountered during the procedures. Challenges include identifying multidomain proteins, identifying remote homologues, identifying protein families, and dealing with large-scale data sets. We then analyzed the clustering techniques that have been developed by exploring how they addressed the issues. In this survey, we focused on the alignment method and clustering algorithm that were employed. We limit our study on the heuristicbased categories namely hierarchical and partitional approaches. We concluded this paper with some research issues.
- Research Article
12
- 10.1371/journal.pone.0290779
- Aug 30, 2023
- PLOS ONE
Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.
- Research Article
30
- 10.1017/s0950268822000255
- Jan 1, 2022
- Epidemiology and infection
In the present study, I explored the relationship between people's trust in different agents related to the prevention of the spread of coronavirus disease 2019 (COVID-19) and their compliance with pharmaceutical and non-pharmaceutical preventive measures. The COVIDiSTRESSII Global Survey dataset, which was collected from international samples, was analysed to examine the aforementioned relationship across different countries. For data-driven exploration, network analysis and Bayesian generalised linear model (GLM) analysis were performed. The result from network analysis demonstrated that trust in the scientific research community was most central in the network of trust and compliance. In addition, the outcome from Bayesian GLM analysis indicated that the same factor, trust in the scientific research community, was most fundamental in predicting participants' intent to comply with both pharmaceutical and non-pharmaceutical preventive measures. I briefly discussed the implications of the findings, the importance of trust in the scientific research community in explaining people's compliance with a measure to prevent the spread of COVID-19.
- Conference Instance
2
- 10.1145/2307836
- Jun 25, 2012
It is our great pleasure to welcome you to the 4th ACM International Workshop on Hot Topics in Planet-Scale Measurement -- HotPlanet'12. This year's workshop is motivated by the fact that successfully researching, designing and building new mobile, ad-hoc, mesh, and opportunistic networking systems and algorithms requires access to large-scale data on human mobility, encounter, and social network patterns. Unfortunately, the wireless and mobile research communities lack such large-scale data. We believe that large-scale datasets are important, not only in communication network design, but also for fundamental study in other academic disciplines, e.g., epidemiology, urban planning, and social science. Complex networks research has flourished since 1989 when the first large Internet (and later WWW) datasets became available. To achieve similar improvements in mobile networking and related fields, large-scale, and ideally planet-scale, datasets must be collected and made available. Following three previous successful editions of the workshop at ACM MobiSys 2009, 2010, and 2011, the fourth HotPlanet workshop will challenge the community to collect large-scale human mobility traces as well as to propose novel mobility data processing and knowledge discovery techniques. The program committee accepted 7 papers that cover a variety of topics including emerging applications involving large-scale human mobility data collection, human dynamics characterization and modelling, knowledge discovery from mobility data, and methods for choosing and collecting large-scale human mobility datasets. In addition, the program includes a keynote speech on Through a Graph, Darkly by Jon Crowcroft, who has a lot of experience in collecting human mobility datasets and investigating social network patterns. We hope that these proceedings will serve as a valuable reference for large-scale wireless networking measurement.
- Book Chapter
4
- 10.1007/978-3-540-69139-6_71
- Dec 1, 2008
The enormous growth of public sequence databases and continuing addition of fully sequenced genomes is a fertile area for data mining. The clustering research is at the cross road of research from several research communities such as document retrieval, image segmentation, and artificial intelligence research communities especially from machine learning and data mining in which the data size is very large. In this paper, we surveyed the clustering aspects of protein sequence data sets by pointing out the problems that was encountered during the procedures. Challenges include identifying multidomain proteins, identifying remote homologues, identifying protein families, and dealing with large-scale data sets. We then analyzed the clustering techniques that have been developed by exploring how they addressed the issues. In this survey, we focused on the alignment method and clustering algorithm that were employed. We limit our study on the heuristicbased categories namely hierarchical and partitional approaches. We concluded this paper with some research issues.
- Book Chapter
- 10.1007/978-3-031-19682-9_88
- Jan 1, 2022
In this paper, we conducted the ImageNet Reannotation workshop with researchers who use ImageNet to find doubtful data in ImageNet. Recent great growth of deep learning is supported by large scale datasets collected by cloud working such as ImageNet, but it seems to have not so few doubtful data for given tasks. We assume that the professionals can efficiently and accurately find doubtful data while they know what kind of data would be better for learning classification tasks. Moreover, we adopted a group working scheme so that it could be more efficient and accurate. This paper shows the re-annotation result that clarifies category and reason of doubtfulness in the large scale dataset constructed by cloud workers.KeywordsDatasetAnnotationCloud sourcing
- Research Article
3869
- 10.1109/tpami.2013.248
- Jul 1, 2014
- IEEE Transactions on Pattern Analysis and Machine Intelligence
We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state-of-the-art by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image, human motion capture, and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of large-scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement by future work in the research community. Our experiments show that our best large-scale model can leverage our full training set to obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem. Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m.
- Research Article
- 10.1002/biot.201200055
- Aug 1, 2012
- Biotechnology Journal
BiotecVisions 2012, August
- Preprint Article
- 10.5194/egusphere-egu25-3721
- Mar 18, 2025
Model predictions are paramount to understanding climate and land management effects on soil organic carbon (SOC) stocks and greenhouse gas (GHG) emissions in forests. However, SOC models remain highly uncertain, and multi-model ensembles can be used to evaluate the level of uncertainty of the predictions due to model choice. One major barrier to the use of multiple models is data availability and the time-scale consistency across models.In this work, we present me4soc, a Multi-model Ensemble interface For Soil Organic Carbon predictions. This open-source software offers a complete environment to launch six SOC models widely used by the soil community to predict the dynamics of SOC stocks and GHG fluxes (CO2, CH4, and N2O) in forests. It allows users to explore the effect of nature-based climate solutions over multiple decades under climate and land-use changes. The models can be run with either user-provided observational data or data automatically extracted from large-scale open-source datasets for the European region. Available earth system model predictions are used to simulate climate and land-use change scenarios. The tool has been developed in Shiny, a R-based package for simple web application developments.The obtained results showed the ability of me4soc to simulate the temporal dynamics of SOC stocks and GHG emissions at site-scale under different climate, land-use, and land management change scenarios. Employing multiple models based on different mathematical structures offers a unique opportunity to estimate the uncertainties in the predictions associated with differences in the model structure.This tool can be applied by the scientific community, forest managers, and policymakers to acquire scientifically-based information about the effects of forest management and disturbances on SOC stocks and GHG emissions. It is an important step towards the use of state-of-the-art models and large-scale datasets to improve model predictions and assess their uncertainties. The software's systematic validation with observational data and parameter optimization to improve model fit are the key priorities of future work. Further software developments to cover other ecosystems (e.g., croplands and grasslands) and data-less sites outside of Europe are also foreseen.
- Research Article
8
- 10.1038/s41598-023-31180-z
- Mar 14, 2023
- Scientific Reports
In order to better understand the relationship between normal and neoplastic brain, we combined five publicly available large-scale datasets, correcting for batch effects and applying Uniform Manifold Approximation and Projection (UMAP) to RNA-Seq data. We assembled a reference Brain-UMAP including 702 adult gliomas, 802 pediatric tumors and 1409 healthy normal brain samples, which can be utilized to investigate the wealth of information obtained from combining several publicly available datasets to study a single organ site. Normal brain regions and tumor types create distinct clusters and because the landscape is generated by RNA-Seq, comparative gene expression profiles and gene ontology patterns are readily evident. To our knowledge, this is the first meta-analysis that allows for comparison of gene expression and pathways of interest across adult gliomas, pediatric brain tumors, and normal brain regions. We provide access to this resource via the open source, interactive online tool Oncoscape, where the scientific community can readily visualize clinical metadata, gene expression patterns, gene fusions, mutations, and copy number patterns for individual genes and pathway over this reference landscape.