Towards Leveraging LLMs for Reducing Open Source Onboarding Information Overload
Consistent, diverse, and quality contributions are essential to the sustainability of the open source community. Therefore, it is important that there is infrastructure for effectively onboarding and retaining diverse newcomers to open source software projects. Most often, open source projects rely on onboarding documentation to support newcomers in making their first contributions. Unfortunately, prior studies suggest that information overload from available documentation, along with the predominantly monolingual nature of repositories, can have negative effects on the newcomer experiences and onboarding process. This, coupled with the effort involved in creating and maintaining onboarding documentation, suggest a need for support in creating more accessible documentation. Large language models (LLMs) have shown great potential in providing text transformation support in other domains, and even shown promise in simplifying or generating other kinds of computing artifacts, such as source code and technical documentation. We contend that LLMs can also help make software onboarding documentation more accessible, thereby reducing the potential for information overload. Using ChatGPT (GPT-3.5 Turbo) and Gemini Pro as case studies, we assessed the effectiveness of LLMs for simplifying software onboarding documentation, one method for reducing information overload. We discuss a broader vision for using LLMs to support the creation of more accessible documentation and outline future research directions toward this vision.
- Supplementary Content
2
- 10.0253/tuprints-00004055
- Jan 1, 2014
Handling hundreds of thousands of files is a major challenge in today’s digital forensics. In order to cope with this information overload, investigators often apply hash functions for automated input identification. Besides identifying exact duplicates, which is mostly solved running cryptographic hash functions, it is also necessary to cope with similar inputs (e.g., different versions of files), embedded objects (e.g., a JPG within a office document), and fragments (e.g., network packets). Thus, the essential idea is to complement the use of cryptographic hash functions, to detect data objects with bytewise identical representation, with the capability to find objects with bytewise similar representations. Unlike cryptographic hash functions, which have a wide range of applications and have been studied as well as tested for a long time, approximate matching algorithms are still in their early development stages. More precisely, currently the community is missing a definition, an evaluation methodology and (additional) fields of application. Therefore, this thesis aims at establishing approximate matching in computer sciences with a special focus on digital forensic investigations. One of our firsts step was to develop a generic definition for approximate matching, in collaboration with the National Institute of Standards and Technology (NIST) which is applicable to the different levels approximate matching, e.g., bytewise and semantic. A subsequent detailed analysis of both existing approaches uncovers different strengths and weaknesses, therefore we present improvements. To extend the range of algorithms, this work introduces three of our new algorithms, that are based on well-known techniques of computer sciences. A core contribution of this thesis is the open source evaluation framework called FRASH which assesses tools on different criteria. Besides traditional properties (borrowed from hash functions) like generation efficiency and space efficiency (compression), we conceive methods to determine precision and recall rates based on synthetic as well as real world data. Since digital investigations are often time critical, we improve the performance of automated file identification by a mechanism we call prefetching. Compared to a straight forward analysis, the performance increases by almost 40% without additional hardware. In this context we also discuss the impact of different hashing/approximate matching algorithms for digital investigations and conclude that it is absolutely reasonable to apply crypto hashing as well as bytewise/semantic approximate matching algorithms in a prosecution. To extend the fields of application, this thesis demonstrates the capabilities of applying approximate matching on network traffic analysis and biometric template protection. Our research shows that approximate matching is perfectly suited for data leakage prevention and can also be applied for biometric template protection, biometric data compression and efficient biometric identification.
- Research Article
1
- 10.26483/ijarcs.v6i2.2450
- Jan 1, 2015
- International Journal of Advanced Research in Computer Science
Reality mining has evolved as the most emerging topic in the field of mining these days. So, instead of mining the data, reality mining focuses on searching mobile data. The data to be mined is collected from the mobile phones or other open sources taken under consideration or surveillance .Through the proposal , the idea is to show the utility of reality mining in such fields which will not only enhance the living conditions of the human beings but will also predict some interesting facts about the humans to improve the severe problems related to health, road traffic congestion, human nature and diseases. In the perspective approach to solve the problem of the road traffic congestion, the idea is to collect the data from the mobile phones based on the criteria of the most traffic prone areas of the city. The proposal aims at taking the real time approach toward designing an application to reduce road traffic on roads using the open source system of Android. Due to the availability of Android mobile phones with majority of people, this application can be used easily by people. Keywords: Reality Mining, Road Traffic Management, Android Application, Data Mining, Mobile
- Research Article
- 10.51983/ajist-2019.9.s1.232
- Feb 5, 2019
- Asian Journal of Information Science and Technology
There are many subscribed resources and open source resources utilized by the Students, Research Scholars and Faculty Members in higher educational institutions. Objective –of this study is to determine the Digital resources usage preferences in Arts & Science college libraries of Tamil Nadu, particularly Islamic Management Arts & Science colleges. This article also examines the usage of e-books, e-journals (both subscribed and open source), Library website and abstracting database. It also deals with the status of the colleges which are subscribing to the digital resources. Research methodology. A systematically designed questionnaire was distributed to the selected colleges and received the data for analysis. Quite a few interesting facts have come out. Findings -Accordingly the data reveals that the undergraduate and postgraduate users preferred opens access resources and Google as their search engine for quick access while on the other hand research scholars insisted that commercial resources are of help and have made a recommendation to increase them. In case of faculty they recommend more number of commercial resources for their study, research and teaching purpose. Suggestions were made by the users to improve the infrastructure facility, regular power connection and a speed increase in the high bandwidth internet connections and to conduct seminars/ workshops/ orientation to the users in order to create awareness to increase the both the category of digital resources.
- Book Chapter
12
- 10.3233/978-1-58603-898-4-331
- Jan 1, 2008
Open Source Intelligence can be defined as the retrieval, extraction and analysis of information from publicly available sources. Each of these three processes is the subject of ongoing research resulting in specialised techniques. Today the largest source of open source information is the Internet. Most newspapers and news agencies have web sites with live updates on unfolding events, opinions and perspectives on world events. Most governments monitor news reports to feel the pulse of public opinion, and for early warning and current awareness of emerging crises. The phenomenal growth in knowledge, data and opinions published on the Internet requires advanced software tools which allow analysts to cope with the overflow of information. Malicious use of the Internet has also grown rapidly, particularly on-line fraud, illegal content, virtual stalking, and various scams. These are all creating major challenges to security and law enforcement agencies. The alarming increase in the use of the Internet by extremist and terrorist groups has also emerged. The Joint Research Centre has developed significant experience in Internet content monitoring through its work on media monitoring (EMM) for the European Commission. EMM forms the core of the Commission's daily press monitoring service, and has also been adopted by the European Council Situation Centre for their ODIN system. This paper will review this growing area of research using EMM as an example.
- Research Article
8
- 10.1109/mc.2007.349
- Oct 1, 2007
- Computer
Exposing students to real-world projects will encourage them to share, learn, and improve.The BOHKNet project began in 1998 with the universities of Eindhoven and Hong Kong. Each BOHKNet team typically consists of 8 to 10 students in two to four locations. The team is assigned a software-related topic such as TV on mobile phones or open source and software patents, which students approach from different geographical perspectives. The different groups produce a Web site for discussing their topics. All the group's Web sites are integrated into an electronic book. The technologies used in the course include videoconferencing, e-mail, and an off-the-shelf learning-management system that supports both chat and forums. The students can freely use additional tools to create the Web site. BOHKNet classroom instructors address specific software management issues to accelerate the experiential learning, including planning, work breakdown, information overload, and knowledge management.
- Research Article
- 10.21045/2071-5021-2024-70-s5-26
- Jan 1, 2024
- Social Aspects of Population Health
Significance. Zemstvo medicine became a completely unique phenomenon for rural life in Russia. The work of a zemstvo doctor assumed the possibility of not only diagnosing and treating, but also developing measures to prevent diseases, disseminating sanitary and hygienic knowledge among the rural population, monitoring the sanitary condition of schools, studying the conditions of nutrition, water supply, and industries, and as a result, his work became a national type of public service. The best traditions of zemstvo medicine, its problems are important for study by health care organizers with the aim of improving health care in rural areas and the "Zemstvo Doctor" program, which has been in effect since 2012. The purpose of the study: to summarize information from open literary sources about the activities of the zemstvo doctor at the turn of the 19th and 20th centuries. Materials and methods. The study was conducted on the basis of the scientific electronic library eLIBRARY, a search of literary sources for 2013-2023 was conducted, using the keywords: zemstvo doctor/doctor, rural doctor. The analysis included data from 45 literary sources, 30 were included in the list of references. Results. The results of the study revealed the following interesting facts: lack of human resources, uneven funding of zemstvo doctors in different districts, the need to provide medicines and improve the premises for receiving patients at the expense of their salaries. At the same time, over time and with the development of zemstvo medicine in the Russian Empire, it demonstrated its effectiveness and, following its example, the tsarist government organized rural medical districts, and a number of positive changes occurred: an increase in the number of zemstvo doctors, the opportunity to transfer experience in the zemstvo paramedic school with an increase in salary, conduct scientific research and study in leading clinics, enjoy some privileges of those in government service, be in the leadership of zemstvo elected bodies and finance measures to protect the health of the population. Conclusion. The enormous contribution of zemstvo doctors to the development of rural areas - universal professional activity in providing medical care to residents, the ability to implement preventive, anti-epidemic measures allows us to talk about the civilizational influence of zemstvo medicine on the development of rural areas of Russia at the turn of the 19th and 20th centuries, which was reflected in the works of famous writers who summarized the most important qualities of a good doctor that are significant for the modern generation. At present, the Zemstvo Doctor program, following the historical legacy, is not only a tool for attracting medical personnel to rural areas, but also a source for preserving and developing human capital. Keywords: zemstvo doctor, historical excursion
- Research Article
7
- 10.1017/eis.2024.61
- Jan 7, 2025
- European Journal of International Security
This article challenges the perception of Open-Source Intelligence (OSINT) as a revolutionary shift driven by the explosion of publicly accessible data. Instead, we argue that the rise of OSINT reflects an evolution of traditional intelligence practices: the collection, processing, analysis and dissemination of vast amounts of information. While the exponential growth of open–source data is reshaping the intelligence landscape, it is neither revolutionizing nor democratizing intelligence. Rather, it is prompting both state and non–state actors to explore how best to integrate OSINT practices and enhance digital literacy within their communities. Core OSINT challenges – information overload, reliability, and legal and ethical concerns – remain consistent with broader intelligence issues. Addressing these challenges provides a foundation for consolidating OSINT as a community of practice, and linking it to debates on the disputed role of security expertise in the public debate.
- Research Article
29
- 10.1016/j.cose.2023.103430
- Aug 19, 2023
- Computers & Security
Multi-level fine-tuning, data augmentation, and few-shot learning for specialized cyber threat intelligence
- Book Chapter
4
- 10.4018/978-1-4666-0294-6.ch010
- Jan 1, 2012
Correct and timely access to business information is the key to success in industry. However in industry, data is generated on daily basis and increases exponentially. Therefore, managing it is a challenging task for every organization. To deal with this phenomenon of information overload, organizations are in dire need to find and set up potential means for the analysis of raw industrial data (i.e. texts) and draw necessary information from it. This information can result in knowledge and knowledge leads towards wisdom, the essence of every business. This chapter is concerned with the use of knowledge management systems to cater information overload hassles, the organizations are facing today. As a solution, a detailed study of currently existing open source data and knowledge management systems is conducted. Hence, this chapter discusses the state of the art tools and technologies in this domain, and highlights the need and importance of semantic applications for industrial data processing.
- Conference Article
7
- 10.1109/icsc.2012.38
- Sep 1, 2012
Web portals are a major class of web-based content management systems. They can provide users with a single point of access to a multitude of content sources and applications. However, further analysis of content brokered through a portal is not supported by current portal systems, leaving it to their users to deal with information overload. We present the first work examining the integration of natural language processing into web portals to provide users with semantic assistance in analyzing and interpreting content. This integration is based on the portal standard JSR286 and open source NLP frameworks. Two application scenarios, news analysis and biocuration, highlight the feasibility and usefulness of our approach.
- Research Article
- 10.24949/njes.v10i2.236
- Aug 13, 2018
- SHILAP Revista de lepidopterología
Geospatial information overload has become an issue in recent years. It is fuelled in part by the widespread availability of mobility data from a variety of sources, such as ubiquitous mobile computing devices, geographic positioning systems and traces from digital map interactions. The article describes a data analysis technique for extracting knowledge from mobility data. Data from mouse movements over digital maps were analysed for their spatial-temporal content to reveal user behavior. Although the trajectories are from mouse movements in Human-Computer Interaction domain, they can also serve as a proxy for physical trajectories in the real world. The article presents the methodology to reduce information overload and convert raw trajectory data into useful knowledge. This geographicknowledgediscoveryprocesswasrealisedusing Secondo, a highly specialised open source tool that allows developingspecific spatio-temporalqueriestoanalysetrajectories. The results indicate that Secondo can be intelligently exploited for identifying specific movement patterns and behavior and ultimately extractknowledgewhichcan be usedinpersonalisedwebmaps,spatial recommender systems, event detection and crime monitoring tasks.
- Research Article
- 10.3233/sw-243685
- Aug 19, 2024
- Semantic Web
Knowledge generated during the scientific process is still mostly stored in the form of scholarly articles. This lack of machine-readability hampers efforts to find, query, and reuse such findings efficiently and contributes to today’s information overload. While attempts have been made to semantify journal articles, widespread adoption of such approaches is still a long way off. One way to demonstrate the usefulness of such approaches to the scientific community is by showcasing the use of freely available, open-access knowledge graphs such as Wikidata as sustainable storage and representation solutions. Here we present an example from the life sciences in which knowledge items from scholarly literature are represented in Wikidata, linked to their exact position in open-access articles. In this way, they become part of a rich knowledge graph while maintaining clear ties to their origins. As example entities, we chose small regulatory RNAs (sRNAs) that play an important role in bacterial and archaeal gene regulation. These post-transcriptional regulators can influence the activities of multiple genes in various manners, forming complex interaction networks. We stored the information on sRNA molecule interaction taken from open-access articles in Wikidata and built an intuitive web interface called InteractOA, which makes it easy to visualize, edit, and query information. The tool also links information on small RNAs to their reference articles from PubMed Central on the statement level. InteractOA encourages researchers to contribute, save, and curate their own similar findings. InteractOA is hosted at https://interactoa.zbmed.de and its code is available under a permissive open source licence. In principle, the approach presented here can be applied to any other field of research.
- Research Article
1
- 10.11591/telkomnika.v12i3.4489
- Mar 1, 2014
- TELKOMNIKA Indonesian Journal of Electrical Engineering
This article describes challenges faced by agricultural production and marketing in the era of big data, and then builds the agricultural market information matching platform based on HADOOP & NUTCH combining cloud computing technology, finally details its layers and key technologies, including the use of open source search engine to capture the whole network market information to build agricultural production and market data sources, users’ interest model and the combination of matching algorithm and HADOOP environment. The aim of this paper is to make the agricultural market information matching platform more suitable for Chinese agricultural production and marketing system in order to solve the bottleneck problems such as information overload, lack of storage space, scalability and efficiency of analysis and calculation. As a result, this thesis provides a useful reference and new strategy for analysis and mining of big data in agricultural production and marketing areas.
- Book Chapter
2
- 10.1007/978-981-13-1498-8_36
- Sep 2, 2018
Stories help to communicate information and interpret knowledge. Once the data is collected, analyzed, cleansed, and transformed, the subsequent step is to extract potential value from it. Realization of value will happen, only when business-centric insights are discovered and translated to time-bound actionable outcome. To maximize the potential value, data should be decoded into a storytelling medium via visualization, which can be either static or dynamic. Big data visualization is to reveal stories from data tsunami, generated at an alarming speed with diversified formats. The stories tend to represent vital characteristics to enlarge users. Self-service visualization empowers users to uncover unique patterns, interesting facts, and relationships from the underlying data by building their own stories without the in-depth technical knowledge, possibly little handhold by IT department. In this survey paper, we first get familiar with big data storytelling with visualization and its related concepts, and then will look through general approaches to do the visualization. To get deeper about it, we will have discussion about truthful data visualization in self-service mode representing real view of the business. This paper also presents the challenges and available technological solution, covering open source for representing real-time view of the story.
- Conference Article
12
- 10.1117/12.2277970
- Oct 5, 2017
The information available on-line and off-line, from open as well as from private sources, is growing at an exponential rate and places an increasing demand on the limited resources of Law Enforcement Agencies (LEAs). The absence of appropriate tools and techniques to collect, process, and analyze the volumes of complex and heterogeneous data has created a severe information overload. If a solution is not found, the impact on law enforcement will be dramatic, e.g. because important evidence is missed or the investigation time is too long. Furthermore, there is an uneven level of capabilities to deal with the large volumes of complex and heterogeneous data that come from multiple open and private sources at national level across the EU, which hinders cooperation and information sharing. Consequently, there is a pertinent need to develop tools, systems and processes which expedite online investigations. In this paper, we describe a suite of analysis tools to identify and localize generic concepts, instances of objects and logos in images, which constitutes a significant portion of everyday law enforcement data. We describe how incremental learning based on only a few examples and large-scale indexing are addressed in both concept detection and instance search. Our search technology allows querying of the database by visual examples and by keywords. Our tools are packaged in a Docker container to guarantee easy deployment on a system and our tools exploit possibilities provided by open source toolboxes, contributing to the technical autonomy of LEAs.