Towards Extracting Ethical Concerns-related Software Requirements from App Reviews
As mobile applications become increasingly integral to our daily lives, concerns about ethics have grown drastically. Users share their experiences, report bugs, and request new features in application reviews, often highlighting safety, privacy, and accountability concerns. Approaches using machine learning techniques have been used in the past to identify these ethical concerns. However, understanding the underlying reasons behind them and extracting requirements that could address these concerns is crucial for safer software solution development. Thus, we propose a novel approach that leverages a knowledge graph (KG) model to extract software requirements from app reviews, capturing contextual data related to ethical concerns. Our framework consists of three main components: developing an ontology with relevant entities and relations, extracting key entities from app reviews, and creating connections between them. This study analyzes app reviews of the Uber mobile application (a popular taxi/ride app) and presents the preliminary results from the proposed solution. Initial results show that KG can effectively capture contextual data related to software ethical concerns, the underlying reasons behind these concerns, and the corresponding potential requirements.
- Research Article
14
- 10.7717/peerj-cs.874
- Mar 15, 2022
- PeerJ. Computer science
Opinion mining for app reviews aims to analyze people’s comments from app stores to support data-driven requirements engineering activities, such as bug report classification, new feature requests, and usage experience. However, due to a large amount of textual data, manually analyzing these comments is challenging, and machine-learning-based methods have been used to automate opinion mining. Although recent methods have obtained promising results for extracting and categorizing requirements from users’ opinions, the main focus of existing studies is to help software engineers to explore historical user behavior regarding software requirements. Thus, existing models are used to support corrective maintenance from app reviews, while we argue that this valuable user knowledge can be used for preventive software maintenance. This paper introduces the temporal dynamics of requirements analysis to answer the following question: how to predict initial trends on defective requirements from users’ opinions before negatively impacting the overall app’s evaluation? We present the MAPP-Reviews (Monitoring App Reviews) method, which (i) extracts requirements with negative evaluation from app reviews, (ii) generates time series based on the frequency of negative evaluation, and (iii) trains predictive models to identify requirements with higher trends of negative evaluation. The experimental results from approximately 85,000 reviews show that opinions extracted from user reviews provide information about the future behavior of an app requirement, thereby allowing software engineers to anticipate the identification of requirements that may affect the future app’s ratings.
- Conference Article
3
- 10.1109/isriti54043.2021.9702793
- Dec 16, 2021
Documentation and fulfillment of software requirement are important aspects in measuring the success of a team in developing software. In the field of requirement engineering, there are two types of requirements namely functional requirements (FR) and non-functional requirements (NFR). Nowadays, requirements may also be found in app reviews, so this study conducted to classify non-functional requirements collected from app reviews. We classify keywords into 2 categories, namely project specific (PS) and non-project specific (NPS) and we propose an automatic method to extract them from app reviews and app description. We classify app reviews plus keyword extracted using convolutional neural network (CNN) and word2vec vectorization into several category of NFRs. Our proposed method managed to extract several keywords and improve the performance of the classification algorithm used. Our proposed method has an average accuracy of 80%, precision of 71%, and recall of 63%. The result show that our proposed method performed better than basic CNN and any classification algorithm.
- Research Article
- 10.12962/j24068535.v24i1.a1333
- Jan 15, 2026
- JUTI: Jurnal Ilmiah Teknologi Informasi
User-generated reviews on mobile applications represent a valuable yet ambiguous resource for classifying software requirements, particularly when multiple aspects—such as bugs, feature requests, and user experiences—are embedded within a single review. Although prior studies have shown the potential of transformer-based and multi-label models in improving text classification accuracy and efficiency, explicit handling of semantic ambiguity in multi-aspect reviews has not been addressed. This study proposes a multi-label classification approach using BERT-based transfer learning to manage ambiguity in app reviews. Each review is manually annotated with one or more relevant requirement categories. Preprocessing involves text cleaning, normalization, and BERT tokenization to convert reviews into structured representations. The classification model categorizes reviews into four classes: bug reports, feature requests, user experiences, and ratings. Evaluation results demonstrate strong performance, with F1-scores of 0.96 for bug reports, 0.95 for feature requests, 0.97 for ratings, and 0.80 for user experiences, confirming the model’s capability in capturing overlapping labels in ambiguous reviews. This approach offers a scalable and automated solution for extracting software requirements, enabling developers to better identify, categorize, and prioritize user needs from unstructured review data.
- Research Article
3
- 10.7717/peerj-cs.2401
- Nov 5, 2024
- PeerJ. Computer science
Mobile app reviews are valuable for gaining user feedback on features, usability, and areas for improvement. Analyzing these reviews manually is difficult due to volume and structure, leading to the need for automated techniques. This mapping study categorizes existing approaches for automated and semi-automated tools by analyzing 180 primary studies. Techniques include topic modeling, collocation finding, association rule-based, aspect-based sentiment analysis, frequency-based, word vector-based, and hybrid approaches. The study compares various tools for analyzing mobile app reviews based on performance, scalability, and user-friendliness. Tools like KEFE, MERIT, DIVER, SAFER, SIRA, T-FEX, RE-BERT, and AOBTM outperformed baseline tools like IDEA and SAFE in identifying emerging issues and extracting relevant information. The study also discusses limitations such as manual intervention, linguistic complexities, scalability issues, and interpretability challenges in incorporating user feedback. Overall, this mapping study outlines the current state of feature extraction from app reviews, suggesting future research and innovation opportunities for extracting software requirements from mobile app reviews, thereby improving mobile app development.
- Research Article
35
- 10.1007/s10462-023-10667-1
- Feb 15, 2024
- Artificial Intelligence Review
Requirement Analysis is the essential sub-field of requirements engineering (RE). From the last decade, numerous automatic techniques are widely exploited in requirements analysis. In this context, requirements identification and classification is challenging for RE community, especially in context of large corpus and app review. As a consequence, several Artificial Intelligence (AI) techniques such as Machine learning (ML), Deep learning (DL) and transfer learning (TL)) have been proposed to reduce the manual efforts of requirement engineer. Although, these approaches reported promising results than traditional automated techniques, but the knowledge of their applicability in real-life and actual use of these approaches is yet incomplete. The main objective of this paper is to systematically investigate and better understand the role of Artificial Intelligence (AI) techniques in identification and classification of software requirements. This study conducted a systematic literature review (SLR) and collect the primary studies on the use of AI techniques in requirements classification. (1) this study found that 60 studies are published that adopted automated techniques in requirements classification. The reported results indicate that transfer learning based approaches extensively used in classification and yielding most accurate results and outperforms the other ML and DL techniques. (2) The data extraction process of SLR indicates that Support Vector Machine (SVM) and Convolutional Neural Network (CNN) are widely used in selected studies. (3) Precision and Recall are the commonly used metrics for evaluating the performance of automated techniques. This paper revealed that while these AI approaches reported promising results in classification. The applicability of these existing techniques in complex and real-world settings has not been reported yet. This SLR calls for the urge for the close alliance between RE and AI techniques to handle the open issues confronted in the development of some real-world automated system.
- Conference Article
- 10.1109/re63999.2025.00048
- Sep 1, 2025
Mobile app reviews are a large-scale data source for software improvements. A key task in this context is effectively extracting requirements from app reviews to analyze the users’ needs and support the software’s evolution. Recent studies show that existing methods fail at this task since app reviews usually contain informal language, grammatical and spelling errors, and a large amount of irrelevant information that might not have direct practical value for developers. To address this, we propose a novel reformulation of requirements extraction as a Named Entity Recognition (NER) task based on the sequence-to-sequence (Seq2seq) generation approach. With this aim, we propose a Seq2seq framework, incorporating a BiLSTM encoder and an LSTM decoder, enhanced with a self-attention mechanism, GloVe embeddings, and a CRF model. We evaluated our framework on two datasets: a manually annotated set of 1,000 reviews (Dataset 1) and a crowdsourced set of 23,816 reviews (Dataset 2). The quantitative evaluation of our framework showed that it outperformed existing state-of-the-art methods with an F1 score of 0.96 on Dataset 2, and achieved comparable performance on Dataset 1 with an F1 score of 0.47.
- Book Chapter
8
- 10.1007/978-3-319-49094-6_39
- Jan 1, 2016
The Kano model is a frequently used method to classify user preferences according to their importance, and by doing so support requirements prioritization. To implement the Kano model, a representative set of users must answer for each feature under evaluation a functional and dysfunctional question. Unfortunately, finding and interviewing users is difficult and time-consuming. Thus, the core idea of our proposed approach is to extract automatically opinions about product features from online open sources (e.g., Q & A sites, App reviews, etc.) and to feed them into the Kano questionnaire to prioritize software requirements following the principles of the Kano model. One problem with our proposed approach is how to pair input extracted from the internet into paired answers to the functional dysfunctional questions. This problem arises because the reviews and comments from online sources that we plan to transform into answers to either the functional or dysfunctional question are usually unpaired. Therefore, the aim of this study is to find a method that produces results resembling those of the traditional Kano model although we only retrieve partial information. We propose two Kano-like models, i.e., the Half- and the Deformed-Kano model, for unpaired answers to functional and dysfunctional questions. In order to analyze the performance of the two proposed models as compared to that of the traditional Kano model, we run several simulations with synthetic data. Then we compare the simulation results to see which Kano-like model produces results that are similar to those of the traditional Kano model. The simulation results show that on average both the Half-Kano and Deformed-Kano models on average generate feature categorizations similar to those of the traditional Kano model. However, only the Deformed-Kano model generates the same range of categorizations as the traditional Kano model. The Deformed-Kano can be used as an approximation of the traditional Kano model when the input is unpaired or partly missing.
- Conference Article
28
- 10.5753/eniac.2020.12144
- Oct 20, 2020
Popular mobile applications receive millions of user reviews. Thesereviews contain relevant information, such as problem reports and improvementsuggestions. The reviews information is a valuable knowledge source for soft-ware requirements engineering since the analysis of the reviews feedback helpsto make strategic decisions in order to improve the app quality. However, due tothe large volume of texts, the manual extraction of the relevant information is animpracticable task. In this paper, we investigate and compare textual represen-tation models for app reviews classification. We discuss different aspects andapproaches for the reviews representation, analyzing from the classic Bag-of-Words models to the most recent state-of-the-art Pre-trained Neural Languagemodels. Our findings show that the classic Bag-of-Words model, combined witha careful analysis of text pre-processing techniques, is still a competitive model.However, pre-trained neural language models showed to be more advantageoussince it obtains good classification performance, provides significant dimension-ality reduction, and deals more adequately with semantic proximity between thereviews’ texts, especially the multilingual neural language models.
- Video Transcripts
- 10.48448/pv64-2974
- Oct 15, 2020
- Underline Science Inc.
Popular mobile applications receive millions of user reviews. These reviews contain relevant information, such as problem reports and improvement suggestions. The reviews information is a valuable knowledge source for software requirements engineering since the analysis of the reviews feedback helps to make strategic decisions in order to improve the app quality. However, due to the large volume of texts, the manual extraction of the relevant information is an impracticable task. In this paper, we investigate and compare textual representation models for app reviews classification. We discuss different aspects and approaches for the reviews representation, analyzing from the classic Bag-of-Words models to the most recent state-of-the-art Pre-trained Neural Language models. Our findings show that the classic Bag-of-Words model, combined with a careful analysis of text pre-processing techniques, is still a competitive model. However, pre-trained neural language models showed to be more advantageous since it obtains good classification performance, provides significant dimensionality reduction, and deals more adequately with semantic proximity between the reviews' texts, especially the multilingual neural language models.
- Conference Article
56
- 10.1145/3412841.3442006
- Mar 22, 2021
Traditionally, developers restricted themselves to collecting opinions from a small group of users by using techniques such as interviews, questionnaires, and meetings. With the popularization of social media and mobile applications, these professionals have to deal with crowd users' opinions, who want to voice the software's evolution. In this context, one of the main related tasks is the automatic identification of software requirements from app reviews. Recent studies show that existing methods fail at this task, since review texts usually contain informal language, contain grammatical and spelling errors, as well as the difficulty in filtering out irrelevant information that has no practical value for developers. In this paper, we present the RE-BERT (Requirements Engineering using Bidirectional Encoder Representations from Transformers). Our method innovates by using pre-trained neural language models to generate semantic textual representations with contextual word embeddings. Our RE-BERT performs fine-tuning of the BERT model with a focus on the local context of the software requirement tokens. A statistical analysis of the experimental results involving eight different apps showed that our RE-BERT outperforms existing state-of-the-art methods.
- Conference Article
10
- 10.1109/iemtronics55184.2022.9795770
- Jun 1, 2022
Mobile app developers are always looking for ways to use the reviews (provided by their app’s users) to improve their application (e.g., adding a new functionality in the app that a user mentioned in their review). Usually, there are thousands of user reviews that are available for each mobile app and isolating software requirements manually from such as big dataset can be difficult and time-consuming. The primary objective of the current research is to automate the process of extracting functional requirements and filtering out non-requirements from user app reviews to help app developers better meet the wants and needs of their users. This paper proposes and evaluates machine learning based models to identify and classify software requirements from both, formal Software Requirements Specifications (SRS) documents and Mobile App Reviews (written by users) using machine learning (ML) algorithms combined with natural language processing (NLP) techniques. Initial evaluation of our ML-based models show that they can help classify user app reviews and software requirements as Functional Requirements (FR), Non-Functional Requirements (NFR), or Non-Requirements (NR).
- Conference Article
25
- 10.1109/snams.2019.8931820
- Oct 1, 2019
In one year, more than 6.5 million mobile applications have been listed for download on the application stores. That is, they are used by millions (or billions) of users across the world. Users express their daily experience of applications as reviews on those stores. This experience may include reporting bugs, demanding new features, posting feedback with regards to performance, reporting security issues, demanding user interface enhancements, and other needs. Interestingly, reviews could contain valuable information for the interest of application vendors and developers. However, the volume of such data is as huge, that is, traditional searching algorithms may not be efficient in extracting such useful information. Machine learning and data mining techniques are one of the popularly used algorithms to efficiently extracting significant information for Software Requirement Engineering; a key phase in the Software Engineering Life Cycle. In this paper, we experience machine learning algorithms and natural language processing techniques to classify a set of reviews about healthcare-domain applications into multiple types of categories such as bug reports, new feature requests, application performance, and user interface. For this purpose, we could extract more than 7500 reviews of ten different health-related mobile applications. More importantly, those reviews were annotated manually by software experts. In our experiments, we use the Weka tool employing different machine learning algorithms. We will also show what algorithms and features will perform better; in terms of accuracy using different evaluation metrics, when classifying reviews about mobile apps into various classes; bugs, new features, sentimental, general bug, usability, security, and performance. Moreover, the conducted experiments show that the overall performance improves when we use the data subset with highly confident labeling; when two experts agree on the same class. For the imbalanced-data problem, this research will show the effect of applying resampling techniques on improving classification accuracy as well.