Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

An interpretable product demand forecasting framework incorporating domain knowledge: a case study of substitution effects in community group-buying

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

With the rapid rise of social e-commerce, an increasing number of consumers now purchase goods through social community networks. However, substitution effects arising from limited inventories and stockouts can seriously impair demand forecasting accuracy. To address this challenge, we propose XAI-Sub, an interpretable demand forecasting framework that systematically incorporates domain knowledge via text mining and biclustering to improve both transparency and predictive performance. By aggregating sparse sales records into substitution-aware clusters and estimating scenario-specific substitution matrices using a novel alternating minimisation algorithm, XAI-Sub explicitly quantifies substitutive relationships while preserving interpretability. Validated on data from a large community group-buying (CGB) platform, our framework achieves a relative improvement of 38% in forecasting accuracy compared with conventional methods. The approach introduces three key advancements in explainable artificial intelligence (XAI): (1) domain-knowledge anchoring: text-derived semantic features anchor substitution patterns within business logic, facilitating human-AI alignment; (2) scenario-driven interpretability: biclustering decomposes demand dynamics into actionable substitution typologies; and (3) causal pathway visualisation: the substitution matrix serves as an interpretable interface, delineating demand redistribution pathways across clusters. This work demonstrates how formalising domain knowledge can bridge the ‘explanation gap’ in complex demand systems and provides CGB enterprises with practical tools to audit and exploit substitution mechanisms.

Similar Papers
  • Conference Article
  • Cite Count Icon 8
  • 10.1145/1562090.1562096
Protein sequence alignment and structural disorder
  • Jun 28, 2009
  • Uros Midic + 2 more

In protein sequence alignment algorithms, a substitution matrix of 20x20 alignment parameters is used to describe the rates of amino acid substitutions over time. Development and evaluation of most substitution matrices including the BLOSUM family [1] was based almost entirely on fully structured proteins. Structurally disordered proteins (i.e. proteins that lack structure, either in part or as a whole) that have been shown to be very common in nature [2] have a significantly different amino acid composition than ordered (i.e. structured) proteins [3]. Furthermore, the sequence evolution rate is higher in unstructured as compared to structured regions of proteins containing both structured and unstructured regions [4]. These results cast doubt on appropriateness of the BLOSUM substitution matrices for alignment of structurally disordered proteins [5].To address this problem, we take into the account the concept of structural disorder by extending the alphabet for sequence representation from 20 to 2x20=40 symbols, 20 for amino acids in disordered regions and 20 for amino acids in ordered regions. A 40x40 substitution matrix is required for alignment of sequences represented in the extended alphabet. Such an expanded matrix contains 20x20 submatrices that correspond to matching ordered-ordered, ordered-disordered, and disordered-disordered pairs of residues. In this paper we describe an iterative procedure that we used to estimate such a 40x40 substitution matrix. The iterative procedure converged with stable results with respect to the choice of the sequences in the dataset. In the obtained 40x40 matrix we found substantial differences between the 20x20 submatrices corresponding to ordered-ordered, ordered-disordered, and disordered-disordered region matching. These differences provide evidence that for alignment of protein sequences that contain disordered segments, the discovered substitution matrix is more appropriate than the BLOSUM substitution matrices. At the same time, the new substitution matrix is applicable for sequence alignment of fully ordered proteins as its order-order submatrix is very similar to a BLOSUM matrix.

  • Research Article
  • Cite Count Icon 6
  • 10.1089/cmb.2007.0155
Sequence Alignment with an Appropriate Substitution Matrix
  • Mar 1, 2008
  • Journal of Computational Biology
  • Xiaoqiu Huang

A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-3-642-11169-3_19
Substitution Matrices and Mutual Information Approaches to Modeling Evolution
  • Jan 1, 2009
  • Stephan Kitchovitch + 3 more

Substitution matrices are at the heart of Bioinformatics: sequence alignment, database search, phylogenetic inference, protein family classification are all based on BLOSUM, PAM, JTT, mtREV24 and other matrices. These matrices provide means of computing models of evolution and assessing the statistical relationships amongst sequences. This paper reports two results; first we show how Bayesian and grid settings can be used to derive novel specific substitution matrices for fish and insects and we discuss their performances with respect to standard amino acid replacement matrices. Then we discuss a novel application of these matrices: a refinement of the mutual information formula applied to amino acid alignments by incorporating a substitution matrix into the calculation of the mutual information. We show that different substitution matrices provide qualitatively different mutual information results and that the new algorithm allows the derivation of better estimates of the similarity along a sequence alignment. We thus express an interesting procedure: generating ad hoc substitution matrices from a collection of sequences and combining the substitution matrices and mutual information for the detection of sequence patterns.

  • Research Article
  • Cite Count Icon 28
  • 10.1093/bioinformatics/17.8.686
Amino acid similarity matrices based on force fields.
  • Aug 1, 2001
  • Bioinformatics
  • Zsuzsanna Dosztányi + 1 more

We propose a general method for deriving amino acid substitution matrices from low resolution force fields. Unlike current popular methods, the approach does not rely on evolutionary arguments or alignment of sequences or structures. Instead, residues are computationally mutated and their contribution to the total energy/score is collected. The average of these values over each position within a set of proteins results in a substitution matrix. Example substitution matrices have been calculated from force fields based on different philosophies and their performance compared with conventional substitution matrices. Although this can produce useful substitution matrices, the methodology highlights the virtues, deficiencies and biases of the source force fields. It also allows a rather direct comparison of sequence alignment methods with the score functions underlying protein sequence to structure threading. Example substitution matrices are available from http://www.rsc.anu.edu.au/~zsuzsa/suppl/matrices.html. The list of proteins used for data collection and the optimized parameters for the alignment are given as supplementary material at http://www.rsc.anu.edu.au/~zsuzsa/suppl/matrices.html.

  • Research Article
  • Cite Count Icon 28
  • 10.1186/s12859-017-1703-z
PFASUM: a substitution matrix from Pfam structural alignments
  • Jun 5, 2017
  • BMC Bioinformatics
  • Frank Keul + 3 more

BackgroundDetecting homologous protein sequences and computing multiple sequence alignments (MSA) are fundamental tasks in molecular bioinformatics. These tasks usually require a substitution matrix for modeling evolutionary substitution events derived from a set of aligned sequences. Over the last years, the known sequence space increased drastically and several publications demonstrated that this can lead to significantly better performing matrices. Interestingly, matrices based on dated sequence datasets are still the de facto standard for both tasks even though their data basis may limit their capabilities.We address these aspects by presenting a new substitution matrix series called PFASUM. These matrices are derived from Pfam seed MSAs using a novel algorithm and thus build upon expert ground truth data covering a large and diverse sequence space.ResultsWe show results for two use cases: First, we tested the homology search performance of PFASUM matrices on up-to-date ASTRAL databases with varying sequence similarity. Our study shows that the usage of PFASUM matrices can lead to significantly better homology search results when compared to conventional matrices. PFASUM matrices with comparable relative entropies to the commonly used substitution matrices BLOSUM50, BLOSUM62, PAM250, VTML160 and VTML200 outperformed their corresponding counterparts in 93% of all test cases. A general assessment also comparing matrices with different relative entropies showed that PFASUM matrices delivered the best homology search performance in the test set.Second, our results demonstrate that the usage of PFASUM matrices for MSA construction improves their quality when compared to conventional matrices. On up-to-date MSA benchmarks, at least 60% of all MSAs were reconstructed in an equal or higher quality when using MUSCLE with PFASUM31, PFASUM43 and PFASUM60 matrices instead of conventional matrices. This rate even increases to at least 76% for MSAs containing similar sequences.ConclusionsWe present the novel PFASUM substitution matrices derived from manually curated MSA ground truth data covering the currently known sequence space. Our results imply that PFASUM matrices improve homology search performance as well as MSA quality in many cases when compared to conventional substitution matrices. Hence, we encourage the usage of PFASUM matrices and especially PFASUM60 for these specific tasks.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 16
  • 10.1186/1471-2105-11-4
Amino acid "little Big Bang": representing amino acid substitution matrices as dot products of Euclidian vectors.
  • Jan 4, 2010
  • BMC bioinformatics
  • Karel Zimmermann + 1 more

BackgroundSequence comparisons make use of a one-letter representation for amino acids, the necessary quantitative information being supplied by the substitution matrices. This paper deals with the problem of finding a representation that provides a comprehensive description of amino acid intrinsic properties consistent with the substitution matrices.ResultsWe present a Euclidian vector representation of the amino acids, obtained by the singular value decomposition of the substitution matrices. The substitution matrix entries correspond to the dot product of amino acid vectors. We apply this vector encoding to the study of the relative importance of various amino acid physicochemical properties upon the substitution matrices. We also characterize and compare the PAM and BLOSUM series substitution matrices.ConclusionsThis vector encoding introduces a Euclidian metric in the amino acid space, consistent with substitution matrices. Such a numerical description of the amino acid is useful when intrinsic properties of amino acids are necessary, for instance, building sequence profiles or finding consensus sequences, using machine learning algorithms such as Support Vector Machine and Neural Networks algorithms.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.immuno.2025.100051
Comparison of different substitution matrices for distance based T-cell receptor epitope predictions using tcrdist3
  • Sep 1, 2025
  • ImmunoInformatics
  • Marc Hoffstedt + 2 more

<h2>Abstract</h2> Various methods, differing in complexity, have been developed to predict T-cell receptor epitopes. tcrdist3, which implements an easy-to-interpret distance-based approach, has demonstrated performance comparable to the best feature-based methods. Here, a new substitution matrix for tcrdist3 is proposed and its performance is compared to various other substitution matrices. Small performance gains were possible; however tcrdist3 was found to perform reliably well with most substitution matrices. Randomly generated substitution matrices were used as a baseline and resulted in good classification results. It was observed that the prediction quality was negatively correlated with the relative standard deviation of the matrix used (i.e. a larger variance of the weights resulted in poorer predictivity). The most important factor of the tcrdist3-distance between two sequences that could be singled out is the number of substitutions. tcrdist3 implicitly considers the number of substitutions and the type of substitution simultaneously. Using substitution matrices with larger variance penalizes certain substitutions more strongly, which blurs the clusters of sequences with the same number of substitutions. Since the number of substitutions was a key predictor, this resulted in decreased prediction performance.

  • Research Article
  • 10.1093/bioinformatics/btag188
Much ado about nothing: modeling amino acid replacement with predicted protein structures
  • Apr 26, 2026
  • Bioinformatics
  • Lukas Buschmann + 4 more

MotivationSubstitution matrices like BLOSUM62 model the likelihood of replacement of amino acids in evolution. Substitution matrices are used in protein sequence alignment tasks. Since the introduction of BLOSUM62 over three decades ago, many matrices have been released. Yet, to date, no effort uses large amounts of 3D structures predicted by AlphaFold.ResultsHere, we define AFSM, the AlphaFold Substitution Matrix derived from over 20 000 predicted 3D structures following the BLOSUM methodology. We benchmark AFSM against BLOSUM62 and 16 other matrices on five tasks in multiple sequence alignment (MSA) and protein homology search. Our analysis surprisingly reveals that all matrix families perform similarly. Only when there are few sequences in an MSA do BLOSUM62 and AFSM perform better than using no matrix. This suggests that substitution matrices were most beneficial when there was little sequence data. We corroborate this argument by showing that embeddings, which are computed from billions of sequences, perform better than substitution matrices, when sequence data is sparse. Taken together, this suggests that structural data does not improve BLOSUM62. But increased sequence data makes extrapolation with substitution matrices obsolete. Nonetheless, BLOSUM62 continues to capture chemists’ intuition on amino acids by providing numerical values implicitly reflecting physicochemical properties, and it remains indispensable for sparse MSAs and direct comparison of two sequences.Availability and implementationData is available from doi.org/10.5281/zenodo.18777546

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 65
  • 10.1186/1471-2105-11-175
Protein sequences classification by means of feature extraction with substitution matrices.
  • Apr 8, 2010
  • BMC Bioinformatics
  • Rabie Saidi + 2 more

BackgroundThis paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step.ResultsIn order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works.ConclusionsThe outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks.

  • Research Article
  • Cite Count Icon 7
  • 10.1109/tcbb.2022.3233856
Knowledge Adaptive Multi-Way Matching Network for Biomedical Named Entity Recognition via Machine Reading Comprehension.
  • May 1, 2023
  • IEEE/ACM Transactions on Computational Biology and Bioinformatics
  • Peng Chen + 4 more

Rapid and effective utilization of biomedical literature is paramount to combat diseases like COVID19. Biomedical named entity recognition (BioNER) is a fundamental task in text mining that can help physicians accelerate knowledge discovery to curb the spread of the COVID-19 epidemic. Recent approaches have shown that casting entity extraction as the machine reading comprehension task can significantly improve model performance. However, two major drawbacks impede higher success in identifying entities (1) ignoring the use of domain knowledge to capture the context beyond sentences and (2) lacking the ability to deeper understand the intent of questions. In this paper, to remedy this, we introduce and explore external domain knowledge which cannot be implicitly learned in text sequence. Previous works have focused more on text sequence and explored little of the domain knowledge. To better incorporate domain knowledge, a multi-way matching reader mechanism is devised to model representations of interaction between sequence, question and knowledge retrieved from Unified Medical Language System (UMLS). Benefiting from these, our model can better understand the intent of questions in complex contexts. Experimental results indicate that incorporating domain knowledge can help to obtain competitive results across 10 BioNER datasets, achieving absolute improvement of up to 2.02% in the f1 score.

  • Research Article
  • Cite Count Icon 25
  • 10.1108/ecam-02-2019-0097
A domain knowledge incorporated text mining approach for capturing user needs on BIM applications
  • Sep 2, 2019
  • Engineering, Construction and Architectural Management
  • Shenghua Zhou + 4 more

PurposeIn the architecture, engineering and construction (AEC) industry, technology developers have difficulties in fully understanding user needs due to the high domain knowledge threshold and the lack of effective and efficient methods to minimise information asymmetry between technology developers and AEC users. The paper aims to discuss this issue.Design/methodology/approachA synthetic approach combining domain knowledge and text mining techniques is proposed to help capture user needs, which is demonstrated using building information modelling (BIM) apps as a case. The synthetic approach includes the: collection and cleansing of BIM apps’ attribute data and users’ comments; incorporation of domain knowledge into the collected comments; performance of a sentiment analysis to distinguish positive and negative comments; exploration of the relationships between user sentiments and BIM apps’ attributes to unveil user preferences; and establishment of a topic model to identify problems frequently raised by users.FindingsThe results show that those BIM app categories with high user interest but low sentiments or supplies, such as “reality capture”, “interoperability” and “structural simulation and analysis”, should deserve greater efforts and attention from developers. BIM apps with continual updates and of small size are more preferred by users. Problems related to the “support for new Revit”, “import & export” and “external linkage” are most frequently complained by users.Originality/valueThe main contributions of this work include: the innovative application of text mining techniques to identify user needs to drive BIM apps development; and the development of a synthetic approach to orchestrating domain knowledge, text mining techniques (i.e. sentiment analysis and topic modelling) and statistical methods in order to help extract user needs for promoting the success of emerging technologies in the AEC industry.

  • Research Article
  • Cite Count Icon 1
  • 10.21926/obm.neurobiol.2303180
Analysis of Interpersonal Relationships of Social Network Users Using Explainable Artificial Intelligence Methods
  • Aug 24, 2023
  • OBM Neurobiology
  • Pavel Ustin + 2 more

The emergence of the social networking phenomenon and the sudden spread of the coronavirus pandemic (COVID-19) around the world have significantly affected the transformation of the system of interpersonal relations, partly shifting them towards virtual reality. Online social networks have greatly expanded the boundaries of human interpersonal interaction and initiated processes of integration of different cultures. As a result, research into the possibilities of predicting human behavior through the characteristics of virtual communication in social networks has become more relevant. The aim of the study is: to explore the possibilities of machine learning model interpretability methods for interpreting the success of social network users based on their profile data. This paper uses a specific method of explainable artificial intelligence, SHAP (SHapley Additive exPlanations), to analyze and interpret trained machine learning models. The research is based on Social Network Analysis (SNA), a modern line of research conducted to understand different aspects of the social network as a whole as well as its individual nodes (users). User accounts on social networks provide detailed information that characterizes a user's personality, interests, and hobbies and reflects their current status. Characteristics of a personal profile also make it possible to identify social graphs - mathematical models reflecting the characteristics of interpersonal relationships of social network users. An important tool for social network analysis is various machine learning algorithms that make different predictions based on sets of characteristics (social network data). However, most of today's powerful machine learning methods are "black boxes," and therefore the challenge of interpreting and explaining their results arises. The study trained RandomForestClassifier and XGBClassifier models and showed the nature and degree of influence of the personal profile metrics of VKontakte social network users and indicators of their interpersonal relationship characteristics (graph metrics).

  • Conference Article
  • Cite Count Icon 13
  • 10.1109/icmla.2004.1383540
Substitution matrix based kernel functions for protein secondary structure prediction
  • Dec 16, 2004
  • B Vanschoenwinkel + 1 more

Different approaches to using substitution matrices in kernel functions for protein secondary structure prediction (PSSP) with support vector machines are investigated. This work introduces a number of kernel functions that calculate inner products between amino acid sequences based on the entries of a substitution matrix (SM), i.e. a matrix that contains evolutionary information about the substitutability of the different amino acids that make up proteins. The starting point is always the same, i.e. a pseudo inner product (PI) between amino acid sequences making use of a SM. It is shown what conditions a SM should satisfy in order for the PI to make sense and subsequently it is shown how a substitution distance (SD) based on the PI can be defined. Next, different ways of using both the PI and the SD in kernel functions for support vector machine (SVM) learning are discussed. In a series of experiments the different kernel functions are compared with each other and with other kernel functions that do not make use of a SM. The results show that the information contained in a SM can have a positive influence on the PSSP results, provided that it is employed in the correct way.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1093/hesc/9780190610548.003.0004
Substitution Matrices
  • Jun 7, 2017
  • Jamil Momand + 3 more

This chapter delves into the derivation of amino acid substitution matrices, the basis of sequence comparison programs, which help us connect molecular evolution to protein structure and function. It begins with a discussion on the notation conventions for mathematical matrices. The chapter then shows why natural selection is the foundation for the development of the PAM substitution matrices. It also assesses the importance of protein domain conservation to the development of the BLOSUM substitution matrices. The chapter next peeks into the frequency of occurrence of amino acids in relation to the creation of PAM and BLOSUM substitution matrices. Towards the end, the chapter discusses the percent identity, identity score, similarity, and similarity score given a sequence alignment and substitution matrix. It also considers the significance of sequence clustering to create different BLOSUM substitution matrices.

  • Research Article
  • Cite Count Icon 13
  • 10.1089/106652701753216495
Amino acid substitution matrices from an artificial neural network model.
  • Oct 1, 2001
  • Journal of computational biology : a journal of computational molecular cell biology
  • Kuang Lin + 2 more

An amino acid substitution matrix specifies probabilities of substitutions for each pair of the 20 amino acids. Log-odds scores transformed from the values in substitution matrices are widely used to construct protein sequence alignments. Any given substitution matrix is suited to matching sequences diverged by a specific evolutionary distance. However, for a given set of sequences, it is not always clear what matrix should be used. We used an artificial neural network model to predict probabilities of amino acid substitutions with alignment samples of different evolutionary distances. From this internal description, substitution matrices suitable for detecting relationships at any chosen evolutionary distance can be instantly generated. By using the additional information of evolutionary distances, the average cross entropy error of our neural network model is lower than that of a series of BLOSUM and PET matrices over all testing sets. Our model is more accurate on the prediction of amino acid substitution probabilities.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant