Abstract

Non-negative tensor factorization (NTF) is a widely used multi-way analysis approach that factorizes a high-order non-negative data tensor into several non-negative factor matrices. In NTF, the non-negative rank has to be predetermined to specify the model and it greatly influences the factorized matrices. However, its value is conventionally determined by specialists’ insights or trial and error. This paper proposes a novel rank selection criterion for NTF on the basis of the minimum description length (MDL) principle. Our methodology is unique in that (1) we apply the MDL principle on tensor slices to overcome a problem caused by the imbalance between the number of elements in a data tensor and that in factor matrices, and (2) we employ the normalized maximum likelihood (NML) code-length for histogram densities. We employ synthetic and real data to empirically demonstrate that our method outperforms other criteria in terms of accuracies for estimating true ranks and for completing missing values. We further show that our method can produce ranks suitable for knowledge discovery.

Highlights

  • We propose a method to select an appropriate value of rank for negative tensor factorization (NTF) based on the minimum description length (MDL) principle [7]

  • This paper has proposed a novel rank selection method for NTF on the basis of the MDL principle

  • Our method is unique in the tensor slice approach and the normalized maximum likelihood (NML)-based code-length calculation

Read more

Summary

Motivation

As the variety of data has grown rapidly, the data often contains more than two attributes and there are only non-negative values. Purchase data can be constructed as trading amounts of customers × commodities × shops × times. Another example is that web access data can be organized as access times of hosts × users × months. We denote the non-negative rank as rank in the rest of this paper following the terminology in [4]. Note that this is not the same notion as rank in matrices. In the web access log analysis, three factorized matrices can represent influences of hosts, users and months, in which NTF can analyze users’. Even though the selection of rank is very important, in most studies on NTF the value of R is determined by trial and error or specialists’ insights, and there does not exist a good way to determine R automatically

Contribution
Related Work
Proposed Method
Comparison Method
Comparison Methods
Experiments on Synthetic Data
On-Line Retail Data Set
Web Access Data Set
App Usage Data Set
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call