A Multi-Label Text Classification Algorithm for Labeling Risk Factors in SEC Form 10-K

Ke-Wei Huang,Zhuolun Li

doi:10.2139/ssrn.1916044

Abstract

This study develops, implements, and evaluates a multi-label text classification algorithm called the multi-label categorical K - nearest neighbor (ML-CKNN). The proposed algorithm is designed to automatically identify 25 types of risk factors with specific meanings reported in Section 1A of SEC Form 10-K. The idea of ML-CKNN is to compute a categorical similarity score for each label by the K-nearest neighbors in that category. ML-CKNN is tailored to achieve the goal of extracting risk factors from 10Ks. The proposed algorithm can perfectly classify 74.94% of risk factors and 98.75% of labels. Moreover, ML-CKNN is empirically shown to outperform ML-KNN and other multi-label algorithms. The extracted risk factors could be valuable to empirical studies in accounting or finance.

Full Text