This study uses the Oracle SQL certification exam questions to explore the design of automatic classifiers for exam questions containing code snippets. SQL's question classification assigns a class label in the exam topics to a question. With this classification, questions can be selected from the test bank according to the testing scope to assemble a more suitable test paper. Classifying questions containing code snippets is more challenging than classifying questions with general text descriptions. In this study, we use factorial experiments to identify the effects of the factors of the feature representation scheme and the machine learning method on the performance of the question classifiers. Our experiment results showed the classifier with the TF-IDF scheme and Logistics Regression model performed best in the weighted macro-average AUC and F1 performance indices. The classifier with TF-IDF and Support Vector Machine performed best in weighted macro-average Precision. Moreover, the feature representation scheme was the main factor affecting the classifier's performance, followed by the machine learning method, over all the performance indices.
Read full abstract