Lymph node metastasis is an important prognostic factor in oral squamous cell carcinoma. However, the lack of significant biomarkers for lymph node metastasis can cause patients to be inappropriately treated and produce a poor prognosis. Therefore, there is a need to identify gene sets that are associated with lymph node metastasis. In this study, we used three expression datasets obtained from a public database and selected candidate gene sets that were related with lymph node metastasis from two datasets and a combined dataset. We evaluated the selected gene set using OOB error rates in a validation dataset. The gene set detected from the combined dataset classified the lymph node status more accurately in the validation dataset and clear expression patterns classifying the lymph node status based on chromosomal location were observed. The combined dataset holds promise for use as a more accurate candidate gene set for the diagnosis of lymph node metastasis and the selected gene set could be used for biological validation in further studies.