Numerous methods are available for analysis of avian vocalizations, but few research efforts have compared recent methods for calculating and evaluating similarity among calls, particularly those collected in the field. This manuscript compares a suite of methodologies for analyzing flight calls of New World warblers, investigating the effectiveness of four techniques for calculating call similarity: (1) spectrographic cross-correlation, (2) dynamic time warping, (3) Euclidean distance between spectrogram-based feature measurements, and (4) random forest distance between spectrogram-based feature measurements. We tested these methods on flight calls, which are short, structurally simple vocalizations typically used during nocturnal migration, as these signals may contain important ecological or demographic information. Using the four techniques listed above, we classified flight calls from three datasets, one collected from captive birds and two collected from wild birds in the field. Each dataset contained an equal number of calls from four warbler species commonly recorded during acoustic monitoring: American Redstart, Chestnut-sided Warbler, Hooded Warbler, and Ovenbird. Using captive recordings to train the classification models, we created four similarity-based classifiers which were then tested on the captive and field datasets. We show that these classification methods are limited in their ability to successfully classify the calls of these warbler species, and that classification accuracy was lower on field recordings than captive recordings for each of the tested methods. Of the four methods we compared, the random forest technique had the highest classification accuracy, enabling correct classification of 67.6% of field recordings. To compare the performance of the automated techniques to manual classification, the most common method used in flight call research, human experts were also asked to classify calls from each dataset. The experts correctly classified approximately 90% of field recordings, indicating that although the automated techniques are faster, they remain less accurate than manual classification. However, because of the challenges inherent to these data, such as the structural similarity among the flight calls of focal species and the presence of environmental noise in the field recordings, some of the tested automated classification techniques may be acceptable for real-world applications. We believe that this comparison of broadly applicable methodologies provides information that will prove to be useful for analysis, detection and classification of short duration signals. Based on our results, we recommend that a combination of feature measurements and random forest classification can be used to assign flight calls to species, while human experts oversee the process.