Visual crowdsensing (VCS) is becoming predominant in mobile crowdsensing, but there still exist various unique challenges, including large sizes of visual data, multidimensional requirements, and intensive processing demands. As a key research problem in VCS, data selection filters out redundant data and only retains most representative samples, which can effectively reduce the complexity and cost for VCS. In this paper, we study a phase-by-phase data selection approach, in which metadata are first used to pre-select collected photos and then only selected ones are sent to a backend server for further processing based on content features. As such, the initial selection can be completed on nearby edge servers in mobile edge computing (MEC), while more intensive content processing can be done in a remote cloud. We evaluate different initial data selection algorithms using traditional performance measures as well as adapted clustering indices as quality metrics. Moreover, we formulate an integer linear program (ILP) problem for the final data selection based on the scale-invariant feature transform (SIFT) feature. This content-based selection can complement the initial data selection based on contextual metadata. The simulation results show the differences of these selection algorithms and provide guidance on how to choose an appropriate one according to application needs.