With the rapid development of information technology, digital music is subsequently increasing in large quantities, and how a good integration of vocal input and recognition technology can be transformed into digital music can greatly improve the efficiency of music production while ensuring the quality and effect of music. This paper focuses on the implementation and application of human voice input and recognition technology in digital music creation, enabling users to generate digital music forms by simply humming a melodic fragment of a piece of music into a microphone. The paper begins with an introduction to digital music and speech recognition technology and goes on to describe the respective characteristics of various audio formats, which are selected as data sources for digital music creation based on the advantages of the files in terms of retrieval. Following that, the method of extracting musical information from music is described, and the main melody is successfully extracted from the multitrack file to extract the corresponding musical performance information. The feature extraction of humming input melody is further described in detail. The traditional speech recognition method of using short-time energy and short-time overzero rate features for speech endpoint detection is analyzed. Combining the characteristics of humming music, the method of cutting notes by two-stage cutting mode, i.e., combining energy saliency index, overzero rate, and pitch change, is adopted to cut notes, which leads to a substantial improvement in performance. The algorithm uses the melody extraction algorithm to obtain the melody line, merges the short-time segments of the melody line to reduce the error rate of emotion recognition, uses the melody line to segment the music signal to generate segmented segments, then abstracts the features of the segmented segments through a CNN-based structural model, and inputs the output of the model to the regressor in cascade with the melody contour features of the corresponding segmented segments to finally obtain the emotion V / A value of the segmented segments.