Many students use videos to supplement learning outside the classroom. This is particularly important for students with challenged visual capacities, for whom seeing the board during lecture is difficult. For these students, we believe that recording the lectures they attend and providing effective video indexing and search tools will make it easier for them to learn course subject matter at their own pace. As a first step in this direction, we seek to help instructors create an index for their lecture videos using audio keyword search, with queries recorded by the instructor on their laptop and/or created from video excerpts. For this we have created an unsupervised within-speaker keyword spotting system. We represent audio data using de-noised, whitened and scale-normalized Mel Frequency Cepstral Coefficient (MFCC) features, and locate queries using Segmental Dynamic Time Warping (SDTW) of feature sequences. Our system is evaluated using introductory Linear Algebra lectures from instructors with different accents at two U.S. universities. For lectures produced using a video camera at RIT, laptop-recorded queries obtain an average Precision at 10 of 71.5%, while 79.5% is obtained for within-lecture queries. For lectures recorded using a lapel microphone at MIT, using a similar keyword set we obtain a much higher average Precision at 10 of 89.5%. Our results suggest that our system is robust to changes in environment, speaker and recording setup.
Read full abstract