Abstract

AbstractThis paper discusses the creation and use of the TV Corpus (subtitles from 75,000 episodes, 325 million words, 6 English-speaking countries, 1950s-2010s) and the Movies Corpus (subtitles from 25,000 movies, 200 million words, 6 English-speaking countries, 1930s–2010s), which are available atEnglish-Corpora.org. The corpora compare well to the BNC-Conversation data in terms of informality, lexis, phraseology, and syntax. But at 525 million words in total size, they are more than 30 times as large as BNC-Conversation (both BNC1994 and BNC2014 combined), which means that they can be used to look at a wide range of linguistic phenomena. The TV and Movies corpora also allow useful comparisons of very informal language across time (containing texts from the 1930s and later for the movies, and from the 1950s onwards for TV shows) and between dialects of English (such as British and American English).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.