Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling

dc.contributor.authorMarquard, Stephen
dc.date.accessioned2016-08-13T18:55:00Z
dc.date.available2016-08-13T18:55:00Z
dc.date.issued2012
dc.date.updated2016-08-13T18:25:02Z
dc.description.abstractRecording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription. Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture. A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing. The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia.en_ZA
dc.identifier.apacitationMarquard, S. (2012). <i>Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling</i>. (ThesesDissertation). University of Cape Town ,Unknown ,Computer Science. Retrieved from http://hdl.handle.net/11427/21226en_ZA
dc.identifier.chicagocitationMarquard, Stephen. <i>"Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling."</i> ThesesDissertation., University of Cape Town ,Unknown ,Computer Science, 2012. http://hdl.handle.net/11427/21226en_ZA
dc.identifier.citationMarquard, S. 2012. Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling. MPhil Thesis. University of Cape Town.en_ZA
dc.identifier.ris TY - Thesis / Dissertation AU - Marquard, Stephen AB - Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription. Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture. A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing. The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia. DA - 2012 DB - OpenUCT DP - University of Cape Town LK - https://open.uct.ac.za PB - University of Cape Town PY - 2012 T1 - Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling TI - Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling UR - http://hdl.handle.net/11427/21226 ER - en_ZA
dc.identifier.urihttp://hdl.handle.net/11427/21226
dc.identifier.vancouvercitationMarquard S. Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling. [ThesesDissertation]. University of Cape Town ,Unknown ,Computer Science, 2012 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/21226en_ZA
dc.languageengen_ZA
dc.publisher.departmentComputer Scienceen_ZA
dc.publisher.facultyUnknownen_ZA
dc.publisher.institutionUniversity of Cape Townen_ZA
dc.publisher.institutionUniversity of Cape Town
dc.titleImproving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modellingen_ZA
dc.typeMaster Thesis
dc.type.qualificationlevelMasters
dc.type.qualificationnameMPhilen_ZA
uct.type.filetypeText
uct.type.filetypeImage
uct.type.publicationResearchen_ZA
uct.type.resourceThesesDissertationen_ZA
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Marquard_Improving_Searchability_2012.pdf
Size:
2.71 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.72 KB
Format:
Item-specific license agreed upon to submission
Description:
Collections