ioppalm.blogg.se -

This corpus is available for download from the Oxford Text Archive.įor the relevant publication, see Halabi (2016)Īudioatlas Siebenbuergisch-Saechsischer DialekteĪnnotation: Geomapping, orthographic/partial phonetic transcription, semantic labelling Spoken corpora in the CLARIN infrastructure Corpora with transcriptions and audio recordings Corpus This website was last updated on 9 March 2022. They are also richly tagged, many with mark-up specific to speech corpora, such as phonemic and prosodic annotation.īelow, we first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.įor comments, changes of the existing content or inclusion of new corpora, send us an email.

In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. Most of the corpora are monolingual, accounting for the following 15 languages: Arabic, Czech, Dutch, Estonian, Finnish, French, German, Hungarian, Italian, Nepali, Norwegian, Polish, Skoti Saami, Slovenian, Spanish, and Swedish. There are 133 spoken corpora in the CLARIN infrastructure, 122 of which contain both the transcriptions of spoken or spontaneous speech and the associated recordings, and 11 only the transcriptions. Such corpora are carefully sampled and rich in sociodemographic metadata. They are an invaluable resource for various kinds of linguistic research, such as phonology, conversational analysis, and dialectology. They are often aligned with the accompanying recordings. Corpora of spoken language contain transcriptions of spontaneous or planned speech, such as broadcast news or elicited narratives and dialogues.