Topics
TOPIC1: Compiling parliamentary corpora (Tomaž Erjavec, Andrej Pančur)
Lecture
The session introduces corpora of parliamentary proceedings, with a focus on building and esp. encoding such corpora. We give the motivation for research on parliamentary proceedings,mention formats in which they are typically available, sketch the tool-chain needed for their download, clean-up, and structural and linguistic annotation, and discuss existing and emerging encoding schemes for their mark-up. Here we concentrate on the Text Encoding Initiative Guidelines where we first introduce the TEI Guidelines, and then demonstrate the mark-up of parliamentary corpora on several existing cases, discussing issues such as annotating sessions, speeches and interruptions, meta-data on speakers and sittings, using typologies, and including linguistic markup.
Hands-on
The session also includes a hands-on part. Before the workshop we will enquire as to the expectations and technical skills of the participants, but the default scenario is that the participants bring with them a short excerpt from a parliamentary debate as a Word file, which they can first roughly annotate and then automatically convert to TEI, and then do the final annotations in TEI, using the Oxygen XML editor (which can be used free of charge for one month). With this, the participants get some hand-on experience in the TEI structure of parliamentary corpora. We will also demonstrate the use of such corpora on some pre-existing ones with noSketch Engine.
TOPIC2: Challenges in Corpus Building for the Romanian ELTeC Collection
Lecture (by Dr. Roxana Patras)
In its first part, the lecture outlines the challenges of corpus building for the Romanian ELTeC collection. As shown below, some of them originate in the undisputable linguistic and cultural specificity of Romanian texts, others drawing from the post-communist policies concerning the digitisation and open-access treatment of cultural heritage: a. scarcity of digitized resources from the period 1850-1920, thus a difficult extraction/ checking of metadata; b. analysis and automation tools that are still unadjusted to the diachronical particularities of Romanian; c. eligibility and composition principles, most of them deducted from Western literary traditions - i.e. book length, number of editions, 30% canonical works, 10-30% female authors; sampling according to various time-slots, etc - that are rather inapplicable to the frame of Eastern European literary phenomena and institutions; d. in the case of Romanian, the slow process of language standardization raises difficult issues concerning clean-up and normalization. For instance, novels published between 1850 and 1865 are printed in a Cyrillic-Latin alphabet that cannot be read by regular OCR engines, while novels published after 1865, albeit in Latin alphabet, still have some special glyphs that result in bad OCR output. In the second part of the lecture, I introduce a few practical solutions that I have tested and that have proven effective in addressing the aforementioned issues: customized digitisation (focused on novel subgenres); customized dataset and DOI assignment on zenodo; support repositories for different text formats on github and zenodo; HTR models for specific prints (such as the ones using the Romanian Transition Alphabet).
Tutorial: Corpus Design Principles in the COST Action 'Distant Reading for European Literary History' (Dr. Carolin Odebrecht)
The tutorial introduces sampling and balancing criteria as well as encoding principles for the multilingual European Literary Text Collection (ELTeC). We will look at the ELTeC-TEI encoding principles and we will use text examples to work with the encoding schemas for metadata and markup. The tutorial will also present the Action's goals and our working environments.
Topic 3: A multidisciplinary approach to the use of technology in research: the case of interview data
Lecture (Louise Corti)
In the first part of this session, the lecture introduces different scholarly approaches when working with interview data as a primary or secondary data source. We set out some of the distinct traditions and differences in analytic practices and use of tools across the disciplines. The wide CLARIN family of digital methods and tools, in use by linguists and speech technologists, such as automated speech recognition, annotation, text analysis and emotion recognition tools,
are open to wider exploitation, for example by digital humanities scholars, historians and social scientists. We show how they can be used to support different phases of the research process, from data preparation to analysis and presentation. Connecting up tools to help meet the needs of a researcher’s analytic journey can also be beneficial. In this respect, we describe the work of the CLARIN Oral History ‘Transcription Chain’ (TChain), a tool that supports transcription, alignment and editing of audio and text in multiple languages.
Hands-on (Christoph Draxler / Arjan van Hessen)
The second part of the session offers a hand-on workshop, giving participants the opportunity to work with the TChain; using a dedicated portal to convert audio-visual material into a suitable format, use automatic speech recognition (ASR), correct the ASR results, and download them.