Huber | Tri-Constituent Compounds | E-Book | sack.de
E-Book

E-Book, Englisch, Band 114, 282 Seiten

Reihe: Topics in English Linguistics [TiEL]

Huber Tri-Constituent Compounds

A Usage-Based Account of Complex Nominal Compounding

E-Book, Englisch, Band 114, 282 Seiten

Reihe: Topics in English Linguistics [TiEL]

ISBN: 978-3-11-108211-0
Verlag: De Gruyter
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)



This book provides a usage-based perspective to the study of multi-word compounding, analyzing the structural, functional and cognitive aspects of tripartite compounds (e.g. day care center, football game, hotel bedroom). It highlights the heterogeneity of these word-formation products, but also carves out surprising differences to two-word compounds. In order to reveal the step from two-word compounding to multi-word compounding, the book explains why only some compounds are used productively for the formation of more complex compounds. Building on the idea of entrenchment, it provides a theoretical account that allows understanding speakers’ ability to produce multi-word compounds.
Huber Tri-Constituent Compounds jetzt bestellen!

Autoren/Hrsg.


Weitere Infos & Material


Part II: Data acquisition and annotation
4 Compilation of a compound database
Unlike scholars like Carstairs-McCarthy (2018: 85), who uses examples that feel remote from linguistic reality (e.g. holiday car sightseeing trip or window oven cleaner marketing), this work embodies a descriptive, usage-based approach to complex compounding. I intend to gain insights into the different ways in which speakers utilize the resources of the English language to form tri-constituent compounds, which is why I will exclusively work with authentic material, i.e. compounds that have been formed by speakers of the English language. This project is grounded on a systematically compiled, corpus-based and exhaustive inventory of English three-noun compounds. It contains almost 60,000 instances and can be seen as a representative database of tri-constituent noun compounds for the English language. The data acquisition process will be documented in the following sections: Section 4.1 will introduce the material that was used as a basis; Section 4.2 will describe the data retrieval process in considerable detail to provide an insight into the intricacies that are involved in the search for compounds in corpora. Section 4.3 will present the setup of the database and point out its flaws. 4.1 Corpus data used in this project
This project is based on The Corpus of Contemporary American English (COCA), a computerized database provided by Marc Davies (2008). This corpus was chosen primarily because the material in COCA covers a broad variety of the English language. It is distributed over various genres such as fiction, magazines, academic texts and newspapers,4 and contains both written and spoken data. Moreover, COCA is remarkably large: it has been compiled since 1990 and encompasses more than one billion words taken from 485,202 different texts. With this scope, it is significantly bigger than other established corpora such as, for example, the British National Corpus (BNC). Due to its size and the variety of the material used, the language in COCA can be considered an authentic and representative sample of American English. The focus on this variety of English was not a choice made with regard to the phenomenon under investigation (i.e. three-noun compounds are not assumed to be more relevant in American English than in other English varieties), but for practical reasons, as COCA is not only large enough to contain sufficient instances of complex compounds, but is furthermore available in a downloadable format, which permits the use of computational methodology. The specific demands posed by the different sorts of investigations performed in this project required working with three sets of material related to COCA, including the online edition of the corpus and two sources of offline material. The latter consist of a downloadable n-gram package and an offline version of the corpus, both available for purchase on the website.5 The n-gram package contains 2-, 3-, and 4-grams which occur at least three times in that sequence. It comprises a total of 155 million strings, including part-of-speech specification. Its 2-gram and 3-gram sets were used as source material for the compilation of the list of three-noun compounds. The offline version of the corpus was used for counting different kinds of token frequencies that could not be extracted from the n-gram sets. The online source was only used for qualitative, small-scale experiments and was accessed via the web-interface provided by English-Corpora.org. The word frequencies in these three sources do not match as the time periods they cover differ slightly: the online edition is up to date and thus yields higher word frequencies than the n-gram lists, which only date to 2011, while the time coverage of the offline corpus is to 2012. These differences, however, are unlikely to affect the overall results as they are generally relatively small (compare, for example, the occurrences of the three-noun compound weekend edition with 3,013 tokens in the offline database, as opposed to 3,033 tokens in the online edition). 4.2 Data retrieval
The methodological work with complex compounds entails intricacies that are systematic enough to sketch them in order to allow later approaches to take them into account. It is the aim of this section to point to aspects that complicate and manipulate the access to complex compounds in corpora on the one hand, and, on the other hand, to provide solutions by presenting the steps that were taken to overcome these difficulties. A major hurdle that impedes the identification of complex compounds in corpora is their spelling. For tri-constituent nominal compounds the possible combinations of hyphens, spaces and solid spelling theoretically add up to nine different formats that need to be taken into account: N-N-N, N N N, NN N, N-N N, N N-N, N NN, NN-N, N-NN and NNN. This has serious implications for the search process, as finding these compounds cannot be efficiently resolved through a query in the format of “Noun Noun Noun”. Orthography is also the source of a more serious problem which significantly hampers the discovery of a particular type of three-noun compound. There are a high number of tri-constituent compounds in which the embedded two-noun compound forms an orthographic unit, i.e. sequences in the format NN N or N NN, such as football game, family network, sunflower seeds or research database. In the process of automated tagging in corpora, the embedded compound is generally not recognized as a complex word but tagged holistically as one single noun, e.g. footballN gameN. Therefore, the relevant tri-constituent compounds are “hiding” in what looks like a bi-constituent compound. This poses a major challenge for finding them in a corpus, as they cannot be identified through a standard search for a sequence of three nouns. Due to the fact that the formats NN N and N NN are extremely common, excluding types of this format from the project was not an option as it would not permit justified generalizations on the phenomenon of tri-constituent compounds to be made. A sighting of two-noun compounds for hidden three-noun compounds is not feasible, as the number of two-noun compounds in English is extremely high. The procedure undertaken to solve this problem and to find three-noun compounds in their different spelling formats in the corpus will be presented in the following sections. 4.2.1 Recall of noun sequences Compiling a database of three-noun compounds clearly requires the use of computer-linguistic work. Using a POS-tagged corpus lends itself to this purpose, which is why the n-gram sets of COCA were chosen as source material. The sets of interest were, obviously, the 3-gram list, but also that of 2-grams, as this is where the above-mentioned problematic type football game was expected to be found. The list of 2-grams contains 6.2 million types; the list of 3-grams is even more extensive with 11.9 million types. These lists were uploaded to the online data management software MySQL on the servers of the CIS of LMU Munich to be further searched for the sequences of interest. In order to reduce both n-gram lists to actual noun sequences, the different noun tags were identified with the help of the CQP query syntax. The sequences selected were those where both (2-gram set) or all three nouns (3-gram set) were labelled with a noun tag, only excluding the tag that marks proper names (“np1”). Not surprisingly, the work with large data and automatized tagged corpora delivers results which include instances of noise, which is why the lists of noun sequences needed to be refined thoroughly. In what follows, I will portray the kinds of unwanted items that can be encountered in a search for noun compounds and propose solutions to eliminate them systematically. The following examples are an extract of the supposed noun sequences in the 2-gram data set and are intended to give an impression of the degree of noise that was prevalent in the data. In all of these instances, the constituents had been tagged as nouns: -- beef /ABC Mr. precedent sensori dritti riservati depositati piu Footage-alternates Mr. ??28;5463;TOOLONG account tat violence ...


Elisabeth Huber
, Ludwig-Maximilian Universität München.

Elisabeth Huber, Ludwig-Maximilian Universität München.


Ihre Fragen, Wünsche oder Anmerkungen
Vorname*
Nachname*
Ihre E-Mail-Adresse*
Kundennr.
Ihre Nachricht*
Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.
Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.