Publications by Koenraad De Smedt

April 25, 2017

NB: This document has hyperlinks which are clickable on a digital platform. Most of the links are to a pdf with the full text of the publication; in some case they point to the volume or the publisher's landing page for the publication. Some full texts provided through the links are identical to the published versions, others are prepublication versions or have postpublication corrections. Care must be taken when quoting from these texts; please refer to the published version or contact the author in case of doubt.

2016 · 2015 · 2014 · 2013 · 2012 · 2011 · 2010 · 2009 · 2008 · 2007 · 2006 · 2005 · 2004 · 2002 · 2000 · 1999 · 1998 · 1996 · 1995 · 1994 · 1993 · 1992 · 1991 · 1990 · 1989 · 1988 · 1987 · 1986 · 1985 · 1984 · 1983 · 1980

De Smedt, Koenraad; Samdal, Gunn Inger Lyse; Kyrkjebø, Rune; Al Ruwehy, Hemed Ali Hemed; Gjesdal, Øyvind Liland; Rosén, Victoria; Meurer, Paul (2016). The CLARINO Bergen Centre: Development and Deployment. In: Selected Papers from the CLARIN Annual Conference 2015, October 14–16, 2015, Wrocław, Poland. Linköping University Electronic Press.

The CLARINO Bergen Centre (Norway) provides a language resource repository, corpus and treebank services and metadata management services. We explain the motivation for using the LINDAT repository software as a model and describe the cloning and adaptation of that software for the CLARINO Bergen Repository. We also describe how the other centre services addressing CLARIN goals have been integrated into the centre, focusing on the steps taken to adapt the INESS treebanking service to CLARIN standards.

Dyvik, Helge J. Jakhelln; Meurer, Paul; Rosén, Victoria; De Smedt, Koenraad; Haugereid, Petter; Losnegaard, Gyri Smørdal; Samdal, Gunn Inger Lyse; Thunes, Martha (2016). NorGramBank: A ‘Deep’ Treebank for Norwegian. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association.

We present NorGramBank, a treebank for Norwegian with highly detailed LFG analyses. It is one of many treebanks made available through the INESS treebanking infrastructure. NorGramBank was constructed as a parsebank, i.e. by automatically parsing a corpus, using the wide coverage grammar NorGram. One part consisting of 350,000 words has been manually disambiguated using computer-generated discriminants. A larger part of 50 M words has been stochastically disambiguated. The treebank is dynamic: by global reparsing at certain intervals it is kept compatible with the latest versions of the grammar and the lexicon, which are continually further developed in interaction with the annotators. A powerful query language, INESS Search, has been developed for search across formalisms in the INESS treebanks, including LFG c- and f-structures. Evaluation shows that the grammar provides about 85% of randomly selected sentences with good analyses. Agreement among the annotators responsible for manual disambiguation is satisfactory, but also suggests desirable simplifications of the grammar.

Meurer, Paul; Rosén, Victoria; De Smedt, Koenraad (2016). Interactive Visualizations in the INESS Treebanking Infrastructure. In: Proceedings of the LREC 2016 Workshop VisLR II: Visualization as Added Value in the Development, Use and Evaluation of Language Resources.

The visualization of syntactic analyses may be challenging due to the number of readings, the size and detail of the structures, and the interrelations between levels of linguistic description. We present a range of interactive visualization techniques applied to complex syntactic analyses in INESS, an online infrastructure for parsing and the annotation and exploration of syntactically annotated corpora (treebanks). Although INESS caters to many syntactic formalisms, we focus on LFG, which allows for multiple levels of syntactic structure, in particular c-structures and f-structures. Interactive dynamic renderings of the relations between components of these structures are presented, with options on the level of detail to be displayed. Furthermore, the disambiguation of sentences with multiple possible parses needs techniques for visualizing the differences between readings. For this purpose, we present and discuss packed representations, the interactive visualization of discriminants, and the previewing of disambiguation choices. The interactive querying of treebanks benefits from appropriate ways of displaying search results. We present the highlighting of matching items in matching sentences. We also present tabular overviews with frequencies of obtained variable values, as well as the inspection of matching structures without having to navigate away from the overview.

Rosén, Victoria; Thunes, Martha; Haugereid, Petter; Losnegaard, Gyri Smørdal; Dyvik, Helge J. Jakhelln; Meurer, Paul; Samdal, Gunn Inger Lyse; De Smedt, Koenraad (2016). The enrichment of lexical resources through incremental parsebanking. Language Resources and Evaluation 50(2):291–319.

Automatic syntactic analysis of a corpus requires detailed lexical and morphological information that cannot always be harvested from traditional dictionaries. Therefore the development of a treebank presents an opportunity to simultaneously enrich the lexicon. In building NorGramBank, we use an incremental parsebanking approach, in which a corpus is parsed and disambiguated, and after improvements to the grammar and the lexicon, reparsed. In this context we have implemented a text preprocessing interface where annotators can enter unknown words or missing lexical information either before parsing or during disambiguation. The information added to the lexicon in this way may be of great interest both to lexicographers and to other language technology.

Rehm, Georg; Uszkoreit, Hans; Ananiadou, Sophia; Bel, Núria; Bielevičienė, Audronė; Borin, Lars; Branco, António; Budin, Gerhard; Calzolari, Nicoletta; Daelemans, Walter; Garabik, Radovan; Grobelnik, Marko; Garcia-Mateo, Carmen; van Genabith, Josef; Hajič, Jan; Hernaez, Inma; Judge, John; Koeva, Svetla; Krek, Simon; Krstev, Cvetana; Lindén, Krister; Magnini, Bernardo; Mariani, Joseph; McNaught, John; Melero, Maite; Monachini, Monica; Moreno, Asunción; Odijk, Jan; Ogrodniczuk, Maciej; Pęzik, Piotr; Piperidis, Stelios; Przepiórkowski, Adam; Rögnvaldsson, Eiríkur; Rosner, Mike; Pedersen, Bolette Sandford; Skadiņa, Inguna; De Smedt, Koenraad; Tadič, Marko; Thompson, Paul; Tufiş, Dan; Váradi, Tamás; Vasiļjevs, Andrejs; Vider, Kadri; Zabarskaitė, Jolanta (2016). The strategic impact of META-NET on the regional, national and international level. Language Resources and Evaluation 50(2):351–374.

This article provides an overview of the dissemination work carried out in META-NET from 2010 until 2015; we describe its impact on the regional, national and international level, mainly with regard to politics and the funding situation for LT topics. The article documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.

Rosén, Victoria; De Smedt, Koenraad; Losnegaard, Gyri Smørdal; Bejček, Eduard; Savary, Agata; Osenova, Petya (2016). MWEs in treebanks: From survey to guidelines. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association.

By means of an online survey, we have investigated ways in which various types of multiword expressions are annotated in existing treebanks. The results indicate that there is considerable variation in treatments across treebanks and thereby also, to some extent, across languages and across theoretical frameworks. The comparison is focused on the annotation of light verb constructions and verbal idioms. The survey shows that the light verb constructions either get special annotations as such, or are treated as ordinary verbs, while VP idioms are handled through different strategies. Based on insights from our investigation, we propose some general guidelines for annotating multiword expressions in treebanks. The recommendations address the following application-based needs: distinguishing MWEs from similar but compositional constructions; searching distinct types of MWEs in treebanks; awareness of literal and nonliteral meanings; and normalization of the MWE representation. The cross-lingually and cross-theoretically focused survey is intended as an aid to accessing treebanks and an aid for further work in treebank annotation.

De Smedt, Koenraad; Rosén, Victoria; Meurer, Paul (2015). Studying consistency in UD treebanks with INESS-Search. In: Proceedings of the Fourteenth Workshop on Treebanks and Linguistic Theories (TLT14), pp. 258–267. Warsaw, Poland: Institute of Computer Science, Polish Academy of Sciences (IPI-PAN)

We demonstrate how treebank search may be helpful in examining the consistency with which annotations are applied, both within and across treebanks. Universal Dependency (UD) treebanks are used as examples.

Samdal, Gunn Inger Lyse; Meurer, Paul; De Smedt, Koenraad (2015). COMEDI: A component metadata editor. In: Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands, pp. 82–98. Linköping University Electronic Press.

The flexibility of component metadata (CMDI) brings about a need for editing tools which are equally flexible. Moreover, such tools should be as user friendly as possible in order to lower the threshold for beginners and to promote efficiency even for advanced users. The current paper presents COMEDI, a new web-based editor which handles any CMDI profile. We evaluate currently existing metadata editors and argue that the COMEDI editor is the first one to combine a good level of user-friendliness with sufficient support for CMDI. COMEDI also offers up to date support for CLARIN features such as current license types.

Victoria Rosén, Gyri Losnegaard, Koenraad De Smedt, Eduard Bejček, Agata Savary, Adam Przepiórkowski, Petya Osenova and Verginica Barbu Mititelu (2015). In: Proceedings of the Fourteenth Workshop on Treebanks and Linguistic Theories (TLT14), pp. 179–193. Warsaw, Poland: Institute of Computer Science, Polish Academy of Sciences (IPI-PAN)

We present the methodology and results of a survey on the annotation of multiword expressions in treebanks. The survey was conducted using a wiki-like website filled out by people knowledgeable about various treebanks. The survey results were studied with a comparative focus on prepositional MWEs, verb-particle constructions and multiword named entities.

De Smedt, Koenraad; Hinrichs, Erhard; Meurers, Detmar; Skadiņa, Inguna; Pedersen, Bolette Sandford; Navarretta, Costanza; Bel, Núria; Lindén, Krister; Lopatková, Markéta; Hajič, Jan; Andersen, Gisle; Lenkiewicz, Przemysław (2014). CLARA: A new generation of researchers in common language resources and their applications. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pp. 2166–2174. European Language Resources Association.

CLARA (Common Language Resources and Their Applications) is a Marie Curie Initial Training Network which ran from 2009 until 2014 with the aim of providing researcher training in crucial areas related to language resources and infrastructure. The scope of the project was broad and included infrastructure design, lexical semantic modeling, domain modeling, multimedia and multimodal communication, applications, and parsing technologies and grammar models. An international consortium of 9 partners and 12 associate partners employed researchers in 19 new positions and organized a training program consisting of 10 thematic courses and summer/winter schools. The project has resulted in new theoretical insights as well as new resources and tools. Most importantly, the project has trained a new generation of researchers who can perform advanced research and development in language resources and technologies.

Victoria Rosén, Petter Haugereid, Martha Thunes, Gyri Smørdal Losnegaard, Helge Dyvik, and Paul Meurer (2014). The interplay between lexical and syntactic resources in incremental parsebanking. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1617–1624. European Language Resources Association (ELRA), 2014.

Automatic syntactic analysis of a corpus requires detailed lexical and morphological information that cannot always be harvested from traditional dictionaries. In building the INESS Norwegian treebank, it is often the case that necessary lexical information is missing in the morphology or lexicon. The approach used to build the treebank is incremental parsebanking; a corpus is parsed with an existing grammar, and the analyses are efficiently disambiguated by annotators. When the intended analysis is unavailable after parsing, the reason is often that necessary information is not available in the lexicon. INESS has therefore implemented a text preprocessing interface where annotators can enter unrecognized words before parsing. This may concern words that are unknown to the morphology and/or lexicon, and also words that are known, but for which important information is missing. When this information is added, either during text preprocessing or during disambiguation, the result is that after reparsing the intended analysis can be chosen and stored in the treebank. The lexical information added to the lexicon in this way may be of great interest both to lexicographers and to other language technology efforts, and the enriched lexical resource being developed will be made available at the end of the project.

Sebastian Sulger, Miriam Butt, Tracy Holloway King, Paul Meurer, Tibor Laczkó, György Rákosi, Cheikh Bamba Dione, Helge Dyvik, Victoria Rosén, Koenraad De Smedt, Agnieszka Patejuk, Özlem Çetinoglu, I Wayan Arka, and Meladel Mistica. ParGramBank: The ParGram parallel treebank. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp. 550–560, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.

This paper discusses the construction of a parallel treebank currently involving ten languages from six language families. The treebank is based on deep LFG (Lexical-Functional Grammar) grammars that were developed within the framework of the ParGram (Parallel Grammar) effort. The grammars produce output that is maximally parallelized across languages and language families. This output forms the basis of a parallel treebank covering a diverse set of phenomena. The treebank is publicly available via the INESS treebanking environment, which also allows for the alignment of language pairs. We thus present a unique, multilayered parallel treebank that represents more and different types of languages than are available in other treebanks, that represents deep linguistic knowledge and that allows for the alignment of sentences at several levels: dependency structures, constituency structures and POS information.

Helge Dyvik, Martha Thunes, Petter Haugereid, Victoria Rosén, Paul Meurer, Koenraad De Smedt, and Gyri Smørdal Losnegaard (2013). Studying interannotator agreement in discriminant-based parsebanking. In Sandra Kübler, Petya Osenova, and Martin Volk, editors, Proceedings of the Twelfth Workshop on Treebanks and Linguistic Theories (TLT12), pages 37–48. Bulgarian Academy of Sciences, 2013.

This paper reports on a pilot study on interannotator agreement in discriminant-based parsebanking, especially with a view to uncovering linguistic issues in the grammar and lexicon. We classify the annotator discrepancies according to their causes, pointing to different stategies for avoiding them in the future, e.g. with regard to documentation or to grammar and lexicon development, and discuss a selection of examples.

Gyri Smørdal Losnegaard, Gunn Inger Lyse, Anje Müller Gjesdal, Koenraad De Smedt, Paul Meurer, and Victoria Rosén (2013). Linking Northern European infrastructures for improving the accessibility and documentation of complex resources. In Koenraad De Smedt, Lars Borin, Krister Lindén, Bente Maegaard, Eiríkur Rögnvaldsson, and Kadri Vider, editors, Proceedings of the workshop on Nordic language research infrastructure at NODALIDA 2013, May 22-24, 2013, Oslo, Norway. NEALT Proceedings Series 20, number 89 in Linköping Electronic Conference Proceedings, pages 44–59. Linköping University Electronic Press, 2013.

This paper describes our integration efforts in two Northern European language infrastructures. Specifically, this work has been a collaboration between the META-NORD team at the University of Bergen and the INESS project, a large treebanking infrastructure project in Norway, in developing and documenting two complex resources, as well as making these accessible to the R&D community.

Paul Meurer, Helge Dyvik, Victoria Rosén, Koenraad De Smedt, Gunn Inger Lyse, Gyri Smørdal Losnegaard, and Martha Thunes (2013). The INESS treebanking infrastructure. In Stephan Oepen, Kristin Hagen, and Janne Bondi Johannessen, editors, Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), May 22–24, 2013, Oslo University, Norway. NEALT Proceedings Series 16, number 85 in Linköping Electronic Conference Proceedings, pages 453–458. Linköping University Electronic Press.

This paper briefly describes the current state of the evolving INESS infrastructure in Norway which is developing treebanks as well as making treebanks more accessible to the R&D community. Recent work includes the hosting of more treebanks, including parallel treebanks, and increasing the number of parsed and disambiguated sentences in the Norwegian LFG treebank. Other recent improvements include the presentation of metadata and license handling for restricted treebanks. The infrastructure is fully operational and accessible, but will be further improved during the lifetime of the INESS project.

Inguna Skadiņa, Andrejs Vasiļjevs, Lars Borin, Krister Lindén, Gyri Losnegaard, Sussi Olsen, Bolette Sandford Pedersen, Roberts Rozis, and Koenraad De Smedt (2013). Baltic and Nordic parts of the European linguistic infrastructure. In Stephan Oepen, Kristin Hagen, and Janne Bondi Johannessen, editors, Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), May 22–24, 2013, Oslo University, Norway. NEALT Proceedings Series 16, number 85 in Linköping Electronic Conference Proceedings, pages 195–211. Linköping University Electronic Press.

This paper describes scientific, technical, and legal work done on the creation of the linguistic infrastructure for the Nordic and Baltic countries. The paper describes the research on assessment of language technology support for the languages of the Baltic and Nordic countries, work on establishing a language resource sharing infrastructure, and collection and description of linguistic resources. We present improvements necessary to ensure usability and interoperability of language resources, discuss issues related to intellectual property rights for complex resources, and describe extension of infrastructure through integration of language-resource specific repositories. Work on treebanks, wordnets, terminology resources, and finite-state technology is described in more detail. Finally, our approach on ensuring the sustainability of infrastructure is discussed.

Kristiina Jokinen and Koenraad De Smedt (2012). Computational pragmatics. In Jan-Ola Östman and Jef Verschueren (editors), Handbook of Pragmatics, pages 1–39. Amsterdam: John Benjamins Publishing Company. (local copy)

Computational pragmatics refers to the study by means of computers of how language is used in context. This chapter is structured along the different types of pragmatic components and various degrees of contextual information needed to model intelligent agents: starting from the sentential context and moving on to two-party interactions and finally to situated interaction with embodied agents.

Victoria Rosén, Paul Meurer, Gyri Smørdal Losnegaard, Gunn Inger Lyse, Koenraad De Smedt, Martha Thunes, and Helge Dyvik (2012). An integrated web-based treebank annotation system. In Iris Hendrickx, Sandra Kübler, and Kiril Simov (editors) Proceedings of the Eleventh International Workshop on Treebanks and Linguistic Theories (TLT11), pages 157–167. Lisbon, Portugal: Edicõ̧es Colibri.

In recent years a development towards easier access to treebanks has been discernible. Fully online environments for the creation, annotation and exploration of treebanks have however been very scarce so far. In this paper we describe some user needs from the annotator perspective and report on our experience with the development and practical use of a web-based annotation system.

Koenraad De Smedt, Gunn Inger Lyse, Anje Müller Gjesdal, and Gyri Smørdal Losnegaard (2012). The Norwegian Language in the Digital Age / Norsk i den digitale tidsalderen (English/Bokmål). Berlin/Heidelberg: Springer.

Dette dokumentet er del av en serie som skal fremme kunnskap om språkteknologiens status og potensiale. Målgruppen er journalister, politikere, språkbrukere, lærere og andre interesserte. Denne språkrapporten utgjør en viktig del av META-NETs strategiske handlingsplan.

Koenraad De Smedt, Gunn Inger Lyse, Anje Müller Gjesdal, and Gyri Smørdal Losnegaard (2012). The Norwegian Language in the Digital Age / Norsk i den digitale tidsalderen (English/Nynorsk). Berlin/Heidelberg: Springer.

Dette dokumentet er del av ein serie som skal fremje kunnskap om språkteknologiens status og potensiale. Målgruppa er journalistar, politikarar, språkbrukarar, lærarar og andre interesserte. Denne språkrapporten utgjer ein viktig del av META-NET sin strategiske handlingsplan.

Victoria Rosén, Koenraad De Smedt, Paul Meurer, and Helge Dyvik (2012). An Open Infrastructure for Advanced Treebanking. In Jan Hajič, Koenraad De Smedt, Marko Tadič and António Branco (eds.), META-RESEARCH Workshop on Advanced Treebanking at LREC2012 (pp. 22–29). Istanbul, Turkey.

Increases in the number and size of treebanks, and the complexity of their annotation, present challenges to their exploration by the research community. Adhering to different formalisms, lacking clear standards, and requiring specialized search and visualization and other services, treebanks have not been widely accessible to a broad audience and have remained underexploited. The INESS project is providing the first infrastructure integrating treebank annotation, analysis and distribution, bringing together treebanks for many different languages, spanning different annotation schemes and including parallel treebanks. The infrastructure offers a uniform interface, interactive visualizations, leading edge search capabilities and high performance computing.

Gyri Smørdal Losnegaard, Gunn Inger Lyse, Martha Thunes, Victoria Rosén, Koenraad De Smedt, Helge Dyvik, and Paul Meurer (2012). What We Have Learned from Sofie: Extending Lexical and Grammatical Coverage in an LFG Parsebank. In Jan Hajič, Koenraad De Smedt, Marko Tadič and António Branco (eds.), META-RESEARCH Workshop on Advanced Treebanking at LREC2012 (pp. 69–76). Istanbul, Turkey.

Constructing a treebank as a dynamically parsed corpus is an iterative process which may effectively lead to improvements of the grammar and lexicon. We show this from our experiences with semiautomatic disambiguation of a Norwegian LFG parsebank. The main types of grammar and lexicon changes necessary for achieving improved coverage are analyzed and discussed. We show that an important contributing factor to missing coverage is missing multiword expressions in the lexicon.

Gunn Inger Lyse, Carla Parra Escartín, and Koenraad De Smedt (2012). Applying current metadata initiatives: The META-NORD experience. In Victoria Arranz, Daan Broeder, Bertrand Gaiffe, Maria Gavrilidou, Monica Monachini, and Thorsten Trippel (eds.), Proceedings of the Workshop on Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LRs (pp. 13–20). Istanbul, Turkey, May 2012. European Language Resources Association (ELRA).

In this paper we present the experiences with metadata in the Norwegian part of the META-NORD project, exemplifying issues related to the top-level description of language resources and tools (LRT). In recent years new initiatives have appeared as regards long-term accessibility plans to LRT. The META-NORD project, and the broader META-SHARE initiative in the META-NET network, are among the initiatives working on the standardization of the description of linguistic resources as well as on the creation of infrastructures that ensure a long-term curation and distribution of LRT. We present the use cases we have been dealing with in Norway as part of this effort. We also report on the importance of dealing with real user case scenarios to detect and solve potential problems concerning the construction of a larger open infrastructure for LRT.

Andrejs Vasiļjevs, Markus Forsberg, Tatiana Gornostay, Dorte H. Hansen, Kristín M. Jóhannsdóttir, Krister Lindén, Gunn I. Lyse, Lene Offersgaard, Ville Oksanen, Sussi Olsen, Bolette S. Pedersen, Eiríkur Rögnvaldsson, Roberts Rozis, Inguna Skadiņa, and Koenraad De Smedt (2012). Creation of an open shared language resource repository in the Nordic and Baltic countries. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis (editors), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 1076–1083, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). ISBN 978-2-9517408-7-7.

The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and Baltic countries in terms of language use, language technology and resources, main actors in the academy, industry, government and society; identified and collected the first batch of language resources in the Nordic and Baltic countries; documented, processed, linked, and upgraded the identified language resources to agreed standards and guidelines. The three horizontal multilingual actions in META-NORD are overviewed in this paper: linking and validating Nordic and Baltic wordnets, the harmonisation of multilingual Nordic and Baltic treebanks, and consolidating multilingual terminology resources across European countries. This paper also touches upon intellectual property rights for the sharing of language resources.

Van Kesteren, Ron; Dijkstra, Ton; De Smedt, Koenraad (2012). Markedness effects in Norwegian–English bilinguals: Task-dependent use of language-specific letters and bigrams. Quarterly Journal of Experimental Psychology: 65. ISSN 1747-0218 (print), 1747-0226 (online). DOI 10.1080/17470218.2012.679946.

This study investigates how bilinguals use sublexical language membership information to speed up their word recognition process in different task situations. Norwegian–English bilinguals performed a Norwegian–English language decision task, a mixed English lexical decision task, or a mixed Norwegian lexical decision task. The mixed lexical decision experiments included words from the nontarget language that required a “no” response. The language specificity of the Bokmål (a Norwegian written norm) and English (non)words was varied by including language-specific letters (“smør”, “hawk”) or bigrams (“dusj”, “veal”). Bilinguals were found to use both types of sublexical markedness to facilitate their decisions, language-specific letters leading to larger effects than language-specific bigrams. A cross-experimental comparison indicates that the use of sublexical language information was strategically dependent on the task at hand and that decisions were based on language membership information derived directly from sublexical (bigram) stimulus characteristics instead of indirectly via their lexical representations. Available models for bilingual word recognition fail to handle the observed marker effects, because all consider language membership as a lexical property only.

De Smedt, Koenraad (2012): Ash compound frenzy: A case study in the Norwegian Newspaper Corpus. In Andersen, Gisle (ed.) Exploring Newspaper Language. Using the web to create and investigate a large corpus of modern Norwegian (Studies in Corpus Linguistics Vol. 49) (pp. 241–255). Amsterdam: John Benjamins. ISBN 9789027203540 (hardcover), 9789027274991 (ebook), ISSN 1388-0373.

The creation and use of Norwegian compounds with aske ‘ash’ in the media in the spring of 2010 was remarkable. The Norwegian Newspaper Corpus has provided suitable data for an empirical study concerning neology and ‘burstiness’ on this topic. The results indicate that the number of new types as well as token frequencies rose markedly and the variety of compounds in use remained elevated for over a month.

Vasiļjevs, Andrejs, Pedersen, Bolette Sandford, De Smedt, Koenraad, Borin, Lars and Skadiņa, Inguna (2011). META-NORD: Baltic and Nordic Branch of the European Open Linguistic Infrastructure. In: Sjur Nørstebø Moshagen and Per Langgård (eds.), Proceedings of the NODALIDA 2011 Workshop Visibility and Availability of LT Resources, volume 13 of NEALT Proceedings Series (pp. 18–22). Northern European Association for Language Technology (NEALT).

This position paper presents META-NORD project which develops Nordic and Baltic part of the European open language resource infrastructure. META-NORD works on assembling, linking across languages, and making widely available the basic language resources used by developers, professionals and researchers to build specific products and applications. Goals of the project, overall approach and specific focus lines on wordnets, terminology resources and treebanks are described.

Skadiņa, Inguna, Vasiļjevs, Andrejs, Borin, Lars, De Smedt, Koenraad, Lindén, Krister and Rögnvaldsson, Eiríkur (2011). META-NORD: Towards Sharing of Language Resources in Nordic and Baltic Countries. In: Nicoletta Calzolari, Toru Ishida and Stelios Piperidis and Virach Sornlertlamvanich (eds.), Proceedings of the Workshop on Language Resources, Technology and Services in the Sharing Paradigm, Chiang Mai, Thailand (pp. 107–114). Asian Federation of Natural Language Processing.

This paper introduces the META-NORD project which develops Nordic and Baltic part of the European open language resource infrastructure. META-NORD works on assembling, linking across languages, and making widely available the basic language resources used by developers, professionals and researchers to build specific products and applications. The goals of the project, overall approach and specific action lines on wordnets, terminology resources and treebanks are described. Moreover, results achieved in first five months of the project, i.e. language whitepapers, metadata specification and IPR management, are presented.

De Smedt, Koenraad and Rögnvaldsson, Eiríkur (2011). The META-NORD Language Reports. In: Sjur Nørstebø Moshagen and Per Langgård (eds.), Proceedings of the NODALIDA 2011 Workshop Visibility and Availability of LT Resources, [volume 13 of NEALT Proceedings Series] (pp. 23–27). Northern European Association for Language Technology (NEALT).

As part of the META-NORD project, the state of affairs in language technology in the Nordic and Baltic countries is being described in a set of eight reports. Each language report describes the situation of a language community and the position of the language service and language technology industry for that language. This position paper presents our methodology and preliminary findings. The final reports will be published in the META-NET series of white papers for all main languages of Europe.

Galen Gisler, Elena Celledoni, Trygve Ulf Helgaker, Trond Iversen, Kjetill Sigurd Jakobsen, Colin Jones, Anna Lipniacka, Arvid Lundervold, Nils Reidar B. Olsen, Koenraad De Smedt, Jacko Koster, and Gudmund Høst (2010). The scientific case for eInfrastructure in Norway. Report, Research Council of Norway.

Many Norwegian research groups use high-performance computing, fast data networks, archival storage, and associated services. To help these research groups maintain their internationally leading positions, the Research Council of Norway has invested in the development of the necessary infrastructure through the eVITA programme, especially in the programs Notur, NorGrid, and NorStore. While it is important to obtain and use the best available hardware, no less important are the human resources for supporting and maintaining them and the institutes necessary for housing them. All this comprises a new kind of societal infrastructure, an electronic infrastructure or eInfrastructure. This eInfrastructure is not associated with individual projects or institutions, but exists at a national level. It is a societal infrastructure, an indispensable part of a “well-functioning research system”. As research is vital to the future of any nation, so is eInfrastructure as vital as power lines and roads. The present eInfrastructure initiative, running from 2005 to 2014, is managed by the eVITA Programme Committee, which reports to the Research Council. Looking beyond the end of this initiative, the Programme Committee appointed the eInfrastructure Scientific Opportunities Panel to assess the future growth of scientific needs for eInfrastructure in Norway. This report is authored by that Panel. The intention of this assessment is to furnish the eVITA Programme Committee and the Research Council with arguments from the research community for the further development of eInfrastructure. It might be expected that scientists, given the opportunity to speculate on future needs, would simply ask for more resources. But the more careful and realistic assessment attempted in this report considers concrete problems that require solutions; those solutions entail eInfrastructure.

Rosén, Victoria and De Smedt, Koenraad (2010). Syntactic annotation of learner corpora. In: Hilde Johansen, Anne Golden, Jon Erik Hagen, and Ann-Kristin Helland (eds.) Systematisk, variert, men ikke tilfeldig (pp. 120–132). Novus forlag.

Syntactic annotation of learner corpora is useful for investigating the grammatical properties of learner language. We discuss two approaches to syntactic annotation based on different methodological choices. One approach, recently proposed in the literature, is the manual annotation of learner language with dependency relations. Another approach, which we present as an alternative, is based on automatic parsing of a ‘correct’ version with an L2 grammar.

Gillis, Steven; Daelemans, Walter; De Smedt, Koenraad (2009). Artificial Intelligence. In: Sandra, Dominiek; Östman, Jan-Ola, Verschueren, Jef (eds.) Cognition and Pragmatics (Handbook of Pragmatics Highlights, Vol. 3, pp. 16–40). Amsterdam: John Benjamins.

Artificial intelligence (AI) is a branch of computer science in which methods and techniques are developed that permit intelligent computer systems to be built. We briefly describe the foundations of AI and review the history of language processing research in AI. Then we explain general aspects of knowledge representation and the most influential knowledge-based paradigms. Finally, we show how linguistic symbol manipulation is applied in semantics and pragmatics.

Dyvik, Helge; Meurer, Paul; Rosén, Victoria; De Smedt, Koenraad (2009). Linguistically Motivated Parallel Parsebanks. In: Passarotti, Marco; Przepiórkowski, Adam; Raynaud, Sabine; Van Eynde, Frank (Eds.) Proceedings of the Eight International Workshop on Treebanks and Linguistic Theories (pp. 71-82). Milano: EDUCatt.

Parallel grammars and parallel treebanks can be a useful method for studying linguistic diversity and commonality. We use this approach to study how arguments to similar predicates are realized across languages. To that end, we formulate formal principles for aligning at phrase and word levels based on translational correspondences at predicate-argument level. A first version of a new tool for creating, storing, visualizing and searching treebank alignment at different levels has been constructed.

De Smedt, Koenraad (2009). NLP for writing: What has changed?. In: Domeij, Rickard; Johansson Kokkinakis, Sofie; Knutsson, Ola; Sofkova Hashemi, Sylvana (Eds.) Proceedings of the Workshop on NLP for Reading and Writing – Resources, Algorithms and Tools (SLTC 2008). NEALT Proceedings Series, Vol. 3 (pp. 1–11). Tartu University Library.

It might appear that few advances have been made in proofreading technology since the 1980s . On the one hand, spelling and grammar checking have become standard features in many kinds of applications that involve writing. On the other hand, a number of advanced research ideas and results from the 1980s do not seem to have been applied or further pursued in newer research. The present moment is therefore an opportunity to look back and reflect on what has been done so far and what has changed.

Rosén, Victoria; Meurer, Paul; De Smedt, Koenraad (2009). LFG Parsebanker: A Toolkit for Building and Searching a Treebank as a Parsed Corpus. In: Van Eynde, Frank; Frank, Anette; De Smedt, Koenraad; Van Noord, Gertjan (Eds.) Proceedings of the Seventh International Workshop on Treebanks and Linguistic Theories, Groningen, The Netherlands, January 23–24, 2009 (pp. 127–133). Utrecht: LOT.

We present the LFG Parsebanker, a comprehensive toolkit for interactive incremental construction of a treebank as a parsed corpus. This web-based toolkit offers an environment for batch and interactive parsing, versioning, inspection of structures, discriminant-based disambiguation, and statistics. It has recently been extended with a structural search facility.

Heeringa, Wilbert; Gooskens, Charlotte; De Smedt, Koenraad (2008) What Role does Dialect Knowledge Play in the Perception of Linguistic Distances? International Journal of Humanities and Arts Computing 2, 243-259.

The present paper investigates to what extent subjects base their judgments of linguistic distances on actual dialect data presented in a listening experiment and to what extent they make use of previous knowledge of the dialects when making their judgments. The point of departure for our investigation were distances between 15 Norwegian dialects as perceived by Norwegian listeners. We correlated these perceptual distances with objective phonetic distances measured on the basis of the transcriptions of the recordings used in the perception experiment. In addition, we correlated the perceptual distances with objective distances based on other datasets. On the basis of the correlation results and multiple regression analyses we conclude that the listeners did not base their judgments solely on information that they heard during the experiments but also on their general knowledge of the dialects. This conclusion is confirmed by the fact that the effect is stronger for the group of listeners who recognised the dialects than for listeners who did not recognise the dialects on the tape.

De Smedt, Koenraad (2008) Processing and distributing language data. Meta 2008 (2), 12–14.

Rosén, Victoria; Meurer, Paul; De Smedt, Koenraad (2007). Designing and Implementing Discriminants for LFG Grammars. In: King, Tracy Holloway and Butt, Miriam (Eds.) The Proceedings of the LFG '07 Conference (pp. 397–417). Stanford: CSLI Publications.

We extend discriminant-based disambiguation techniques to LFG grammars. We present the design and implementation of lexical, morphological, c-structure and f-structure discriminants for an LFG-based parser. Chief considerations in the computation of discriminants are capturing all distinctions between analyses and relating linguistic properties to words in the string. Our work is mostly tested on Norwegian, but our approach is independent of the language and grammar.

Rosén, Victoria; De Smedt, Koenraad (2007). Theoretically motivated Treebank coverage. In: Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit (Eds.) Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007 (pp. 152–159). Tartu: University of Tartu.

The question of grammar coverage in a treebank is addressed from the perspective of language description, not corpus description. We argue that a treebanking methodology based on parsing a corpus does not necessarily imply worse coverage than grammar induction based on a manually annotated corpus.

Lech, Till Christopher; De Smedt, Koenraad (2007). Ontology extraction for coreference chaining. In: Johansson, Christer (Eds.). Proceedings from the first Bergen Workshop on Anaphora Resolution (WAR I) (pp. 26-38). Newcastle upon Tyne: Cambridge Scholars Publishing.

The KunDoc project investigates coreference chaining with ontology-based methods. In this paper, we discuss knowledge-based methods for coreference chaining and in particular the use of ontologies and their acquisition from a corpus. We present the KunDoc methodology and its implementation. We use concepts and their interrelations extracted from a corpus of Norwegian newspaper articles to build up domain-specific ontologies which contribute with selectional restrictions on possible co-referents. We expect to see an improvement over methods that do not employ any semantic knowledge.

Rosén, Victoria; De Smedt, Koenraad; Meurer, Paul (2006). Towards a Toolkit Linking Treebanking to Grammar Development. In: Hajič, J; Nivre, J. (Eds.) TLT 2006: Proceedings of the Fifth Workshop on Treebanks and Linguistic Theories, December 1-2, 2006, Prague, Czech Republic (pp. 55–66). Institute of formal and applied linguistics, Prague, Czech Republic.

We present advances in the construction of a treebanking toolkit that implements discriminants at several levels and we present improvements in its web-based interface. We will first outline our use of discriminants in the context of LFG-based parsing. Then, we will highlight some new features in our treebanking interface, including statistics. Finally, we will discuss the linking of treebanking and grammar development.

Lech, Till Christopher; De Smedt, Koenraad (2006). Dreistadt: A language enabled MOO for language learning. In: Proceedings of the combined Workshop on Language-Enabled Educational Technology and Development and Evaluation of Robust Spoken Dialogue Systems, Riva del Garda (Italy), Aug. 28, 2006 (pp. 38-44).

Dreistadt is an educational MOO (Multi User Domain, Object Oriented) for language learning. It presents a virtual world in which learners of German communicate with their fellow learners, teachers and native language users in other locations via the Internet. While the original Dreistadt had an artificial command language for interaction with the system, we have provided it with natural language processing capabilities, in order to allow a more seamless linguistic interaction. For this purpose, an NLP interface for controlled German has been added. The student's natural language commands are translated to system internal instructions by a set of syntactic, semantic and pragmatic analysis tools. The system is capable of handling pronouns and other referring expressions by applying domain knowledge and includes an inferencing component based on predicate logic.

Rosén, Victoria; De Smedt, Koenraad; Dyvik, Helge; Meurer, Paul (2005). TREPIL: Developing Methods and Tools for Multilevel Treebank Construction. In: Civit, Montserrat; Kübler, Sandra; Martí, Ma. Antònia (Eds.) Proceedings of TLT'05 (pp. 161-172).

Current trends in language technology require treebanks that do not stop at the level of constituent structure, but include deeper and richer levels of analysis, including appropriate meaning structures. Capturing sufficient detail at different levels of linguistic description is too complex a task to be practically achievable by manual annotation or shallow parsing; rather it requires sophisticated tools that help secure the consistency of parallel but different structures. We are constructing a multilevel treebanking tool that incorporates a deep parser and grammar for Norwegian. Thus, we are tightly linking our treebank to grammar development so as to achieve a sound embedding in grammatical theory and yield more useful results for applications.

Rosén, Victoria; Meurer, Paul; De Smedt, Koenraad (2005). Constructing a parsed corpus with a large LFG grammar. In: Butt, Miriam; King, Tracy (Eds.) Proceedings of the LFG'05 Conference, University of Bergen (pp. 371-387). CSLI Publications.

The TREPIL project (Norwegian treebank pilot project 2004-2008) is aimed at developing and testing methods for the construction of a Norwegian parsed corpus. Annotation of c-structures, f-structures and mrs-structures is based on automatic parsing with human validation and disambiguation. Parsing is done with a large LFG grammar and the XLE parser. We propose a method for efficient disambiguation based on discriminants and we have implemented a set of computational tools for this purpose.

Nivre, Joakim; De Smedt, Koenraad; Volk, Martin (2005). Treebanking in Northern Europe: A White Paper. In: Holmboe, Henrik (Ed.) Nordisk Sprogteknologi 2004: Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (pp. 97-112). København: Museum Tusculanums Forlag.

We present the case for an extensive scientific effort to build up large treebanks for the Nordic and Baltic languages, as a step towards developing advanced multilingual communication technologies for these languages in the future.

Lech, Till; De Smedt, Koenraad (2005). Enhancing Semantic Annotation through Coreference Chaining: An Ontology-based Approach. In: Handschuh, Siegfried; Declerck, Thierry; Koivunen, Marja-Riitta (Eds.) Proceedings of the 5th International Workshop on Knowledge Markup and Semantic Annotation (SemAnnot 2005) (pp. 131-136). CEUR Workshop Proceedings, Vol. 185.

Semantic annotation of natural language text requires a certain degree of understanding of the document in question. Especially the resolution of unclear reference is a major challenge when detecting relevant information units in a document. The ongoing KunDoc project examines how domain specific ontologies can support the task of coreference chaining in order to enhance applications such as automatic annotation, information extraction or automatic summarization. In this paper, we present a robust methodology for acquisition of semantic contexts that does not depend on a thorough syntactic parsing as necessary tools often are unavailable for “smaller” languages. Based on a shallow corpus-analysis, verb-subject relations constitute the framework for the extraction of semantic contexts. Our approach either adds the semantic contexts to concepts and instances in an existing ontology or builds up the domain knowledge necessary for coreference chaining from scratch.

De Smedt, Koenraad; Liseth, Anja; Hassel, Martin; Dalianis, Hercules (2005). How short is good? An evaluation of automatic summarization. In: Holmboe, Henrik (Ed.) Nordisk Sprogteknologi 2004: Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (pp. 267-287). København: Museum Tusculanums Forlag.

The evaluation of automatic summarization is important and challenging, since in general it is difficult to agree on an ideal summary of a text. We report on research advances in summarization evaluation obtained in the context of ScandSum, a researcher network targeted at automatic summarization for the Scandinavian languages, supported by the Norwegian Council of Ministers under its Language Technology programme (2000-2004).

De Smedt, Koenraad; Andersen, Gisle (2005). Linking a Norwegian web portal for Language Technology to its Nordic partner sites. In: Holmboe, Henrik (Ed.) Nordisk Sprogteknologi 2004: Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (pp. 35-42). København: Museum Tusculanums Forlag.

In order to achieve a truly Nordic perspective, the Norwegian Documentation Centre for Language Technology has during its four years of activities (2001-2004) intensively cooperated with the other Nordic documentation centres in Denmark, Finland, Iceland and Sweden. This cooperation has eventually crystallized into a system for linking together the various national web portals. The Norwegian portal is aimed at providing a news service as well as a comprehensive and updated survey of activities in the field, language resources, networks and contact information of the participants. This information has been made searchable across all the Nordic documentation centres.

Fersøe, Hanne; Rögnvaldsson, Eiríkur; De Smedt, Koenraad (2005). NorDokNet: Network of Nordic Documentation Centres - Contacts to Future Baltic Partners. In: Holmboe, Henrik (Ed.) Nordisk Sprogteknologi 2005. Nordisk Sprogteknologi 2004: Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (pp. 13-27). København: Museum Tusculanums Forlag.

The Nordic network of national documentation centres for language technology, NorDokNet, has been operational since September 1st of 2001. The results of the network collaboration are visible on the web portals of the national documentation centres where the organization and classification fo the information follow a common framework for the data as decided in the network. The paper focuses on the visit of a delegation to the Baltic countries carried out from the 24th to the 29th of October, 2005.

Dalianis, Hercules; Hassel, Martin; De Smedt, Koenraad; Liseth, Anja; Lech, Till Christopher; Wedekind, Jürgen (2004). Porting and evaluation of automatic summarization. In: Holmboe, Henrik (Ed.) Nordisk Sprogteknologi 2003: Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (pp. 107-121). Københavns Universitet: Museum Tusculanums Forlag.

In the context of the Nordic research network on summarization SCANDSUM, a Swedish system for automatic summarization system has been ported to Danish and Norwegian, in addition to a number of other languages including even Farsi. The principles and techniques of this research are described. Furthermore, the system is being extensively evaluated. It is argued that evaluation is an integral and important part of the research effort.

De Smedt, Koenraad; Black, William J. (2002). Humanities. In: Adelsberger, H. H.; Collis, B.; Pawlowski, J. M. (Eds.) Handbook on Information Technologies for Education and Training (pp. 495-522). Heidelberg: Springer.

Innovation in humanities education and research is stimulated by new technologies for the processing of language, speech, music, visual arts, and other expressions of the human mind. Three main topics will be discussed: the role of large scale resources such as text corpora and digital archives; advanced methods and tools for processing and simulation; and the use of courseware, multimedia and hypermedia in humanities teaching and learning.

De Smedt, Koenraad (2002). Some reflections on studies in humanities computing. Journal of Linguistic and Literary Computing 17, 89-101.

In any academic field, research advances tend to percolate naturally to higher education in that field. In recent years, there has been a slow but steady increase in the number of courses and degree programs in humanities computing. This article presents some reflections on the status of humanities computing in higher education, in terms of curricula, degrees, and international student and staff mobility. The most important issue is the question of a what humanities computing degree should offer, in view of the wide interdisciplinarity of the field. Different institutions have coped with this question in quite different ways. With potentially far reaching consequences on methodology in the various relevant disciplines, humanities computing is bound to change both what and how humanities students learn. Curriculum innovation that aims to integrate computing in the humanities is a difficult process that requires reflection, cooperation, teacher training and other supporting actions.

Birkeland, Ingebjørg; De Smedt, Koenraad (2002). Thematic Networks in Higher Education: A Norwegian Perspective. Report to the Ministry of Education and Research, Norway.

This report presents a preliminary assessment of the SOCRATES/ERASMUS action "Thematic Network Projects" with special reference to Norwegian participation. Part I is a brief presentation of background information. Part II analyses some results, particularly from 16 projects in the period 1996-2000. In Part III, the impact of the action and its results is analysed and further discussed.

De Smedt, Koenraad; Rosén, Victoria (2000). Automatic proofreading for Norwegian: The challenges of lexical and grammatical variation. In: Nordgård, Torbjørn (Ed.) NODALIDA'99: Proceedings from the 12th "Nordiske datalingvistikkdager", Trondheim, December 9-10, 1999 (pp. 206-215). Trondheim: NTNU.

In this paper we present some techniques, experiences and results from the SCARRIE project, which has aimed at developing improved proofreading tools for the Scandinavian languages. The focus is on methods which were used for spelling and grammar checking and particularly some novel analyses and treatments dealing with the extensive lexical and grammar variation in Norwegian Bokmål. The major findings are that (1) since in Bokmål, lexical variants may differ with respect to grammatical features, stylistic replacement at the word level causes a need for grammar checking, and (2) the different systems for gender agreement in Bokmål can be handled in an economical way by a single grammar and lexicon if the features in the lexicon are interpreted dynamically depending on the subnorm or style preferred by the author.

Rosén, Victoria; De Smedt, Koenraad (2000). *Er korrekturlesningsevnen di god? Resultater fra SCARRIE. In: Jansen Westvik, O.; Swan, T. Mørck, E.; Lorentz, O. (Eds.) Nordlyd: Tromsø University working papers on language & linguistics 28, 214-228. Tromsø: University of Tromsø.

SCARRIE er et EU-finansiert forskningsprosjekt om automatisk korrekturlesning for de skandinaviske språkene. HIT-senteret hadde ansvar for den norske delen av prosjektet, som bare hadde midler til å ta for seg bokmål. I denne artikkelen beskriver vi nærmere noen av metodene som ble brukt og resultatene som ble oppnådd. Vi konkluderer at kvaliteten på det leksikalske grunnlagsmaterialet er avgjørende for prestasjonene til nesten alle komponentene i systemet. Vi har investert mye i SCARRIE-ordlisten og mener at den per i dag er blant de beste leksikalske kildene for bokmål. Stilverdi er en verdifull tilleggsressurs for stilriktig korreksjon av bokmål. Grammatisk korreksjon er utviklet slik at det fungerer noenlunde bra in vitro, men det er svært vanskelig å gjennomføre på en pålitelig måte i autentiske tekster. Avansert sammensetningsanalyse, derimot, kan bidra til en dramatisk forbedring av ordgjenkjenning og dermed mer tilfredsstillende automatisk korrekturlesning.

De Smedt, Koenraad; Rosén, Victoria (1999). Datamaskinell skrivestøtte. Lindgren, Birgitta (Ed.) Språk i Norden 1999 (pp. 20-32). Oslo: Novus.

For mindre språk, deriblant de skandinaviske språkene, er et mindre utvalg av verktøy tilgjengelig per idag. Avis- og forlagsbransjen har likevel et klart behov for ulike former for automatisert skrivestøtte. Korrekturlesning er alltid normerende, enten denne prosessen utføres manuelt eller ved hjelp av datamaskiner. Mange typer av kunnskap må legges inn og kombineres for å klare ulike oppgaver innenfor korrekturlesning. Innenfor Scarrie-prosjektet er det blitt foretatt utførlig arbeid med hensyn til leksikalsk og annen variasjon relatert til subnormer i Bokmål.

De Smedt, Koenraad; Gardiner, Helen; Ore, Espen; Orlandi, Tito; Short, Harold; Souillot, Jacques; Vaughan, William (Eds.) (1999). Computing in humanities education: A European perspective. Bergen: University of Bergen, HIT centre.

The book presents an analysis of the status quo with respect to computing in humanities education, points out related developments in society, presents current innovations and plans for the future, identifies best practice and makes proposals and recommendations. The different chapters in this book address these issues from different perspectives and in different disciplines: formal methods; textual scholarship; computational linguistics and language engineering; non-European languages; and history of art, architecture and design. A common conclusion is that computing is strongly affecting all the humanities disciplines and is a catalyst for educational innovation and cooperation, while at the same time presenting great challenges for higher education systems.

Orlandi, Tito; Burnard, Lou; Buzzetti, Dino; De Smedt, Koenraad; Kropac, Ingo; Souillot, Jacques; Thaller, Manfred (1999). European studies on formal methods in the humanities. In De Smedt, Koenraad; Gardiner, Helen; Ore, Espen; Orlandi, Tito; Short, Harold; Souillot, Jacques; Vaughan, William (Eds.) Computing in humanities education: A European perspective (pp. 13-62). University of Bergen: HIT centre.

This chapter presents an investigation of some methodological questions related to teaching and learning at humanities faculties in Europe, in particular those arising from the use of digital technologies.

De Smedt, Koenraad; Black, William; Van den Bosch, Antal; Lavid López, Julia; Mc Kevitt, Paul; Way, Andy (1999). European studies on computational linguistics. In De Smedt, Koenraad; Gardiner, Helen; Ore, Espen; Orlandi, Tito; Short, Harold; Souillot, Jacques; Vaughan, William (Eds.) Computing in humanities education: A European perspective (pp. 89-154). University of Bergen: HIT centre.

This chapter assesses the situation of CL courses and programmes in the European educational landscape, and the use of educational CL tools for teaching and learning.

De Smedt, Koenraad (1998). Advanced computing in the humanities: A network approach. Proceedings of the BITE-conference (Bringing Information Technology to Education), Maastricht, March 25-27, 1998 (pp. 134-140).

Humanities faculties are faced with a challenge to innovate both the learning content and the delivery of courses. The thematic network project on Advanced Computing in the Humanities (ACO*HUM) investigates the impact of new information and communication technologies on humanities education. ACO*HUM brings together one hundred institutions for higher education to develop a common strategy. Attention is paid to curriculum innovation and to new methods for teaching and learning. The current pilot areas are computational linguistics, historical informatics, computing in history of art, and computing for non-European languages. In our experience, the willingness to change humanities education is real but must be supported by substantial international infrastructural measures to secure cooperation and efficiency. Among the necessary measures we name the establishment of an international repository of computational resources, a brokerage for competence in teaching expertise, and technical and organizational support for transnational distributed ODL.

De Smedt, Koenraad (1998). Teaching and learning computational linguistics in an international setting. In: Proceedings of NODALIDA'98, Copenhagen, January 28-29, 1998 (pp. 186-189).

Computational Linguistics has long been a forerunner in the use of humanities computing technology. However, there are many organisational problems to be addressed to maintain and improve the quality of teaching and learning of Computational Linguistics. Advanced Computing in the Humanities (ACO*HUM) is an international network which investigates the use of new technologies in Humanities teaching and learning. It promotes international co-operation a.o. for teaching and learning Computational Linguistics.

De Smedt, Koenraad (1998). Beyond courseware as giftpaper: Computers as exploratory learning tools for the humanities. In: Extended abstracts of The Future of the Humanities in the Digital Age, Bergen, September 25-28, 1998.

It is a common misconception that the mere introduction of multimedia in the classroom, or putting lecture notes on the web, will significantly change education. Most multimedia courseware does not offer more than book-style materials enriched with some multimedia and hypermedia as giftpaper wrapped on the outside. Real pedagogical benefits are to be expected from environments which let the humanities student creatively explore and construct. Examples of advanced simulation environments which meet those needs are Tarski's World and the LFG Workbench. The introduction of creative exploratory tools in the humanities deserves a more important place in educational innovation strategies.

Rosén, Victoria; De Smedt, Koenraad (1998). SCARRIE: Automatisk korrekturlesning for skandinaviske språk. In: Faarlund, Jan Terje; Mæhlum, Brit; Nordgård, Torbjørn (Eds.) Mons 7: Utvalde artiklar frå det 7. Møtet Om Norsk Språk i Trondheim 1997 (pp. 197-210). Oslo: Novus.

I denne artikkelen ønsker vi å presentere SCARRIE, som er et EU-støttet forskningsprosjekt som er rettet mot hvordan lingvistiske kunnskaper kan forbedre automatisk korrekturlesning. Vi vil først vise hva slags skrivefeil en finner i norske tekster. Deretter vil vi illustrere hvordan lingvistiske kunnskaper kan føre til vesentlige forbedringer i programmer for automatisk korrekturlesning på tre områder: fonologisk analyse, morfologisk analyse og syntaktisk analyse.

De Smedt, Koenraad; Horacek, Helmut; Zock, Michael (1996). Architectures for natural language generation: Problems and perspectives. In: Adorni, G.; Zock, M. (Eds.), Trends in natural language generation: An artificial intelligence perspective (Springer lecture notes in artificial intelligence 1036) (pp. 17-46). Berlin: Springer.

Current research in natural language generation is situated in a computational linguistics tradition that was founded several decades ago. We critically analyse some of the architectural assumptions underlying existing systems and point out some problems in the domains of text planning and lexicalization. Guided by the identification of major generation challenges viewed from the angles of knowledge-based systems and cognitive psychology, we sketch some new directions for future research.

Dijkstra, Anton; De Smedt, Koenraad (Eds.) (1996). Computational psycholinguistics: AI and connectionist models of human language processing. London: Taylor & Francis, 1996.

This book is an advanced textbook giving a multidisciplinary overview of current computational models in the domain of human language processing. The first part of the book introduces the basic two paradigms for computational modelling: the Artificial Intelligence paradigm, using symbol manipulation, and the connectionist approach, using neural networks. The second part presents chapters on various subdomains of language comprehension, ranging from speech recognition to discourse comprehension. Part three presents chapters on structure for language production, ranging from discourse planning to articulation and handwriting. Each chapter explains and compares several representative computer models against the background of current experimental and theoretical work in psycholinguistics. The chapters can be looked at individually giving a modular structure which allows for a selection of chapters depending on course load, or thematic restrictions. Computational psycholinguistics is, therefore, an invaluable course text for advanced students in computational linguistics, psycholinguistics and cognitive science.

Dijkstra, Anton; De Smedt, Koenraad (1996). Computer modelling. In: Dijkstra, Anton; De Smedt, Koenraad (Eds.) (1996). Computational psycholinguistics: AI and connectionist models of human language processing (pp. 3-23). London: Taylor & Francis, 1996.

This chapter provides an introduction to computer modeling and simulation in the field of language understanding and production.

Daelemans, Walter; De Smedt, Koenraad (1996). Computational modelling in Artificial Intelligence. In: Dijkstra, Anton; De Smedt, Koenraad (Eds.) (1996). Computational psycholinguistics: AI and connectionist models of human language processing (pp. 24-48). London: Taylor & Francis, 1996.

The aim of this chapter is to support the readers’ understanding of AI-based computational models of language by introducing the essentials of the methods that underlie those models.

Andriessen, Jerry; De Smedt, Koenraad; Zock, Michael (1996). Discourse planning: Empirical research and computer models. In: Dijkstra, Anton; De Smedt, Koenraad (Eds.) (1996). Computational psycholinguistics: AI and connectionist models of human language processing (pp. 247-278). London: Taylor & Francis, 1996.

This chapter outlines theoretical issues in the production of multisentential discourse, reviews experimental evidence and presents some computer models for this task.

De Smedt, Koenraad (1996). Computional models of incremental grammatical encoding. In Dijkstra, Anton; De Smedt, Koenraad (Eds.) (1996). Computational psycholinguistics: AI and connectionist models of human language processing (pp. 279-307). London: Taylor & Francis, 1996.

This chapter presents a selection of linguistic and psycholinguistic phenomena, leading to the formulation of some important problems and assumptions in grammatical encoding. Then the chapter focuses on the incremental mode of sentence production. A number of computational models which account for this processing mode are discussed in detail. A comparison is made between models developed for different languages (Dutch, German or English) and different ordering phenomena.

De Smedt, Koenraad; Kempen, Gerard (1996). Discountinuous constituency in Segment Grammar. In: Bunt, Harry; Van Horck, Arthur (Eds.), Discontinuous constituency (pp. 141-163). Berlin: Mouton de Gruyter.

Segment Grammar (SG) is a grammar formalism which is especially suited to model the incremental generation of sentences. SG is characterized by a dual level of syntactic description: f-structures, which are unordered functional structures composed out of syntactic segments, and c-structures, which represent left-to-right order of constituents. True discontinuities in SG are viewed as differences between immediate dominance (ID) relations in c-structures and those in corresponding f-structures. Constructions which are treated in this way include clause union, right dislocation, and fronting. Separable parts of words such as verbs and compound prepositions are not viewed as true discontinuities but as lexical entries consisting of separate syntactic segments.

De Smedt, Koenraad (1995). GeKnIPT: Op kennis gebaseerde informatiepresentatie (een projectvoorstel). In: Noordman, L. G. M.; De Vroomen, W. A. M. (Eds.), Informatiewetenschap 1994: Wetenschappelijke bijdragen aan de Derde StinfoN-Conferentie (pp. 157-165). Leiden: StinfoN.

Deze bijdrage schetst de opzet van een nieuw projectvoorstel op het gebied van informatiepresentatie. Het doel van het project is het ontwikkelen van een generische technologie voor de presentatie van informatie waarbij vooral aandacht wordt besteed aan (1) afstemming op individuele gebruikers en (2) multimodale presentatie. Twee praktijkgerichte prototypen op het gebied van de medische informatica spelen een belangrijke rol in het voorgestelde project.

De Smedt, Koenraad; Hovy, Eduard; McDonald, David; Meteer, Marie (1995). The Seventh International Workshop on Natural Language Generation (workshop report). AI magazine 16 (3), 67-68.

The Seventh International Workshop on Natural Language Generation was held from 21 to 24 June 1994 in Kennebunkport, Maine. Sixty-seven people from 13 countries attended this 4-day meeting on the study of natural language generation in computational linguistics and AI. The goal of the workshop was to introduce new, cutting-edge work to the community and provide an atmosphere in which discussion and exchange would flourish.

Gillis, Steven; Daelemans, Walter; De Smedt, Koenraad (1995). Artificial Intelligence. In Verschueren, J.; Östman, J.-O.; Blommaert, J. (Eds.), Handbook of pragmatics (pp. 61-80). Amsterdam: John Benjamins.

Artificial intelligence (AI) is a branch of computer science in which methods and techniques are developed that permit intelligent computer systems to be built. The meaningful use of a natural language in order to communicate is considered to be a task requiring intelligence, even if the ability of people to speak and understand everyday language were not related to other cognitive abilities. We first briefly review the history of language processing research in AI and sketch the physical symbol system hypothesis which is the philosophical foundation for AI. Then we explain general aspects of knowledge representation and the most influential knowledge-based paradigms. Finally, we show how linguistic symbol manipulation is applied in semantics and pragmatics.

Daelemans, Walter; De Smedt, Koenraad (1994). Default inheritance in an object-oriented representation of linguistic categories. International Journal of Human-Computer Studies 41, 149-177.

We describe an object-oriented approach to the representation of linguistic knowledge. Rather than devising a dedicated grammar formalism, we explore the use of powerful but domain-independent object-oriented languages. We use default inheritance to organize regular and exceptional behavior of linguistic categories. Examples from our work in the areas of morphology, syntax and the lexicon are provided. Special attention is given to multiple inheritance, which is used for the composition of new categories out of existing ones, and to structured inheritance, which is used to predict, among other things, to which rule domain a word form belongs.

De Smedt, Koenraad (1994). Parallelism in incremental sentence generation. In Adriaens, Geert; Hahn, Udo (Eds.), Parallel natural language processing (pp. 421-447). Norwood, NJ: Ablex. (Figures)

IPF (Incremental Parallel Formulator) is a computer model in which the formulation stage in sentence generation is distributed among a number of parallel processes. Each conceptual fragment which is passed on to the Formulator gives rise to a new process, which attempts to formulate only that fragment and then exits. The task of each formulation process consists basically of instantiating one or more syntactic segments and attaching these to the present syntactic structure by means of a general unification operation. A shared memory provides the necessary (and only) interaction between parallel processes and allows the integration of segments created by different processes into one syntactic structure. The race between parallel processes in time may partly explain some variations in word order and lexical choice.

Claassen, Wim; Bos, Edwin; Huls, Carla; De Smedt, Koenraad (1993). Commenting on action: A continuous linguistic feedback generator. In: Gray, W. D.; Hefley, W. E.; Murray, D. (Eds.), Proceedings of the International Workshop on Intelligent User Interfaces, Orlando, FL, January 4-7, 1993 (pp. 141-148). (Local copy; figures)

Action mode interfaces, in which users achieve their goals by manipulating representations, suffer from some fundamental disadvantages. In this paper, we present a working prototype of a system called Continuous Linguistic Feedback Generator (CLFG), a facility that addresses the major disadvantages. CLFG generates natural language descriptions of the actions the user is performing. These descriptions are presented in both the visual and audio channels. The knowledge sources and algorithm that enable CLFG to provide relevant and concise information are described in detail.

Zock, M., Carcagno, D., Kay, M., Namer, F., Nogier, J. F., Nossin, M., & De Smedt, K. (1992). Automatic text generation: A tool for the business world? In: Proceedings of the 12th International Conference on Artificial Intelligence, Expert Systems and Natural Language, Avignon, 1-6 June 1992: Vol. 3. Natural Language Processing and its applications (pp. 281-294).

Text generation is a highly complex task, requiring different kinds of expertise. Despite this complexity, there are already a great number of systems in various languages, developed for various purposes. Natural language has a great industrial potential which, surprisingly, has been neglected for a long time. Even if computers are incapable of matching human performance, programs that generate natural language are invaluable tools for modeling human performance, and will eventually be applied in communicative tasks. Linguists, computer scientists and psychologists must work together to achieve advances in the field.

Daelemans, Walter; De Smedt, Koenraad; Gazdar, Gerald (1992). Inheritance in natural language processing. Computational Linguistics 18, 205-218.

In this introduction to the special issues, we begin by outlining a concrete example that indicates some of the motivations leading to the widespread use of inheritance networks in computational linguistics. This example allows us to illustrate some of the formal choices that have to be made by those who seek network solutions to natural language processing (NLP) problems. We provide some pointers into the extensive body of AI knowledge representation publications that have been concerned with the theory of inheritance over the last dozen years or so. We go on to identify the four rather separate traditions that have led to the current work in NLP. We then provide a fairly comprehensive literature survey of the use that computational linguists have made of inheritance networks over the last two decades, organized by reference to levels of linguistic description. In the course of this survey, we draw the reader's attention to each of the papers in these issues of Computational Linguistics and set them in the context of related work.

De Smedt, Koenraad; Kempen, Gerard (1991). Segment Grammar: A formalism for incremental sentence generation. In: Paris, Cecile L.; Swartout, William R.; Mann, William C. (Eds.), Natural Language Generation in Artificial Intelligence and Computational linguistics (pp. 329-349). Boston/Dordrecht/London: Kluwer Academic Publishers.

Incremental sentence generation imposes special constraints on the representation of the grammar and the design of the formulator (the module which is responsible for constructing the syntactic and morphological structure). In the model of natural speech production presented here, a formalism called Segment Grammar is used for the representation of linguistic knowledge. We give a definition of this formalism and present a formulator design which relies on it. Next, we present an object-oriented implementation of Segment Grammar. Finally, we compare Segment Grammar with other formalisms.

De Smedt, Koenraad (1991). Revisions during generation using non-destructive unification. In: Abstracts of the Third European Workshop on Natural Language Generation, Judenstein/Innsbruck, March 13-15, 1991 (pp. 53-70).

A realistic model for natural language generation must account for overt revisions of the syntactic structure (self-corrections) as well as covert revisions (backtracking on syntactic options). This paper presents the preliminaries of a hybrid architecture for grammatical encoding (the `tactical' phase in sentence generation) which allows such revisions. This architecture combines the concept of activation with a non-destructive variant of the unification algorithm and views sentence generation as an optimalization process.

Daelemans, Walter; De Smedt, Koenraad; De Graaf, Josje (1991). Default inheritance in an object-oriented representation of linguistic categories (Research Report No. 31). Tilburg: University of Tilburg, Institute for Language Technology and Artificial Intelligence.

We describe an object-oriented approach to the representation of linguistic knowledge. Rather than devising a dedicated grammar formalism, we explore the use of powerful but domain-independent object-oriented languages. We use default inheritance to organize regular and exceptional behavior of linguistic categories. Examples from our work in the areas of morphology, syntax and the lexicon are provided. Special attention is given to multiple inheritance, which is used for the composition of new categories out of existing ones, and to structured inheritance, which is used to predict, among other things, to which rule domain a word form belongs.

De Smedt, Koenraad; Schotel, Henk (1991). Review of: [Luger, George F.; Stubblefield, William F. (1989) Artificial intelligence and the design of expert systems. Benjamin/Cummings]. AISB Quarterly 78, 53-55.

Review.

De Smedt, Koenraad (1990). Incremental sentence generation: a computer model of grammatical encoding. Ph.D. dissertation, University of Nijmegen, Nijmegen Institute for Cognition and Information (NICI Technical Report No. 90-01). (Figures)

Spontaneous speech is characterized by the fact, that the speaker often has not yet fully worked out the content of a sentence before the first words are uttered. A computer model called IPF is presented which accounts for this characteristic by operating in a parallel and incremental mode. Part One discusses psychological and linguistic aspects of IPF. The requirements for incremental generation are investigated. A grammar formalism called Segment Grammar is presented which fulfills these requirements. This unification-based grammar is so organized, that for each utterance it constructs a functional structure as well as a (surface) constituent structure. Variations in word order and some discontinuities are accounted for from a perspective of incremental generation. Part Two discusses representational and computational aspects of IPF. Because the generation of natural language is a knowledge-intensive process, a representation of linguistic knowledge is presented which is hierarchically structured and captures generalizations while allowing exceptions. An object-oriented programming language, CommonORBIT, provides the necessary mechanisms for this purpose, as is illustrated with examples. Furthermore, the application of concurrent programming concepts in IPF is explained. Finally, the model is evaluated and some extensions and future research topics are proposed.

De Smedt, Koenraad (1990). Een objectgerichte taal gebaseerd op LISP. Informatie 32, 340-354.

Objecten zijn voorstellingen van afzonderlijke entiteiten in een computermodel van de werkelijkheid. De kennis over deze objecten kan bestaan uit data, maar ook uit procedures die toepasbaar zijn op deze objecten. In objectgerichte (of object-georiënteerde) talen worden procedures en data dan ook ingekapseld in objecten. Door middel van erving kunnen objecten kennis delen met andere. Op deze manier kan men nieuwe objecten creëren als specialisaties of combinaties van andere objecten. Tevens ondersteunt erving een manier van programmeren door verfijning en draagt het bij tot het vermijden van redundantie. De programmeertaal CommonORBIT is een objectgerichte uitbreiding van Common LISP. De kenmerken van deze taal worden vergeleken met die van andere objectgerichte talen om zo tot een genuanceerd overzicht te komen van enkele architecturen binnen het objectgerichte paradigma.

De Smedt, Koenraad (1990). IPF: An incremental parallel formulator. In: Dale, Robert; Mellish, Chris; Zock, Michael (Eds.), Current research in natural language generation (pp. 167-192). London: Academic Press. (Prepublication version; figures)

A computer simulation model of the human speaker is presented which generates sentences in a piecemeal way. The module responsible for Grammatical Encoding (the tactical component) is discussed in detail. Generation is conceptually and lexically guided and may proceed from the bottom of the syntactic structure upwards as well as from the top downwards. The construction of syntactic structures is based on unification of so-called syntactic segments.

Kempen, Gerard; De Smedt, Koenraad (1990). Tree Adjoining Grammar, Segment Grammar, and incremental sentence generation. In: Extended abstracts of the First International Workshop on Tree Adjoining Grammars: Formal Theory and Application, Schloss Dagstuhl, August 15-17, 1990 (pp. 61-63). (Prepublication version) Saarbrücken: Deutsches Forschungszentrum für Künstliche Intelligenz.

Segment Grammar and Tree Adjoining Grammar are similar in that they are lexically guided and fulfill some requirements of incremental generation. However, the distinction between an functional structure and a (surface) constituent structure in Segment Grammar allows more flexible processing in an incremental mode.

De Smedt, Koenraad; De Graaf, Josje (1990). Structured inheritance in frame-based representation of linguistic categories. In: Daelemans, Walter; Gazdar, Gerald (Eds.), Proceedings of the First Workshop on Inheritance in Natural Language Processing, Tilburg, August 16-18, 1990 (pp. 39-47). Tilburg: University of Tilburg, Institute for Language Technology and Artificial Intelligence.

Structured inheritance is a powerful mechanism which models a slot filler after one higher in the hierarchy. Provided by several general-purpose frame-based and object-oriented representation languages, it is also very useful for linguistic representation. Examples from morphology and syntax are provided in the context of a natural language generation task.

Van der Linden, Erik; Brinkkemper, Sjaak; De Smedt, Koenraad; Van Boven, P.; Van der Linden, M. (1990). The representation of lexical objects. In: Magay, T.; Zigány, J. (Eds.), BudaLEX '88 Proceedings: Papers from the EURALEX Third International Congress, Budapest, September 4-9, 1988 (pp. 287-295). Budapest: Akadémiai Kiaidó. [Also published as Internal Report No. 88-ITI-B-33. Delft: TNO).

Information analysis methods developed in computer science for the construction of database systems can also be applied to computational lexicography. These methods deliver an abstract and concise description of the objects involved in a lexical information system, and reveal the considerations that are used when establishing the identity of the lexical units. Two underlying principles are introduced: the abstraction principle, positing that objects that do not occur in reality may nevertheless have to be represented in the lexicon, and the generalization principle, stating that the inclusion of these objects necessitates linguistic generalizations tied to the lexicon.

De Smedt, Koenraad (1989). Object-oriented knowledge representation in CommonORBIT (Internal Report No. 89-NICI-01). Nijmegen: University of Nijmegen, Nijmegen Institute for Cognition and Information. (Revised version)

Objects are representations of entities in a domain which is modeled in a computer. Each object encapsulates the knowledge relevant to one abstract concept or physical object in the real world. This knowledge may consist of data but also of procedures which are applicable to the object. Using the metaphor of a society of communicating entities, these procedures are activated by sending messages to objects. In an applicative view of object-oriented programming, procedures are called as generic functions. Object-oriented languages allow knowledge to be shared between several objects by a mechanism called inheritance. The knowledge representation language which is discussed in this report is CommonORBIT, an extension to Common LISP. It is an easy but powerful language, which offers a general-purpose object-oriented representation with similarities to semantic networks and frame-based systems.

De Smedt, Koenraad (1989). Distributed unification in parallel incremental syntactic tree formation. In: Extended abstracts presented at the Second European Natural Language Generation Workshop, Edinburgh, April 6-8, 1989.

For incremental syntactic tree formation, a grammar formalism called Segment Grammar has been proposed. This paper shows how tree formation sith such a formalism can be seen as distributed unification which allows parallelism in syntactic tree formation.

Van Berkel, Brigit; De Smedt, Koenraad (1988). Triphone analysis: A combined method for the correction of orthographical and typographical errors. In: Proceedings of the Second Conference on Applied Natural Language Processing, Austin, February 9-12, 1988 (pp. 77-83). Association for Computational Linguistics.

Most existing systems for the correction of word level errors are oriented toward either typographical or orthographical errors. Triphone analysis is a new correction strategy which combines phonemic transcription with trigram analysis. It corrects both kinds of errors (also in combination) and is a superior method for the correction of orthographical errors.

De Smedt, Koenraad (1988). Automatische correctie van spelling. Onze Taal 57, 136.

Dit artikel geeft een kort overzicht van de mogelijkheden en beperkingen van automatische correctie van de spelling in Nederlandse teksten.

De Smedt, Koenraad (1988). Knowledge representation techniques in artificial intelligence: An overview. In: Van der Veer, G. C.; Mulder, G. (Eds.) Human-computer interaction: Psychonomic aspects (pp. 207-222). Berlin: Springer.

The complexity of knowledge involved in Artificial Intelligence systems justifies the distinction of a separate programming level for the representation of knowledge. An overview is given of four styles of symbolic knowledge representation currently used in AI: (1) logic, (2) production rules, (3) procedures, and (4) semantic networks and frames. Each style is characterized briefly and some advantages and disadvantages of each style are mentioned. A number of languages for knowledge representation are mentioned.

De Smedt, Koenraad; Huls, Carla; Pijls, Fieny (1988). Taalkennis in tekstverwerking. Interdisciplinair Tijdschrift voor Taal- & Tekstwetenschap 8, 157-172.

Het gebruik van de computer voor de verwerking van natuurlijke taal heeft zich in het verleden sterk toegespitst op automatische vertaling en vraag-antwoord-systemen. Nochtans zijn er veel meer toepassingsgebieden waar natuurlijke taalverwerking interessante mogelijkheden biedt. De toepassing van taalkennis in redactionele taken is tot op heden onvoldoende gewaardeerd en geëxploiteerd. In dit artikel schetsen wij een auteursomgeving die taalkundige ondersteuning biedt bij het schrijven en redigeren van teksten. Wij bespreken onder meer automatische correctie van tik- en spelfouten (ook grammatische zoals d/t-fouten), betrouwbare woordafbreking en raadpleging van een lexicon. Ook stellen we enkele afgeleide systemen voor, met name een schooltekstverwerker en een generator van semi-standaardteksten.

De Smedt, Koenraad; Kempen, Gerard (1987). Incremental sentence production, self-correction and coordination. In Kempen. Gerard (Ed.), Natural language generation: New results in artificial intelligence, psychology and linguistics (pp. 365-376). Dordrecht: Nijhoff (Kluwer Academic Publishers).

In the generation of spontaneous speech, several stages of processing of the conceptual, lexico-syntactic, morpho-phonological and articulatory modules are customarily distinguished. It is not necessary for these modules to operate on input structures corresponding to whole sentences. Rather, the modules can work on different part of the final utterance simultaneously in an incremental fashion. Such a framework can accommodate several performance phenomena such as hesitations within the sentence, changes of mind, self-corrections, afterthoughts, and dead ends, i.e. the fact that people sometimes "talk themselves into a corner" and have to restart the utterance. We present a general framework for an incremental generation and describe how various conceptual activities during production may effect the partial syntactic structure.

Kempen, Gerard; Anbeek, Gert; Desain, Peter; Konst, Leo; De Smedt, Koenraad (1987). Author environments: Fifth generation text processors. In Directorate General XIII of the Commission of the European Communities (Ed.), ESPRIT '86: Results and achievements (pp. 365-372). Amsterdam: North-Holland.

Artificial Intelligence techniques for Natural Language Processing enable the construction of knowledge-based editorial software systems which greatly facilitate the preparation, manipulation and translation of full-text documents. They can offer many new forms of support which are far beyond the reach of present-day word processors. We propose Author Environment or Author System as collective terms for such text editing tools. After a somewhat principled discussion of what we mean by representing a natural-language text in a computer, we describe the goals, design, implementation and functionality of the Author Environment which we are developing as part of the ESPRIT OS-82 project which aims at the construction of an intelligent office workstation. We focus on the linguistic modules and the user interface.

Kempen, Gerard; Anbeek, Gert; Desain, Peter; Konst, Leo; De Smedt, Koenraad (1987). Auteursomgevingen: Tekstverwerkers van de vijfde generatie. Informatie 29, 988-993. (Vertaling van "Author environments: Fifth generation text processors")

Technieken die binnen de Artificiële Intelligentie (AI) voor verwerking van natuurlijke taal, maken de bouw mogelijk van 'intelligente' tekstverwerkers die nieuwe vormen van ondersteuning bieden bij het schrijven, redigeren en vertalen van teksten en documenten. Als algemene benaming voor zulke tekstverwerkers stellen wij de termen auteursomgeving of auteursysteem voor. Na een theoretische uiteenzetting over het representeren van in natuurlijke taal gestelde tekst in een computer, beschrijven wij achtereenvolgens doelstelling, ontwerp, implementatie en functionaliteit van de auteursomgeving die in het Psychologisch Laboratorium van de KUN wordt ontwikkeld als onderdeel van een ESPRIT-project dat de bouw van een intelligent kantoorwerkstation beoogt. De aandacht zal met name gericht zijn op de linguïstische modules en het gebruikersinterface.

Van der Linden, Erik; De Smedt, Koenraad (1987). Computerlexica voor een auteursysteem. Toegepaste Taalwetenschap in Artikelen 27, 33-41.

Een computerlexicon ten behoeve van een auteursysteem moet een ondersteunende rol kunnen spelen bij onder meer de volgende taken: detecteren en corrigeren van spel- en tikfouten, afbreken van woorden aan de rechtermarge, aangeven van andere verbogen of vervoegde vormen van een woord, signaleren van foutieve zinsbouw, transformeren van zinnen. Deze verschillende taken stellen elk andere eisen aan de inhoud en structuur van het woordenboek. Een aantal lexica die hiervoor zijn ontworpen in het Taaltechnologieproject te Nijmegen worden besproken en een aanzet wordt gegeven tot een ontwerp voor een lexicale databank.

De Smedt, Koenraad (1987). Object-oriented programming in Flavors and CommonORBIT. In R. Hawley (Ed.), Artificial intelligence programming environments (pp. 157-176). Chichester: Ellis Horwood.

When writing a program, it is often useful to have a computational representation of the objects in the problem domain. Object-oriented programming environments strongly support such representations by organizing procedures as well as data in terms of objects. Two such programming environments are discussed: FLAVORS and CommonORBIT. A comparison is made with respect to underlying metaphors, the behavior of objects, specialization hierarchies, behavior sharing mechanisms, and integration in LISP. Attention is given to aspects of system design which affect programming style, computing efficiency, and modularity.

De Smedt, Koenraad; Geurts, Bart; Desain, Peter (1987). Waiting for the gift of sound and vision. In: Proceedings of the ESPRIT Natural Language and Dialogue Workshop, Brussels, October 1, 1987.

It is generally acknowledged that both the linguistic and the graphical modes of interaction possess specific advantages and disadvantages for man-machine interaction. We do not argue that a linguistic mode or a graphical mode is better, but merely that the two are complementary. We envisage a multi-modal system which is primarily based on direct manipulation on a graphical screen but which is supplemented by natural language communication in specific situations.

De Smedt, Koenraad (1986). De rol van het lexicon in de taaltechnologie. Dictaat behorend bij de zomercursus "De computer en de taalkunde" aan de Universiteit van Amsterdam, juni 1986.

In programmatuur voor de verwerking van natuurlijke taal, zoals die thans wordt ontwikkeld in het Taaltechnologieproject te Nijmegen, speelt het lexicon een belangrijke rol. In deze paper worden eerst de doelstellingen van het Taaltechnologieproject opgesomd. Dan worden de mogelijkheden van een gedrukt woordenboek tegenover die van een computerwoordenboek gesteld. Verder wordt een overzicht geboden van de manier waarop lexica gestructureerd zijn die thans worden gebruikt binnen het project. Tenslotte wordt een visie gegeven op het nog te verrichten werk voor de constructie van computerlexica voor taaltechnologische toepassingen.

De Smedt, Koenraad (1985). Some aspects of Dutch morphology and syntax in the context of the Language Technology project at the University of Nijmegen. Newsletter of the Belgian Association for Artificial Intelligence 2 (1).

An overview is given of the various modules of the Language Technology Project at the Psychology Laboratory of the University of Nijmegen, which aims at the development of dialogue systems as well as author systems. The article focuses on the role of the various knowledge sources which are involved in the tasks of generation and analysis on the word and sentence levels.

De Smedt, Koenraad (1984). Using object-oriented knowledge-representation techniques in morphology and syntax programming. In O'Shea, Tim (Ed.), ECAI-84: Proceedings of the Sixth European Conference on Artificial Intelligence (pp. 181-184). Amsterdam: Elsevier.

The class/subclass and rule/exception relations in natural language grammars can be captured elegantly by the inheritance mechanism in an object-oriented programming language. Some examples are given of how morphological and syntactic relations can be captured using role relationships between objects.

Martin, Willy; Tops, Guy A. J. (Eds.). (1984). Groot woordenboek Engels-Nederlands. Utrecht: Van Dale Lexicografie.

Lexicographic contributions were made to this comprehensive English-Dutch dictionary.

De Smedt, Koenraad (1984). Kennisrepresentatie. In Kempen, Gerard; Sprangers, Chris (Eds.), Kennis, mens en computer (pp. 79-88). Lisse: Swets & Zeitlinger. (Ook gepubliceerd in Intermediair, 20(11), 27-31.)

Vijf belangrijke symbolische formalismen voor kennisrepresentatie in Artificiële Intelligentie worden besproken: (1) predicatenlogica, (2) procedurele representaties, (3) semantische netwerken, (4) productiesystemen en (5) frames.

De Smedt, Koenraad (1984). Orbit, an object-oriented extension of Lisp (Internal Report No. 84-FU-13). Nijmegen: University of Nijmegen, Psychology Department.

ORBIT is an applicative object-oriented programming language. It is implemented as an extension of FRANZ LISP and allows free intermixing of object-oriented and non-object-oriented code. ORBIT is a reasonably efficient tool for writing and running programs that operate on complex and dynamic data. Its features include multiple and selective inheritance, the ability to automatically store inherited or computed information, and reverse function application. The ORBIT programming environment provides special editing and debugging tools.

De Smedt, Koenraad (1984). Implementing an IPG generator in an applicative object-oriented programming language. In: Computers in Literary and Linguistic research: Proceedings of the XIth International Conference of the Association for Literary and Linguistic Computing, Louvain, April 2-6, 1984.

Language production can be viewed as the creation of objects and the activation of object-oriented (generic) functions in an applicative object-oriented programming language. The resulting linguistic structures are networks of objects with relations between them. In this way, constituents can be viewed as objects and syntactic functions as object-oriented (generic) functions. The paper discusses how this approach is compatible with the basic tenets of Incremental Procedural Grammar (IPG).

Kempen, Gerard; Konst, Leo; De Smedt, Koenraad (1984). Taaltechnologie voor het Nederlands: Vorderingen bij de bouw van een nederlandstalig dialoog- en auteursysteem. Informatie 26, 878-881.

Dit artikel geeft een overzicht van het Taaltechnologie-project dat sinds eind 1982 wordt uitgevoerd aan de Katholieke Universiteit Nijmegen. Na de belangrijkste doestellingen te hebben geschetst beschrijven we de stand van zaken medio 1984. We zetten de principes uiteen die ten grondslag liggen aan de Nederlandstalige linguïstische modules waaraan wordt gewerkt: woord- en zinsontleders, woord- en zinsgeneratoren, een linguïstische databank, en een objectgericht kennisrepresentatiesysteem. Het is de bedoeling deze modules te integreren tot een dialoogsysteem voor communicatie met databanken en expertsystemen in natuurlijke taal, en tot een auteursysteem voor tekstverwerking op basis van ingebouwde taalkennis.

Steels, Luc; De Smedt, Koenraad (1983). Some examples of frame-based syntactic processing. In Daems, Frans; Goossens, Louis (Eds.), Een spyeghel voor G. Jo Steenbergen (pp. 293-305). Leuven: Acco.

Language processing is viewed as a problem solving process with two components: a problem solving mechanism, and a grammar as the body of knowledge needed to drive it. The grammar consists of associations of descriptions grouped in frames. Frames are organized in tangled hierarchies based on generalization and refinement relationships. Linguistic structures are collections of descriptions which are generalized and/or refined until a particular goal is reached. A small but illustrative example of this approach is presented.

De Smedt, Koenraad (1983). Implementing an IPG generator in an applicative object-oriented programming language. In: Documentation on the International Workshop on Language Generation, Burg Stettenfels, Germany, August 15-17, 1983. Stuttgart: Universität Stuttgart, Institut für Informatik.

Language production can be viewed as the creation of objects and the activation of object-oriented (generic) functions in an applicative object-oriented programming language. The resulting linguistic structures are networks of objects with relations between them. In this way, constituents can be viewed as objects and syntactic functions as object-oriented (generic) functions. The paper discusses how this approach is compatible with the basic tenets of Incremental Procedural Grammar (IPG).

De Smedt, Koenraad (1980). Using case to improve frame-based representation languages (Schlumberger A.I. Working Paper No. 3). Ridgefield, CT: Schlumberger-Doll Research.

Frame-based representation languages can benefit from the correspondence between slots in a frame and case relations in natural language. This view facilitates the incorporation of natural-language-like features such as verbs and adjectives in a frame description.