Abstract

As an engineering field, research on natural language processing (NLP) is much more constrained by currently available resources and technologies, compared with theoretical work on computational linguistics (CL). In today’s technology-driven society, it is almost impossible to imagine the degree to which computational resources, the capacity of secondary and main storage, and software technologies were restricted when I embarked upon my research career 50 years ago. While these restrictions inevitably shaped my early research into NLP, my subsequent work evolved, according to the significant progress made in associated technologies and related academic fields, particularly CL.Figure 1 shows the research topics in which I have been engaged. My initial NLP research was concerned with a question answering system, which I worked on during my M.Eng and D.Eng degrees. The research focused on reasoning and language understanding, which I soon found was too ambitious and ill-defined. After receiving my D.Eng., I changed my direction of research, and began to be engaged in processing forms of language expressions, with less commitment to language understanding, machine translation (MT), and parsing. However, I returned to research into reasoning and language understanding in the later stage of my career, with clearer definitions of tasks and relevant knowledge, and equipped with access to more advanced supporting technologies.In this article, I begin by briefly describing my views on mutual relationships among disciplines related to CL and NLP, and then move on to discussing my own research.Language is a complex topic to study, infinitely harder than I first imagined when I began to work in the field of NLP.There is a whole discipline on the study of language—namely, linguistics. Linguistics is concerned not only with language per se, but must also deal with how humans model the world.1 The study of semantics, for example, must relate language expressions to their meanings, which reside in the mental models possessed by humans.Apart from linguistics, there are two fields of science that are concerned with language, that is, brain science and psychology. These are concerned with how humans process language. Then, there are two disciplines in which we are involved—namely, CL and NLP.Figure 2 is a schematic view of these research disciplines. Both of the lower disciplines are concerned with processing language, that is, how language is processed in our minds or our brains, and how computer systems should be designed to process language efficiently and effectively.The top discipline, linguistics, on the other hand, is concerned with rules that are followed by languages. That is to say, linguists study language as a system. This schematic view is certainly oversimplified, and there are subject fields in which these disciplines overlap. Psycholinguistics, for example, is a subfield of linguistics which is concerned with how the human mind processes language. A broader definition of CL may include NLP as its subfield.In this article, for the sake of discussion, I adopt narrower definitions of linguistics and CL. In this narrower definition, linguistics is concerned with the rules followed by languages as a system, whereas CL, as a subfield of linguistics, is concerned with the formal or computational description of rules that languages follow.2CL, which focuses on formal/computational description of languages as a system, is expected to bridge broader fields of linguistics with the lower disciplines, which are concerned with processing of language.Given my involvement in NLP, I would like to address the question of whether the narrowly defined CL is relevant to NLP. The simple answer is yes. However, the answer is not so straightforward, and requires us to examine the degree to which the representations used to describe language as a system are relevant to the representations used for processing language.Although my colleagues and I have been engaged in diverse research areas, I pick up only on a subset of these, to illustrate how I view the relationships between NLP and CL. Due to the nature of the article, I ignore technical details and focus instead on the motivation of the research and the lessons which I have learned through research.Background and Motivation. Following the ALPAC report Pierce et al. (1966), research into MT had been largely abandoned by academia, with the exception of a small number of institutes (notably, GETA at Grenoble, France, and Kyoto University, Japan). There were only a handful of commercial MT systems, being used for limited purposes. These commercial systems were legacy systems that had been developed over years and had become complicated collections of ad hoc programs. They had become too convoluted to allow for changes and improvements. To re-initiate MT research in academia, we had to have more systematic and disciplined design methodologies.On the other hand, theoretical linguistics, initiated by Noam Chomsky (Chomsky 1957, 1965) had attracted linguists with a mathematical orientation, who were interested in formal frameworks of describing rules followed by language. Those linguists with interests in formal ways of describing rules were the first generation of computational linguists.Although computational linguists did not necessarily follow the Chomskyan way of thinking, they shared the general view of treating language as a system of rules. They had developed formal ways of describing rules of language and showed that these rules consisted of different layers, such as morphology, syntax, and semantics, and that each layer required different formal frameworks with different computational powers. Their work had also motivated work on how one could process language by computerizing its rules of language. This work constituted the beginning of NLP research, and resulted in the development of parsing algorithms for context-free language, finite-state machines, and so forth.3 It was natural to use this work as the basis for designing the second generation of MT systems, which was initiated by an MT project (MU project, 1082-1986) led by Prof. M. Nagao (Nagao, Tsujii, and Nakamura 1985).Research Contributions. When I began research into MT in the late 1970s, there was a common view largely shared by the community, which had been advocated by the group of GETA, in France. The view was called the transfer approach of MT (Boitet 1987).The transfer approach viewed translation as a process consisting of three phases: analysis, transfer, and generation. According to linguists, a language is a system of rules. The analysis and generation phases were monolingual phases that were concerned with a set of rules for a single language, the analysis phase using the rules of the source language and the generation phase using the rules of the target language. Only the transfer phase was a bilingual phase.Another view shared by the community was an abstraction hierarchy of representation, called the triangle of translation. For example, Figure 3(a)4 shows the hierarchy of representation used in the Eurotra project, with their definition of each level (Figure 3(b)).By climbing up such a hierarchy, the differences among languages would become increasingly small, so that the mapping (i.e., the transfer phase) from one language to another would become as simple as possible. Independently of the target language, the goal of the analysis phase was to climb up the hierarchy, while the aim of the generation phase was to climb down the hierarchy to generate surface expressions in the target language. Both phases are concerned only with rules of single languages.In the extreme view, the top of the hierarchy was taken as the language-independent representation of meaning. Proponents of the interlingual approach claimed that, if the analysis phase reached this level, then no transfer phase would be required. Rather, translation would consist only of the two monolingual phases (i.e., the analysis and generation phases).However, in Tsujii (1986), I claimed, and still maintain, that this was a mistaken view about the nature of translation. In particular, this view assumed that a translation pair (consisting of the source and target sentences) encodes the same “information”. This assumption does not hold, in particular, for a language pair such as Japanese and English, that belong to very different language families. Although a good translation should preserve the information conveyed by the source sentence as much as possible in the target sentence, translation may lose some information or add extra information.5Furthermore, the goal of translation may not be to preserve information but to convey the same pragmatic effects to readers of the translation.More seriously, the abstract level of representation such as Interface Structure6 in Eurotra focused only on the propositional content encoded in language, and tended to abstract away other aspects of information, such as the speaker’s empathy, distinction of old/new information, emphasis, and so on.To climb up the hierarchy led to loss of information in lower levels of representation. In Tsujii (1986), instead of mapping at the abstract level, I proposed “transfer based on a bundle of features of all the levels”, in which the transfer would refer to all levels of representation in the source language to produce a corresponding representation in the target language (Figure 4). Because different levels of representation require different geometrical structures (i.e., different tree structures), the realization of this proposal had to wait for development of a clear mathematical formulation of feature-based representation with reentrancy, which allowed multiple levels (i.e., multiple trees) to be represented with their mutual relationships (see the next section).Another idea we adopted to systematize the transfer phase was recursive transfer (Nagao and Tsujii 1986), which was inspired by the idea of compositional semantics in CL. According to the views of linguists at the time, a language is an infinite set of expressions which, in turn, is defined by a finite set of rules. By applying this finite number of rules, one can generate infinitely many grammatical sentences of the language. Compositional semantics claimed that the meaning of a phrase was determined by combining the meanings of its subphrases, using the rules that generated the phrase. Compositional translation applied the same idea to translation. That is, the translation of a phrase was determined by combining the translations of its subphrases. In this way, translations of infinitely many sentences of the source language could be generated.Using the compositional translation approach, the translation of a sentence would be undertaken by recursively tracing a tree structure of a source sentence. The translation of a phrase would then be formulated by combining the translations of its subphrases. That is, translation would be constructed in a bottom up manner, from smaller units of translation to larger units.Furthermore, because the mapping of a phrase from the source to the target would be determined by the lexical head of the phrase, the lexical entry for the head word specified how to map a phrase to the target. In the MU project, we called this lexicon-driven, recursive transfer (Nagao and Tsujii 1986) (Figure 5).Compared with the first-generation MT systems, which replaced source expressions with target ones in an undisciplined and ad hoc order, the order of transfer in the MU project was clearly defined and systematically performed.Lessons. Research and development of the second-generation MT systems benefitted from research into CL, allowing more clearly defined architectures and design principles than first-generation MT systems. The MU project successfully delivered English-Japanese and Japanese-English MT systems within the space of four years. Without these CL-driven design principles, we could not have delivered these results in such a short period of time.However, the differences between the objectives of the two disciplines also became clear. Whereas CL theories tend to focus on specific aspects of language (such as morphology, syntax, semantics, discourse, etc.), MT systems must be able to handle all aspects of information conveyed by language. As discussed, climbing up a hierarchy that focuses on propositional content alone does not result in good translation.A more serious discrepancy between CL and NLP is the treatment of ambiguities of various kinds. Disambiguation is the single most significant challenge in most NLP tasks; it requires the context in which expressions to be disambiguated occur to be processed. In other words, it requires understanding of context.Typical examples of disambiguation are shown in Figure 6. The Japanese word asobu has a core meaning of “spend time without engaging in any specific useful tasks”, and would be translated into “to play”, “to have fun”, “to spend time”, “to hang around”, and so on, depending on the context (Tsujii 1986).Considering context for disambiguation contradicts with recursive transfer, since it requires larger units to be handled (i.e., the context in which a unit to be translated occurs). The nature of disambiguation made the process of recursive transfer clumsy. Disambiguation was also a major problem in the analysis phase, which I discuss in the next section.The major (although hidden) general limitation of CL or linguistics is that it tends to view language as an independent, closed system and avoids the problem of understanding, which requires reference to knowledge or non-linguistic context.7 However, many NLP tasks, including MT, require an understanding or interpretation of language expressions in terms of knowledge and context, which may involve other input modalities, such as visual stimuli, sound, and so forth. I discuss this in the section on the future of research.Background and Motivation. At the time I was engaged in MT research, new developments took place in CL, namely, feature-based grammar formalisms (Kriege 1993).At its early stage, transformational grammar in theoretical linguistics by N. Chomsky assumed that sequential stages of application of tree transformation rules linked the two levels of structures, that is, deep and surface structures. A similar way of thinking was also shared by the MT community. They assumed that climbing up the hierarchy would involve sequential stages of rule application, which map from the representation at one level to another representation at the next adjacent level.8 Because each level of the hierarchy required its own geometrical structure, it was not considered possible to have a unified non-procedural representation, in which representations of all the levels co-exist.This view was changed by the emergence of feature-based formalisms that used directed acyclic graphs (DAGs) to allow reentrancy. Instead of mappings from one level to another, it described mutual relationships among different levels of representation in a declarative manner. This view was in line with our idea of description-based transfer, which used a bundle of features of different levels for transfer. Moreover, some grammar formalisms at the time emphasized the importance of lexical heads. That is, local structures of all the levels are constrained by the lexical head of a phrase, and these constraints are encoded in lexicon. This was also in line with our lexicon-driven transfer.A further significant development in CL took place at the same time. Namely, a number of sizable tree bank projects, most notably the Penn Treebank and the Lancaster/IBM Treebank, had reinvigorated corpus linguistics and started to have significant impacts on research into CL and NLP (Marcus et al. 1994). From the NLP point of view, the emergence of large tree banks led to the development of powerful tools (i.e., probabilistic models) for disambiguation.9We started research that would combine these two trends to systematize the analysis phase—that is, parsing based on feature-based grammar formalisms.Research Contributions. It is often claimed that ambiguities occur because of insufficient constraints. In the analysis phase of the “climbing up the hierarchy” model, lower levels of processing could not refer to constraints in higher levels of representation. This was considered the main cause of the combinatorial explosion of ambiguities at the early stages of climbing up the hierarchy. Syntactic analysis could not refer to semantic constraints, meaning that ambiguities in syntactic analysis would explode.On the other hand, because the feature-based formalisms could describe constraints at all levels in a single unified framework, it was possible to refer to constraints at all levels, to narrow down the set of possible interpretations.However, in practice, the actual grammar was still vastly underconstrained. This was partly because we do not have effective ways of expressing semantic and pragmatic constraints. Computational linguists were interested in formal declarative ways for relating syntactic and semantic levels of representation, but not so much in how semantic constraints are to be expressed. To specify semantic or pragmatic constraints, one may have to refer to the mental models of the world (i.e., how humans see the world), or discourse structures beyond single sentences, and so on. These fell outside of the scope of CL research at the time, whose main focus is on grammar formalisms.Furthermore, it is questionable whether semantics or pragmatics can be used as constraints. They may be more concerned with the plausibility of an interpretation than the constraints which an interpretation should satisfy (for example, see the discussion in Wilks [1975]).Therefore, even for parsing using feature-based formalisms, issues of disambiguation and how to handle the explosion of ambiguities remained major issues for NLP.Probabilistic models were one of the most powerful tools for disambiguation and handling the plausibility of an interpretation. However, probabilistic models for simpler formalisms, such as regular and context-free grammars, had to be changed for more complex grammar formalisms. Techniques for handling combinatorial explosion, such as packing, had to be reformulated for feature-based formalisms.Furthermore, although feature-based formalisms were neat in terms of describing constraints in a declarative manner, the unification operation, which was a basic operation for treating feature-based descriptions, was computationally very expensive. To deliver practical NLP systems, we had to develop efficient implementation technologies and processing architectures for feature-based formalisms.The team at the University of Tokyo started to study how we could transform a feature-based grammar (we chose HPSG) into effective and efficient representations for parsing. The research included:Design of an abstract machine for processing of typed-feature structures and development of a logic programming system—LiLFeS (Makino et al. 1998; Miyao et al. 2000).Transforming HPSG grammar into a more processing-oriented representation, such as extracting CFG skeletons (Torisawa and Tsujii 1996; Torisawa et al. 2000) and supertags from original HPSG.Packing of feature structures (feature forest) and long-linear probabilistic models (Miyao and Tsujii 2003, 2005, 2008).A staged architecture of parsing based on transformation of grammar formalisms and their probabilistic modeling (Matsuzaki, Miyao and Tsujii 2007; Ninomiya et al. 2010).A simplified representation of our parsing model is shown in Figure 7. Given a sentence, its representation of all the levels was constructed at the final stage by using the HPSG grammar. Disambiguation took place mainly in the first two phases. The first phase was a supertagger that would disambiguate supertags assigned to words in a sentence. Supertags were derived from the original HPSG grammar and a set of supertags were attached to a word in its lexicon. A suppertagger would choose the most probable sequence of supertags for the given sequence of words. The task was a sequence labeling task, which could be carried out in a very efficient manner (Zhang, Matsuzaki, and Tsujii 2009). This means that the surface local context (i.e., local sequences of supertags) was used for disambiguation, without constructing actual DAGs of features.The second phase was CFG filtering. A CFG skeleton, which also was derived from the HPSG grammar, was used to check whether sequences of supertags chosen by the first phase could reach a successful derivation tree. The supertagger did not build actual parse trees explicitly to check whether a chosen sequence could reach legitimate derivation trees or not. The second phase of CFG filtering would filter out supertag sequences that could not reach legitimate trees.The final phase not only built the final representation of all the levels, but it also checked extra constraints specified in the original grammar. Because the first two phases only use partial constraints specified in the HPSG grammar, the final phase would reject results produced by the first two phases if they failed to satisfy these extra constraints. In this case, the system would backtrack to the previous phases to obtain the next candidate.All of these research efforts collectively produced a practical efficient parser based on HPSG (Enju ).Lessons. As in MT, CL theories were effective for the systematic development of NLP systems. Feature-based grammar formalisms drastically changed the view of parsing as “climbing up the hierarchy”. Moreover, mathematically well-defined formalisms helped the systematic implementation of efficient implementations of unification, transformation of grammar into supertags, CFG skeletons, and so forth. These formalisms also provided solid ground for operations in NLP such as packing of feature structures, and so on, which are essential for treating combinatorial explosion.On the other hand, direct application of CL theories to NLP did not work, since this would result in extremely slow processing. We had to transform them into more processing-oriented formats, which required significant efforts and time on the NLP side. For example, we had to transform the original HPSG grammar into processing-oriented forms, such as supertags, CFG skeletons, and so on. It is worth noting that, while the resultant architecture was similar to the climbing-up hierarchy processing, each stage in the final architecture was clearly defined and related to each other through the single declarative grammar.I also note that advances in the fields of computer science/engineering significantly changed what was possible to achieve in NLP. For example, the design of an abstract machine and its efficient implementation for unification in LiLFeS (Makino et al. 1998), effective support systems for maintaining large banks of parsed trees (Ninomiya, Makino, and Tsujii 2002; Ninomiya, Tsujii, and Miyao 2004), and so forth, would be impossible without advances in the broader fields of computer science/engineering and without much improved computational power (Taura et al. 2010).On the other hand, disambiguation remained the major issue in NLP. Probabilistic models enabled major breakthroughs in terms of solving the problem. Compared with the fairly clumsy rule-based disambiguation that we adopted for the MU project,10 probabilistic modeling provided the NLP community with systematic ways of handling ambiguities. Combined with large tree banks, objective quantitative comparison of different models also became feasible, which made systematic development of NLP systems possible. However, the error rate in parsing remained (and still remains) high.While reported error rates are getting lower, measuring the error rate in terms of the number of incorrectly recognized dependency relations was misleading. At the sentence level, the error rate remains high. That is, a sentence in which all dependency relations are correctly recognized remains very rare. Because most of dependency relations are trivial (i.e., pairs of adjacent words or pairs of close neighbors), errors in semantically critical dependencies, such as PP-attachments, scopes of conjunction, etc., remain abundant (Hara, Miyao, and Tsujii 2009).Even using probabilistic models, there are obvious limits to disambiguation performance, unless a deeper understanding is taken into account. This leads me to the next research topic: language and knowledge.Background and Motivation. I was interested in the topic of how to relate language with knowledge at the very beginning of my career. At the time, my naiveté led me to believe initially that a large collection of text could be used as a knowledge base and was engaged in research of a question-answering system based on a large text base (Nagao and Tsujii 19731979). However, resources such as a large collection of text, storage capacity, processing speed of computer systems, and basic NLP technologies, such as parsing, were not available at the time.I soon realized, however, that the research would involve a whole range of difficult research topics in artificial intelligence, such as representation of common sense, human ways of reasoning, and so on. Moreover, the topics had to deal with uncertainty and peculiarities of individual humans. Knowledge or the world models that individual humans have may differ from one person to another. I felt that the research target was ill-defined.However, through research in MT and parsing in the later stages of my career, I started to realize that NLP research is incomplete if it ignores how knowledge is involved in processing, and that challenging NLP problems are all related to issues of understanding and knowledge. At the same time, considering NLP as an engineering field, I took it to be essential to have a clear definition of knowledge or information with which language is to be related. I would like to avoid too much vagueness of research into commonsense knowledge and reasoning and to restrict our research focus to the relationship between language and knowledge. As a research strategy, I chose to focus on the biomedicine as the application domain. There were two reasons for the choice.One reason was that microbiology colleagues at the two universities with which I was affiliated told me that, in order to understand life-related phenomena, it had become increasingly important for them to organize pieces of information scattered in a large collection of published papers in diverse subject fields such as microbiology, medical sciences, chemistry, and agriculture. In addition to the large collection of papers, they also had diverse databases that had to be linked with each other. In other words, they had a solid body of knowledge shared by domain specialists that was to be linked with information in text.The other reason was that there were colleagues at the University of Manchester who were interested in sublanguages. According to the discussion on information formats in a medical sublanguage by the NYU group (Sager 1978) and research into medical terminology at the University of Manchester, focusing on relations between terms and concepts (Ananiadou 1994; Frantzi and Ananiadou 1996; Mima et al. 2002), the biomedical domain had been a natural choice of sublanguage research. The important point here was that information formats in a sublanguage and terminology concepts were defined by the target domain, and not by NLP researchers. Furthermore, domain experts had actual needs and concrete requirements to help solve their own problems in the target domains.Research Contributions. Although there had been quite a large amount of research into information retrieval and text mining for the biomedical domain, there had been no serious efforts to apply structure-based NLP techniques to text mining in the domain. To address this, the teams at the University of Manchester and the University of Tokyo jointly launched a new research program in this direction.Because this was a novel research program, we first had to define concrete tasks to solve, to prepare resources, and to involve not only NLP researchers, but also experts in the target domains.Regarding the involvement of NLP researchers and domain experts, we found that a few groups in the world also began to be interested in similar research topics. In response to this, we organized a number of research gatherings in collaboration with colleagues around the world, which led to establishment of a SIG (SIGBIOMED) at ACL. The first workshop took place in 2002, collocated with the ACL conference (Workshop 2002). The SIG now organizes annual workshops and co-located shared tasks. It has been expanding rapidly and has become one of the most active SIGs in NLP applications. The research field of application of structure-based NLP to text-mining is broadening to cover clinical/medical domains (Xu et al. 2012; Sohrab et al. 2020), chemistry, and material science domains (Kuniyoshi et al. 2019).Research contributions by the two teams include the GENIA corpus (Kim et al. 2003; Thompson, Ananiadou, and Tsujii 2017), a large repository of acronyms with their original terms (Okazaki, Ananiadou, and Tsujii 20082010), the GENIA POS tagger Tsuruoka et al. (2005), the BRAT annotation tool (Stenetorp et al. 2012), a workflow design tool for information extraction (Kano et al. 2011), an intelligent search system based on entity association (Tsuruoka, Tsujii, Ananiadou 2008), and a system for pathway construction (Kemper et al. 2010).The GENIA annotated corpus is one of the most frequently used corpora in the biomedical domain. To see what information domain experts considered important in text and how it was encoded in language, we annotated 2000 abstracts, not only from the linguistic point o

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call