The use of grammatically annotated corpora for the display of textual patterns
This paper looks at how meanings unfold across the genre of Contemporary Slovenian Sermons under the application of tools developed within the fields of systemic functional linguistics and corpus linguistics. The theoretical construct for the analysis is provided by Hasan';s (1984) concept of genre and Cloran';s (1994) concept of message semantics. To determine consistency of textual patterns for the genre of sermons, lexico-grammatical patterns in sermons are examined. It is argued that genre analysis, supported by visual presentation of lexico-grammatical patterns (as suggested by Biber et al. 1998) extracted from a grammatically annotated corpus of Slovenian sermons, provides a fuller picture of crucial properties of genre.
- Research Article
16
- 10.6100/ir589992
- Nov 18, 2015
- Data Archiving and Networked Services (DANS)
An RDF model is similar to a directed labeled graph (DLG) [Lassila and Swick, 1999].However, it differs from a classical DLG since its definition allows for multiple edges between
- Research Article
5
- 10.1515/jlt-2019-0009
- Sep 6, 2019
- Journal of Literary Theory
Gattungsgeschichte und ihr Gattungsbegriff am Beispiel der Novellen
- Supplementary Content
4
- 10.17635/lancaster/thesis/831
- Sep 30, 2019
- University of Lancaster
Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports $F_1$ of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F$_1$ of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%.
- Research Article
62
- 10.6100/ir615072
- Nov 18, 2015
- Data Archiving and Networked Services (DANS)
The scenario in which the present research takes place is that of one or more online video repositories containing several hours of documentary footage and users possibly interested only in particular topics of that material. In such a setting it is not possible to craft a single version containing all possible topics the user might like to see, unless including all the material, which is clearly not feasible. The main motivation for this research is, therefore, to enable an alternative authoring process for film makers to make all their material dynamically available to users, without having to edit a static final cut that would select out possible informative footage. We developed a methodology to automatically organize video material in an edited video sequence with a rhetorical structure. This was enabled by defining an annotation schema for the material and a genera- tion process with the following two requirements: • the data repository used by the generation process could be extended by simply adding annotated material to it • the final resulting structure of the video generation process would seem familiar to a video literate user The first requirement was satisfied by developing an annotation schema that explic- itly identifies rhetorical elements in the video material, and a generation process that assembles longer sequences of video by manipulating the annotations in a bottom-up fashion. The second requirement was satisfied by modelling the generation process accord- ing to documentary making and general film theory techniques, in particular making the role of rhetoric in video documentaries explicit. A specific case study was carried out using video material for video documentaries. These used an interview structure, where people are asked to make statements about subjective matters. This category is characterized by rich information encoded in the audio track and by the controversiality of the different opinions expressed in the inter- views. The approach was tested by implementing a system called Vox Populi that real- izes a user-driven generation of rhetoric-based video sequences. Using the annotation schema, Vox Populi can be used to generate the story space and to allow the user to select and browse such a space. The user can specify the topic but also the characters of the rhetorical dialogue and the rhetoric form of the presentation. Presenting controversial topics can introduce some bias: Vox Populi tries to con- trol that by modelling some rhetoric and film theory editing techniques that influence
- Research Article
- 10.22049/jalda.2018.26165.1051
- Mar 1, 2016
The purpose of the present study was to compare the PhD dissertations written by native and nonnative English writers in the field of Applied Linguistics with regard to the use of self-mentions. To this end, 40 Applied Linguistics PhD dissertations (20 written by native English writers and 20 by non-native English writers), were selected randomly among academic texts written in 2007-2017. The present study analyzed only the introduction and discussion sections of these PhD dissertations. The results of the chi-square analyses revealed that native English writers used more self-mentions in the introduction and discussion sections of Applied Linguistics PhD dissertations than their non-native counterparts. In the light of the findings of the study, it was recommended that Iranian writers in general and PhD candidates in particular have to move away from positivist impersonalized writing presentation towards more socialist performance of knowledge claims and authors’ voice and stance.
- Conference Article
7
- 10.7916/d8zk5r0k
- May 1, 2002
- Columbia Academic Commons (Columbia University)
We have collected a large-scale corpus of electronic articles in the cardiology domain (85 million+ words) in the framework of a digital library project that tailors the presentation of online medical literature to both patients and healthcare providers. We describe the webbased and XML technologies we used for the collection, encoding and linguistic processing of the corpus. This resulted in a largescale, high-quality, thoroughly marked-up resource which is used by many researchers in our project, in the areas of natural language processing, information retrieval and medical informatics. We show how the final use of the resource has influenced the design of its structural and linguistic encoding. The procedure we describe is general enough to be of use to researchers in a similar position wishing to compile, encode and linguistically annotate their own corpus from the web.
- Research Article
- 10.47405/mjssh.v4i5.256
- Sep 9, 2019
Language is one of the means that has a large in public opinion to build a new perspective on development, both academically and non-academically. Relating to Developing 4.0 thought system (Internet of Things, Artificial Intelligence, Human-Machine Interface, Robot Technology and Sensors, and 3D Printing Technology). One of them is use Corpus Linguistics. One of the typical difficulties poses by modals verbs are the multiplicity of meaning. Most modal verbs have more than one meaning or function. By using descriptive method, this research describes the modal have to+ verb in English modality and Indonesian based on its meaning. The data was taken from Corpus of Contemporary American English (COCA). The results showed that the types of meanings modals have to + verb are dynamic and deontic modalities.
- Research Article
- 10.13133/2239-1983/14387
- Jan 1, 2017
- Università degli studi di Roma La Sapienza
This paper carries out a close reading of a short extract of an Edith Wharton story using the tools of stylistics. The objective is to demonstrate that Wharton’s fundamental aims are to investigate the mind and to show how mind and society are inextricably intertwined. This she does by employing subtle linguistic means to vary the ‘degree of focalisation’ on the character in order not simply to guarantee the faithfulness of the mode of speech and thought presentation employed at each point in the text, but also to unveil the nature of the character’s thought processes at that given point in the text, thereby distinguishing between the character’s presentation of the Self in everyday life and his private musings, the latter being of two types: unconscious or inchoate thought or conscious thought which the character is fully aware of. It will be shown that each ‘type of thought’ reveals different aspects of the character’s personality and that each type is identifiable through specific linguistic means which Wharton reserves for each thought type. Thus, in addition to providing an analysis of the character’s personality and how society has impinged on his personality, the paper also constitutes a theoretical investigation of the methodological tools employed by Wharton. Keywords: cognition, emotion, speech and thought presentation, conceptual metaphor, pragmatics, projection, mind and society.
- Research Article
- 10.22028/d291-25042
- Jan 1, 2004
- Publications of the UdS (Saarland University)
This report presents an approach to enriching flat and robust predicate argument structures with more fine-grained semantic information, extracted from underspecified semantic representations and encoded in Minimal Recursion Semantics (MRS). Such representations are provided by a hand-built HPSG grammar with a wide linguistic coverage. A specific semantic representation, called linked predicate argument structure (LPAS), has been worked out, which describes the explicit embedding relationships among predicate argument structures. LPAS can be used as a generic interface language for integrating semantic representations with different granularities. Some initial experiments have been conducted to convert MRS expressions into LPASs. A simple constraint solver is developed to resolve the underspecified dominance relations between the predicates and their arguments in MRS expressions. LPASs are useful for high-precision information extraction and question answering tasks because of their fine-grained semantic structures. In addition, I have attempted to extend the lexicon of the HPSG English Resource Grammar (ERG) exploiting WordNet and to disambiguate the readings of HPSG parsing with the help of a probabilistic parser, in order to process texts from application domains. Following the presented approach, the HPSG ERG grammar can be used for annotating some standard treebank, e.g., the Penn Treebank, with its fine-grained semantics. In this vein, I point out opportunities for a fruitful cooperation of the HPSG annotated Redwood Treebank and the Penn PropBank. In my current work, I exploit HPSG as an additional knowledge resource for the automatic learning of LPASs from dependency structures.
- Research Article
- 10.5281/zenodo.1211103
- Apr 12, 2018
- SHILAP Revista de lepidopterología
The author of this article emphasizes that study of the narrative discourse on the example of English literature, the definition of psycholinguistic aspects of such a discourse is a very actual problem of the present, since this kind of discourse plays an important role in all genres of fiction. In this article, the narrative discourse will be analyzed on the example of graphic novels, and based on the analysis will be determined the psycholinguistic features of the narrative discourse. In the article it was emphasized that the analysis of the materials presented in the form of comics is important for defining the features of the narrative discourse, because the person who is observing is only the object of visualization, and not its subject. In the case of the character watches for something, the reader will be positioned exactly like this character.Herewith, a certain type of positioning is important for the narration of text material presented in the form of comics, because it is this type of positioning greatly affects the values created by the reader’s work. It was emphasized that the psycholinguistic narrative paradigm in the form of comics should be considered in the context in which comics are the assembly of both words and images, and thus the reader should carry out both visual and linguistic interpretation of the content. The psycholinguistic paradigm of comics contributes to the fact that artistic schemes (for example, the point of view, symmetry, the contextualization by means of contextual details), and sections of linguistics (for example, grammar, literature, syntax) appear to overlap each other, create an integral frame, which strenghtens the reader’s understandinf of a certain work. In this article the context of “focusing” from the point of psycholinguistic view was analyzed. The author of this article described his own version of comic comprehension processing, which is based on the knowledge of various narrative schemes, constantly changing the source data and obtaining the new information. Understanding in this case is represented as a series of transitions from the external to the internal focus. The latter often happens in a visual form, creating visual continuity through the processing of information from the bottom to top, while the change in the time and space of the plot occurs by text information processing from the top to bottom, combining previous knowledge of the characters and actions of the characters of the work into the new contextual scripts. Analyzing the work “Night Guards” by Alan Moore and Dave Gibbons, the author of this article proposed the following psycholinguistic aspects of the narrative discourse, namely: visual accentuation, updating of information, meta-narrative presentation of the text, contrasting visual word-combinations, actualization of the narrative potential.
- Research Article
- 10.26262/istal.v19i0.5478
- Apr 6, 2011
- Aristotle University of Thessaloniki
The idea of the “situatedness” of all scientific endeavour has been proven beyond the shadow of a doubt by the so-called ‘sociologists of knowledge’ and is today beginning to be recognized even by some hard-core, dyed-in-the-wool philosophers of science. Linguistics, like all other human and social sciences, cannot help being socio-historically situated. Neither can linguists. It is also no secret that the science of language came into being at a time when the world lived by a completely different set of rules. Today the world we live in is a far cry from what it used to be in those times and the phenomenon of globalization has changed it unrecognizably. It only stands to reason that our science is in dire need of being rehashed or, who knows, radically revamped, so as to bring it more in tune with the changing times. This presentation addresses the mind-boggling prospects ahead, including that of having to rethink some of the fundamental concepts and categories with which we have got used to working in the field of linguistics.
- Supplementary Content
2
- 10.22028/d291-23594
- Jan 1, 2012
- Publications of the UdS (Saarland University)
In this thesis we describe an incremental multi-layer rule-based methodology for the extraction of ontology schema components from German financial newspaper text. By Extraction of Ontology Schema Components we mean the detection of new concepts and relations between these concepts for ontology building. The process of detecting concepts and relations between these concepts corresponds to the intensional part of an ontology and is often referred to as ontology learning1. We present the process of rule generation for the extraction of ontology schema components as well as the application of the generated rules. Most of the research on ontology learning (Cimiano et al., 2005; Aguado de Cea et al., 2008) investigates the learning potential at sentential level, after the corpus has undergone a deep linguistic analysis2. In this thesis we present a bottomup method for the extraction of ontology schema components, showing that the extraction process of new classes and relations can be initialized at a more ”lower” level using shallow and robust linguistic analysis. We start the investigation by extracting candidates for ontology classes and relations from plain text, by applying text-based and string-based patterns. Then we go one step further and apply the accumulated knowledge from the previous step on Part-of-Speech (PoS) and semantically annotated text, validating in this way 1Ontology learning is the process of semi-automatic support in ontology development (Buitelaar et al., 2005) 2By deep linguistic analysis we mean grammatical function analysis.
- Conference Article
- 10.46793/tie22.408q
- Aug 1, 2022
Genre analysis has become a prevalent approach in the linguistic analysis of various specialized genres. A concept of genre, emerging from literature, has received a broader dimension in the last decade, focusing on establishing recognized structures and language exponents of a specific genre in a particular discourse community. In addition, the expansion of ESP and the rise of subgenres in many rising professional vocations require users to have competence in the English language. In addition, language researchers need ‘to dig into’ the pragmatic context of genres. With this mind and resting on the concept of genre and discourse communities, the paper sheds light on how the genre analysis approach can be applied in teaching different marine electrical genres to students and future ETO officers. The marine electrical engineering discourse community is specific and relatively novel. In this paper, the focus is placed on seafarers, future electro-technical officers and the analysis of genres they utilize in their professional work on board ships. The results of the paper can be inspiring to ESP teachers involved in teaching specialized and technical genres.
- Research Article
2
- 10.17863/cam.1576
- Jan 1, 2009
- Apollo (University of Cambridge)
This paper is intended to examine the Generic Structural Potential, semantic attributes and lexicogrammatical patterns of the event sections of Grimm's fairy tales. Generic Structural Potential is a description of all structural elements of a genre. Event sections include three obligatory elements of fairy tales from Initiating Even via Sequent Event to Final Event. The semantic attributes of Initiating Event are Lack, Obligation and Ordeal. The semantic attributes of Sequent Event are Test and Solution. The semantic attributes of Final Event are Punishment and Victory. Each attribute is realized by distinct lexicogrammatical patterns. With regard to a certain genre, the relationship between Generic Structural Potential, semantic attribute and lexicogrammatical pattern are systematic. Semantic properties and lexicogrammatical resources have different distributions in different structural elements of the same genre.
- Research Article
- 10.9776/13453
- Feb 1, 2013
Ubiquitous access to internet has resulted in more and more people going online to get their daily dose of news. In a 2010 survey conducted by the Pew Project for Excellence in Journalism, 41% of the respondents said they get most of their news online, 10% more than those who said they got most of their news from a newspaper. A lot of socio-technical factors have contributed to this phenomenal rise in adoption of online news in recent years. One of the biggest reasons why people are increasingly reading news online is because it facilitates discussion with peers (Nguyen 2010), offering different viewpoints which aid in forming a rounded personal opinion about the news story. The Pew survey found that 37% of online news users (and 51% of 18-29 year olds) think that commenting on news stories is an important feature to have. A lot of people tend to shape their opinion by reading discussion comments, reflective articles, blogs and even tweets about the news. Hence, an increasing number of people rely on online sources of news – be it news websites or news aggregator services like Digg, Reddit, Google Reader, Flipboard, Pulse etc. The problem with these news websites and aggregators is that the only way people can gather public opinion is by actively searching through the endless stream of comments and feeds, filtering out spam (which is a growing problem) and then reading the relevant posts. A top trending story on Twitter will typically see multiple tweets per second, and keeping up with the rapid flow of incoming tweets is quite cumbersome and cognitively taxing. Hence it becomes increasingly difficult and time consuming for someone who wants to get the pulse of the people affected by a news story. Furthermore, in certain scenarios people might want to look at more fine grained opinions. Currently, there is no elegant way to extract geographic and demographic impact of a news story. What is the public sentiment in Indonesia about the Arab Spring? How did the public opinion about the Wikileaks disclosures change as the story unfolded during the course of a year? It is very difficult and tedious to observe such patterns using the currently available news providers. This work attempts to solve these problems by proposing a news aggregator platform which pulls news stories from various sources and also aggregates public responses, reflections, opinions and sentiments associated with those stories. This data is presented in ways that are easily understandable so readers can make better sense of the stories unfolding across the globe. Such a news aggregator platform that gathers and display public opinion and sentiments about a story, must deal with various challenges – 1. Opinions are very subjective. Different people feel about a story in different ways. With such an enormous amount of diverse opinions and subjectivity, how can we possibly aggregate the responses into something that makes sense as a whole? ________________________________ Acknowledgements: Prof. Yardi, S. for the guidance Sethi, P. (2013). Public opinion aggregation by annotation and tagging of online news stories. iConference 2013 Proceedings (pp. 891-894). doi:10.9776/13453 Copyright is held by the author. iConference 2013 February 12-15, 2013 Fort Worth, TX, USA 892 2. There isn’t really a unified web standard for expressing opinion (in textual form). Some people tweet in 140 characters, while others write elaborate blog posts. Some websites employ tags which a reader can use to define and classify their public opinion, while others rely on threaded comments and comment ranking systems. How can a platform be flexible enough to adapt to all these varied standards so that it can extract valuable data from various sources? Perhaps the platform can create a new standard of expression on the web which is flexible and comprehensive enough to be used to express diverse views about every news story in the world. 3. How to filter out spam while extracting public opinion? 4. Once the platform has access to the data it needs, how should it be displayed to the reader in a way that makes sense? What forms of visualizations, illustrations and graphical representations can be employed to give the reader a holistic view of how people feel about a story? 5. How can the platform determine and convey effects of geographical, demographic and temporal variations as the story unfolds? These are just a few out of possibly many issues which must be dealt with. Previous research on similar public opinion aggregation services has greatly focused on natural language processing, data mining and text categorization and clustering. Xiaojun (2010) proposed a framework for crawling the web for comments and applying various data mining algorithms on the data to extract relevant information. Diakopoulos and Shamma (2010) used tweets posted in conjunction with the live presidential debate between Barack Obama and John McCain to gauge public opinion. Brody and Diakopoulos (2011) studied the use of word lengthening to detect sentiment in microblogs. This research proposes a solution – The Opinionated Reader, which relies on sentiment tags and annotations associated with a news story. The essential idea is to create a commenting, discussion and sharing plug-in which can be used by news websites and aggregators as a commenting solution for their news pages. Users wanting to share or comment on a news story through the plug-in are asked to tag the news story with sentiment tags and annotate the story with their reaction (happy/positive or sad/negative). These tags and annotations are stored, aggregated and linked to each news story. A mobile application provides the front-end interface for users to access the news stories and the aggregated sentiment associated with each story. The basic architecture is explained as follows: The Opinionated Reader – Mobile/Tablet App The app fetches news articles from various web sources, based on the interests and preferences configured by the user. In every news article, a portion of the screen real estate is reserved for Opinions which shows graphical visualizations and illustrations of the public opinion surrounding the news story. These visualizations include: A Sentiment Graph indicating the popular tags associated with the story (E.g.: “Shocking”, “Inspiring”, “Amusing” etc.). See Figure 1 for example visualization. A Positivity Graph which plots the level of positivity associated with the story on a time scale from when the news broke. See Figure 2 for example visualization. Figure 1. Example Sentiment Graph Figure 2. Example Positivity Graph iConference 2013 February 12-15, 2013 Fort Worth, TX, USA 893 The user can choose to see these visualizations for a particular time period in the evolution history of the news story, or for a specific country. The app also facilitates people to tag and annotate news articles from within its interface. The Opinionated Reader – Commenting and Sharing Web Plug-in These days, a common way of adding discussion and commenting functionality to news websites is by using 3 rd party services (like DISQUS). The Opinionated Reader is a similar service which can be embedded into the news articles of various news websites to enable commenting and sharing. When someone wishes to comment on an article, the comment is directed through this plug-in, which allows the users to annotate the article with the sentiment tags and reactions along with their comments. The Opinionated Reader saves this information along with the commenter’s location and date of comment (See Figure 3). Figure 3. The Opinionated Reader Web Plug-in 'Add Comment' dialog mockup The Opinionated Reader – Back-end The Back-end maintains a database of news items extracted from RSS feeds of various news websites. Each news article is linked with the sentiment tags and reaction/positivity annotations extracted from the comments and annotations gathered by the commenting plug-in. This data is used by the mobile/tablet app to generate visualizations (Sentiment Graph and Positivity Graph). The back-end also performs data mining on the tags and annotations for geographies and tracks the opinions across time. The back end system responds to queries received from the mobile app with the news story and associated tags and annotations, which are then rendered by the mobile app for the user. iConference 2013 February 12-15, 2013 Fort Worth, TX, USA 894 Discussion and Conclusion This design idea is still in a nascent state and has long hours of research, brainstorming, designing and development to go before it can be realized into something tangible. Twitter has grown exponentially in importance as a news source and it would be vastly valuable to integrate Twitter with The Opinionated Reader. Possibilities include use of special hash tags and natural language processing of tweets to extract public sentiment. The current design supports only two reaction annotations – positive and negative. Not every news story fits this annotation paradigm. Further research about human reactions to news stories might unveil interesting insights which would help zero in on a more robust annotation rubric. Lastly, since this service is envisioned to be non-curated and non-moderated, the value served by the app depends on the users themselves. Greater adoption will lead to more annotations and tags, which translates into a more accurate public opinion as presented to the user.