рефераты конспекты курсовые дипломные лекции шпоры

Раздел Образование
/
INFORMATION RETRIEVAL

Реферат Курсовая Конспект

Выберите учебное заведение

INFORMATION RETRIEVAL

INFORMATION RETRIEVAL - раздел Образование, THE ROLE OF NATURAL LANGUAGE PROCESSING Information Retrieval Systems (Irs) Are Designed To Search For Relevant Infor...

Information retrieval systems (IRS) are designed to search for relevant information in large documentary databases. This information can be of various kinds, with the queries ranging from “Find all the documents containing the word conjugar” to “Find information on the conjugation of Spanish verbs”. Accordingly, various systems use different methods of search.

The earliest IRSs were developed to search for scientific articles on a specific topic. Usually, the scientists supply their papers with a set of keywords, i.e., the terms they consider most important and relevant for the topic of the paper. For example, español, verbos, subjuntivo might be the keyword set of the article “On means of expressing unreal conditions” in a Spanish scientific journal.

These sets of keywords are attached to the document in the bibliographic database of the IRS, being physically kept together with the corresponding documents or separately from them. In the simplest case, the query should explicitly contain one or more of such keywords as the condition on what the article can be found and retrieved from the database. Here is an example of a query: “Find the documents on verbos and español”. In a more elaborate system, a query can be a longer logical expression with the operators and, or, not, e.g.: “Find the documents on (sustantivos or adjetivos) and (not inglés)”.

Nowadays, a simple but powerful approach to the format of the query is becoming popular in IRSs for non-professional users: the query is still a set of words; the system first tries to find the documents containing all of these words, then all but one, etc., and finally those containing only one of the words. Thus, the set of keywords is considered in a step-by-step transition from conjunction to disjunction of their occurrences. The results are ordered by degree of relevance, which can be measured by the number of relevant keywords found in the document. The documents containing more keywords are presented to the user first.

In some systems the user can manually set a threshold for the number of the keywords present in the documents, i.e., to search for “at least m of n” keywords. With m = n, often too few documents, if any, are retrieved and many relevant documents are not found; with m = 1, too many unrelated ones are retrieved because of a high rate of false alarms.

Usually, recall and precision are considered the main characteristics of IRSs. Recall is the ratio of the number of relevant documents found divided by the total number of relevant documents in the database. Precision is the ratio of the number of relevant documents divided by the total number of documents found.

It is easy to see that these characteristics are contradictory in the general case, i.e. the greater one of them the lesser another, so that it is necessary to keep a proper balance between them.

In a specialized IRS, there usually exists an automated indexing subsystem, which works before the searches are executed. Given a set of keywords, it adds, using the or operator, other related keywords, based on a hierarchical system of the scientific, technical or business terms. This kind of hierarchical systems is usually called thesaurus in the literature on IRSs and it can be an integral part of the IRS. For instance, given the query “Find the documents on conjugación,” such a system could add the word morfología to both the query and the set of keywords in the example above, and hence find the requested article in this way.

Thus, a sufficiently sophisticated IRS first enriches the sets of keywords given in the query, and then compares this set with the previously enriched sets of keywords attached to each document in the database. Such comparison is performed according to any criteria mentioned above. After the enrichment, the average recall of the IRS system is usually increased.

Recently, systems have been created that can automatically build sets of keywords given just the full text of the document. Such systems do not require the authors of the documents to specifically provide the keywords. Some of the modern Internet search engines are essentially based on this idea.

Three decades ago, the problem of automatic extraction of keywords was called automatic abstracting. The problem is not simple, even when it is solved by purely statistical methods. Indeed, the most frequent words in any business, scientific or technical texts are purely auxiliary, like prepositions or auxiliary verbs. They do not reflect the essence of the text and are not usually taken for abstracting. However, the border between auxiliary and meaningful words cannot be strictly defined. Moreover, there exist many term-forming words like system, device, etc., which can seldom be used for information retrieval because their meaning is too general. Therefore, they are not useful for abstracts.

The multiplicity of IRSs is considered now as an important class of the applied software and, specifically, of applied linguistic systems. The period when they used only individual words as keys has passed. Developers now try to use word combinations and phrases, as well as more complicated strategies of search. The limiting factors for the more sophisticated techniques turned out to be the same as those for grammar and style checkers: the absence of complete grammatical and semantic analysis of the text of documents. The methods used now even in the most sophisticated Internet search engines are not efficient for accurate information retrieval. This leads to a high level of information noise, i.e., delivering of irrelevant documents, as well as to the frequent missing of relevant ones.

The results of retrieval operations directly depend on the quality and performance of the indexing and comparing subsystems, on the content of the terminological system or the thesaurus, and other data and knowledge used by the system. Obviously, the main tools and data sets used by an IRS have the linguistic nature.

Развернуть

Открыть в широком формате

– Конец работы –

Эта тема принадлежит разделу:

THE ROLE OF NATURAL LANGUAGE PROCESSING

THE ROLE OF NATURAL LANGUAGE PROCESSING... LINGUISTICS AND ITS STRUCTURE... WHAT WE MEAN BY COMPUTATIONAL LINGUISTICS...

Если Вам нужно дополнительный материал на эту тему, или Вы не нашли то, что искали, рекомендуем воспользоваться поиском по нашей базе работ: INFORMATION RETRIEVAL

Что будем делать с полученным материалом:

Если этот материал оказался полезным ля Вас, Вы можете сохранить его на свою страничку в социальных сетях:

Все темы данного раздела:

THE ROLE OF NATURAL LANGUAGE PROCESSING
We live in the age of information. It pours upon us from the pages of newspapers and magazines, radio loudspeakers, TV and computer screens. The main part of this information has the form of natura

LINGUISTICS AND ITS STRUCTURE
Linguistics is a science about natural languages. To be more precise, it covers a whole set of different related sciences (see Figure I.1). General linguistics is a nucleus [18, 36]

WHAT WE MEAN BY COMPUTATIONAL LINGUISTICS
Computational linguistics might be considered as a synonym of automatic processing of natural language, since the main task of computational linguistics is just the construction of computer

WORD, WHAT IS IT?
As it could be noticed, the term word was used in the previous sections very loosely. Its meaning seems obvious: any language operates with words and any text or utterance consists of them.

THE IMPORTANT ROLE OF THE FUNDAMENTAL SCIENCE
In the past few decades, many attempts to build language processing or language understanding systems have been undertaken by people without sufficient knowledge in theoretical linguistics. They ho

CURRENT STATE OF APPLIED RESEARCH ON SPANISH
In our books, the stress on Spanish language is made intentionally and purposefully. For historical reasons, the majority of the literature on natural languages processing is not only written in En

CONCLUSIONS
The twenty-first century will be the century of the total information revolution. The development of the tools for the automatic processing of the natural language spoken in a country or a whole gr

II. A HISTORICAL OUTLINE
A COURSE ON LINGUISTICS usually follows one of the general models, or theories, of natural language, as well as the corresponding methods of interpretation of the linguistic phenomena. A c

THE STRUCTURALIST APPROACH
At the beginning of the twentieth century, Ferdinand de Saussure had developed a new theory of language. He considered natural language as a structure of mutually linked elements, similar or

INITIAL CONTRIBUTION OF CHOMSKY
In the 1950’s, when the computer era began, the eminent American linguist Noam Chomsky developed some new formal tools aimed at a better description of facts in various languages [12].

A SIMPLE CONTEXT-FREE GRAMMAR
Let us consider an example of a context-free grammar for generating very simple English sentences. It uses the initial symbol S of a sentence to be generated and several oth

TRANSFORMATIONAL GRAMMARS
Further research revealed great generality, mathematical elegance, and wide applicability of generative grammars. They became used not only for description of natural languages, but also for specif

THE LINGUISTIC RESEARCH AFTER CHOMSKY: VALENCIES AND INTERPRETATION
After the introduction of the Chomskian transformations, many conceptions of language well known in general linguistics still stayed unclear. In the 1980’s, several grammatical theories different f

LINGUISTIC RESEARCH AFTER CHOMSKY: CONSTRAINTS
Another very valuable idea originated within the generative approach was that of using special features assigned to the constituents, and specifying constraints to characterize agreement or

HEAD-DRIVEN PHRASE STRUCTURE GRAMMAR
One of the direct followers of the GPSG was called Head-Driven Phrase Structure Grammar (HPSG). In addition to the advanced traits of the GPSG, it has introduced and intensively used the notion of

THE IDEA OF UNIFICATION
Having in essence the same initial idea of phrase structures and their context-free combining, the HPSG and several other new approaches within Chomskian mainstream select the general and very powe

THE MEANING Û TEXT THEORY: MULTISTAGE TRANSFORMER AND GOVERNMENT PATTERNS
The European linguists went their own way, sometimes pointing out some oversimplifications and inadequacies of the early Chomskian linguistics. In late 1960´s, a new theory, the Mean

THE MEANING Û TEXT THEORY: DEPENDENCY TREES
Another important feature of the MTT is the use of its dependency trees, for description of syntactic links between words in a sentence. Just the set of these links forms the representation

THE MEANING Û TEXT THEORY: SEMANTIC LINKS
The dependency approach is not exclusively syntactic. The links between wordforms at the surface syntactic level determine links between corresponding labeled nodes at the deep syntactic level, and

CONCLUSIONS
In the twentieth century, syntax was in the center of the linguistic research, and the approach to syntactic issues determined the structure of any linguistic theory. There are two major approaches

III. PRODUCTS OF COMPUTATIONAL LINGUISTICS: PRESENT AND PROSPECTIVE
FOR WHAT PURPOSES do we need to develop computational linguistics? What practical results does it provide for society? Before we start discus-sing the methods and techniques of computational lingui

CLASSIFICATION OF APPLIED LINGUISTIC SYSTEMS
Applied linguistic systems are now widely used in business and scientific domains for many purposes. Some of the most important ones among them are the following: · Text preparation

AUTOMATIC HYPHENATION
Hyphenation is intended for the proper splitting of words in natural language texts. When a word occurring at the end of a line is too long to fit on that line within the accepted margins, a part o

SPELL CHECKING
The objective of spell checking is the detection and correction of typographic and orthographic errors in the text at the level of word occurrence considered out of its context. Nob

GRAMMAR CHECKING
Detection and correction of grammatical errors by taking into account adjacent words in the sentence or even the whole sentence are much more difficult tasks for computational linguists and softwar

STYLE CHECKING
The stylistic errors are those violating the laws of use of correct words and word combinations in language, in general or in a given literary genre. This application is the nearest in its

REFERENCES TO WORDS AND WORD COMBINATIONS
The references from any specific word give access to the set of words semantically related to the former, or to words, which can form combinations with the former in a text. This is a very importan

TOPICAL SUMMARIZATION
In many cases, it is necessary to automatically determine what a given document is about. This information is used to classify the documents by their main topics, to deliver by Internet the documen

AUTOMATIC TRANSLATION
Translation from one natural language to another is a very important task. The amount of business and scientific texts in the world is growing rapidly, and many countries are very productive in sci

NATURAL LANGUAGE INTERFACE
The task performed by a natural language interface to a database is to understand questions entered by a user in natural language and to provide answers—usually in natural language, but sometimes a

EXTRACTION OF FACTUAL DATA FROM TEXTS
Extraction of factual data from texts is the task of automatic generation of elements of a factographic database, such as fields, or parameters, based on on-line texts. Often the flows of the curre

TEXT GENERATION
The generation of texts from pictures and formal specifications is a comparatively new field; it arose about ten years ago. Some useful applications of this task have been found in recent years. Am

SYSTEMS OF LANGUAGE UNDERSTANDING
Natural language understanding systems are the most general and complex systems involving natural language processing. Such systems are universal in the sense that they can perform nearly all the t

RELATED SYSTEMS
There are other types of applications that are not usually considered systems of computational linguistics proper, but rely heavily on linguistic methods to accomplish their tasks. Of these we will

CONCLUSIONS
A short review of applied linguistic systems has shown that only very simple tasks like hyphenation or simple spell checking can be solved on a modest linguistic basis. All the other systems should

POSSIBLE POINTS OF VIEW ON NATURAL LANGUAGE
One could try to define natural language in one of the following ways: · The principal means for expressing human thoughts; · The principal means for text generation; · T

LANGUAGE AS A BI-DIRECTIONAL TRANSFORMER
The main purpose of human communication is transferring some information—let us call it Meaning[6]—from one person to the other. However, the direct transferring of thoughts is not possi

TEXT, WHAT IS IT?
The empirical reality for theoretical linguistics comprises, in the first place, the sounds of speech. Samples of speech, i.e., separate words, utterances, discourses, etc., are given to the resear

MEANING, WHAT IS IT?
Meanings, in contrast to texts, cannot be observed directly. As we mentioned above, we consider the Meaning to be the structures in the human brain which people experience as ideas and thoughts. Si

TWO WAYS TO REPRESENT MEANING
To represent the entities and relationships mentioned in the texts, the following two logically and mathematically equivalent formalisms are used: · Predicative formulas. Logical

DECOMPOSITION AND ATOMIZATION OF MEANING
Semantic representation in many cases turns out to be universal, i.e., common to different natural languages. Purely grammatical features of different languages are not usually reflected in

NOT-UNIQUENESS OF MEANING Þ TEXT MAPPING: SYNONYMY
Returning to the mapping of Meanings to Texts and vice versa, we should mention that, in contrast to common mathematical functions, this mapping is not unique in both directions, i.e., it is of the

NOT-UNIQUENESS OF TEXT Þ MEANING MAPPING: HOMONYMY
In the opposite direction—Texts to Meanings—a text or its fragment can exhibit two or more different meanings. That is, one element of the surface edge of the mapping (i.e. text) can correspond to

MORE ON HOMONYMY
In the field of computational linguistics, homonymous lexemes usually form separate entries in dictionaries. Linguistic analyzers must resolve the homonymy automatically, by choosing the correct op

MULTISTAGE CHARACTER OF THE MEANING Û TEXT TRANSFORMER
FIGURE IV.10. Levels of linguistic representation.

TRANSLATION AS A MULTISTAGE TRANSFORMATION
FIGURE IV.13. The role of dictionaries and grammars in linguis

TWO SIDES OF A SIGN
The notion of sign, so important for linguistics, was first proposed in a science called semiotics. The sign was defined as an entity consisting of two components, the signifier

LINGUISTIC SIGN
The notion of linguistic sign was introduced by Ferdinand de Saussure. By linguistic signs, we mean the entities used in natural languages, such as morphs, lexemes, and phrases. Lin

LINGUISTIC SIGN IN THE MMT
In addition to the two well-known components of a sign, in the Meaning Û Text Theory yet another, a third component of a sign, is considered essential: a record about its ability or inability

LINGUISTIC SIGN IN HPSG
In Head-driven Phrase Structure Grammar a linguistic sign, as usually, consists of two main components, a signifier and a signified. The signifier is defined as a phoneme string (or a sequence of s

ARE SIGNIFIERS GIVEN BY NATURE OR BY CONVENTION?
The notion of sign appeared rather recently. However, the notions equivalent to the signifier and the signified were discussed in science from the times of the ancient Greeks. For several centuries

GENERATIVE, MTT, AND CONSTRAINT IDEAS IN COMPARISON
In this book, three major approaches to linguistic description have been discussed till now, with different degree of detail: (1) generative approach developed by N. Chomsky, (2) the Meaning Û

CONCLUSIONS
The definition of language has been suggested as a transformer between the two equivalent representations of information, the Text, i.e., the surface textual representation, and the Meaning, i.e.,

V. LINGUISTIC MODELS
THROUGHOUT THE PREVIOUS CHAPTERS, you have learned, on the one hand, that for many computer applications, detailed linguistic knowledge is necessary and, on the other hand, that natural language ha

WHAT IS MODELING IN GENERAL?
In natural sciences, we usually consider the system A to be a model of the system B if A is similar to B in some important properties and exhibits somewhat simila

NEUROLINGUISTIC MODELS
Neurolinguistic models investigate the links between any external speech activity of human beings and the corresponding electrical and humoral activities of nerves in their brain. I

PSYCHOLINGUISTIC MODELS
Psycholinguistics is a science investigating the speech activity of humans, including perception and forming of utterances, via psychological methods. After creating its hypotheses and model

FUNCTIONAL MODELS OF LANGUAGE
In terms of cybernetics, natural language is considered as a black box for the researcher. A black box is a device with observable input and output but with a completely unobservable inner s

RESEARCH LINGUISTIC MODELS
There are still other models of interest for linguistics. They are called research models. At input, they take texts in natural language, maybe prepared or formatted in a special manner befo

COMMON FEATURES OF MODERN MODELS OF LANGUAGE
The modern models of language have turned out to possess several common features that are very important for the comprehension and use of these models. One of these models is given by the Meaning &

SPECIFIC FEATURES OF THE MEANING Û TEXT MODEL
The Meaning Û Text Model was selected for the most detailed study in these books, and it is necessary now to give a short synopsis of its specific features. · Orientation to synth

REDUCED MODELS
We can formulate the problem of selecting a good model for any specific linguistic application as follows. A holistic model of the language facilitates describing the language as a

DO WE REALLY NEED LINGUISTIC MODELS?
Now let us reason a little bit on whether computer scientists really need a generalizing (complete) model of language. In modern theoretical linguistics, certain researchers study phonolog

ANALOGY IN NATURAL LANGUAGES
Analogy is the prevalence of a pattern (i.e., one rule or a small set of rules) in the formal description of some linguistic phenomena. In the simplest case, the pattern can be represented with the

EMPIRICAL VERSUS RATIONALIST APPROACHES
In the recent years, the interest to empirical approach in linguistic research has livened. The empirical approach is based on numerous statistical observations gathered purely automatically

LIMITED SCOPE OF THE MODERN LINGUISTIC THEORIES
Even the most advanced linguistic theories cannot pretend to cover all computational problems, at least at present. Indeed, all of them evidently have the following limitations: · Only the

CONCLUSIONS
A linguistic model is a system of data (features, types, structures, levels, etc.) and rules, which, taken together, can exhibit a “behavior” similar to that of the human brain in understanding and

REVIEW QUESTIONS
THE FOLLOWING QUESTIONS can be used to check whether the reader has understood and remembered the main contents of the book. The questions are also recommended for t

PROBLEMS RECOMMENDED FOR EXAMS
IN THIS SECTION, each test question is supplied with a set of four variants of the answer, of which exactly one is correct and the others are not. 1. Why automatic natural language process

RECOMMENDED LITERATURE
1. Allen, J. Natural Language Understanding. The Benjamin / Cummings Publ., Amsterdam, Bonn, Sidney, Singapore, Tokyo, Madrid, 1995. 2. Cortés García, U., J. Bé

ADDITIONAL LITERATURE
10. Baeza-Yates, R., B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman and ACM Press, 1999. 11. Beristáin, Helena. Gramática estructural de la l

GENERAL GRAMMARS AND DICTIONARIES
20. Criado de Val, M. Gramática española. Madrid, 1958. 21. Cuervo, R. J. Diccionario de construcción y régimen de la lengua castellana. Instituto

REFERENCES
34. Apresian, Yu. D. et al. Linguistic support of the system ETAP-2 (in Russian). Nauka, Moscow, Russia, 1989. 35. Beekman, G. “Una mirada a la tecnología del ma&ntild

SOME SPANISH-ORIENTED GROUPS AND RESOURCES
HERE WE PRESENT a very short list of groups working on Spanish, with their respective URLs, especially the groups in Latin America. The members of the RITOS network (emilia.dc.fi.udc.es / Ritos2) a