рефераты конспекты курсовые дипломные лекции шпоры

Раздел Лингвистика
/
Фрагмент словаря-тезауруса по корпусной лингвистике

Реферат Курсовая Конспект

Выберите учебное заведение

Фрагмент словаря-тезауруса по корпусной лингвистике

Фрагмент словаря-тезауруса по корпусной лингвистике - раздел Лингвистика, Корпусная лингвистика В Структуре Словарных Статей Выделяются Поля, Которые Помечены Следующими Мет...

В структуре словарных статей выделяются поля, которые помечены следующими метками: Term –англоязычный термин;Trans –русскоязычный термин; Def –определение; Syn –синоним; Ant –антоним; Up –вышестоящий термин; Down –нижестоящий термин, Cyt –цитата.

Term aligned parallel corpus

Trans выровненный параллельный корпус

Def Parallel corpus where texts in one language and their translations into other languages are aligned, sentence by sentence, phrase by phrase.

Up parallel corpus

CytA type of multilingual corpus where texts in one language and their translations into other languages are aligned, sentence by sentence, preferably phrase by phrase. Sometimes reciprocate parallel corpora are set up, corpora containing authentic texts as well as translations in each of the languages involved. This allows double-checking translation equivalents.

Cyt A parallel corpus is not immediately user-friendly. For the corpus to be useful it is necessary to identify which sentences in the sub-corpora are translations of each other, and which words are translations of each other. A corpus which shows these identifications is known as an aligned corpus as it makes an explicit link between the elements which are mutual translations of each other. For example, in a corpus the sentences "Das Buch ist auf dem Tisch" and "The book is on the table" might be aligned to one another. At a further level, specific words might be aligned, e.g. "Das" with "The". This is not always a simple process, however, as often one word in one language might be equal to two words in another language, e.g. the German word "raucht" would be equivalent to "is smoking" in English.

Term aligned reciprocate parallel corpus

Trans выровненный двусторонний параллельный корпус

Def Reciprocate parallel corpus where texts and their translations are aligned, sentence by sentence, phrase by phrase.

Up reciprocate parallel corpus

Term annotated corpus

Trans размеченный корпус

Def Corpus enhanced with additional linguistic information.

Syn tagged corpus

Ant unannotated corpus

Up corpus

Down phonetically transcribed corpus

Down parsed corpus

Cyt A type of corpus enhanced with various types of linguistic information (or tagged corpus). An annotated corpus may be considered to be a repository of linguistic information, because the information which was implicit in the plain text has been made explicit through concrete annotation.

CytMore difficult is the question of annotated corpora. It is proposed that this term is used for any corpus which includes codes that record extra information -- provenance, analytical marks, etc. Again the annotations should be separable from the plain text in a simple and agreed fashion. A set of conventions for removing, restoring and manipulating annotations is necessary, especially as the next few years will see a large growth in the provision of annotated corpora. It is naive to expect that big corpora will remain easy to manage if they are full of various annotations; retrieval times are already critical.

Cyt For example, the form "gives" contains the implicit part-of-speech information "third person singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-existing knowledge of the grammar of English. However, in an annotated corpus the form "gives" might appear as "gives_VVZ", with the code VVZ indicating that it is a third person singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and easier to retrieve and analyse information about the language contained in the corpus.

Cyt Enriched data: Many corpora have already been enriched with additional linguistic information such as part-of-speech annotation, parsing and prosodic transcription. Hence data retrieval from annotated corpora can be easier and more specific than with unannotated data.

Term balanced corpus

Trans исчерпывающий корпус

Look saturated corpus

CytA type of corpus composed according to parameters such as text type, genre or domain.

Cyt The Core Corpus of 2 million words is intended to be a representative subset of the whole corpus, in the sense that it contains samples from all the major subdivisions of the whole BNC, and in approximately the same proportions as those found in the BNC as a whole. There is one major exception to this statement however: whereas in the whole BNC, only c.10 million words (10% of the corpus) consist of spoken data, the Core Corpus is divided approximately equally between written and spoken material (c.1 million words each). It will generally be felt that in an ideal balanced corpus of the language, at least half of the material should be spoken English. It was only the impracticality of collecting and transcribing 50 million words of the spoken language which led to abandonment of this goal of "ideal balance" in the case of the whole BNC.

Term corpus

Trans корпус

DefBody of texts.

Down opportunistic corpus

Down saturated corpus

Down balanced corpus

Down Reference corpus

Down annotated corpus

Down tagged corpus

Down unannotated corpus

Down raw corpus

Down parsed corpus

Down treebank

Down phonetically transcribed corpus

Down monolingual corpus

Down multilingual corpus

Down monitor corpus

Down finite corpus

Down samples corpus

Down whole text corpus

Down synchronic corpus

Down diachronic corpus

Cyt Corpora are sources of quantitative information beyond compare.

Cyt Leech (1992) argues that the corpus is a more powerful methodology from the point of view of the scientific method, as it is open to objective verification of results.

Cyt Whatever philosophical advantages we may eventually see in a corpus, it is the computer which allows us to exploit corpora on a large scale with speed and accuracy.

Cyt However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.

Cyt In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition.

Cyt We are therefore interested in creating a corpus which is maximally representative of the variety under examination, that is, which provides us with an as accurate a picture as possible of the tendencies of that variety, as well as their proportions.

Cyt Nowadays the term "corpus" nearly always implies the additional feature "machine-readable". This was not always the case as in the past the word "corpus" was only used in reference to printed text.

Cyt There is often a tacit understanding that a corpus constitutes a standard reference for the language variety that it represents. This presupposes that it will be widely available to other researchers, which is indeed the case with many corpora – e.g. the Brown Corpus, the LOB corpus and the London-Lund corpus.

Cyt Part-of-speech annotation is useful because it increases the specificity of data retrieval from corpora, and also forms an essential foundation for further forms of analysis (such as syntactic parsing and semantic field annotation).

Cyt Problem-oriented tagging (as described by de Haan (1984)) is the phenomenon whereby users will take a corpus, either already annotated, or unannotated, and add to it their own form of annotation, oriented particularly towards their own research goal.

Cyt In this session we will examine a few of the roles which corpora may play in the study of language. The importance of corpora to language study is aligned to the importance of empirical data. Empirical data enable the linguist to make objective statements, rather than those which are subjective, or based upon the individual's own internalised cognitive perception of language.

Cyt It is important to note that although many linguists may use the term "corpus" to refer to any collection of texts, when it is used here it refers to a body of text which is carefully sampled to be maximally representative of the language or language variety.

Cyt A linguist who has access to a corpus, or other (non-representative) collection of machine readable text can call up all the examples of a word or phrase from many millions of words of text in a few seconds. Dictionaries can be produced and revised much more quickly than before, thus providing up-to-date information about language. Also, definitions can be more complete and precise since a larger number of natural examples are examined.

Cyt Because a corpus is sampled to maximally represent the population, any findings taken from the corpus can be generalised to the larger population. Hence quantification in corpus linguistics is more meaningful than other forms of linguistic quantification because it can tell us about a variety of language, not just that which is being analysed.

Cyt Most European languages (not to mention Chinese, Japanese, Korean etc.) now have some sort of corpus already and there is a growing awareness that a good corpus can be put to many uses; hence their importance grows. Despite initial disapprovals voiced by some linguists, doubts are dispelled by obvious and indisputable facts: nobody has ever been able to manually collect and subsequently process so much data in his or her lifetime as the computer can in a very short time.

Cyt It may still be premature to try to mark out exhaustively what corpora may do for language studies and linguists; undoubtedly, many new options are still to come while the appetite of linguists is gradually whetted and new ways of corpus exploitation are offered by corpus linguists. In fact, it is hard to see a linguistic discipline not being able to profit from a corpus one way or another, both written and oral. It is increasingly clearer that new ways and methods for retrieving information from corpora will have to be given more thought.

Приложение 4

Миникорпус корпусной терминологии
(фрагмент)

Термин	Цитата
Corpus	Corpora are sources of quantitative information beyond compare.
Corpus	Leech (1992) argues that the corpus is a more powerful methodology from the point of view of the scientific method, as it is open to objective verification of results.
Corpus	Whatever philosophical advantages we may eventually see in a corpus, it is the computer which allows us to exploit corpora on a large scale with speed and accuracy.
Corpus	However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.
Corpus	In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition.
Corpus	We are therefore interested in creating a corpus which is maximally representative of the variety under examination, that is, which provides us with an as accurate a picture as possible of the tendencies of that variety, as well as their proportions.
Corpus	Nowadays the term "corpus" nearly always implies the additional feature "machine-readable". This was not always the case as in the past the word "corpus" was only used in reference to printed text.
Corpus	There is often a tacit understanding that a corpus constitutes a standard reference for the language variety that it represents. This presupposes that it will be widely available to other researchers, which is indeed the case with many corpora - e.g. the Brown Corpus, the LOB corpus and the London-Lund corpus.
Corpus	Part-of-speech annotation is useful because it increases the specificity of data retrieval from corpora, and also forms an essential foundation for further forms of analysis (such as syntactic parsing and semantic field annotation).
Corpus	Problem-oriented tagging (as described by de Haan (1984)) is the phenomenon whereby users will take a corpus, either already annotated, or unannotated, and add to it their own form of annotation, oriented particularly towards their own research goal.
Corpus	In this session we will examine a few of the roles which corpora may play in the study of language. The importance of corpora to language study is aligned to the importance of empirical data. Empirical data enable the linguist to make objective statements, rather than those which are subjective, or based upon the individual's own internalised cognitive perception of language.
Corpus	It is important to note that although many linguists may use the term "corpus" to refer to any collection of texts, when it is used here it refers to a body of text which is carefully sampled to be maximally representative of the language or language variety.
Corpus	A linguist who has access to a corpus, or other (non-representative) collection of machine readable text can call up all the examples of a word or phrase from many millions of words of text in a few seconds. Dictionaries can be produced and revised much more quickly than before, thus providing up-to-date information about language. Also, definitions can be more complete and precise since a larger number of natural examples are examined.
Corpus	Grammatical (or syntactic) studies have, along with lexical studies, been the most frequent types of research which have used corpora.
Corpus	Because a corpus is sampled to maximally represent the population, any findings taken from the corpus can be generalised to the larger population. Hence quantification in corpus linguistics is more meaningful than other forms of linguistic quantification because it can tell us about a variety of language, not just that which is being analysed.
Corpus	Most European languages (not to mention Chinese, Japanese, Korean etc.) now have some sort of corpus already and there is a growing awareness that a good corpus can be put to many uses; hence their importance grows. Despite initial disapprovals voiced by some linguists, doubts are dispelled by obvious and indisputable facts: nobody has ever been able to manually collect and subsequently process so much data in his or her lifetime as the computer can in a very short time.
Corpus	It may still be premature to try to mark out exhaustively what corpora may do for language studies and linguists; undoubtedly, many new options are still to come while the appetite of linguists is gradually whetted and new ways of corpus exploitation are offered by corpus linguists. In fact, it is hard to see a linguistic discipline not being able to profit from a corpus one way or another, both written and oral. It is increasingly clearer that new ways and methods for retrieving information from corpora will have to be given more thought.
Corpus	Since any language needs a consistent, perpetual and next-to-exhaustive coverage of its data, it should have a corpus of corresponding qualities, although in practice it is a gradual business of taking many minor decisions in the course of its construction and maintenance. This is particularly important in the case of small languages, which, unlike English and other languages, cannot afford the luxury of having a variety and multitude of corpora for specific purposes, at least not at the moment. What is really needed is a steady increase and perpetual growth of even, by present standards, very large corpora of billions of words, which should be as much representative as possible.
Corpus	Although the degree of the coverage of language by a large corpus is considerable, it is by no means true that today's corpora reflect language as a whole. Moreover, some corpus linguists are becoming more and more susceptible to another challenge here, namely the degree of representativeness of this coverage, which is very much an open issue and matter of much dispute.
Corpus	As information is to be found coming from all fields of human life and activity, it is hard to imagine that corpora can be based on a collection of, perhaps, newspapers only. On the other hand, this diversity of sources suggests that a mapping of proportions in which various kinds of information occur should take place and be reflected in the design and structure of the corpus, should this be a general type of corpus. This raises the problem of the corpus representativeness, mentioned above.
Corpus	More generally, one may wonder where this trend actually fits in, in an attempt to pursue purely practical and utilitarian goals, or in one aiming at an exhaustive, systematic and non-eclectic description of one's language. Corpora definitely offer the latter possibility.
Corpus	Corpora are cross-sections of a discourse universe comprising all communication acts. The texts they monitor are principally transient communication acts.
Corpus	It is the task of the linguist to define and delimit the scope of the discourse universe she or he is interested in in such a way that it can be reduced to a corpus. Parameters can be language, time segment, region, situation, external and internal properties of texts, and many others.
Corpus collection	Corpus collection continued and diversified after the diary studies period: large sample studies covered the period roughly from 1927 to 1957 - analysis was gathered from a large number of children with the express aim of establishing norms of development.
Early corpus linguistics	All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions: The sentences of a natural language are finite. The sentences of a natural language can be collected and enumerated.

Развернуть

Открыть в широком формате

– Конец работы –

Эта тема принадлежит разделу:

Корпусная лингвистика

Филологический факультет... Кафедра математической лингвистики... В П Захаров...

Если Вам нужно дополнительный материал на эту тему, или Вы не нашли то, что искали, рекомендуем воспользоваться поиском по нашей базе работ: Фрагмент словаря-тезауруса по корпусной лингвистике

Что будем делать с полученным материалом:

Если этот материал оказался полезным ля Вас, Вы можете сохранить его на свою страничку в социальных сетях:

Все темы данного раздела:

Захаров В.П.
З-38Корпусная лингвистика: Учебно-метод. пособие. – СПб., 2005. – 48 с. Предлагаемое пособие содержит описание предмета и основного содержания корпусной лингвистики – новог

ББК 81.1
ã В.П. Захаров, 2005

Репрезентативность
Задача создателей корпуса – собрать как можно большее количество текстов, относящихся к тому подмножеству языка, для изучения которого корпус создается. Но главное не только и не столько в количест

Разметка
Для решения различных лингвистических задач мало лишь наличия массива текстов. Требуется также, чтобы тексты содержали в себе явным образом разного рода дополнительную лингвистическую и экстралингв

Технология создания корпусов
Технологический процесс создания корпуса можно представить в виде следующих шагов или этапов. 1. Определение перечня источников. 2. Оцифровка текстов (преобразование в компьютерну

Автоматическая разметка
Фактически, корпус в его современном понимании – это всегда компьютерная база данных, и в процессе его создания естественно использование специальных программ. Среди этих программ особое место зани

Исправление ошибок и снятие неоднозначности
Однако автоматический анализ естественного языка небезошибочен и многозначен – он, как правило, дает несколько вариантов анализа для одной лексической единицы (слова, словосочетания, предложения).

Форматы данных и стандартизация
Корпусы, как правило, предназначены для многократного использования многими пользователями, соответственно, и их разметка, и их программное обеспечение должны быть определенным образом унифицирован

Корпусные менеджеры
Работа пользователей с корпусом осуществляется с помощью специализированных программных средств – корпусных менеджеров, предоставляющих разнообразные возможности по получению из корпуса необ

Пользователи и способы использования корпусов
Пользователей корпусов, как правило, интересует не содержание конкретных текстов, а их метатекстовая информация и примеры употребления тех или иных языковых элементов и конструкций. Это, в первую о

Типы корпусов
Несмотря на разнообразие корпусов, можно выделить два основных способа деления корпусов на классы: 1) это противопоставление корпусов, относящихся ко всему языку (часто к языку определенного период

Терминология
Терминология корпусной лингвистики еще не установилась. Во-первых, это естественно, учитывая ее недавнее происхождение. Во-вторых, корпусная лингвистика как отдельная ветвь лингвистики сложилась в

Корпусы в сети Интернет
Приведем сетевые адреса и краткие сведения о некоторых корпусах. В Интернете можно получить доступ и найти списки самых различных корпусов — см., например, D. Lee. Bookmarks for Corpus-based Lingui

Поиск всех прилагательных (A) в краткой форме (C), мужского рода (Y), единственного числа (S)
[tag="ACYS.*"] (...) (...)

Жанр текста
нежанровая проза

Тип текста
автобиографическая проза