DISTRIBUTIONAL GRAMMAR. SEGMENTATION PROCEDURES

In §4 we defined distributional Grammar as such description of a grammatical system which starts at the positional (syntagmatic) properties of units. In this section we shall discuss the procedures for extracting linguistically important information which may lead to discovery of the grammatical system of a language.

The first procedure (which in fact the first in all types of analysis) is segmenting speech into units within the framework of the distributional Grammar we can find several ways of segmenting and none of them employs meaning as it is normal for other types. Meanings of units are used as late as possible, in ideal - never, because the Meaning is what should finally be discovered. As a result the procedure of segmentation in distributional Grammar is based on substitution.

The use of this procedure may be of two variants.

The first variant utilises the fact that the smallest grammatically relevant units - morphs - can be substituted by other morph or clusters of morphs in a bigger unit producing meaningful units. So if can substitute a sound or a group of sounds with an element which we recognise as a morph and both are meaningful units then the sound or a groups of sound is a morph. E.g. in the sentence [ðe dog ba:ks] "The dog barks" elements -ðe-, -dog-, -ba:k- and -s- can be substituted by a number of other elements and should be considered morphs. We may substitute [ðe] by [] and receive [dog ba:ks]. But [gd] cannot be substituted by any other element and cannot be considered a morph. But this method was criticised because it employed semantic criterion - both structures the one before substitution and the one after substitution must be meaningful.

The other strategy of segmentation of an utterance into morph made use of an assumption that the number of sounds at the limit of morphs rapidly increases or the end of an utterance can be inserted.

The analyses starts with selection of a phrase, for example the same [ðe dog ba:ks].Then we have to collect all phrases with the initial sound [ð] and count different sounds that appear after [ð]. We receive the following list:

1. [đ dog ba:ks].

2. [đæt dog ba:ks].

3. [đi:z dogz ba:k]

4. [đouz dogz ba:k]..

5. [đei klev].

6. [đen a dog ba:kt].

7. [đe dog ba:ks].

8. [đou đ dog ba:kt ai went on].

9. [đs its tru:].

We may add to this number two archaic forma:

10.[đau noust h]

11.[đai dog ba:ks]..

So the largest number of sounds met after [đ] is 11.

Then we should take all phrases beginning with [ð].and count all sounds possible after it. The list is as follows:

1. [đ dog ba:ks].

2. [đ bi: flaiz].

3. [đ pai z swi:t].

4. [đ f:st z bet].

5. [đ viktri wz kmli:t].

6. [đ ti: wz swi:t].

7. [đ q:t wz di:p].

8. [đ siti gru: big].

9. [đ zeroks w:kt kwikli].

10.[đ òi:p greiz].

11.[đ tòips we teisti].

12.[đ 3].

13.[đ d3u:s wz bit] .

14.[đ nait wz da:k].

15.[đ lai wz brait].

16.[đ haus wz wudn].

17.[đ kout wz blu:].

18.[đ geits wr oupn].

19.[đ wind wz kould].

20.[đ mir reflekts ounli gri:n reiz].

21.[đ j:r wz hæpi wn].

22.[đ ribn wz red].

Then we should repeat the same procedure for the sequences [đ d], [đ do], [đ dog], [đ dog b], [đ dog ba:] and [đ dog ba:k]. Finally we receive the following data: [đ921d14o5g40b12a:1k3s]. We see that in two of the three cases the peaks of the numbers of different sounds appear at the limits of morphs. This makes it possible to place the limits .of the units in the following way: [ð| dog| ba:k|s]. Some explanation is necessary for the last limit. Though we can find only three different sounds in this position but if we add at the beginning of the utterance a string [ai h:d] the result should be [ai h:d đ dog ba:k], where the end of the utterance if found after [ba:k]. Because the end of an utterance is a universal limit, and if we can insert it after any string of sound it means that all possible limits including limits of morphs can be placed there.

This strategy of segmentation may be done in a more sophisticated manner. We may count not the simple numbers of different sound appearing in certain positions but probabilities of peaks. This variant gives more accurate data but is too time and labour taking.