hits counter
OCOJ: Tagging Conventions

 


Tagging conventions

The corpus consists of Old Japanese texts which are supplied with a range of information in the form of xml tags, following the conventions of the Text Encoding Initiative. The texts are romanized in a phonemic transcription and supplied with information about the original orthography; original script is also retained. Lexemes and morphemes are given unique identifiers and words are part-of-speech tagged. Inflecting words are supplied with information about morphology. Finally, information about syntactic constituency is encoded.

  Back to top  Romanization & Orthography  POS & Identification  Morphology  Syntax  Tagging

 

Romanization & Writing

The texts are romanized in a phonemic transcription which reflects the phonology of Old Japanese. The corpus employs the Frellesvig & Whitman system of transcription.

Throughout the OJ period, Japanese was written entirely in Chinese characters (kanji) which were used either logographically or phonographically. The corpus indicates in the romanization whether strings of text are written logographically or phonographically in the original Japanese script. The romanized text is linked line by line with the original Japanese script.

Syllable type Index
notation
Ohno Modified
Mathias-
Miller
Yale Frellesvig
& Whitman
 
Kō-ruii1iîyii
Otsu-ruii2ïïiywi
neutraliiiii
 
Kō-ruie1eêyeye
Otsu-ruie2ëëeye
neutraleeee
 
Kō-ruio1oôwowo
Otsu-ruio2ööoo
neutralooooo

Table 1. OJ transcription systems

 

Gloss NJ Frellesvig
& Whitman
Index
notation
Yale Modified
Mathias-
Miller
Ohno
'fire'hipwipi2piy
'sun'hipipi1pyipi
'blood'chitititititi
'woman'memyeme1myeme
'eye'mememe2mey
'hand'tetetetetete
'child'kokwoko1kwoko
'this'kokoko2ko
'ear (of rice)'hopopopopopo

Table 2. Examples

 

Part of speech; lexeme and morpheme identification

Words in the texts are tagged for part-of-speech. Each lexeme and morpheme in the corpus is assigned a unique identification number. This number is encoded in the text as an @ana attribute.

The information about the lexical item is stored in a dictionary file which was created using the TEI dictionaries module. This file contains the following information: (a) sound shape at various points in time, (b) part-of-speech, (c) function and/or definition, (d) conjugation class for inflecting words and other relevant morphological information, (e) related lexical items.

  Back to top  Romanization & Orthography  POS & Identification  Morphology  Syntax  Tagging

 

Morphology

For inflecting words, full information is given about inflected forms, auxiliaries, auxiliary verbs, and compounding.

  Back to top  Romanization & Orthography  POS & Identification  Morphology  Syntax  Tagging

 

Syntax

Syntactic analysis in the corpus marks sentences, clauses, and phrases. Each clause is delimited to directly contain elements corresponding to a root predicate, overt arguments and complements, modifying expressions, modal extensions, and conjunctional particles. Phrases are delimited around nominal expressions directly contained by clauses, and also around nominal expressions that take pre-nominal modifiers. Arguments are marked as such.

  Back to top  Romanization & Orthography  POS & Identification  Morphology  Syntax  Tagging

 

Tagging conventions

Here are some of our tagging conventions, including those for words (<w>) and morphemes (<m>):

<w> : Denotes a simple or complex word or word-like element.

@type: States part of speech.

adjNoun adjectival nouns
adjective adjectives
adverb adverbs
copula copula
extension verbal extensions
interjection interjections
modifier modifiers
noun nouns
number numbers
particle particles
pronoun nouns
verb verbs

 

@subtype: Specifies the type of particle.

case case particle
comp complementizer
conj conjunctional particle
finl clause or sentence final particle
foc focus particle
intj interjectional particle
res restrictive particle
top topic particle

 

@function: Identifies the function of case and final particles and stative and progressive function of certain verbs.

abl ablative
acc accusative
all allative
com comitative
conj conjectural final particles
dat dative
emph emphatic particles
evd evidential final particles
excl exclamatory final particles
gen genitive
inst instrumental
intr interrogative final particles
nec necessitive final particles
negconj negative conjectural final particles
opt optative final particles
prb prohibitive final particles
progressive progressive function of verbs
stative stative function of verbs

 

@inflection: States the inflection of words.

adnconc The adnominal or conclusive form. This is used when the forms are identical in shape.
adnominal The adnominal form.
concessive The concessive form.
conclusive The conclusive form.
conditional The conditional form.
continuative The continuative form.
exclamatory The exclamatory form.
gerund The gerund form.
imperative The imperative form.
infconc The infinitive or conclusive form. This is used when the forms are identical in shape.
infinitive The infinitive form.
negConjectural The negative conjectural form.
nominal The nominal form.
optative The optative form.
provisional The provisional form.
sem the semblative forms of the copula
stem The stem of a verb or adjective. For verbs this indicates a non-final form, for adjectives this indicates bare stem modifiers or occurs with the copula.

 


<m> : Denotes a morpheme.

@type: Describes the type and/or function of a morpheme.

adjcop adjectival copula
auxadj auxiliary adjective
auxiliary auxiliary
circumfix circumfix
copula copula
counter counter
nominalizer nominalizer
numeral numeral
prefix prefix
suffix suffix

 

@inflection: States the inflection of morphemes.

adnconc The adnominal or conclusive form. This is used when the forms are identical in shape
adnominal The adnominal form.
concessive The concessive form.
conclusive The final form.
conditional The conditional form.
continuative The continuative form.
exclamatory The exclamatory form.
gerund The gerund form.
imperative The imperative form.
infconc The infinitive or conclusive form. This is used when the forms are identical in shape
infinitive The infinitive form.
negConjectural The negative conjectural form.
negNominal The negative nominal form.
nominal The nominal form.
optative The optative form.
provisional The provisional form.
stem The stem of an adjectival copula or auxiliary.

 

@function: Describes the function of a morpheme.

causative causative
conjectural conjectural
desiderative desiderative
emph emphatic
evid evidential
grad gradual use
hon honorific
intense intensifying use
intent intentional
iterative iterative
modalPast modal past
neg negative
optative optative
passive passive
perf the perfective
potential potential
prohibitive prohibitive
reciprocal reciprocal
respect respect
simplePast simple past
stative stative
subjunctive subjunctive
unknown verbal prefixes of unclear function

 


<c> : Gives orthographical information.

@type: States how a word or morpheme was rendered.

logo text written logographically
noLogo Text that is not directly represented in the original script, and is rendered based on reading tradition.
phon text written phonographically

 

  Back to top  Romanization & Orthography  POS & Identification  Morphology  Syntax  Tagging