Tagging conventions
The corpus consists of Old Japanese texts which are supplied with a range of information in the form of xml tags, following the conventions of the Text Encoding Initiative. The texts are romanized in a phonemic transcription and supplied with information about the original orthography; original script is also retained. Lexemes and morphemes are given unique identifiers and words are part-of-speech tagged. Inflecting words are supplied with information about morphology. Finally, information about syntactic constituency is encoded.
Back to top Romanization & Orthography POS & Identification Morphology Syntax Tagging
The texts are romanized in a phonemic transcription which reflects the phonology of Old Japanese. The corpus employs the Frellesvig & Whitman system of transcription.
Throughout the OJ period, Japanese was written entirely in Chinese characters (kanji) which were used either logographically or phonographically. The corpus indicates in the romanization whether strings of text are written logographically or phonographically in the original Japanese script. The romanized text is linked line by line with the original Japanese script.
Syllable type | Index notation |
Ohno | Modified Mathias- Miller |
Yale | Frellesvig & Whitman |
---|---|---|---|---|---|
Kō-rui | i1 | i | î | yi | i |
Otsu-rui | i2 | ï | ï | iy | wi |
neutral | i | i | i | i | i |
Kō-rui | e1 | e | ê | ye | ye |
Otsu-rui | e2 | ë | ë | ey | e |
neutral | e | e | e | e | |
Kō-rui | o1 | o | ô | wo | wo |
Otsu-rui | o2 | ö | ö | o | o |
neutral | o | o | o | o | o |
Table 1. OJ transcription systems
Gloss | NJ | Frellesvig & Whitman |
Index notation |
Yale | Modified Mathias- Miller |
Ohno |
---|---|---|---|---|---|---|
'fire' | hi | pwi | pi2 | piy | pï | pï |
'sun' | hi | pi | pi1 | pyi | pî | pi |
'blood' | chi | ti | ti | ti | ti | ti |
'woman' | me | mye | me1 | mye | mê | me |
'eye' | me | me | me2 | mey | më | më |
'hand' | te | te | te | te | te | te |
'child' | ko | kwo | ko1 | kwo | kô | ko |
'this' | ko | ko | ko2 | ko | kö | kö |
'ear (of rice)' | ho | po | po | po | po | po |
Table 2. Examples
Part of speech; lexeme and morpheme identification
Words in the texts are tagged for part-of-speech. Each lexeme and morpheme in the corpus is assigned a unique identification number. This number is encoded in the text as an @ana attribute.
The information about the lexical item is stored in a dictionary file which was created using the TEI dictionaries module. This file contains the following information: (a) sound shape at various points in time, (b) part-of-speech, (c) function and/or definition, (d) conjugation class for inflecting words and other relevant morphological information, (e) related lexical items.
Back to top Romanization & Orthography POS & Identification Morphology Syntax Tagging
For inflecting words, full information is given about inflected forms, auxiliaries, auxiliary verbs, and compounding.
Back to top Romanization & Orthography POS & Identification Morphology Syntax Tagging
Syntactic analysis in the corpus marks sentences, clauses, and phrases. Each clause is delimited to directly contain elements corresponding to a root predicate, overt arguments and complements, modifying expressions, modal extensions, and conjunctional particles. Phrases are delimited around nominal expressions directly contained by clauses, and also around nominal expressions that take pre-nominal modifiers. Arguments are marked as such.
Back to top Romanization & Orthography POS & Identification Morphology Syntax Tagging
Here are some of our tagging conventions, including those for words (<w>) and morphemes (<m>):
<w> : Denotes a simple or complex word or word-like element.
@type: States part of speech.
adjNoun | adjectival nouns |
adjective | adjectives |
adverb | adverbs |
copula | copula |
extension | verbal extensions |
interjection | interjections |
modifier | modifiers |
noun | nouns |
number | numbers |
particle | particles |
pronoun | nouns |
verb | verbs |
@subtype: Specifies the type of particle.
case | case particle |
comp | complementizer |
conj | conjunctional particle |
finl | clause or sentence final particle |
foc | focus particle |
intj | interjectional particle |
res | restrictive particle |
top | topic particle |
@function: Identifies the function of case and final particles and stative and progressive function of certain verbs.
abl | ablative |
acc | accusative |
all | allative |
com | comitative |
conj | conjectural final particles |
dat | dative |
emph | emphatic particles |
evd | evidential final particles |
excl | exclamatory final particles |
gen | genitive |
inst | instrumental |
intr | interrogative final particles |
nec | necessitive final particles |
negconj | negative conjectural final particles |
opt | optative final particles |
prb | prohibitive final particles |
progressive | progressive function of verbs |
stative | stative function of verbs |
@inflection: States the inflection of words.
adnconc | The adnominal or conclusive form. This is used when the forms are identical in shape. |
adnominal | The adnominal form. |
concessive | The concessive form. |
conclusive | The conclusive form. |
conditional | The conditional form. |
continuative | The continuative form. |
exclamatory | The exclamatory form. |
gerund | The gerund form. |
imperative | The imperative form. |
infconc | The infinitive or conclusive form. This is used when the forms are identical in shape. |
infinitive | The infinitive form. |
negConjectural | The negative conjectural form. |
nominal | The nominal form. |
optative | The optative form. |
provisional | The provisional form. |
sem | the semblative forms of the copula |
stem | The stem of a verb or adjective. For verbs this indicates a non-final form, for adjectives this indicates bare stem modifiers or occurs with the copula. |
<m> : Denotes a morpheme.
@type: Describes the type and/or function of a morpheme.
adjcop | adjectival copula |
auxadj | auxiliary adjective |
auxiliary | auxiliary |
circumfix | circumfix |
copula | copula |
counter | counter |
nominalizer | nominalizer |
numeral | numeral |
prefix | prefix |
suffix | suffix |
@inflection: States the inflection of morphemes.
adnconc | The adnominal or conclusive form. This is used when the forms are identical in shape |
adnominal | The adnominal form. |
concessive | The concessive form. |
conclusive | The final form. |
conditional | The conditional form. |
continuative | The continuative form. |
exclamatory | The exclamatory form. |
gerund | The gerund form. |
imperative | The imperative form. |
infconc | The infinitive or conclusive form. This is used when the forms are identical in shape |
infinitive | The infinitive form. |
negConjectural | The negative conjectural form. |
negNominal | The negative nominal form. |
nominal | The nominal form. |
optative | The optative form. |
provisional | The provisional form. |
stem | The stem of an adjectival copula or auxiliary. |
@function: Describes the function of a morpheme.
causative | causative |
conjectural | conjectural |
desiderative | desiderative |
emph | emphatic |
evid | evidential |
grad | gradual use |
hon | honorific |
intense | intensifying use |
intent | intentional |
iterative | iterative |
modalPast | modal past |
neg | negative |
optative | optative |
passive | passive |
perf | the perfective |
potential | potential |
prohibitive | prohibitive |
reciprocal | reciprocal |
respect | respect |
simplePast | simple past |
stative | stative |
subjunctive | subjunctive |
unknown | verbal prefixes of unclear function |
<c> : Gives orthographical information.
@type: States how a word or morpheme was rendered.
logo | text written logographically |
noLogo | Text that is not directly represented in the original script, and is rendered based on reading tradition. |
phon | text written phonographically |
Back to top Romanization & Orthography POS & Identification Morphology Syntax Tagging