The Pentaglot

A Dictionary in Five Languages

A Preliminary HTML Dump of the Pentaglot Database,

Demonstrator based on Volume 1, Heaven

editors: Oliver Corff, Dorj Dorjpalam, Xieyan Hincha, Wolfgang Lipp, Kyôko Maezono, Ablet Semet, Töwshintögs, Aysima Mirsultan, G. Gerelee

December 01, 2008, Ulaanbaatar/Berlin

Abstract

The Pentaglot, or Wuti Qingwen Jian in Chinese, compiled during the Qing Dynasty, is a famous multi-lingual dictionary. It comprises five languages: Manju, Tibetan, Mongolian, Uighur and Chinese, showing all five languages in parallel, arranged in semasiological order.

Its contents can only be accessed by a huge table of contents covering all subjects; no index for any of the languages is available in the original text.

The Pentaglot has never been published before 1957, when the Beijing Palace Museum prepared a photographical reproduction of the original manuscript, of which only two or three copies are known to have survived into the 21st century.

Many features of the manuscript suggest that the authors of the original dictionary did not finalize the preparations for a publication; too many errors persist which would have been purged from an officially approved imperial publication.

The Pentaglot Project aims at presenting all data of this dictionary in a manner as precise as possible, preserving all idiosyncrasies of the original; however, the complete text is now fully indexed by every segment of every lemma in every language. Segments can be words, suffix elements or Chinese characters. A Chinese pronounciation (in pinyin) has been added, and a German translation based on Hauer's Manju-German dictionary has been added.

Material contained in this Demonstration
Data Structure

Browse the Top Level Table of Contents

Presentation

Presentation Levels
Special Presentation Features
Colour Codes

The Index

Browse the Index

Final Remarks

Nota Bene:

Due to the proofreading status of individual languages, this HTML demonstration of the Pentaglot database is by no means authoritative; errors still exist in abundance, and hence, this material is only quotable after written request and confirmation.

All material presented here is encoded as UTF-8 Unicode. Unfortunately, it is possible that the fonts used by a particular browser are incomplete in which case boxes are displayed instead of characters. This holds true for some accented characters (as used in Uighur and Hanyu Pinyin) as well as some Arab and Chinese characters.

Material contained in this Demonstration

The Pentaglot Demonstration shows all material of Volume 1, Heaven. Except for Tibetan in Manju transcription, all language data are available and have undergone at least one proofreading cycle, not precluding of course that mistakes persist.

For the complete database, the editing status presents itself as follows:

Language or Script	Data Acquisition Status	Proofreading Status	Level II Presentation
Manju	Complete	Done	Done
Tibetan	Complete	Cycle II in progress	In progress, not yet included
Tibetan transliteration	Complete	Cycle I in progress	In progress
Tibetan transcription	Complete	Cycle I in progress	In progress
Mongolian	Complete	Cycle III in progress	In progress, parts included
Uighur	Complete	Cycle II in progress	In progress, parts included
Uighur transcription	Complete	Cycle II in progress	In progress, parts included
Chinese	Complete	Cycle III in progress	Done
Pinyin	Complete	Cycle III in progress	Done
German	Complete	Done

Data Structure

The Pentaglot database is structured in the natural order of its semasiological system which is presented via a top level directory (containing 32 volumes) leading to the multitude of thematical sections in a two-tier manner. Each volume of the Pentaglot may contain one or more categories (there is a total of approximately 56 categories). Each category has its own table of contents; from here, the reader is guided to the lemmata which are presented in tabular form.

The total of approx. 56 categories (šošohon in Manju, xuriyanggui in Mongolian, ser jama i in Uighur and bu in Chinese) comprises approx. 318 major sections (hacin in Manju, skor in Tibetan, züil in Mongolian, qismi in Uighur and lei in Chinese) which may again be divided into sub-sections (meyen in Manju, anggi in Mongolian, böläk in Uighur and ze in Chinese), resulting in approx. 636 sections. These sections contain sometimes only a few lemmata (like three lemmata on page 4596 of the Beijing 1957 edition), sometimes as many as 66 lemmata (like the section beginning on page 2670 of the Beijing 1957 edition).

With opening the Top Level Tables of Contents (see below) in any desired language, one proceeds as follows:

Top Level Table of Contents lists categories
Category Table of Contents lists sections
Individual Section lists lemmata

Since each Category Table of Contents lists its full title in all languages, it is necessary to scroll one or two screens downwards until the list of sections is reached.

On all levels, it is possible to browse to the preceding or following entity or return to the next-higher level by using the [Previous][Up][Next] buttons on top and bottom of each page.

Browse the Top Level Table of Contents in:

Manju
Tibetan
Mongolian
Uighur
Chinese
German

All sectional tables of contents show their entries running all available languages in parallel.

Presentation

In each sectional file, all lemmata are listed in blocks running in tables with the following columns:

Page.Column	Language Identifier	Lemma (Level I)
0002.1	ma	abka
	tib	gnam.
	tib.l	gnam.
	tib.s	nam.
	mo	tngri
	ui	āsman
	ui.a	آسمان
	ui.s	asman
	zh	天
	py	tiān
	de	Himmel

The Page.Column pair refers to the pages and columns of the Beijing 1957 edition. The Language Identifier is one of:

Language Identifier	Language
ma	Manju
tib	Tibetan
tib.l	Tibetan transliteration
tib.s	Tibetan transcription
mo	Mongolian
ui	Uighur
ui.a	Uighur in Arab script
ui.s	Uighur transcription
zh	Chinese (in characters)
py	Chinese (in Hanyu Pinyin)
de	German (based on Hauer's dictionary)

Usually, the Lemma is presented in one column which is understood to be identical with Level I (see below, Presentation Levels). If a second or third presentation level is present, the table adds more columns to the right side as needed.

Within the Lemma column(s), each word or Chinese character is linked to the index table system (see also below, The Index).

Presentation Levels

The material presentation may take place on several levels. Level I represents the lemma as it is found in the source text, including both orthographical variants and mistakes. In case the source text shows a form which can be considered standard usage (which has to be defined for each individual language separately), there is no Level II representation. In case the form of Level I is, by any standard, either a variant or a misrepresentation, a supposed normal form is offered in Level II. Consider, e.g., the characters 囘 and 回, both pronounced hui. The first form is used in lemma no. 2 on page 8 of the Pentaglot, but is commonly considered an yitizi of the second form. In such a case, the Level II column contains a normalized form which is also used for searching this lemma via the index. It is also attempted to highlight the words or characters different between the two levels. Finally, Level III contains notes, annotations and local crossreferences, and can be understood much as being an equivalent of footnotes in conventional texts.

Special Presentation Features

Besides the obvious semasiological superstructure individual languages show further peculiarities. Mongolian entries frequently contain synonyms in the form of A, basa B kämämüi. Thus, one page/column pair of the original text effectively contains two Mongolian lemmata. There is a total of 841 such pairs which are resolved in Level I into three lines within one lemma block: the first line contains the original entry, the next lines contain both synonyms without context. These lines are marked with a neutral gray colour as can be seen at lemma 0010.4.

Chinese lemmata apparently show a lesser degree of differentiation; there are 554 cases where all other languages show different lemmata, but Chinese has only backreferences marked by the phrase hanyu tongshang or a variation hereof. In these cases, the original text is printed first, followed by the Chinese lemma of the preceding tupel, as can be seen at lemma 0053.4 and its predecessor. Such a resolved backreference has its own colour code; a purple shade of gray stands for the hanzi entry, a blue shade of gray stands for the repeated pinyin entry.

Colour Codes

The remaining colour codes which have not been demonstrated yet give indications of the editing status, and are explained in the following table:

2991.2	ma	black indicates normal entry
	mo	needs further editing
	mo.{a,b}	Auto-resolved synonym entry
	ui	Entry missing in source
	ui.s	Ready for shipment
	zh	把子
	zh.a	Auto-reduplicated back reference
	py	bă zi
	py.a	Auto-reduplicated back reference

The Index

The original Pentaglot could only be accessed by its huge table of contents. There was no way of finding an unknown word in any language other than browsing the table of contents or eventually the total corpus based on the presumed meaning of the unknown word.

The index to the Pentaglot presented here covers all available languages and all words within every single lemma. It is thus possible to locate any lemma by any of its components.

In alphabetical scripts, the index features a list of every initial letter found in the source; in Chinese, all characters are classified according to a very close approximation of the Kangxi radicals.

For each word found, the index shows the page and column (clicking here leads to the text corpus) as well as the Level I data of all occurrences of the word in question in tabular form, as shown by the example of mong. naran (engl. sun):

	naran
0006.2	naran
0009.2	naran dulaxan
0010.3	naran γarba
0010.4	naran manduba, basa dägzibä kämämüi
0011.2	naran kälbäyibä
0011.3	naran kälbäribä
0011.4	naran tasiba
0012.1	naran singgäbä
0012.2	naran bürtäyizi
0012.3	naran xosilaba
0012.4	naran küriyäläbä
0013.1	naran barimui

The index is detailed enough to consider orthographical variation spread over Level I and Level I presentations of a given lemma. In case of the above-mentioned 回, the lemma location will be found under 回 even if the Level I entry shows a 囘, not a 回.

Chinese backreferences are also covered by the index, as can be shown by the character 密. The list offered here lists both 0053.3, the first lemma, and the repeated lemma 0053.4.

It is possible to access the Pentaglot database via the top level index file for each language, but it is also possible to click every word and character in the corpus in order to jump to the entry in the index.

It is then possible to jump to the original location of each reference by clicking on the Page.Column pair in the left column, but it is also possible to look up each word in each lemma occurrence by clicking on it. The index is completely circular and should not have any dead end.

Browse the Index in:

Final Remarks

The purpose of this HTML Demonstrator is limited to presenting the general structure of the Pentaglot Database; by no means this little demonstration is indicative of the final layout in printed form, nor does it show the advanced capabilities of a genuine database which can be queried interactively. Unlike the present HTML dump which is generated exclusively in Latin script and Chinese characters, the final and complete text will be shown in its original scripts, both in printed and electronic forms.

The editors welcome all comments and will do their best to improve contents and form of the Pentaglot database; meanwhile, we claim responsibility for all errors. Comments, questions and critical remarks are always welcome and may be sent to oliver(dot)corff(at)email(dot)de