The Pentaglot

A Dictionary in Five Languages

A Preliminary HTML Dump of the Pentaglot Database,

Demonstrator based on Volume 1, Heaven

editors: Oliver Corff, Dorj Dorjpalam, Xieyan Hincha, Wolfgang Lipp, Kyôko Maezono, Ablet Semet, Töwshintögs, Aysima Mirsultan, G. Gerelee

December 01, 2008, Ulaanbaatar/Berlin


The Pentaglot, or Wuti Qingwen Jian in Chinese, compiled during the Qing Dynasty, is a famous multi-lingual dictionary. It comprises five languages: Manju, Tibetan, Mongolian, Uighur and Chinese, showing all five languages in parallel, arranged in semasiological order.

Its contents can only be accessed by a huge table of contents covering all subjects; no index for any of the languages is available in the original text.

The Pentaglot has never been published before 1957, when the Beijing Palace Museum prepared a photographical reproduction of the original manuscript, of which only two or three copies are known to have survived into the 21st century.

Many features of the manuscript suggest that the authors of the original dictionary did not finalize the preparations for a publication; too many errors persist which would have been purged from an officially approved imperial publication.

The Pentaglot Project aims at presenting all data of this dictionary in a manner as precise as possible, preserving all idiosyncrasies of the original; however, the complete text is now fully indexed by every segment of every lemma in every language. Segments can be words, suffix elements or Chinese characters. A Chinese pronounciation (in pinyin) has been added, and a German translation based on Hauer's Manju-German dictionary has been added.


Nota Bene:

Due to the proofreading status of individual languages, this HTML demonstration of the Pentaglot database is by no means authoritative; errors still exist in abundance, and hence, this material is only quotable after written request and confirmation.

All material presented here is encoded as UTF-8 Unicode. Unfortunately, it is possible that the fonts used by a particular browser are incomplete in which case boxes are displayed instead of characters. This holds true for some accented characters (as used in Uighur and Hanyu Pinyin) as well as some Arab and Chinese characters.

Material contained in this Demonstration

The Pentaglot Demonstration shows all material of Volume 1, Heaven. Except for Tibetan in Manju transcription, all language data are available and have undergone at least one proofreading cycle, not precluding of course that mistakes persist.

For the complete database, the editing status presents itself as follows:

Language or Script Data Acquisition StatusProofreading StatusLevel II Presentation
Manju Complete DoneDone
Tibetan Complete Cycle II in progressIn progress, not yet included
Tibetan transliterationComplete Cycle I in progressIn progress
Tibetan transcriptionComplete Cycle I in progressIn progress
Mongolian Complete Cycle III in progressIn progress, parts included
Uighur Complete Cycle II in progressIn progress, parts included
Uighur transcriptionComplete Cycle II in progressIn progress, parts included
Chinese Complete Cycle III in progressDone
Pinyin Complete Cycle III in progressDone
German Complete Done

Data Structure

The Pentaglot database is structured in the natural order of its semasiological system which is presented via a top level directory (containing 32 volumes) leading to the multitude of thematical sections in a two-tier manner. Each volume of the Pentaglot may contain one or more categories (there is a total of approximately 56 categories). Each category has its own table of contents; from here, the reader is guided to the lemmata which are presented in tabular form.

The total of approx. 56 categories (šošohon in Manju, xuriyanggui in Mongolian, ser jama i in Uighur and bu in Chinese) comprises approx. 318 major sections (hacin in Manju, skor in Tibetan, züil in Mongolian, qismi in Uighur and lei in Chinese) which may again be divided into sub-sections (meyen in Manju, anggi in Mongolian, böläk in Uighur and ze in Chinese), resulting in approx. 636 sections. These sections contain sometimes only a few lemmata (like three lemmata on page 4596 of the Beijing 1957 edition), sometimes as many as 66 lemmata (like the section beginning on page 2670 of the Beijing 1957 edition).

With opening the Top Level Tables of Contents (see below) in any desired language, one proceeds as follows:

  1. Top Level Table of Contents lists categories
  2. Category Table of Contents lists sections
  3. Individual Section lists lemmata

Since each Category Table of Contents lists its full title in all languages, it is necessary to scroll one or two screens downwards until the list of sections is reached.

On all levels, it is possible to browse to the preceding or following entity or return to the next-higher level by using the [Previous][Up][Next] buttons on top and bottom of each page.

Browse the Top Level Table of Contents in:

All sectional tables of contents show their entries running all available languages in parallel.


In each sectional file, all lemmata are listed in blocks running in tables with the following columns:

Page.ColumnLanguage IdentifierLemma (Level I)
0002.1 maabka

The Page.Column pair refers to the pages and columns of the Beijing 1957 edition. The Language Identifier is one of:

Language IdentifierLanguage
ma Manju
tib Tibetan
tib.l Tibetan transliteration
tib.s Tibetan transcription
mo Mongolian
ui Uighur
ui.a Uighur in Arab script
ui.s Uighur transcription
zh Chinese (in characters)
py Chinese (in Hanyu Pinyin)
de German (based on Hauer's dictionary)

Usually, the Lemma is presented in one column which is understood to be identical with Level I (see below, Presentation Levels). If a second or third presentation level is present, the table adds more columns to the right side as needed.

Within the Lemma column(s), each word or Chinese character is linked to the index table system (see also below, The Index).

Presentation Levels

The material presentation may take place on several levels. Level I represents the lemma as it is found in the source text, including both orthographical variants and mistakes. In case the source text shows a form which can be considered standard usage (which has to be defined for each individual language separately), there is no Level II representation. In case the form of Level I is, by any standard, either a variant or a misrepresentation, a supposed normal form is offered in Level II. Consider, e.g., the characters 囘 and 回, both pronounced hui. The first form is used in lemma no. 2 on page 8 of the Pentaglot, but is commonly considered an yitizi of the second form. In such a case, the Level II column contains a normalized form which is also used for searching this lemma via the index. It is also attempted to highlight the words or characters different between the two levels. Finally, Level III contains notes, annotations and local crossreferences, and can be understood much as being an equivalent of footnotes in conventional texts.

Special Presentation Features

Besides the obvious semasiological superstructure individual languages show further peculiarities. Mongolian entries frequently contain synonyms in the form of A, basa B kämämüi. Thus, one page/column pair of the original text effectively contains two Mongolian lemmata. There is a total of 841 such pairs which are resolved in Level I into three lines within one lemma block: the first line contains the original entry, the next lines contain both synonyms without context. These lines are marked with a neutral gray colour as can be seen at lemma 0010.4.

Chinese lemmata apparently show a lesser degree of differentiation; there are 554 cases where all other languages show different lemmata, but Chinese has only backreferences marked by the phrase hanyu tongshang or a variation hereof. In these cases, the original text is printed first, followed by the Chinese lemma of the preceding tupel, as can be seen at lemma 0053.4 and its predecessor. Such a resolved backreference has its own colour code; a purple shade of gray stands for the hanzi entry, a blue shade of gray stands for the repeated pinyin entry.

Colour Codes

The remaining colour codes which have not been demonstrated yet give indications of the editing status, and are explained in the following table:

2991.2mablack indicates normal entry
moneeds further editing
mo.{a,b}Auto-resolved synonym entry
uiEntry missing in source
ui.sReady for shipment
zh.aAuto-reduplicated back reference
pybă zi
py.aAuto-reduplicated back reference

The Index

The original Pentaglot could only be accessed by its huge table of contents. There was no way of finding an unknown word in any language other than browsing the table of contents or eventually the total corpus based on the presumed meaning of the unknown word.

The index to the Pentaglot presented here covers all available languages and all words within every single lemma. It is thus possible to locate any lemma by any of its components.

In alphabetical scripts, the index features a list of every initial letter found in the source; in Chinese, all characters are classified according to a very close approximation of the Kangxi radicals.

For each word found, the index shows the page and column (clicking here leads to the text corpus) as well as the Level I data of all occurrences of the word in question in tabular form, as shown by the example of mong. naran (engl. sun):

0009.2naran dulaxan
0010.3naran γarba
0010.4naran manduba, basa dägzibä kämämüi
0011.2naran kälbäyibä
0011.3naran kälbäribä
0011.4naran tasiba
0012.1naran singgäbä
0012.2naran bürtäyizi
0012.3naran xosilaba
0012.4naran küriyäläbä
0013.1naran barimui

The index is detailed enough to consider orthographical variation spread over Level I and Level I presentations of a given lemma. In case of the above-mentioned , the lemma location will be found under 回 even if the Level I entry shows a 囘, not a 回.

Chinese backreferences are also covered by the index, as can be shown by the character . The list offered here lists both 0053.3, the first lemma, and the repeated lemma 0053.4.

It is possible to access the Pentaglot database via the top level index file for each language, but it is also possible to click every word and character in the corpus in order to jump to the entry in the index.

It is then possible to jump to the original location of each reference by clicking on the Page.Column pair in the left column, but it is also possible to look up each word in each lemma occurrence by clicking on it. The index is completely circular and should not have any dead end.

Browse the Index in:

Final Remarks

The purpose of this HTML Demonstrator is limited to presenting the general structure of the Pentaglot Database; by no means this little demonstration is indicative of the final layout in printed form, nor does it show the advanced capabilities of a genuine database which can be queried interactively. Unlike the present HTML dump which is generated exclusively in Latin script and Chinese characters, the final and complete text will be shown in its original scripts, both in printed and electronic forms.

The editors welcome all comments and will do their best to improve contents and form of the Pentaglot database; meanwhile, we claim responsibility for all errors. Comments, questions and critical remarks are always welcome and may be sent to oliver(dot)corff(at)email(dot)de