Pinyin Collation

(Tom Bishop, 2005-9-20; revised thru 2006-4-27)

Introduction

This page describes the system used by Wenlin and the ABC Chinese-English Comprehensive Dictionary for alphabetizing Mandarin words according to their pinyin spelling. The system is described from a practical perspective in the frontmatter of the dictionary itself. Here the issue is addressed more from a technical perspective. Programming source code is provided for the benefit of anyone who might want to implement pinyin collation for other applications.

History

An important standard-setting book on pinyin collation is Hànyǔ Pīnyīn Cíhuì (汉语拼音词汇, 语文出版社, 1989, ISBN7-80006-335-6), edited under the direction of Zhou Youguang 周有光. Hànyǔ Pīnyīn Cíhuì served as the basis for the system used in the ABC Chinese-English Dictionary, edited by John DeFrancis. (A footnote below describes a slight difference.) The original implementation for ABC was done by Bob Hsu using the SNOBOL language. The current implementations by Tom Bishop (used for the Comprehensive edition of ABC and for the Wenlin software) are written in C and Perl.

Basic Idea

The ordering is primarily simply alphabetical. Diacritical marks, punctuation, juncture and capitalization are only taken into account when the strings being compared are otherwise identical. For example, píng'ān sorts before pīnyīn, because pingan sorts before pinyin, because g precedes y alphabetically.

Only when two strings are alphabetically identical is non-alphabetical information taken into account. For example, pǐnxíng sorts before pǐnxìng, because the only difference is in the tone mark of the second syllable (xíng, 2nd tone, vs. xìng, 4th tone), and the standard order of the four Mandarin tones (ā=a1; á=a2; ǎ=a3; à=a4) is followed. The neutral tone (without a tone mark) is treated as tone zero, so that de (=de0) sorts before (=de2).

Tones are extremely important in Mandarin. For example, the words mother and horse are distinguished in speech only by their tones. The difference is just as important as between the English words big and pig. Tones are challenging, not only for non-native learners of Chinese, but also for native speakers whose dialects are not exactly the same as what is described in a particular dictionary. Therefore, it's easier to look up words in a dictionary if words that differ only in tone are listed close together. One time I thought someone was asking me if I had gāngbǐ 钢笔, a pen; but they were really asking if I had Gǎngbì 港币, Hong Kong currency. (This was in Sichuan, where the Mandarin tones are all topsy-turvy). This kind of near-homophone can be studied relatively easily using an alphabetically arranged dictionary like ABC; other kinds of dictionaries list them so far apart there is no way to make the connection.

Details

For strings that are alphabetically identical, the order is determined by the following factors, in order of decreasing priority:

Implementation with Collation Keys

Pinyin strings can be specified either with Unicode, as in nǐhǎo, or with ASCII, as in ni3hao3. (The ASCII equivalent of ü is v, so that and nv3 are treated the same.) For any given pinyin string a collation key is generated. The collation key is a string, using only the letters a-z, the digits 0-4, and a period (.) as a separator. (Compression can be used to produce shorter binary keys that sort the same way; compression won't be discussed further here.) Two collation keys can be compared as simple ASCII strings, using the default sort or cmp operators in Perl, or the strcmp() function in C.

A key consists of six parts, separated by periods:

  1. alpha (simply a-z, for example píng'ān becomes pingan);
  2. umlaut (empty unless ü occurs in the string);
  3. apostrophe (empty unless an apostrophe or equivalent space or hyphen occurs);
  4. tones (digits 0-4 for each syllable);
  5. capitalization (empty unless there is a capitalized letter);
  6. homophone (based on leading number as in ³xíng).

Pinyin Collation Utility

A Perl implementation of the algorithm is included in a utility whose source code appears in the following section. (The Wenlin program itself is written in the C programming language, not in Perl; I intend eventually to provide equivalent C code for key generation here as well, if anyone is interested. The Perl code, which started as a translation of the C code, is shorter and easier to make into a stand-alone program. The C code is wrapped up in the Wenlin source, and will take a bit of work to extract.) The Perl script can be run on the command line with two parameters, for the source and target pinyin strings to be compared. Sample output:

Source = [xiàn]
Target = [Xī'ān]
Source key = [xian...4..z]
Target key = [xian..x.11.z.z]
Source is less than target (xiàn < Xī'ān).

If you use the source code, I'd appreciate receiving credit and being informed of it. Please contact tbishop@wenlin.com.

Perl Source Code


#!/usr/bin/perl  -w
# Copyright (c) 2005-2006 Tom Bishop, Wenlin Institute, Inc.
use strict;
use utf8; use open ':utf8'; use open ':std';
use Unicode::Normalize; # for NFC() (April 27, 2006)

my %uniPrecomp = ( # for tone marks, use 1-4; ü=v0, Ü=V0
	'À' => 'A4', # U+00c0
	'Á' => 'A2', # U+00c1
	'È' => 'E4', # U+00c8
	'É' => 'E2', # U+00c9
	'Ì' => 'I4', # U+00cc
	'Í' => 'I2', # U+00cd
	'Ò' => 'O4', # U+00d2
	'Ó' => 'O2', # U+00d3
	'Ù' => 'U4', # U+00d9
	'Ú' => 'U2', # U+00da
	'Ü' => 'V0', # U+00dc
	'à' => 'a4', # U+00e0
	'á' => 'a2', # U+00e1
	'è' => 'e4', # U+00e8
	'é' => 'e2', # U+00e9
	'ì' => 'i4', # U+00ec
	'í' => 'i2', # U+00ed
	'ò' => 'o4', # U+00f2
	'ó' => 'o2', # U+00f3
	'ù' => 'u4', # U+00f9
	'ú' => 'u2', # U+00fa
	'ü' => 'v0', # U+00fc
	'Ā' => 'A1', # U+0100
	'ā' => 'a1', # U+0101
	'Ē' => 'E1', # U+0112
	'ē' => 'e1', # U+0113
	'Ě' => 'E3', # U+011a
	'ě' => 'e3', # U+011b
	'Ī' => 'I1', # U+012a
	'ī' => 'i1', # U+012b
	'Ń' => 'N2', # U+0143
	'ń' => 'n2', # U+0144
	'Ň' => 'N3', # U+0147
	'ň' => 'n3', # U+0148
	'Ō' => 'O1', # U+014c
	'ō' => 'o1', # U+014d
	'Ū' => 'U1', # U+016a
	'ū' => 'u1', # U+016b
	'Ǎ' => 'A3', # U+01cd
	'ǎ' => 'a3', # U+01ce
	'Ǐ' => 'I3', # U+01cf
	'ǐ' => 'i3', # U+01d0
	'Ǒ' => 'O3', # U+01d1
	'ǒ' => 'o3', # U+01d2
	'Ǔ' => 'U3', # U+01d3
	'ǔ' => 'u3', # U+01d4
	'Ǖ' => 'V1', # U+01d5
	'ǖ' => 'v1', # U+01d6
	'Ǘ' => 'V2', # U+01d7
	'ǘ' => 'v2', # U+01d8
	'Ǚ' => 'V3', # U+01d9
	'ǚ' => 'v3', # U+01da
	'Ǜ' => 'V4', # U+01db
	'ǜ' => 'v4', # U+01dc
	'Ǹ' => 'N4', # U+01f8
	'ǹ' => 'n4', # U+01f9
); # uniPrecomp

my $source = (defined $ARGV[0]) ? $ARGV[0] : "xiàn";
my $target = (defined $ARGV[1]) ? $ARGV[1] : "Xī'ān";
my $sourceKey = ABCPinyinKey($source);
my $targetKey = ABCPinyinKey($target);
print "Source = [$source]\n";
print "Target = [$target]\n";
print "Source key = [$sourceKey]\n";
print "Target key = [$targetKey]\n";
if ($sourceKey lt $targetKey) {
	print "Source is less than target ($source < $target).\n";
}
elsif ($sourceKey gt $targetKey) {
	print "Source is greater than target ($source > $target).\n";
}
else {
	print "Source and target are equal ($source == $target).\n";
}

sub ABCPinyinKey {
	# Given pinyin string (which may start with homophone number), return ABC collation key.
	# Return '??' if bad input.
	my ($s) = @_;
	$s = NFC($s); # April 27, 2006
	my $sCopy = $s;
	my $homNumber = 0;

	if ($s =~ /^(\d+)(.*)$/) {
		$homNumber = $1;
		$s = $2;
	}
	if ($s =~ /\*$/) { # Trailing asterisk ('*')?
		$s =~ s/\*$//; # Remove *
		if ($homNumber > 1) {
			warn "ABCPinyinKey: asterisk illegal with homophone number > 1 ($sCopy)\n";
			return '??';
		}
		$homNumber = '*';
	}
	my $key = '';
	my $umlautBuf = ''; my $apostBuf= '';
	my $toneBuf = ''; my $capBuf = '';
	my $prevChar = ''; my $vowel = ''; my $tone = '';
	my $YES = 1; my $NO = 0;
	my $gotVowel = $NO; my $gotTone = $NO; my $digitAfterNGR = $NO;
	my $endOfString = '(end)';

	my $homBuf = HomNumABC($homNumber);
	$s =~ s/\([^)]*\)/-/g; # Replace anything in parentheses by a hyphen (in case followed by a/e/o)
	my @charArray = split(//, $s);
	my $arrayLength = push(@charArray, $endOfString);
	# Use index $i since need ability to look ahead.
	for (my $i = 0; $i < $arrayLength; $i++) {
		my $c = $charArray[$i];
		$tone = ''; # default, unless this char includes a tone
		$vowel = ''; # default, unless this char includes a vowel
		if ($c ne $endOfString) {
			my $ord = ord($c);
			if ($ord >= 0x80) {
				if ($key ne '' && ($tone = PinyinDiacritic($c)) ne '') {
					# tone mark, as a non-spacing diacritic; count it as being on previous character
					$c = ''; # to be ignored
					$gotTone = $YES; # this syllable has a non-neutral tone
				}
				else {
					my $ascii;
					($ascii, $tone) = UniPyChar($c);
					if ($ascii ne '') {
						$c = $ascii;
					}
					else {
						my $uni = sprintf("U+%04x", $ord);
						warn "ABCPinyinKey: $c ($uni) ignored ($sCopy) (@charArray)\n";
					}
				}
			}
			if ($c =~ /[A-Z]/) { # capital letter
				$capBuf .= ZYXPosition(length($key));
				$c = lc($c); # convert to lower case
			}
			if ($c =~ /[a-z]/) { # lower-case letter
				if ($c eq 'v') {
					$umlautBuf .= ZYXPosition(length($key));
					$c = 'u';
				}
				if ($c =~ /[aeiou]/) {
					$vowel = $c;
					$gotVowel = $YES; # will remain $YES until we reach a non-vowel
					if ($tone ne '') {
						$gotTone = $YES; # this syllable has a non-neutral tone
					}
					# a/e/o preceded by apostrophe, space, or hyphen?
					if ($c =~ /[aeo]/ && $prevChar =~ /[-' ]/) {
						$apostBuf .= ZYXPosition(length($key));
					}
				}
				$key .= $c;
			}
			elsif ($c =~ /[0-4]/) { # support tone-digits as well as diacritics
				$gotTone = $YES; # this syllable (vowel string) has an explicit tone
				$digitAfterNGR = $NO;
				$tone = $c;
			}
		}
		if ($vowel eq '' && $digitAfterNGR == $NO) {
			# Do this check after each syllable, including end of string.
			# If we've just ended a string of vowels that did not include a tone mark,
			#  then treat it as a neutral tone syllable.
			if ($gotVowel == $YES && $gotTone == $NO) {
				# Must check for tone digit after -n, -ng, or -er.
				if ($c eq 'n' || ($c eq 'r' && $prevChar eq 'e')) {
					my $nextC = $charArray[$i + 1];
					if ($c eq 'n' && ($nextC eq 'g' || $nextC eq 'G')) {
						$nextC = $charArray[$i + 2];
					}
					if ($nextC =~ /[0-4]/) {
						$digitAfterNGR = $YES; # there is a tone digit after -n, -ng, or -er
					}
				}
				if ($digitAfterNGR == $NO) {
					# Syllable not followed by digit: treat as neutral tone
					# "dui4buqi3" = "dui4bu0qi3"
					$tone = '0';
				}
			}
			$gotVowel = $gotTone = $NO; # reset in preparation for next syllable
		}
		if ($tone ne '') {
			$toneBuf .= $tone; # Represent tones by digits '0' thru '4'
		}
		$prevChar = $c;
	}
	$key .= '.' . $umlautBuf . '.' . $apostBuf . '.' . $toneBuf . '.' . $capBuf . '.' . $homBuf;
	$key = '??' if ($key =~ /\?\?/);
	return $key;
} # ABCPinyinKey

sub PinyinDiacritic
# if input is one of the 4 Unicode diacritics for a tone mark, then
#	 return the tone number, as a char '1','2','3','4'. Else return '' (empty string).
{
	my ($c) = @_;
	my $ord = ord($c);
	if ($ord == 0x0304) {
		return '1';	# tone1 -- macron -- U+0304
	}
	if ($ord == 0x0301) {
		return '2';	# tone2 -- acute -- U+0301
	}
	if ($ord == 0x030c) {
		return '3';	# tone3 -- hacek -- U+030c
	}
	if ($ord == 0x0300) {
		return '4';	# tone4 -- grave -- U+0300
	}
	return '';	# not a diacritic
} # PinyinDiacritic

sub UniPyChar
# Translate a Unicode Pinyin Vowel or 'n' with tonemark
#	 into an ASCII letter, plus a number for the tone.
#	Return a list of two items:
#     (1) ASCII letter, or '' (empty string) if input isn't pinyin.
#     (2) tone: '0' (ü=v0), '1', '2', '3', '4', or ''
{
	my ($c) = @_;

	if (defined($uniPrecomp{$c})) {
		my $p = $uniPrecomp{$c};
		if ($p =~ /(.)(.)/) {
			return ($1, $2);
		}
	}
	return ('', '');
} # UniPyChar

sub ZYXPosition {
	# For compactness, we use letters rather than numbers, to indicate the
	# position where umlaut, apostrophe, or capitalization occurs. The
	# sooner these occur, the later the string should be sorted, so we assign
	# letters in reverse alphabetical order. For example:
	# "fRog" sorts before "Frog", which sorts before "FROG"; the corresponding
	# values for the capitalization key are "y", "z", and "zyxw".
	my ($position) = @_;
	# 0 through 24 use single letter 'z', 'y', ... , 'c', 'b'
	if ($position <= 24) {
		return chr(ord('z') - $position);
	}
	# 25 through 700 use three letters "azz", "azy", ... , "aaa"
	$position -= 25;
	if ($position < 26*26) {
		return 'a' . chr(ord('z') - ($position / 26)) . chr(ord('z') - ($position % 26));
	}
	return '??';
} # ZYXPosition

sub HomNumABC {
	my ($n) = @_;
	if ($n eq '*') {
		return 'a';	# 'a' is for asterisk!
	}
	if ($n == 0) {
		# for homophone number zero (i.e. no homophone number), use 'z'
		return 'z';
	}
	if ($n <= 23) { # 1 thru 23 use single letters 'b' thru 'x'
		return chr(ord('a') + $n);
	}
	$n -= 24;
	if ($n < 26*26) { # for 24 thru 699 use "yaa", "yab", ... , "yzz"
		return 'y' . chr(ord('a') + ($n / 26)) . chr(ord('a') + ($n % 26));
	}
	return '??';
} # HomNumABC

Future Direction

I wrote the current implementation without knowledge of the Unicode Collation Algorithm (UCA) or International Components for Unicode (ICU), which includes an implementation of UCA. (Beware, however, of a so-called "pinyin collation" algorithm that operates directly on Hanzi and falsely assumes each Hanzi has only one possible pronunciation!) Ideally UCA might be used to implement the same pinyin ordering, but in a more robust way so that non-pinyin characters could also be included. For example, if the English/French word naïve occurs in an index along with pinyin words, then it should be properly alphabetized between nǎitóuzuǐr and nàiwán. The current Wenlin implementation fails to do anything reasonable with characters (like ï in naïve) that aren't used in standard pinyin.


Footnote about a difference between Hànyǔ Pīnyīn Cíhuì and the ABC Dictionary

The only difference I'm aware of is that HPC gave hyphens and spaces the same priority as apostrophes, so that lìgōng sorted before lǐ-gōng, in spite of the tones. Usage of hyphens and spaces in pinyin is still far from being fully standardized. (The same is true in English orthography.) Consequently, for collation it makes sense to give less weight to hyphens and spaces, and more weight to tones, thus sorting lǐ-gōng before lìgōng. In ABC, hyphens and spaces don't affect the sort order unless they change the pronunciation in the same way that apostrophe would; for example, ¹míng-àn 明暗 and ²míng'àn 冥暗 are treated as homophones, and they sort after mǐngǎn 敏感. (Back to the History section)


Wénlín home page

Valid XHTML 1.0!