Chau Chee Yang Technical Blog: Windows: String Comparison and Sorting

The most common sorting style is code point sorting that is culture insensitive. This type of sorting doesn't respect the radical order of cultural aspect but it is the fastest sorting order.

For example:

Character 'E' has code point 0x45 and character 'a' has code point 0x61. If we compare or sort the character according to code point, 'E' will show before 'a'. But this contradict to our knowledge that 'a' should always show before 'E'.

Another example is the Chinese character where it's sorting order depending on phonetics or number of pen strokes. Sort order according to code point doesn't make much sense for Chinese characters.

The following chart show some Chinese characters sorted by unicode code point that is culture insensitive:

Ideograph	汉语拼音 (Phonetic)	笔划 (Key strokes)	Unicode Code Point
一	yi	1	0x4E00
丁	ding	2	0x4E01
上	shang	3	0x4E0A
且	qie	5	0x4E14
人	ren	2	0x4EBA

We may use Windows API function CompareString to perform comparison for sorting operation.

var L: DWORD;
R: integer;
Str1, Str2: string;
begin
...
// For Stroke Count Order
L := MAKELCID(MAKELANGID(LANG_CHINESE, SUBLANG_CHINESE_SIMPLIFIED), SORT_CHINESE_PRC);
R := CompareString(L, 0, PChar(Str1), Length(Str1), PChar(Str2), Length(Str2));

// For Phonetic Order
L := MAKELCID(MAKELANGID(LANG_CHINESE, SUBLANG_CHINESE_SIMPLIFIED), SORT_CHINESE_PRCP);
R := CompareString(L, 0, PChar(Str1), Length(Str1), PChar(Str2), Length(Str2));
...

// For Ordinal Comparison (Code point comparison, culture insensitive)
R := StrComp(PChar(Str1), PChar(Str2));
end;

Stroke Count Order:

Ideograph	汉语拼音 (Phonetic)	笔划 (Key strokes)	Unicode Code Point
一	yi	1	0x4E00
丁	ding	2	0x4E01
人	ren	2	0x4EBA
上	shang	3	0x4E0A
且	qie	5	0x4E14

Phonetic Order:

Ideograph	汉语拼音 (Phonetic)	笔划 (Key strokes)	Unicode Code Point
丁	ding	2	0x4E01
且	qie	5	0x4E14
人	ren	2	0x4EBA
上	shang	3	0x4E0A
一	yi	1	0x4E00

Reference:

Chau Chee Yang Technical Blog

Wednesday, November 12, 2008

Windows: String Comparison and Sorting

No comments: