[SCM] WebKit Debian packaging branch, webkit-1.2, updated. upstream/1.1.90-6072-g9a69373

darin at apple.com darin at apple.com
Thu Apr 8 00:59:53 UTC 2010


The following commit has been merged in the webkit-1.2 branch:
commit 9c2f9e674fbb03ccc81bdfac313466d1e6bd280f
Author: darin at apple.com <darin at apple.com@268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Date:   Mon Jan 11 16:30:49 2010 +0000

    REGRESSION: Japanese text search ignores small vs. large and voicing mark differences
    https://bugs.webkit.org/show_bug.cgi?id=30437
    rdar://problem/7214058
    
    Reviewed by Alexey Proskuryakov.
    
    WebCore:
    
    Test: fast/text/find-kana.html
    
    * editing/TextIterator.cpp:
    (WebCore::isKanaLetter): Added.
    (WebCore::isSmallKanaLetter): Added.
    (WebCore::composedVoicedSoundMark): Added.
    (WebCore::isCombiningVoicedSoundMark): Added.
    (WebCore::containsKanaLetters): Added.
    (WebCore::normalizeCharacters): Added.
    (WebCore::SearchBuffer::SearchBuffer): Initialize the data members
    m_targetRequiresKanaWorkaround and m_normalizedTarget.
    (WebCore::SearchBuffer::isBadMatch): Added. Checks for matches that
    ICU's default collation considers correct, but we consider incorrect.
    (WebCore::SearchBuffer::search): Added code to call isBadMatch and
    move to the next match with usearch_next if the result is true.
    
    LayoutTests:
    
    * fast/text/international/japanese-kana-letters-expected.txt: Removed.
    * fast/text/international/japanese-kana-letters.html: Removed.
    
    * fast/text/find-kana-expected.txt: Added.
    * fast/text/find-kana.html: Added.
    * fast/text/script-tests/find-kana.js: Added.
    This includes all the tests that were in the old test removed above, with three
    differences:
        1) Moved out of "international" directory because Mitz wants to phase that
           directory out.
        2) Added more tests to cover more cases involving things like decomposed
           characters and different voice marks.
        3) Used script-tests, so results list passing tests as well as failing tests.
    We could still test even more, but this should at least cover all the lines of
    code in the current bug fix patch.
    
    
    
    git-svn-id: http://svn.webkit.org/repository/webkit/trunk@53078 268f45cc-cd09-0410-ab3c-d52691b4dbfc

diff --git a/LayoutTests/ChangeLog b/LayoutTests/ChangeLog
index d69735f..69d303e 100644
--- a/LayoutTests/ChangeLog
+++ b/LayoutTests/ChangeLog
@@ -1,3 +1,27 @@
+2010-01-10  Darin Adler  <darin at apple.com>
+
+        Reviewed by Alexey Proskuryakov.
+
+        REGRESSION: Japanese text search ignores small vs. large and voicing mark differences
+        https://bugs.webkit.org/show_bug.cgi?id=30437
+        rdar://problem/7214058
+
+        * fast/text/international/japanese-kana-letters-expected.txt: Removed.
+        * fast/text/international/japanese-kana-letters.html: Removed.
+
+        * fast/text/find-kana-expected.txt: Added.
+        * fast/text/find-kana.html: Added.
+        * fast/text/script-tests/find-kana.js: Added.
+        This includes all the tests that were in the old test removed above, with three
+        differences:
+            1) Moved out of "international" directory because Mitz wants to phase that
+               directory out.
+            2) Added more tests to cover more cases involving things like decomposed
+               characters and different voice marks.
+            3) Used script-tests, so results list passing tests as well as failing tests.
+        We could still test even more, but this should at least cover all the lines of
+        code in the current bug fix patch.
+
 2010-01-11  Andras Becsi  <abecsi at inf.u-szeged.hu>
 
         Rubber-stamped by Holger Hans Peter Freyther.
diff --git a/LayoutTests/fast/text/find-kana-expected.txt b/LayoutTests/fast/text/find-kana-expected.txt
new file mode 100644
index 0000000..8bee3bf
--- /dev/null
+++ b/LayoutTests/fast/text/find-kana-expected.txt
@@ -0,0 +1,105 @@
+Tests find for strings with kana letters in them.
+
+On success, you will see a series of "PASS" messages, followed by "TEST COMPLETE".
+
+
+Exact matches first as a baseline
+
+PASS canFind(decomposedHiraganaLetterGa, decomposedHiraganaLetterGa) is true
+PASS canFind(decomposedKatakanaLetterGa, decomposedKatakanaLetterGa) is true
+PASS canFind(decomposedLatinCapitalLetterAWithGrave, decomposedLatinCapitalLetterAWithGrave) is true
+PASS canFind(halfwidthKatakanaLetterA, halfwidthKatakanaLetterA) is true
+PASS canFind(halfwidthKatakanaLetterSmallA, halfwidthKatakanaLetterSmallA) is true
+PASS canFind(hiraganaLetterA, hiraganaLetterA) is true
+PASS canFind(hiraganaLetterA, hiraganaLetterA) is true
+PASS canFind(hiraganaLetterBa, hiraganaLetterBa) is true
+PASS canFind(hiraganaLetterGa, hiraganaLetterGa) is true
+PASS canFind(hiraganaLetterHa, hiraganaLetterHa) is true
+PASS canFind(hiraganaLetterKa, hiraganaLetterKa) is true
+PASS canFind(hiraganaLetterPa, hiraganaLetterPa) is true
+PASS canFind(katakanaLetterA, katakanaLetterA) is true
+PASS canFind(katakanaLetterSmallA, katakanaLetterSmallA) is true
+PASS canFind(latinCapitalLetterAWithGrave, latinCapitalLetterAWithGrave) is true
+
+Hiragana, katakana, and half width katakana: Must be treated as equal
+
+PASS canFind(decomposedHiraganaLetterGa, decomposedKatakanaLetterGa) is true
+PASS canFind(decomposedKatakanaLetterGa, decomposedHiraganaLetterGa) is true
+PASS canFind(hiraganaLetterA, halfwidthKatakanaLetterA) is true
+PASS canFind(hiraganaLetterA, katakanaLetterA) is true
+PASS canFind(katakanaLetterSmallA, hiraganaLetterSmallA) is true
+
+Composed and decomposed forms: Must be treated as equal
+
+PASS canFind(decomposedHiraganaLetterBa, hiraganaLetterBa) is true
+PASS canFind(decomposedHiraganaLetterGa, decomposedKatakanaLetterGa) is true
+PASS canFind(decomposedHiraganaLetterGa, hiraganaLetterGa) is true
+PASS canFind(decomposedHiraganaLetterGa, katakanaLetterGa) is true
+PASS canFind(decomposedHiraganaLetterPa, hiraganaLetterPa) is true
+PASS canFind(decomposedKatakanaLetterGa, decomposedHiraganaLetterGa) is true
+PASS canFind(decomposedLatinCapitalLetterAWithGrave, latinCapitalLetterAWithGrave) is true
+PASS canFind(hiraganaLetterBa, decomposedHiraganaLetterBa) is true
+PASS canFind(hiraganaLetterGa, decomposedHiraganaLetterGa) is true
+PASS canFind(hiraganaLetterPa, decomposedHiraganaLetterPa) is true
+PASS canFind(katakanaLetterGa, decomposedHiraganaLetterGa) is true
+PASS canFind(latinCapitalLetterAWithGrave, decomposedLatinCapitalLetterAWithGrave) is true
+
+Small and non-small kana letters: Must *not* be treated as equal
+
+PASS canFind(halfwidthKatakanaLetterA, hiraganaLetterSmallA) is false
+PASS canFind(halfwidthKatakanaLetterSmallA, halfwidthKatakanaLetterA) is false
+PASS canFind(hiraganaLetterA, hiraganaLetterSmallA) is false
+PASS canFind(hiraganaLetterSmallA, katakanaLetterA) is false
+PASS canFind(katakanaLetterA, halfwidthKatakanaLetterSmallA) is false
+PASS canFind(katakanaLetterSmallA, katakanaLetterA) is false
+
+Kana letters where the only difference is in voiced sound marks: Must *not* be treated as equal
+
+PASS canFind(decomposedHiraganaLetterBa, hiraganaLetterHa) is false
+PASS canFind(decomposedHiraganaLetterBa, hiraganaLetterPa) is false
+PASS canFind(decomposedHiraganaLetterGa, halfwidthKatakanaLetterKa) is false
+PASS canFind(decomposedHiraganaLetterGa, hiraganaLetterKa) is false
+PASS canFind(decomposedHiraganaLetterGa, hiraganaLetterKa) is false
+PASS canFind(decomposedHiraganaLetterPa, hiraganaLetterBa) is false
+PASS canFind(decomposedHiraganaLetterPa, hiraganaLetterHa) is false
+PASS canFind(halfwidthKatakanaLetterKa, decomposedHiraganaLetterGa) is false
+PASS canFind(hiraganaLetterBa, decomposedHiraganaLetterPa) is false
+PASS canFind(hiraganaLetterBa, hiraganaLetterHa) is false
+PASS canFind(hiraganaLetterBa, hiraganaLetterPa) is false
+PASS canFind(hiraganaLetterGa, hiraganaLetterKa) is false
+PASS canFind(hiraganaLetterHa, decomposedHiraganaLetterBa) is false
+PASS canFind(hiraganaLetterHa, decomposedHiraganaLetterPa) is false
+PASS canFind(hiraganaLetterHa, hiraganaLetterBa) is false
+PASS canFind(hiraganaLetterHa, hiraganaLetterPa) is false
+PASS canFind(hiraganaLetterKa, decomposedHiraganaLetterGa) is false
+PASS canFind(hiraganaLetterKa, decomposedHiraganaLetterGa) is false
+PASS canFind(hiraganaLetterKa, hiraganaLetterGa) is false
+PASS canFind(hiraganaLetterPa, decomposedHiraganaLetterBa) is false
+PASS canFind(hiraganaLetterPa, hiraganaLetterBa) is false
+PASS canFind(hiraganaLetterPa, hiraganaLetterHa) is false
+
+Composed/decomposed form differences before kana characters must have no effect
+
+PASS canFind(decomposedLatinCapitalLetterAWithGrave + halfwidthKatakanaLetterA, latinCapitalLetterAWithGrave + hiraganaLetterSmallA) is false
+PASS canFind(decomposedLatinCapitalLetterAWithGrave + halfwidthKatakanaLetterSmallA, latinCapitalLetterAWithGrave + halfwidthKatakanaLetterA) is false
+PASS canFind(decomposedLatinCapitalLetterAWithGrave + hiraganaLetterA, latinCapitalLetterAWithGrave + hiraganaLetterSmallA) is false
+PASS canFind(decomposedLatinCapitalLetterAWithGrave + hiraganaLetterGa, latinCapitalLetterAWithGrave + hiraganaLetterGa) is true
+PASS canFind(decomposedLatinCapitalLetterAWithGrave + hiraganaLetterGa, latinCapitalLetterAWithGrave + hiraganaLetterKa) is false
+PASS canFind(decomposedLatinCapitalLetterAWithGrave + hiraganaLetterKa, latinCapitalLetterAWithGrave + hiraganaLetterGa) is false
+PASS canFind(decomposedLatinCapitalLetterAWithGrave + hiraganaLetterSmallA, latinCapitalLetterAWithGrave + katakanaLetterA) is false
+PASS canFind(decomposedLatinCapitalLetterAWithGrave + katakanaLetterA, latinCapitalLetterAWithGrave + halfwidthKatakanaLetterSmallA) is false
+PASS canFind(decomposedLatinCapitalLetterAWithGrave + katakanaLetterSmallA, latinCapitalLetterAWithGrave + katakanaLetterA) is false
+PASS canFind(latinCapitalLetterAWithGrave + halfwidthKatakanaLetterA, decomposedLatinCapitalLetterAWithGrave + hiraganaLetterSmallA) is false
+PASS canFind(latinCapitalLetterAWithGrave + halfwidthKatakanaLetterSmallA, decomposedLatinCapitalLetterAWithGrave + halfwidthKatakanaLetterA) is false
+PASS canFind(latinCapitalLetterAWithGrave + hiraganaLetterA, decomposedLatinCapitalLetterAWithGrave + hiraganaLetterSmallA) is false
+PASS canFind(latinCapitalLetterAWithGrave + hiraganaLetterGa, decomposedLatinCapitalLetterAWithGrave + hiraganaLetterGa) is true
+PASS canFind(latinCapitalLetterAWithGrave + hiraganaLetterGa, decomposedLatinCapitalLetterAWithGrave + hiraganaLetterKa) is false
+PASS canFind(latinCapitalLetterAWithGrave + hiraganaLetterKa, decomposedLatinCapitalLetterAWithGrave + hiraganaLetterGa) is false
+PASS canFind(latinCapitalLetterAWithGrave + hiraganaLetterSmallA, decomposedLatinCapitalLetterAWithGrave + katakanaLetterA) is false
+PASS canFind(latinCapitalLetterAWithGrave + katakanaLetterA, decomposedLatinCapitalLetterAWithGrave + halfwidthKatakanaLetterSmallA) is false
+PASS canFind(latinCapitalLetterAWithGrave + katakanaLetterSmallA, decomposedLatinCapitalLetterAWithGrave + katakanaLetterA) is false
+
+PASS successfullyParsed is true
+
+TEST COMPLETE
+
diff --git a/LayoutTests/fast/text/find-kana.html b/LayoutTests/fast/text/find-kana.html
new file mode 100644
index 0000000..5a3b8b8
--- /dev/null
+++ b/LayoutTests/fast/text/find-kana.html
@@ -0,0 +1,13 @@
+<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
+<html>
+<head>
+<link rel="stylesheet" href="../js/resources/js-test-style.css">
+<script src="../js/resources/js-test-pre.js"></script>
+</head>
+<body>
+<p id="description"></p>
+<div id="console"></div>
+<script src="script-tests/find-kana.js"></script>
+<script src="../js/resources/js-test-post.js"></script>
+</body>
+</html>
diff --git a/LayoutTests/fast/text/international/japanese-kana-letters-expected.txt b/LayoutTests/fast/text/international/japanese-kana-letters-expected.txt
deleted file mode 100644
index 95d4f0d..0000000
--- a/LayoutTests/fast/text/international/japanese-kana-letters-expected.txt
+++ /dev/null
@@ -1 +0,0 @@
-FAILURE: Found small hiragana A when searching for hiragana A. Found small katakana A when searching for katakana A. Found halfwidth small katakana A when searching for halfwidth katakana A. Found small hiragana A when searching for katakana A. Found katakana A when searching for halfwidth small katakana A. Found halfwidth katakana A when searching for small hiragana A. Found hiragana Ka when searching for hiragana Ga.
diff --git a/LayoutTests/fast/text/international/japanese-kana-letters.html b/LayoutTests/fast/text/international/japanese-kana-letters.html
deleted file mode 100644
index 1f0f39f..0000000
--- a/LayoutTests/fast/text/international/japanese-kana-letters.html
+++ /dev/null
@@ -1,115 +0,0 @@
-<html>
-<head>
-    <script>
-        function canFind(target, specimen)
-        {
-            getSelection().empty();
-            document.body.innerHTML = specimen;
-            document.execCommand("FindString", false, target);
-            var result = getSelection().rangeCount != 0;
-            getSelection().empty();
-            return result;
-        }
-
-        function runTests()
-        {
-            if (window.layoutTestController)
-                layoutTestController.dumpAsText();
-
-            var smallHiraganaA = String.fromCharCode(0x3041);
-            var hiraganaA = String.fromCharCode(0x3042);
-            var smallKatakanaA = String.fromCharCode(0x30a1);
-            var katakanaA = String.fromCharCode(0x30a2);
-            var halfwidthSmallKatakanaA = String.fromCharCode(0xff67);
-            var halfwidthKatakanaA = String.fromCharCode(0xff71);
-            var hiraganaKa = String.fromCharCode(0x304b);
-            var hiraganaGa = String.fromCharCode(0x304c);
-
-            var success = true;
-
-            var message = "FAILURE:";
-
-            if (!canFind(smallHiraganaA, smallHiraganaA)) {
-                success = false;
-                message += " Cannot find small hiragana A when searching for small hiragana A.";
-            }
-
-            if (!canFind(hiraganaA, hiraganaA)) {
-                success = false;
-                message += " Cannot find hiragana A when searching for hiragana A.";
-            }
-
-            if (!canFind(smallKatakanaA, smallKatakanaA)) {
-                success = false;
-                message += " Cannot find small katakana A when searching for small katakana A.";
-            }
-
-            if (!canFind(katakanaA, katakanaA)) {
-                success = false;
-                message += " Cannot find katakana A when searching for katakana A.";
-            }
-
-            if (!canFind(halfwidthSmallKatakanaA, halfwidthSmallKatakanaA)) {
-                success = false;
-                message += " Cannot find halfwidth small katakana A when searching for halfwidth small katakana A.";
-            }
-
-            if (!canFind(halfwidthKatakanaA, halfwidthKatakanaA)) {
-                success = false;
-                message += " Cannot find halfwidth katakana A when searching for halfwidth katakana A.";
-            }
-
-            if (!canFind(smallHiraganaA, smallKatakanaA)) {
-                success = false;
-                message += " Cannot find small katakana A when searching for small hiragana A.";
-            }
-
-            if (!canFind(hiraganaA, halfwidthKatakanaA)) {
-                success = false;
-                message += " Cannot find halfwidth katakana A when searching for hiragana A.";
-            }
-
-            if (canFind(smallHiraganaA, hiraganaA)) {
-                success = false;
-                message += " Found small hiragana A when searching for hiragana A.";
-            }
-
-            if (canFind(smallKatakanaA, katakanaA)) {
-                success = false;
-                message += " Found small katakana A when searching for katakana A.";
-            }
-
-            if (canFind(halfwidthSmallKatakanaA, halfwidthKatakanaA)) {
-                success = false;
-                message += " Found halfwidth small katakana A when searching for halfwidth katakana A.";
-            }
-
-            if (canFind(smallHiraganaA, katakanaA)) {
-                success = false;
-                message += " Found small hiragana A when searching for katakana A.";
-            }
-
-            if (canFind(katakanaA, halfwidthSmallKatakanaA)) {
-                success = false;
-                message += " Found katakana A when searching for halfwidth small katakana A.";
-            }
-
-            if (canFind(halfwidthKatakanaA, smallHiraganaA)) {
-                success = false;
-                message += " Found halfwidth katakana A when searching for small hiragana A.";
-            }
-
-            if (canFind(hiraganaKa, hiraganaGa)) {
-                success = false;
-                message += " Found hiragana Ka when searching for hiragana Ga.";
-            }
-
-            if (success)
-                message = "SUCCESS: Found hiragana and katakana correctly.";
-
-            document.body.innerHTML = message;
-        }
-    </script>
-</head>
-<body onload="runTests()"></body>
-</html>
diff --git a/LayoutTests/fast/text/script-tests/find-kana.js b/LayoutTests/fast/text/script-tests/find-kana.js
new file mode 100644
index 0000000..1bc9913
--- /dev/null
+++ b/LayoutTests/fast/text/script-tests/find-kana.js
@@ -0,0 +1,149 @@
+description("Tests find for strings with kana letters in them.");
+
+function canFind(target, specimen)
+{
+    getSelection().empty();
+    var textNode = document.createTextNode(specimen);
+    document.body.appendChild(textNode);
+    document.execCommand("FindString", false, target);
+    var result = getSelection().rangeCount != 0;
+    getSelection().empty();
+    document.body.removeChild(textNode);
+    return result;
+}
+
+var combiningGraveAccent = String.fromCharCode(0x0300);
+var combiningKatakanaHiraganaSemiVoicedSoundMark = String.fromCharCode(0x309A);
+var combiningKatakanaHiraganaVoicedSoundMark = String.fromCharCode(0x3099);
+var halfwidthKatakanaLetterA = String.fromCharCode(0xFF71);
+var halfwidthKatakanaLetterKa = String.fromCharCode(0xFF76);
+var halfwidthKatakanaLetterSmallA = String.fromCharCode(0xFF67);
+var hiraganaLetterA = String.fromCharCode(0x3042);
+var hiraganaLetterBa = String.fromCharCode(0x3070);
+var hiraganaLetterGa = String.fromCharCode(0x304C);
+var hiraganaLetterHa = String.fromCharCode(0x306F);
+var hiraganaLetterKa = String.fromCharCode(0x304B);
+var hiraganaLetterPa = String.fromCharCode(0x3071);
+var hiraganaLetterSmallA = String.fromCharCode(0x3041);
+var katakanaLetterA = String.fromCharCode(0x30A2);
+var katakanaLetterGa = String.fromCharCode(0x30AC);
+var katakanaLetterKa = String.fromCharCode(0x30AB);
+var katakanaLetterSmallA = String.fromCharCode(0x30A1);
+var latinCapitalLetterAWithGrave = String.fromCharCode(0x00C0);
+
+var decomposedHiraganaLetterBa = hiraganaLetterHa + combiningKatakanaHiraganaVoicedSoundMark;
+var decomposedHiraganaLetterGa = hiraganaLetterKa + combiningKatakanaHiraganaVoicedSoundMark;
+var decomposedHiraganaLetterPa = hiraganaLetterHa + combiningKatakanaHiraganaSemiVoicedSoundMark;
+var decomposedKatakanaLetterGa = katakanaLetterKa + combiningKatakanaHiraganaVoicedSoundMark;
+var decomposedLatinCapitalLetterAWithGrave = 'A' + combiningGraveAccent;
+
+debug('Exact matches first as a baseline');
+debug('');
+
+shouldBe('canFind(decomposedHiraganaLetterGa, decomposedHiraganaLetterGa)', 'true');
+shouldBe('canFind(decomposedKatakanaLetterGa, decomposedKatakanaLetterGa)', 'true');
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave, decomposedLatinCapitalLetterAWithGrave)', 'true');
+shouldBe('canFind(halfwidthKatakanaLetterA, halfwidthKatakanaLetterA)', 'true');
+shouldBe('canFind(halfwidthKatakanaLetterSmallA, halfwidthKatakanaLetterSmallA)', 'true');
+shouldBe('canFind(hiraganaLetterA, hiraganaLetterA)', 'true');
+shouldBe('canFind(hiraganaLetterA, hiraganaLetterA)', 'true');
+shouldBe('canFind(hiraganaLetterBa, hiraganaLetterBa)', 'true');
+shouldBe('canFind(hiraganaLetterGa, hiraganaLetterGa)', 'true');
+shouldBe('canFind(hiraganaLetterHa, hiraganaLetterHa)', 'true');
+shouldBe('canFind(hiraganaLetterKa, hiraganaLetterKa)', 'true');
+shouldBe('canFind(hiraganaLetterPa, hiraganaLetterPa)', 'true');
+shouldBe('canFind(katakanaLetterA, katakanaLetterA)', 'true');
+shouldBe('canFind(katakanaLetterSmallA, katakanaLetterSmallA)', 'true');
+shouldBe('canFind(latinCapitalLetterAWithGrave, latinCapitalLetterAWithGrave)', 'true');
+
+debug('');
+debug('Hiragana, katakana, and half width katakana: Must be treated as equal');
+debug('');
+
+shouldBe('canFind(decomposedHiraganaLetterGa, decomposedKatakanaLetterGa)', 'true');
+shouldBe('canFind(decomposedKatakanaLetterGa, decomposedHiraganaLetterGa)', 'true');
+shouldBe('canFind(hiraganaLetterA, halfwidthKatakanaLetterA)', 'true');
+shouldBe('canFind(hiraganaLetterA, katakanaLetterA)', 'true');
+shouldBe('canFind(katakanaLetterSmallA, hiraganaLetterSmallA)', 'true');
+
+debug('');
+debug('Composed and decomposed forms: Must be treated as equal');
+debug('');
+
+shouldBe('canFind(decomposedHiraganaLetterBa, hiraganaLetterBa)', 'true');
+shouldBe('canFind(decomposedHiraganaLetterGa, decomposedKatakanaLetterGa)', 'true');
+shouldBe('canFind(decomposedHiraganaLetterGa, hiraganaLetterGa)', 'true');
+shouldBe('canFind(decomposedHiraganaLetterGa, katakanaLetterGa)', 'true');
+shouldBe('canFind(decomposedHiraganaLetterPa, hiraganaLetterPa)', 'true');
+shouldBe('canFind(decomposedKatakanaLetterGa, decomposedHiraganaLetterGa)', 'true');
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave, latinCapitalLetterAWithGrave)', 'true');
+shouldBe('canFind(hiraganaLetterBa, decomposedHiraganaLetterBa)', 'true');
+shouldBe('canFind(hiraganaLetterGa, decomposedHiraganaLetterGa)', 'true');
+shouldBe('canFind(hiraganaLetterPa, decomposedHiraganaLetterPa)', 'true');
+shouldBe('canFind(katakanaLetterGa, decomposedHiraganaLetterGa)', 'true');
+shouldBe('canFind(latinCapitalLetterAWithGrave, decomposedLatinCapitalLetterAWithGrave)', 'true');
+
+debug('');
+debug('Small and non-small kana letters: Must *not* be treated as equal');
+debug('');
+
+shouldBe('canFind(halfwidthKatakanaLetterA, hiraganaLetterSmallA)', 'false');
+shouldBe('canFind(halfwidthKatakanaLetterSmallA, halfwidthKatakanaLetterA)', 'false');
+shouldBe('canFind(hiraganaLetterA, hiraganaLetterSmallA)', 'false');
+shouldBe('canFind(hiraganaLetterSmallA, katakanaLetterA)', 'false');
+shouldBe('canFind(katakanaLetterA, halfwidthKatakanaLetterSmallA)', 'false');
+shouldBe('canFind(katakanaLetterSmallA, katakanaLetterA)', 'false');
+
+debug('');
+debug('Kana letters where the only difference is in voiced sound marks: Must *not* be treated as equal');
+debug('');
+
+shouldBe('canFind(decomposedHiraganaLetterBa, hiraganaLetterHa)', 'false');
+shouldBe('canFind(decomposedHiraganaLetterBa, hiraganaLetterPa)', 'false');
+shouldBe('canFind(decomposedHiraganaLetterGa, halfwidthKatakanaLetterKa)', 'false');
+shouldBe('canFind(decomposedHiraganaLetterGa, hiraganaLetterKa)', 'false');
+shouldBe('canFind(decomposedHiraganaLetterGa, hiraganaLetterKa)', 'false');
+shouldBe('canFind(decomposedHiraganaLetterPa, hiraganaLetterBa)', 'false');
+shouldBe('canFind(decomposedHiraganaLetterPa, hiraganaLetterHa)', 'false');
+shouldBe('canFind(halfwidthKatakanaLetterKa, decomposedHiraganaLetterGa)', 'false');
+shouldBe('canFind(hiraganaLetterBa, decomposedHiraganaLetterPa)', 'false');
+shouldBe('canFind(hiraganaLetterBa, hiraganaLetterHa)', 'false');
+shouldBe('canFind(hiraganaLetterBa, hiraganaLetterPa)', 'false');
+shouldBe('canFind(hiraganaLetterGa, hiraganaLetterKa)', 'false');
+shouldBe('canFind(hiraganaLetterHa, decomposedHiraganaLetterBa)', 'false');
+shouldBe('canFind(hiraganaLetterHa, decomposedHiraganaLetterPa)', 'false');
+shouldBe('canFind(hiraganaLetterHa, hiraganaLetterBa)', 'false');
+shouldBe('canFind(hiraganaLetterHa, hiraganaLetterPa)', 'false');
+shouldBe('canFind(hiraganaLetterKa, decomposedHiraganaLetterGa)', 'false');
+shouldBe('canFind(hiraganaLetterKa, decomposedHiraganaLetterGa)', 'false');
+shouldBe('canFind(hiraganaLetterKa, hiraganaLetterGa)', 'false');
+shouldBe('canFind(hiraganaLetterPa, decomposedHiraganaLetterBa)', 'false');
+shouldBe('canFind(hiraganaLetterPa, hiraganaLetterBa)', 'false');
+shouldBe('canFind(hiraganaLetterPa, hiraganaLetterHa)', 'false');
+
+debug('');
+debug('Composed/decomposed form differences before kana characters must have no effect');
+debug('');
+
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave + halfwidthKatakanaLetterA, latinCapitalLetterAWithGrave + hiraganaLetterSmallA)', 'false');
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave + halfwidthKatakanaLetterSmallA, latinCapitalLetterAWithGrave + halfwidthKatakanaLetterA)', 'false');
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave + hiraganaLetterA, latinCapitalLetterAWithGrave + hiraganaLetterSmallA)', 'false');
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave + hiraganaLetterGa, latinCapitalLetterAWithGrave + hiraganaLetterGa)', 'true');
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave + hiraganaLetterGa, latinCapitalLetterAWithGrave + hiraganaLetterKa)', 'false');
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave + hiraganaLetterKa, latinCapitalLetterAWithGrave + hiraganaLetterGa)', 'false');
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave + hiraganaLetterSmallA, latinCapitalLetterAWithGrave + katakanaLetterA)', 'false');
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave + katakanaLetterA, latinCapitalLetterAWithGrave + halfwidthKatakanaLetterSmallA)', 'false');
+shouldBe('canFind(decomposedLatinCapitalLetterAWithGrave + katakanaLetterSmallA, latinCapitalLetterAWithGrave + katakanaLetterA)', 'false');
+shouldBe('canFind(latinCapitalLetterAWithGrave + halfwidthKatakanaLetterA, decomposedLatinCapitalLetterAWithGrave + hiraganaLetterSmallA)', 'false');
+shouldBe('canFind(latinCapitalLetterAWithGrave + halfwidthKatakanaLetterSmallA, decomposedLatinCapitalLetterAWithGrave + halfwidthKatakanaLetterA)', 'false');
+shouldBe('canFind(latinCapitalLetterAWithGrave + hiraganaLetterA, decomposedLatinCapitalLetterAWithGrave + hiraganaLetterSmallA)', 'false');
+shouldBe('canFind(latinCapitalLetterAWithGrave + hiraganaLetterGa, decomposedLatinCapitalLetterAWithGrave + hiraganaLetterGa)', 'true');
+shouldBe('canFind(latinCapitalLetterAWithGrave + hiraganaLetterGa, decomposedLatinCapitalLetterAWithGrave + hiraganaLetterKa)', 'false');
+shouldBe('canFind(latinCapitalLetterAWithGrave + hiraganaLetterKa, decomposedLatinCapitalLetterAWithGrave + hiraganaLetterGa)', 'false');
+shouldBe('canFind(latinCapitalLetterAWithGrave + hiraganaLetterSmallA, decomposedLatinCapitalLetterAWithGrave + katakanaLetterA)', 'false');
+shouldBe('canFind(latinCapitalLetterAWithGrave + katakanaLetterA, decomposedLatinCapitalLetterAWithGrave + halfwidthKatakanaLetterSmallA)', 'false');
+shouldBe('canFind(latinCapitalLetterAWithGrave + katakanaLetterSmallA, decomposedLatinCapitalLetterAWithGrave + katakanaLetterA)', 'false');
+
+debug('');
+
+var successfullyParsed = true;
diff --git a/WebCore/ChangeLog b/WebCore/ChangeLog
index d5560d0..c537527 100644
--- a/WebCore/ChangeLog
+++ b/WebCore/ChangeLog
@@ -1,3 +1,27 @@
+2010-01-10  Darin Adler  <darin at apple.com>
+
+        Reviewed by Alexey Proskuryakov.
+
+        REGRESSION: Japanese text search ignores small vs. large and voicing mark differences
+        https://bugs.webkit.org/show_bug.cgi?id=30437
+        rdar://problem/7214058
+
+        Test: fast/text/find-kana.html
+
+        * editing/TextIterator.cpp:
+        (WebCore::isKanaLetter): Added.
+        (WebCore::isSmallKanaLetter): Added.
+        (WebCore::composedVoicedSoundMark): Added.
+        (WebCore::isCombiningVoicedSoundMark): Added.
+        (WebCore::containsKanaLetters): Added.
+        (WebCore::normalizeCharacters): Added.
+        (WebCore::SearchBuffer::SearchBuffer): Initialize the data members
+        m_targetRequiresKanaWorkaround and m_normalizedTarget.
+        (WebCore::SearchBuffer::isBadMatch): Added. Checks for matches that
+        ICU's default collation considers correct, but we consider incorrect.
+        (WebCore::SearchBuffer::search): Added code to call isBadMatch and
+        move to the next match with usearch_next if the result is true.
+
 2010-01-11  Joanmarie Diggs  <joanmarie.diggs at gmail.com>
 
         Reviewed by Xan Lopez.
diff --git a/WebCore/editing/TextIterator.cpp b/WebCore/editing/TextIterator.cpp
index 1a75fcf..b7beefb 100644
--- a/WebCore/editing/TextIterator.cpp
+++ b/WebCore/editing/TextIterator.cpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2004, 2005, 2006, 2007, 2008, 2009 Apple Inc. All rights reserved.
+ * Copyright (C) 2004, 2005, 2006, 2007, 2008, 2009, 2010 Apple Inc. All rights reserved.
  * Copyright (C) 2005 Alexey Proskuryakov.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -73,11 +73,17 @@ public:
 #if USE(ICU_UNICODE) && !UCONFIG_NO_COLLATION
 
 private:
+    bool isBadMatch(const UChar*, size_t length) const;
+
     String m_target;
     Vector<UChar> m_buffer;
     size_t m_overlap;
     bool m_atBreak;
 
+    bool m_targetRequiresKanaWorkaround;
+    Vector<UChar> m_normalizedTarget;
+    mutable Vector<UChar> m_normalizedMatch;
+
 #else
 
 private:
@@ -1489,9 +1495,213 @@ static inline void unlockSearcher()
 #endif
 }
 
+// ICU's search ignores the distinction between small kana letters and ones
+// that are not small, and also characters that differ only in the voicing
+// marks when considering only primary collation strength diffrences.
+// This is not helpful for end users, since these differences make words
+// distinct, so for our purposes we need these to be considered.
+// The Unicode folks do not think the collation algorithm should be
+// changed. To work around this, we would like to tailor the ICU searcher,
+// but we can't get that to work yet. So instead, we check for cases where
+// these differences occur, and skip those matches.
+
+// We refer to the above technique as the "kana workaround". The next few
+// functions are helper functinos for the kana workaround.
+
+static inline bool isKanaLetter(UChar character)
+{
+    // Hiragana letters.
+    if (character >= 0x3041 && character <= 0x3096)
+        return true;
+
+    // Katakana letters.
+    if (character >= 0x30A1 && character <= 0x30FA)
+        return true;
+    if (character >= 0x31F0 && character <= 0x31FF)
+        return true;
+
+    // Halfwidth katakana letters.
+    if (character >= 0xFF66 && character <= 0xFF9D && character != 0xFF70)
+        return true;
+
+    return false;
+}
+
+static inline bool isSmallKanaLetter(UChar character)
+{
+    ASSERT(isKanaLetter(character));
+
+    switch (character) {
+    case 0x3041: // HIRAGANA LETTER SMALL A
+    case 0x3043: // HIRAGANA LETTER SMALL I
+    case 0x3045: // HIRAGANA LETTER SMALL U
+    case 0x3047: // HIRAGANA LETTER SMALL E
+    case 0x3049: // HIRAGANA LETTER SMALL O
+    case 0x3063: // HIRAGANA LETTER SMALL TU
+    case 0x3083: // HIRAGANA LETTER SMALL YA
+    case 0x3085: // HIRAGANA LETTER SMALL YU
+    case 0x3087: // HIRAGANA LETTER SMALL YO
+    case 0x308E: // HIRAGANA LETTER SMALL WA
+    case 0x3095: // HIRAGANA LETTER SMALL KA
+    case 0x3096: // HIRAGANA LETTER SMALL KE
+    case 0x30A1: // KATAKANA LETTER SMALL A
+    case 0x30A3: // KATAKANA LETTER SMALL I
+    case 0x30A5: // KATAKANA LETTER SMALL U
+    case 0x30A7: // KATAKANA LETTER SMALL E
+    case 0x30A9: // KATAKANA LETTER SMALL O
+    case 0x30C3: // KATAKANA LETTER SMALL TU
+    case 0x30E3: // KATAKANA LETTER SMALL YA
+    case 0x30E5: // KATAKANA LETTER SMALL YU
+    case 0x30E7: // KATAKANA LETTER SMALL YO
+    case 0x30EE: // KATAKANA LETTER SMALL WA
+    case 0x30F5: // KATAKANA LETTER SMALL KA
+    case 0x30F6: // KATAKANA LETTER SMALL KE
+    case 0x31F0: // KATAKANA LETTER SMALL KU
+    case 0x31F1: // KATAKANA LETTER SMALL SI
+    case 0x31F2: // KATAKANA LETTER SMALL SU
+    case 0x31F3: // KATAKANA LETTER SMALL TO
+    case 0x31F4: // KATAKANA LETTER SMALL NU
+    case 0x31F5: // KATAKANA LETTER SMALL HA
+    case 0x31F6: // KATAKANA LETTER SMALL HI
+    case 0x31F7: // KATAKANA LETTER SMALL HU
+    case 0x31F8: // KATAKANA LETTER SMALL HE
+    case 0x31F9: // KATAKANA LETTER SMALL HO
+    case 0x31FA: // KATAKANA LETTER SMALL MU
+    case 0x31FB: // KATAKANA LETTER SMALL RA
+    case 0x31FC: // KATAKANA LETTER SMALL RI
+    case 0x31FD: // KATAKANA LETTER SMALL RU
+    case 0x31FE: // KATAKANA LETTER SMALL RE
+    case 0x31FF: // KATAKANA LETTER SMALL RO
+    case 0xFF67: // HALFWIDTH KATAKANA LETTER SMALL A
+    case 0xFF68: // HALFWIDTH KATAKANA LETTER SMALL I
+    case 0xFF69: // HALFWIDTH KATAKANA LETTER SMALL U
+    case 0xFF6A: // HALFWIDTH KATAKANA LETTER SMALL E
+    case 0xFF6B: // HALFWIDTH KATAKANA LETTER SMALL O
+    case 0xFF6C: // HALFWIDTH KATAKANA LETTER SMALL YA
+    case 0xFF6D: // HALFWIDTH KATAKANA LETTER SMALL YU
+    case 0xFF6E: // HALFWIDTH KATAKANA LETTER SMALL YO
+    case 0xFF6F: // HALFWIDTH KATAKANA LETTER SMALL TU
+        return true;
+    }
+    return false;
+}
+
+enum VoicedSoundMarkType { NoVoicedSoundMark, VoicedSoundMark, SemiVoicedSoundMark };
+
+static inline VoicedSoundMarkType composedVoicedSoundMark(UChar character)
+{
+    ASSERT(isKanaLetter(character));
+
+    switch (character) {
+    case 0x304C: // HIRAGANA LETTER GA
+    case 0x304E: // HIRAGANA LETTER GI
+    case 0x3050: // HIRAGANA LETTER GU
+    case 0x3052: // HIRAGANA LETTER GE
+    case 0x3054: // HIRAGANA LETTER GO
+    case 0x3056: // HIRAGANA LETTER ZA
+    case 0x3058: // HIRAGANA LETTER ZI
+    case 0x305A: // HIRAGANA LETTER ZU
+    case 0x305C: // HIRAGANA LETTER ZE
+    case 0x305E: // HIRAGANA LETTER ZO
+    case 0x3060: // HIRAGANA LETTER DA
+    case 0x3062: // HIRAGANA LETTER DI
+    case 0x3065: // HIRAGANA LETTER DU
+    case 0x3067: // HIRAGANA LETTER DE
+    case 0x3069: // HIRAGANA LETTER DO
+    case 0x3070: // HIRAGANA LETTER BA
+    case 0x3073: // HIRAGANA LETTER BI
+    case 0x3076: // HIRAGANA LETTER BU
+    case 0x3079: // HIRAGANA LETTER BE
+    case 0x307C: // HIRAGANA LETTER BO
+    case 0x3094: // HIRAGANA LETTER VU
+    case 0x30AC: // KATAKANA LETTER GA
+    case 0x30AE: // KATAKANA LETTER GI
+    case 0x30B0: // KATAKANA LETTER GU
+    case 0x30B2: // KATAKANA LETTER GE
+    case 0x30B4: // KATAKANA LETTER GO
+    case 0x30B6: // KATAKANA LETTER ZA
+    case 0x30B8: // KATAKANA LETTER ZI
+    case 0x30BA: // KATAKANA LETTER ZU
+    case 0x30BC: // KATAKANA LETTER ZE
+    case 0x30BE: // KATAKANA LETTER ZO
+    case 0x30C0: // KATAKANA LETTER DA
+    case 0x30C2: // KATAKANA LETTER DI
+    case 0x30C5: // KATAKANA LETTER DU
+    case 0x30C7: // KATAKANA LETTER DE
+    case 0x30C9: // KATAKANA LETTER DO
+    case 0x30D0: // KATAKANA LETTER BA
+    case 0x30D3: // KATAKANA LETTER BI
+    case 0x30D6: // KATAKANA LETTER BU
+    case 0x30D9: // KATAKANA LETTER BE
+    case 0x30DC: // KATAKANA LETTER BO
+    case 0x30F4: // KATAKANA LETTER VU
+    case 0x30F7: // KATAKANA LETTER VA
+    case 0x30F8: // KATAKANA LETTER VI
+    case 0x30F9: // KATAKANA LETTER VE
+    case 0x30FA: // KATAKANA LETTER VO
+    case 0x30FE: // KATAKANA VOICED ITERATION MARK
+        return VoicedSoundMark;
+    case 0x3071: // HIRAGANA LETTER PA
+    case 0x3074: // HIRAGANA LETTER PI
+    case 0x3077: // HIRAGANA LETTER PU
+    case 0x307A: // HIRAGANA LETTER PE
+    case 0x307D: // HIRAGANA LETTER PO
+    case 0x30D1: // KATAKANA LETTER PA
+    case 0x30D4: // KATAKANA LETTER PI
+    case 0x30D7: // KATAKANA LETTER PU
+    case 0x30DA: // KATAKANA LETTER PE
+    case 0x30DD: // KATAKANA LETTER PO
+        return SemiVoicedSoundMark;
+    }
+    return NoVoicedSoundMark;
+}
+
+static inline bool isCombiningVoicedSoundMark(UChar character)
+{
+    switch (character) {
+    case 0x3099: // COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
+    case 0x309A: // COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
+        return true;
+    }
+    return false;
+}
+
+static inline bool containsKanaLetters(const String& pattern)
+{
+    const UChar* characters = pattern.characters();
+    unsigned length = pattern.length();
+    for (unsigned i = 0; i < length; ++i) {
+        if (isKanaLetter(characters[i]))
+            return true;
+    }
+    return false;
+}
+
+static void normalizeCharacters(const UChar* characters, unsigned length, Vector<UChar>& buffer)
+{
+    ASSERT(length);
+
+    buffer.resize(length);
+
+    UErrorCode status = U_ZERO_ERROR;
+    size_t bufferSize = unorm_normalize(characters, length, UNORM_NFC, 0, buffer.data(), length, &status);
+    ASSERT(status == U_ZERO_ERROR || status == U_STRING_NOT_TERMINATED_WARNING || status == U_BUFFER_OVERFLOW_ERROR);
+    ASSERT(bufferSize);
+
+    buffer.resize(bufferSize);
+
+    if (status == U_ZERO_ERROR || status == U_STRING_NOT_TERMINATED_WARNING)
+        return;
+
+    status = U_ZERO_ERROR;
+    unorm_normalize(characters, length, UNORM_NFC, 0, buffer.data(), bufferSize, &status);
+    ASSERT(status == U_STRING_NOT_TERMINATED_WARNING);
+}
+
 inline SearchBuffer::SearchBuffer(const String& target, bool isCaseSensitive)
     : m_target(target)
     , m_atBreak(true)
+    , m_targetRequiresKanaWorkaround(containsKanaLetters(m_target))
 {
     ASSERT(!m_target.isEmpty());
 
@@ -1521,6 +1731,10 @@ inline SearchBuffer::SearchBuffer(const String& target, bool isCaseSensitive)
     UErrorCode status = U_ZERO_ERROR;
     usearch_setPattern(searcher, m_target.characters(), targetLength, &status);
     ASSERT(status == U_ZERO_ERROR);
+
+    // The kana workaround requires a normalized copy of the target string.
+    if (m_targetRequiresKanaWorkaround)
+        normalizeCharacters(m_target.characters(), m_target.length(), m_normalizedTarget);
 }
 
 inline SearchBuffer::~SearchBuffer()
@@ -1558,6 +1772,59 @@ inline void SearchBuffer::reachedBreak()
     m_atBreak = true;
 }
 
+inline bool SearchBuffer::isBadMatch(const UChar* match, size_t matchLength) const
+{
+    // This function implements the kana workaround. If usearch treats
+    // it as a match, but we do not want to, then it's a "bad match".
+    if (!m_targetRequiresKanaWorkaround)
+        return false;
+
+    // Normalize into a match buffer. We reuse a single buffer rather than
+    // creating a new one each time.
+    normalizeCharacters(match, matchLength, m_normalizedMatch);
+
+    const UChar* a = m_normalizedTarget.begin();
+    const UChar* aEnd = m_normalizedTarget.end();
+
+    const UChar* b = m_normalizedMatch.begin();
+    const UChar* bEnd = m_normalizedMatch.end();
+
+    while (true) {
+        // Skip runs of non-kana-letter characters. This is necessary so we can
+        // correctly handle strings where the target and match have different-length
+        // runs of characters that match, while still double checking the correctness
+        // of matches of kana letters with other kana letters.
+        while (a != aEnd && !isKanaLetter(*a))
+            ++a;
+        while (b != bEnd && !isKanaLetter(*b))
+            ++b;
+
+        // If we reached the end of either the target or the match, we should have
+        // reached the end of both; both should have the same number of kana letters.
+        if (a == aEnd || b == bEnd) {
+            ASSERT(a == aEnd);
+            ASSERT(b == bEnd);
+            return false;
+        }
+
+        // Check for differences in the kana letter character itself.
+        if (isSmallKanaLetter(*a) != isSmallKanaLetter(*b))
+            return true;
+        if (composedVoicedSoundMark(*a) != composedVoicedSoundMark(*b))
+            return true;
+        ++a;
+        ++b;
+
+        // Check for differences in combining voiced sound marks found after the letter.
+        while (a != aEnd && b != bEnd && isCombiningVoicedSoundMark(*a) && isCombiningVoicedSoundMark(*b)) {
+            if (*a != *b)
+                return true;
+            ++a;
+            ++b;
+        }
+    }
+}
+
 inline size_t SearchBuffer::search(size_t& start)
 {
     size_t size = m_buffer.size();
@@ -1577,6 +1844,8 @@ inline size_t SearchBuffer::search(size_t& start)
 
     int matchStart = usearch_first(searcher, &status);
     ASSERT(status == U_ZERO_ERROR);
+
+nextMatch:
     if (!(matchStart >= 0 && static_cast<size_t>(matchStart) < size)) {
         ASSERT(matchStart == USEARCH_DONE);
         return 0;
@@ -1591,12 +1860,22 @@ inline size_t SearchBuffer::search(size_t& start)
         return 0;
     }
 
+    size_t matchedLength = usearch_getMatchedLength(searcher);
+    ASSERT(matchStart + matchedLength <= size);
+
+    // If this match is "bad", move on to the next match.
+    if (isBadMatch(m_buffer.data() + matchStart, matchedLength)) {
+        matchStart = usearch_next(searcher, &status);
+        ASSERT(status == U_ZERO_ERROR);
+        goto nextMatch;
+    }
+
     size_t newSize = size - (matchStart + 1);
     memmove(m_buffer.data(), m_buffer.data() + matchStart + 1, newSize * sizeof(UChar));
     m_buffer.shrink(newSize);
 
     start = size - matchStart;
-    return usearch_getMatchedLength(searcher);
+    return matchedLength;
 }
 
 #else // !ICU_UNICODE

-- 
WebKit Debian packaging



More information about the Pkg-webkit-commits mailing list