[Dict-common-dev] Huge Catalan dictionary

Agustin Martin agmartin at debian.org
Sun Mar 26 22:14:21 UTC 2006


On Sat, Mar 25, 2006 at 05:17:38PM +0100, Jordi Mallach wrote:
> Hi,
> 
> I've spent some hours thinking about how to solve:
> #345242: aspell-ca: reports hyphenated and apostrophed words as
> mispellings.
> 
> It's not trivial. Many verions ago, I asked Agustín about ideas to solve
> the big size of my generated dictionaries. He suggested that I could
> remove a few rules from my .aff file, and that indeed did generate a
> reasonably-sized dictionary. Unfortunately, the stuff that was removed
> from the resulting dictionary is quite annoying.
> 
> I tried adding some of the rules again, but the dictioary still grows
> quite a bit. I've been discussing with my upstream dictionary
> maintainer, and he suggests I remove some rules from the aff file and
> then hack around the generated wordlist to make things work, although
> they suck a bit.
> 
> The "100% correct" aspell dictionary is nearly 200 megabytes, as it
> includes a lot of variations for hyphenated and apostrophed words, which
> is mainly what was getting removed in the past.

Hi, Jordi

I think you need to use affix compression, but previously upstream
(or you) need to fix myspell affix file so aspell accepts it. Some things
there are not accepted by aspell, see my experiments about using affix
compression in aspell-ca

http://bugs.debian.org/311391

Since I filed that bugreport against the source package ispellcat instead of
aspell-ca, it went probably missed, since I never received any reply. I cite
the good news there,

-------------------------------------------------------------------------
  EPILOG) And the good news ;-)
  =============================

  And now, after the lo...oong report, the good news, building the catalan
  dict unstripped but with affix compression produces a 3.7Mb hash file,
  instead of the >100Mb file that was previously needed for the unstripped
  version.
-------------------------------------------------------------------------

If upstream is too busy now to deal with this I suggest you to use aspell
affix file in aspell6-ca (in the aspell site), that has been fine tuned by
Kevin Atkinson to work with affix compression. I hope it will also work well
for myspell, but in case not, keep two myspell type affix files, one for
myspell and other for aspell, as a temporary fix.

I am attaching a patch with what I think should work for using affix
compression, once myspell affix file is fixed (another couple of problems
are also fixed in the patch). Note that it will not work for aspell
with current affix file.

Salut,

-- 
Agustin
-------------- next part --------------
diff -u ispellcat-0.4/debian/ca.dat ispellcat-0.4/debian/ca.dat
--- ispellcat-0.4/debian/ca.dat
+++ ispellcat-0.4/debian/ca.dat
@@ -4,2 +4,5 @@
-special ' -*- · -*- - -*-
+special ' -*- · -* - -*- . --*
 soundslike generic
+affix          ca
+affix-compress true
+repl-table     ca_affix.dat
diff -u ispellcat-0.4/debian/changelog ispellcat-0.4/debian/changelog
--- ispellcat-0.4/debian/changelog
+++ ispellcat-0.4/debian/changelog
@@ -1,3 +1,11 @@
+ispellcat (0.4-6.1) unstable; urgency=low
+
+  * debian/ca.dat, debian/rules: Use affix compression
+  * debian/ca.dat: Allow . at end of words
+  * debian/rules: Make sure no cruft is left on purge
+
+ -- Agustin Martin Domingo <agmartin at debian.org>  Sun, 26 Mar 2006 23:53:04 +0200
+
 ispellcat (0.4-6) unstable; urgency=low
 
   * debian/control:
diff -u ispellcat-0.4/debian/rules ispellcat-0.4/debian/rules
--- ispellcat-0.4/debian/rules
+++ ispellcat-0.4/debian/rules
@@ -32,9 +32,7 @@
 #	cat catala.words.debian | \
 #		aspell --local-data-dir=$(CURDIR) --lang=ca \
 #			create master ./ca.rws
-	cp catala.words.debian ca.wl
-	prezip ca.wl
-	gzip ca.cwl
+	cat catalan-m.dic | fromdos | sed '1d'  | prezip | gzip -c > ca.cwl.gz
 
 	echo "add ca.rws" > ca.multi
 	echo "add ca.multi" > catalan.alias
@@ -71,14 +69,14 @@
 	# aspell-ca stuff
 	install -m 644 ca.cwl.gz $(ADICT_DIR)/usr/share/aspell
 	install -m 644 debian/ca.dat $(ADICT_DIR)/usr/lib/aspell/ca.dat
-	touch $(ADICT_DIR)/usr/lib/aspell/ca.rws
-	touch $(ADICT_DIR)/usr/lib/aspell/ca.compat
+	install -m 644 catalan-m.aff $(ADICT_DIR)/usr/lib/aspell/ca_affix.dat
 	install -m 644 ca.multi $(ADICT_DIR)/usr/lib/aspell/ca.multi
 	install -m 644 catalan.alias $(ADICT_DIR)/usr/lib/aspell/catalan.alias
 	install -m 644 catala.alias $(ADICT_DIR)/usr/lib/aspell/catala.alias
 	install -m 644 català.alias $(ADICT_DIR)/usr/lib/aspell/català.alias
+	touch $(ADICT_DIR)/var/lib/aspell/ca.rws
 	touch $(ADICT_DIR)/var/lib/aspell/ca.compat
-	
+
 #	install -m 644 ca_phonet.dat $(ADICT_DIR)/usr/lib/aspell/ca_phonet.dat
 
 	# myspell-ca stuff


More information about the Dict-common-dev mailing list