[ucto] 01/03: New upstream version 0.9.8
Maarten van Gompel
proycon-guest at moszumanska.debian.org
Thu Nov 2 20:15:47 UTC 2017
This is an automated email from the git hooks/post-receive script.
proycon-guest pushed a commit to branch master
in repository ucto.
commit 98fae82a0f2e7e7ed8335832ee9cdff8c4a5f0a7
Author: proycon <proycon at anaproy.nl>
Date: Thu Nov 2 21:07:31 2017 +0100
New upstream version 0.9.8
---
ChangeLog | 491 +++++++++++++++++++++++++++++++++++++++++
INSTALL | 370 +++++++++++++++++++++++++++++++
Makefile.am | 2 +-
Makefile.in | 30 +--
NEWS | 31 +++
README | 113 ----------
aclocal.m4 | 1 -
bootstrap.sh | 3 -
config.guess | 165 ++++++++------
config.h.in | 3 -
config.sub | 56 +++--
config/Makefile.in | 10 +-
configure | 403 ++++++++++++++++------------------
configure.ac | 63 ++----
docs/Makefile.in | 10 +-
docs/ucto.1 | 31 ++-
include/Makefile.in | 10 +-
include/ucto/Makefile.in | 10 +-
include/ucto/setting.h | 1 +
include/ucto/textcat.h | 2 +-
include/ucto/tokenize.h | 22 +-
install-sh | 23 +-
ltmain.sh | 39 ++--
m4/Makefile.in | 10 +-
m4/ax_icu_check.m4 | 86 --------
m4/libtool.m4 | 27 ++-
m4/ltsugar.m4 | 7 +-
m4/lt~obsolete.m4 | 7 +-
m4/pkg.m4 | 217 +++++++-----------
src/Makefile.am | 5 +-
src/Makefile.in | 15 +-
src/setting.cxx | 39 +++-
src/textcat.cxx | 4 +-
src/tokenize.cxx | 560 +++++++++++++++++++++++++++++++++++------------
src/ucto.cxx | 268 +++++++++++++++++------
src/unicode.cxx | 17 +-
tests/Makefile.in | 10 +-
ucto.pc.in | 1 -
38 files changed, 2116 insertions(+), 1046 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 4cfc5fa..25afd34 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,490 @@
+2017-10-23 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testutt, tests/testutt.ok, tests/utt2.xml: added anothrer
+ utterance test.
+
+2017-10-22 Maarten van Gompel <proycon at anaproy.nl>
+
+ * src/tokenize.cxx: Attempted fix for utterance/sentence problem #37
+
+2017-10-22 Maarten van Gompel <proycon at anaproy.nl>
+
+ * src/tokenize.cxx: another related comment
+
+2017-10-22 Maarten van Gompel <proycon at anaproy.nl>
+
+ * src/tokenize.cxx: just added a comment/suggestion on detection
+ structure elements
+
+2017-10-19 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * NEWS: small folia ==> FoLiA edit
+
+2017-10-18 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * configure.ac: bumped version after release
+
+2017-10-18 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * NEWS: some typos in NEWS
+
+2017-10-18 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * NEWS: Updated NEWS with old news from 23-01-2017
+
+2017-10-17 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * NEWS: some news
+
+2017-10-11 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx, tests/testfoliain.ok: fixed
+ textredundancy="full". Now it adds text upto the highest level.
+
+2017-10-11 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testfoliain, tests/testfoliain.ok, tests/textproblem.xml:
+ added and modified tests, after change in FoLiA parser
+
+2017-10-11 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/tokenize.h, src/tokenize.cxx: added a
+ setTextRedundancy member
+
+2017-10-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/partest2_folia.nl.xml, tests/partest_folia.nl.xml,
+ tests/testfolia.ok, tests/testfolia2.ok, tests/testfoliain.ok,
+ tests/testlang.ok, tests/testutt.ok: adapted tests to changed
+ textredundancy level
+
+2017-10-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx, src/ucto.cxx: changed textredundancy default to
+ 'minimal'
+
+2017-10-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testfoliain.ok: adapted test to changed <br/> handling
+
+2017-10-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: for now, disable the <br/> handling. It is too
+ complicated.
+
+2017-10-02 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testfolia2, tests/testfolia2.ok, tests/testfoliain.ok: fixed
+ tests
+
+2017-10-02 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx,
+ tests/testfolia, tests/testfoliain, tests/testfoliain.ok:
+ implemented --textredundancy option (replaces --noredundanttext)
+
+2017-10-02 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/tokenize.h, src/tokenize.cxx: removed an unused
+ function. Give a warning when attempting to set language on metadata
+ of non-native type
+
+2017-10-02 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * configure.ac: re-instated --with-icu in configure.ac
+
+2017-09-28 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: added safeguards around set_metadata
+
+2017-09-27 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: the default is doRedundantText == true
+
+2017-09-27 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testfoliain: adapted test to check automagically detecting
+ folia
+
+2017-09-27 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/ucto.cxx: automatically switch to -F or -X when input or
+ outputfile have '.xml' extension(s)
+
+2017-09-27 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testfolia2, tests/testfolia2.ok: modified test to also test
+ -T option
+
+2017-09-26 Maarten van Gompel <proycon at anaproy.nl>
+
+ * src/ucto.cxx: added CLST, Nijmegen to --version
+
+2017-09-26 Maarten van Gompel <proycon at anaproy.nl>
+
+ * src/ucto.cxx: Added shortcut option for --noredundanttext (-T) and
+ changed help text a bit #31
+
+2017-09-26 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testfolia.ok: add updated file, missing from previous commit
+
+2017-09-26 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx,
+ tests/testfolia, tests/testfoliain, tests/testfoliain.ok:
+ implemented an --noredundanttext option. and added tests
+
+2017-09-12 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * configure.ac: be sure to use recent libfolia
+
+2017-09-12 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx, tests/testfoliain.ok: set textclass on <w> when
+ outputclass != inputclass
+
+2017-09-11 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * configure.ac: use C++!
+
+2017-08-30 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * ucto.pc.in: removed icu requirement
+
+2017-08-30 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * : commit 5ee40601de62c8612f4660a7748151fee7ea9929 Author: Ko van
+ der Sloot <K.vanderSloot at let.ru.nl> Date: Wed Aug 30 16:24:06 2017
+ +0200
+
+2017-08-30 Maarten van Gompel <proycon at anaproy.nl>
+
+ * docs/ucto_manual.tex: typo fix (and automatic trailing space
+ stuff)
+
+2017-08-21 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/folia9a.xml, tests/folia9b.xml, tests/testfoliain,
+ tests/testfoliain.ok: added test documents with embedded tabs,
+ newlines and multiple spaces.
+
+2017-08-18 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/folia8.xml: new file
+
+2017-08-18 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * configure.ac, tests/testfoliain, tests/testfoliain.ok: added a
+ test wikt xml comment inside a <t>
+
+2017-08-17 Maarten van Gompel <proycon at anaproy.nl>
+
+ * src/tokenize.cxx, src/ucto.cxx: language fix
+
+2017-08-15 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: added some more debug lines
+
+2017-08-14 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: try to generate id's based on the parents ID or
+ there parents ID.
+
+2017-07-27 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: add libtar-dev too
+
+2017-07-25 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * : commit 00c3b9e94e36331b756f67110c0fc940ff83075d Author: Ko van
+ der Sloot <K.vanderSloot at let.ru.nl> Date: Tue Jul 25 10:45:38 2017
+ +0200
+
+2017-07-20 Maarten van Gompel <proycon at anaproy.nl>
+
+ * tests/testall: use python2 explicitly
+
+2017-07-20 Maarten van Gompel <proycon at anaproy.nl>
+
+ * tests/test.py: use python 2 explicitly
+
+2017-07-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx, tests/testutt.ok: fixed utterance handling
+ (quite hacky)
+
+2017-07-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testall, tests/testutt, tests/utt.xml: added a (yet failing)
+ test
+
+2017-07-18 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: attempt to fix clang test on travis
+
+2017-07-18 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: disable filtering in XML files in more cases
+
+2017-06-28 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: attempt to fix build
+
+2017-06-28 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testfoliain.ok: adaped test, now newline handling is fixed
+
+2017-06-28 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/tokenize.h, src/tokenize.cxx: added code to handle
+ embedded newlines in FoLiA documents.
+
+2017-06-26 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: adapted to changed libfolis
+
+2017-06-01 Maarten van Gompel <proycon at anaproy.nl>
+
+ * : commit 2037878fff5e9bb47911c1a0c54b9c79291754fc Author: Maarten
+ van Gompel <proycon at anaproy.nl> Date: Thu Jun 1 21:30:05 2017
+ +0200
+
+2017-05-22 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/setting.cxx, src/tokenize.cxx, src/ucto.cxx,
+ tests/testfiles2.ok, tests/testfoliain.ok, tests/testlang.ok,
+ tests/testoption2.ok, tests/testslash.ok: sorted out logging and
+ such a bit.
+
+2017-05-22 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testfoliain.ok, tests/testlang.ok, tests/testslash.ok:
+ adaptes tests
+
+2017-05-22 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/ucto.cxx: No longer SILENTLY set --filter=NO for FoLiA with
+ equal input ans output class
+
+2017-05-22 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/ucto.cxx, tests/testnormalisation: added an --filter option.
+ superseeds -f (that could only switch filtering OFF)
+
+2017-05-17 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/folia1.xml, tests/testfoliain, tests/testfoliain.ok:
+ enhanced and extended folia testing
+
+2017-05-17 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx, src/ucto.cxx, tests/testfoliain.ok: Disable
+ filtering of characters on FoLiA input with same inputclass and
+ outputclass
+
+2017-05-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/filter.xml, tests/testfoliain.ok, tests/testtext,
+ tests/testtext.ok: added a test, and adapted to changes results
+
+2017-05-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: now we adapt text on <s> and <p> to the lower
+ layers
+
+2017-05-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * configure.ac: simplified configuration
+
+2017-05-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: added IRC notification
+
+2017-05-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/testlang.ok: adepted test after fix in libfolia
+
+2017-05-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * docs/ucto.1, src/ucto.cxx: update manpage. Fixed typo.
+
+2017-05-09 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * Makefile.am, configure.ac, ucto.pc.in: more configuration cleanup.
+
+2017-05-08 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * bootstrap.sh, configure.ac: modernized build system
+
+2017-05-03 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: still a leak was left. plugging...
+
+2017-05-03 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/setting.cxx, src/tokenize.cxx: fixed a memory leak
+
+2017-04-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: added some comment
+
+2017-04-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: better debug output
+
+2017-04-10 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * tests/folia7.xml, tests/testfolia, tests/testfoliain,
+ tests/testfoliain.ok: added a test
+
+2017-04-04 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: revert back to default g++
+
+2017-03-30 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: numb edits
+
+2017-03-28 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/tokenize.h, src/tokenize.cxx, src/ucto.cxx,
+ tests/folia-lang-2.xml, tests/testlang: started implementing
+ language detection in FoLiA input too. Not done, nothing broke (yet)
+
+2017-03-27 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: fixed a problem with log token detection
+
+2017-03-14 Maarten van Gompel <proycon at anaproy.nl>
+
+ * : Merge pull request #17 from sanmai-NL/speed_up_CI_build Limit network transfers, add `ccache`
+
+2017-03-01 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: Oops. A function got lost... :{
+
+2017-02-27 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/ucto.cxx: removed redundant mentioning of configfile. (is
+ empty > 90% of time)
+
+2017-02-27 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/tokenize.h, src/tokenize.cxx: in case of problems in
+ tokenizeLine(), we display the offending line numner OR the FoLiA
+ element ID.
+
+2017-02-26 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: for extreme long 'words' display a part of the
+ offensive intput. Also typo corrected.
+
+2017-02-21 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/setting.cxx, src/ucto.cxx: give better information when
+ language is missing or wrong
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/ucto.cxx: updated usage()
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * docs/ucto.1: updated ucto man page
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: another final attempt :{
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: final attempt
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: getting closer?
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: wow wat lastig
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: next try
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: another attempt
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: attempt to fix
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: modernized Travis config
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * .travis.yml: added dependency for travis
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/ucto.cxx: Warn about use of unsupported languages. Don't use
+ 'generic' by default.
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/ucto.cxx: check specified languages against the installed ones
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/setting.h, src/setting.cxx, src/ucto.cxx: use a set
+ to store resulte, not a vector
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/setting.h, src/setting.cxx, src/ucto.cxx: added a
+ function to search for installed languages
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: typo corrected
+
+2017-02-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx: choke on words from 2500 characters ore more
+
+2017-02-08 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/tokenize.h, src/tokenize.cxx: some more repait
+ considering outputclass
+
+2017-02-08 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/tokenize.cxx, src/ucto.cxx: when using the --textclass option.
+ make sure --inputclass and --outputclass are not used.
+
+2017-02-07 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/tokenize.h, src/Makefile.am, src/tokenize.cxx:
+ attempt to speed up some stuff
+
+2017-02-02 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * src/Makefile.am, src/tokenize.cxx: minor changes
+
+2017-01-24 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * include/ucto/textcat.h, src/Makefile.am, src/setting.cxx,
+ src/textcat.cxx, src/tokenize.cxx, src/ucto.cxx, src/unicode.cxx:
+ some refactoring to satisfy static checkers
+
+2017-01-23 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * configure.ac: bumped version after release
+
2017-01-23 Maarten van Gompel <proycon at anaproy.nl>
* configure.ac: rely on uctodata 0.4
@@ -38,6 +525,10 @@
* config/Makefile.am, src/Makefile.am: unstall and look for
datafiles in $PREFIX/share/ucto
+2017-01-18 Sander Maijers <S.N.Maijers at gmail.com>
+
+ * .travis.yml: Speed up CI builds
+
2017-01-18 Ko van der Sloot <K.vanderSloot at let.ru.nl>
* tests/test.nl.tok.V, tests/test.nl.txt: added more DATE testcases
diff --git a/INSTALL b/INSTALL
new file mode 100644
index 0000000..2099840
--- /dev/null
+++ b/INSTALL
@@ -0,0 +1,370 @@
+Installation Instructions
+*************************
+
+Copyright (C) 1994-1996, 1999-2002, 2004-2013 Free Software Foundation,
+Inc.
+
+ Copying and distribution of this file, with or without modification,
+are permitted in any medium without royalty provided the copyright
+notice and this notice are preserved. This file is offered as-is,
+without warranty of any kind.
+
+Basic Installation
+==================
+
+ Briefly, the shell command `./configure && make && make install'
+should configure, build, and install this package. The following
+more-detailed instructions are generic; see the `README' file for
+instructions specific to this package. Some packages provide this
+`INSTALL' file but do not implement all of the features documented
+below. The lack of an optional feature in a given package is not
+necessarily a bug. More recommendations for GNU packages can be found
+in *note Makefile Conventions: (standards)Makefile Conventions.
+
+ The `configure' shell script attempts to guess correct values for
+various system-dependent variables used during compilation. It uses
+those values to create a `Makefile' in each directory of the package.
+It may also create one or more `.h' files containing system-dependent
+definitions. Finally, it creates a shell script `config.status' that
+you can run in the future to recreate the current configuration, and a
+file `config.log' containing compiler output (useful mainly for
+debugging `configure').
+
+ It can also use an optional file (typically called `config.cache'
+and enabled with `--cache-file=config.cache' or simply `-C') that saves
+the results of its tests to speed up reconfiguring. Caching is
+disabled by default to prevent problems with accidental use of stale
+cache files.
+
+ If you need to do unusual things to compile the package, please try
+to figure out how `configure' could check whether to do them, and mail
+diffs or instructions to the address given in the `README' so they can
+be considered for the next release. If you are using the cache, and at
+some point `config.cache' contains results you don't want to keep, you
+may remove or edit it.
+
+ The file `configure.ac' (or `configure.in') is used to create
+`configure' by a program called `autoconf'. You need `configure.ac' if
+you want to change it or regenerate `configure' using a newer version
+of `autoconf'.
+
+ The simplest way to compile this package is:
+
+ 1. `cd' to the directory containing the package's source code and type
+ `./configure' to configure the package for your system.
+
+ Running `configure' might take a while. While running, it prints
+ some messages telling which features it is checking for.
+
+ 2. Type `make' to compile the package.
+
+ 3. Optionally, type `make check' to run any self-tests that come with
+ the package, generally using the just-built uninstalled binaries.
+
+ 4. Type `make install' to install the programs and any data files and
+ documentation. When installing into a prefix owned by root, it is
+ recommended that the package be configured and built as a regular
+ user, and only the `make install' phase executed with root
+ privileges.
+
+ 5. Optionally, type `make installcheck' to repeat any self-tests, but
+ this time using the binaries in their final installed location.
+ This target does not install anything. Running this target as a
+ regular user, particularly if the prior `make install' required
+ root privileges, verifies that the installation completed
+ correctly.
+
+ 6. You can remove the program binaries and object files from the
+ source code directory by typing `make clean'. To also remove the
+ files that `configure' created (so you can compile the package for
+ a different kind of computer), type `make distclean'. There is
+ also a `make maintainer-clean' target, but that is intended mainly
+ for the package's developers. If you use it, you may have to get
+ all sorts of other programs in order to regenerate files that came
+ with the distribution.
+
+ 7. Often, you can also type `make uninstall' to remove the installed
+ files again. In practice, not all packages have tested that
+ uninstallation works correctly, even though it is required by the
+ GNU Coding Standards.
+
+ 8. Some packages, particularly those that use Automake, provide `make
+ distcheck', which can by used by developers to test that all other
+ targets like `make install' and `make uninstall' work correctly.
+ This target is generally not run by end users.
+
+Compilers and Options
+=====================
+
+ Some systems require unusual options for compilation or linking that
+the `configure' script does not know about. Run `./configure --help'
+for details on some of the pertinent environment variables.
+
+ You can give `configure' initial values for configuration parameters
+by setting variables in the command line or in the environment. Here
+is an example:
+
+ ./configure CC=c99 CFLAGS=-g LIBS=-lposix
+
+ *Note Defining Variables::, for more details.
+
+Compiling For Multiple Architectures
+====================================
+
+ You can compile the package for more than one kind of computer at the
+same time, by placing the object files for each architecture in their
+own directory. To do this, you can use GNU `make'. `cd' to the
+directory where you want the object files and executables to go and run
+the `configure' script. `configure' automatically checks for the
+source code in the directory that `configure' is in and in `..'. This
+is known as a "VPATH" build.
+
+ With a non-GNU `make', it is safer to compile the package for one
+architecture at a time in the source code directory. After you have
+installed the package for one architecture, use `make distclean' before
+reconfiguring for another architecture.
+
+ On MacOS X 10.5 and later systems, you can create libraries and
+executables that work on multiple system types--known as "fat" or
+"universal" binaries--by specifying multiple `-arch' options to the
+compiler but only a single `-arch' option to the preprocessor. Like
+this:
+
+ ./configure CC="gcc -arch i386 -arch x86_64 -arch ppc -arch ppc64" \
+ CXX="g++ -arch i386 -arch x86_64 -arch ppc -arch ppc64" \
+ CPP="gcc -E" CXXCPP="g++ -E"
+
+ This is not guaranteed to produce working output in all cases, you
+may have to build one architecture at a time and combine the results
+using the `lipo' tool if you have problems.
+
+Installation Names
+==================
+
+ By default, `make install' installs the package's commands under
+`/usr/local/bin', include files under `/usr/local/include', etc. You
+can specify an installation prefix other than `/usr/local' by giving
+`configure' the option `--prefix=PREFIX', where PREFIX must be an
+absolute file name.
+
+ You can specify separate installation prefixes for
+architecture-specific files and architecture-independent files. If you
+pass the option `--exec-prefix=PREFIX' to `configure', the package uses
+PREFIX as the prefix for installing programs and libraries.
+Documentation and other data files still use the regular prefix.
+
+ In addition, if you use an unusual directory layout you can give
+options like `--bindir=DIR' to specify different values for particular
+kinds of files. Run `configure --help' for a list of the directories
+you can set and what kinds of files go in them. In general, the
+default for these options is expressed in terms of `${prefix}', so that
+specifying just `--prefix' will affect all of the other directory
+specifications that were not explicitly provided.
+
+ The most portable way to affect installation locations is to pass the
+correct locations to `configure'; however, many packages provide one or
+both of the following shortcuts of passing variable assignments to the
+`make install' command line to change installation locations without
+having to reconfigure or recompile.
+
+ The first method involves providing an override variable for each
+affected directory. For example, `make install
+prefix=/alternate/directory' will choose an alternate location for all
+directory configuration variables that were expressed in terms of
+`${prefix}'. Any directories that were specified during `configure',
+but not in terms of `${prefix}', must each be overridden at install
+time for the entire installation to be relocated. The approach of
+makefile variable overrides for each directory variable is required by
+the GNU Coding Standards, and ideally causes no recompilation.
+However, some platforms have known limitations with the semantics of
+shared libraries that end up requiring recompilation when using this
+method, particularly noticeable in packages that use GNU Libtool.
+
+ The second method involves providing the `DESTDIR' variable. For
+example, `make install DESTDIR=/alternate/directory' will prepend
+`/alternate/directory' before all installation names. The approach of
+`DESTDIR' overrides is not required by the GNU Coding Standards, and
+does not work on platforms that have drive letters. On the other hand,
+it does better at avoiding recompilation issues, and works well even
+when some directory options were not specified in terms of `${prefix}'
+at `configure' time.
+
+Optional Features
+=================
+
+ If the package supports it, you can cause programs to be installed
+with an extra prefix or suffix on their names by giving `configure' the
+option `--program-prefix=PREFIX' or `--program-suffix=SUFFIX'.
+
+ Some packages pay attention to `--enable-FEATURE' options to
+`configure', where FEATURE indicates an optional part of the package.
+They may also pay attention to `--with-PACKAGE' options, where PACKAGE
+is something like `gnu-as' or `x' (for the X Window System). The
+`README' should mention any `--enable-' and `--with-' options that the
+package recognizes.
+
+ For packages that use the X Window System, `configure' can usually
+find the X include and library files automatically, but if it doesn't,
+you can use the `configure' options `--x-includes=DIR' and
+`--x-libraries=DIR' to specify their locations.
+
+ Some packages offer the ability to configure how verbose the
+execution of `make' will be. For these packages, running `./configure
+--enable-silent-rules' sets the default to minimal output, which can be
+overridden with `make V=1'; while running `./configure
+--disable-silent-rules' sets the default to verbose, which can be
+overridden with `make V=0'.
+
+Particular systems
+==================
+
+ On HP-UX, the default C compiler is not ANSI C compatible. If GNU
+CC is not installed, it is recommended to use the following options in
+order to use an ANSI C compiler:
+
+ ./configure CC="cc -Ae -D_XOPEN_SOURCE=500"
+
+and if that doesn't work, install pre-built binaries of GCC for HP-UX.
+
+ HP-UX `make' updates targets which have the same time stamps as
+their prerequisites, which makes it generally unusable when shipped
+generated files such as `configure' are involved. Use GNU `make'
+instead.
+
+ On OSF/1 a.k.a. Tru64, some versions of the default C compiler cannot
+parse its `<wchar.h>' header file. The option `-nodtk' can be used as
+a workaround. If GNU CC is not installed, it is therefore recommended
+to try
+
+ ./configure CC="cc"
+
+and if that doesn't work, try
+
+ ./configure CC="cc -nodtk"
+
+ On Solaris, don't put `/usr/ucb' early in your `PATH'. This
+directory contains several dysfunctional programs; working variants of
+these programs are available in `/usr/bin'. So, if you need `/usr/ucb'
+in your `PATH', put it _after_ `/usr/bin'.
+
+ On Haiku, software installed for all users goes in `/boot/common',
+not `/usr/local'. It is recommended to use the following options:
+
+ ./configure --prefix=/boot/common
+
+Specifying the System Type
+==========================
+
+ There may be some features `configure' cannot figure out
+automatically, but needs to determine by the type of machine the package
+will run on. Usually, assuming the package is built to be run on the
+_same_ architectures, `configure' can figure that out, but if it prints
+a message saying it cannot guess the machine type, give it the
+`--build=TYPE' option. TYPE can either be a short name for the system
+type, such as `sun4', or a canonical name which has the form:
+
+ CPU-COMPANY-SYSTEM
+
+where SYSTEM can have one of these forms:
+
+ OS
+ KERNEL-OS
+
+ See the file `config.sub' for the possible values of each field. If
+`config.sub' isn't included in this package, then this package doesn't
+need to know the machine type.
+
+ If you are _building_ compiler tools for cross-compiling, you should
+use the option `--target=TYPE' to select the type of system they will
+produce code for.
+
+ If you want to _use_ a cross compiler, that generates code for a
+platform different from the build platform, you should specify the
+"host" platform (i.e., that on which the generated programs will
+eventually be run) with `--host=TYPE'.
+
+Sharing Defaults
+================
+
+ If you want to set default values for `configure' scripts to share,
+you can create a site shell script called `config.site' that gives
+default values for variables like `CC', `cache_file', and `prefix'.
+`configure' looks for `PREFIX/share/config.site' if it exists, then
+`PREFIX/etc/config.site' if it exists. Or, you can set the
+`CONFIG_SITE' environment variable to the location of the site script.
+A warning: not all `configure' scripts look for a site script.
+
+Defining Variables
+==================
+
+ Variables not defined in a site shell script can be set in the
+environment passed to `configure'. However, some packages may run
+configure again during the build, and the customized values of these
+variables may be lost. In order to avoid this problem, you should set
+them in the `configure' command line, using `VAR=value'. For example:
+
+ ./configure CC=/usr/local2/bin/gcc
+
+causes the specified `gcc' to be used as the C compiler (unless it is
+overridden in the site shell script).
+
+Unfortunately, this technique does not work for `CONFIG_SHELL' due to
+an Autoconf limitation. Until the limitation is lifted, you can use
+this workaround:
+
+ CONFIG_SHELL=/bin/bash ./configure CONFIG_SHELL=/bin/bash
+
+`configure' Invocation
+======================
+
+ `configure' recognizes the following options to control how it
+operates.
+
+`--help'
+`-h'
+ Print a summary of all of the options to `configure', and exit.
+
+`--help=short'
+`--help=recursive'
+ Print a summary of the options unique to this package's
+ `configure', and exit. The `short' variant lists options used
+ only in the top level, while the `recursive' variant lists options
+ also present in any nested packages.
+
+`--version'
+`-V'
+ Print the version of Autoconf used to generate the `configure'
+ script, and exit.
+
+`--cache-file=FILE'
+ Enable the cache: use and save the results of the tests in FILE,
+ traditionally `config.cache'. FILE defaults to `/dev/null' to
+ disable caching.
+
+`--config-cache'
+`-C'
+ Alias for `--cache-file=config.cache'.
+
+`--quiet'
+`--silent'
+`-q'
+ Do not print messages saying which checks are being made. To
+ suppress all normal output, redirect it to `/dev/null' (any error
+ messages will still be shown).
+
+`--srcdir=DIR'
+ Look for the package's source code in directory DIR. Usually
+ `configure' can determine that directory automatically.
+
+`--prefix=DIR'
+ Use DIR as the installation prefix. *note Installation Names::
+ for more details, including other options available for fine-tuning
+ the installation locations.
+
+`--no-create'
+`-n'
+ Run the configure checks, but stop before creating any output
+ files.
+
+`configure' also accepts some other, not widely useful, options. Run
+`configure --help' for more details.
diff --git a/Makefile.am b/Makefile.am
index 76d6153..72104ba 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -5,7 +5,7 @@ SUBDIRS = src include m4 config docs tests
EXTRA_DIST = bootstrap.sh AUTHORS TODO NEWS ucto.pc.in ucto-icu.pc.in
pkgconfigdir = $(libdir)/pkgconfig
-pkgconfig_DATA = ucto.pc ucto-icu.pc
+pkgconfig_DATA = ucto.pc
ChangeLog: NEWS
git pull; git2cl > ChangeLog
diff --git a/Makefile.in b/Makefile.in
index 0fb55a4..d6652da 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -90,8 +90,7 @@ build_triplet = @build@
host_triplet = @host@
subdir = .
ACLOCAL_M4 = $(top_srcdir)/aclocal.m4
-am__aclocal_m4_deps = $(top_srcdir)/m4/ax_icu_check.m4 \
- $(top_srcdir)/m4/ax_lib_readline.m4 \
+am__aclocal_m4_deps = $(top_srcdir)/m4/ax_lib_readline.m4 \
$(top_srcdir)/m4/libtool.m4 $(top_srcdir)/m4/ltoptions.m4 \
$(top_srcdir)/m4/ltsugar.m4 $(top_srcdir)/m4/ltversion.m4 \
$(top_srcdir)/m4/lt~obsolete.m4 $(top_srcdir)/m4/pkg.m4 \
@@ -104,7 +103,7 @@ am__CONFIG_DISTCLEAN_FILES = config.status config.cache config.log \
configure.lineno config.status.lineno
mkinstalldirs = $(install_sh) -d
CONFIG_HEADER = config.h
-CONFIG_CLEAN_FILES = ucto.pc ucto-icu.pc
+CONFIG_CLEAN_FILES = ucto.pc
CONFIG_CLEAN_VPATH_FILES =
AM_V_P = $(am__v_P_ at AM_V@)
am__v_P_ = $(am__v_P_ at AM_DEFAULT_V@)
@@ -193,9 +192,9 @@ CTAGS = ctags
CSCOPE = cscope
DIST_SUBDIRS = $(SUBDIRS)
am__DIST_COMMON = $(srcdir)/Makefile.in $(srcdir)/config.h.in \
- $(srcdir)/ucto-icu.pc.in $(srcdir)/ucto.pc.in AUTHORS COPYING \
- ChangeLog NEWS README TODO compile config.guess config.sub \
- depcomp install-sh ltmain.sh missing
+ $(srcdir)/ucto.pc.in AUTHORS COPYING ChangeLog INSTALL NEWS \
+ TODO compile config.guess config.sub depcomp install-sh \
+ ltmain.sh missing
DISTFILES = $(DIST_COMMON) $(DIST_SOURCES) $(TEXINFOS) $(EXTRA_DIST)
distdir = $(PACKAGE)-$(VERSION)
top_distdir = $(distdir)
@@ -269,13 +268,7 @@ EXEEXT = @EXEEXT@
FGREP = @FGREP@
GREP = @GREP@
ICU_CFLAGS = @ICU_CFLAGS@
-ICU_CONFIG = @ICU_CONFIG@
-ICU_CPPSEARCHPATH = @ICU_CPPSEARCHPATH@
-ICU_CXXFLAGS = @ICU_CXXFLAGS@
-ICU_IOLIBS = @ICU_IOLIBS@
-ICU_LIBPATH = @ICU_LIBPATH@
ICU_LIBS = @ICU_LIBS@
-ICU_VERSION = @ICU_VERSION@
INSTALL = @INSTALL@
INSTALL_DATA = @INSTALL_DATA@
INSTALL_PROGRAM = @INSTALL_PROGRAM@
@@ -366,6 +359,7 @@ pdfdir = @pdfdir@
prefix = @prefix@
program_transform_name = @program_transform_name@
psdir = @psdir@
+runstatedir = @runstatedir@
sbindir = @sbindir@
sharedstatedir = @sharedstatedir@
srcdir = @srcdir@
@@ -382,7 +376,7 @@ ACLOCAL_AMFLAGS = -I m4 --install
SUBDIRS = src include m4 config docs tests
EXTRA_DIST = bootstrap.sh AUTHORS TODO NEWS ucto.pc.in ucto-icu.pc.in
pkgconfigdir = $(libdir)/pkgconfig
-pkgconfig_DATA = ucto.pc ucto-icu.pc
+pkgconfig_DATA = ucto.pc
all: config.h
$(MAKE) $(AM_MAKEFLAGS) all-recursive
@@ -437,8 +431,6 @@ distclean-hdr:
-rm -f config.h stamp-h1
ucto.pc: $(top_builddir)/config.status $(srcdir)/ucto.pc.in
cd $(top_builddir) && $(SHELL) ./config.status $@
-ucto-icu.pc: $(top_builddir)/config.status $(srcdir)/ucto-icu.pc.in
- cd $(top_builddir) && $(SHELL) ./config.status $@
mostlyclean-libtool:
-rm -f *.lo
@@ -641,7 +633,7 @@ distdir: $(DISTFILES)
! -type d ! -perm -444 -exec $(install_sh) -c -m a+r {} {} \; \
|| chmod -R a+r "$(distdir)"
dist-gzip: distdir
- tardir=$(distdir) && $(am__tar) | eval GZIP= gzip $(GZIP_ENV) -c >$(distdir).tar.gz
+ tardir=$(distdir) && $(am__tar) | GZIP=$(GZIP_ENV) gzip -c >$(distdir).tar.gz
$(am__post_remove_distdir)
dist-bzip2: distdir
@@ -667,7 +659,7 @@ dist-shar: distdir
@echo WARNING: "Support for shar distribution archives is" \
"deprecated." >&2
@echo WARNING: "It will be removed altogether in Automake 2.0" >&2
- shar $(distdir) | eval GZIP= gzip $(GZIP_ENV) -c >$(distdir).shar.gz
+ shar $(distdir) | GZIP=$(GZIP_ENV) gzip -c >$(distdir).shar.gz
$(am__post_remove_distdir)
dist-zip: distdir
@@ -685,7 +677,7 @@ dist dist-all:
distcheck: dist
case '$(DIST_ARCHIVES)' in \
*.tar.gz*) \
- eval GZIP= gzip $(GZIP_ENV) -dc $(distdir).tar.gz | $(am__untar) ;;\
+ GZIP=$(GZIP_ENV) gzip -dc $(distdir).tar.gz | $(am__untar) ;;\
*.tar.bz2*) \
bzip2 -dc $(distdir).tar.bz2 | $(am__untar) ;;\
*.tar.lz*) \
@@ -695,7 +687,7 @@ distcheck: dist
*.tar.Z*) \
uncompress -c $(distdir).tar.Z | $(am__untar) ;;\
*.shar.gz*) \
- eval GZIP= gzip $(GZIP_ENV) -dc $(distdir).shar.gz | unshar ;;\
+ GZIP=$(GZIP_ENV) gzip -dc $(distdir).shar.gz | unshar ;;\
*.zip*) \
unzip $(distdir).zip ;;\
esac
diff --git a/NEWS b/NEWS
index b95d3ab..e747015 100644
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,34 @@
+0.9.8 2017-10-23
+[Ko vd Sloot]
+Bugfix release.
+ * fixed utterance handling in FoLiA input. Don't try sentence detection!
+
+0.9.7 2017-10-17
+[Ko van der Sloot]
+ * added textredundancy option, default is 'minimal'
+ * small adaptations to work with FoLiA 1.5 specs
+ - set textclass on words when outputclass != inputclass
+ - DON'T filter special characters when inputclass == outputclass
+ * -F (folia input) is automatically set for .xml files
+ * more robust against texts with embedded tabs, etc.
+ * more and better tests added
+ * better logging and error messaging
+ * improved language handling. TODO: Language detection in FoLiA
+ * bug fixes:
+ - correctly handle xml-comment inside a <t>
+ - better id generation when parent has no id
+ - better reaction on overly long 'words'
+
+0.9.6 2017-01-23
+[Maarten van Gompel]
+* Moving data files from etc/ to share/, as they are more data files than
+ configuration files that should be edited.
+* Requires uctodata >= 0.4.
+* Should solve debian packaging issues (#18)
+* Minor updates to the manual (#2)
+* Some refactoring/code cleanup, temper expectations regarding ucto's
+ date-tagging abilities (#16, thanks also to @sanmai-NL)
+
0.9.5 2017-01-06
[Ko van der Sloot]
Bug fix release:
diff --git a/README b/README
deleted file mode 100644
index 1cfd7f8..0000000
--- a/README
+++ /dev/null
@@ -1,113 +0,0 @@
-[![Build Status](https://travis-ci.org/LanguageMachines/ucto.svg?branch=master)](https://travis-ci.org/LanguageMachines/ucto) [![Language Machines Badge](http://applejack.science.ru.nl/lamabadge.php/ucto)](http://applejack.science.ru.nl/languagemachines/)
-
-================================
-Ucto - A rule-based tokeniser
-================================
-
- Centre for Language and Speech technology, Radboud University Nijmegen
- Induction of Linguistic Knowledge Research Group, Tilburg University
-
-Website: https://languagemachines.github.io/ucto/
-
-Ucto tokenizes text files: it separates words from punctuation, and splits
-sentences. This is one of the first tasks for almost any Natural Language
-Processing application. Ucto offers several other basic preprocessing steps
-such as changing case that you can all use to make your text suited for further
-processing such as indexing, part-of-speech tagging, or machine translation.
-
-Ucto comes with tokenisation rules for several languages (packaged separately)
-and can be easily extended to suit other languages. It has been incorporated
-for tokenizing Dutch text in Frog (https://languagemachines.github.io/frog),
-our Dutch morpho-syntactic processor.
-
-The software is intended to be used from the command-line by researchers in
-Natural Language Processing or related areas, as well as software developers.
-An [Ucto python binding](https://github.com/proycon/python-ucto) is also available
-separately.
-
-Features:
-
-- Comes with tokenization rules for English, Dutch, French, Italian, Turkish,
- Spanish, Portuguese and Swedish; easily extendible to other languages. Rules
- consists of regular expressions and lists. They are
- packaged separately as [uctodata](https://github.com/LanguageMachines/uctodata).
-- Recognizes units, currencies, abbreviations, and simple dates and times like dd-mm-yyyy
-- Recognizes paired quote spans, sentences, and paragraphs.
-- Produces UTF8 encoding and NFC output normalization, optionally accepting
- other input encodings as well.
-- Ligature normalization (can undo for isntance fi,fl as single codepoints).
-- Optional conversion to all lowercase or uppercase.
-- Supports [FoLiA XML](https://proycon.github.io/folia)
-
-Ucto was written by Maarten van Gompel and Ko van der Sloot. Work on Ucto was
-funded by NWO, the Netherlands Organisation for Scientific Research, under the
-Implicit Linguistics project, the CLARIN-NL program, and the CLARIAH project.
-
-This software is available under the GNU Public License v3 (see the file
-COPYING).
-
-------------------------------------------------------------
-Installation
-------------------------------------------------------------
-
-To install ucto, first consult whether your distribution's package manager has an up-to-date package for it.
-If not, for easy installation of ucto and all dependencies, it is included as part of our software
-distribution [LaMachine](https://proycon.github.io/LaMachine).
-
-To compile and install manually from source, provided you have all the
-dependencies installed:
-
- $ bash bootstrap.sh
- $ ./configure
- $ make
- $ sudo make install
-
-You will need current versions of the following dependencies of our software:
-
-* [ticcutils](https://github.com/LanguageMachine/ticcutils) - A shared utility library
-* [libfolia](https://github.com/LanguageMachines/libfolia) - A library for the FoLiA format.
-* [uctodata](https://github.com/LanguageMachines/uctodata) - Data files for ucto, packaged separately
-
-As well as the following 3rd party dependencies:
-
-* ``icu`` - A C++ library for Unicode and Globalization support. On Debian/Ubuntu systems, install the package libicu-dev.
-* ``libxml2`` - An XML library. On Debian/Ubuntu systems install the package libxml2-dev.
-* A sane build environment with a C++ compiler (e.g. gcc or clang), autotools, libtool, pkg-config
-
-------------------------------------------------------------
-Usage
-------------------------------------------------------------
-
-Tokenize an english text file to standard output, tokens will be
-space-seperated, sentences delimiter by ``<utt>``:
-
- $ ucto -L eng yourfile.txt
-
-The -L flag specifies the language (as a three letter iso-639-3 code), provided
-a configuration file exists for that language. The configurations are provided
-separately, for various languages, in the
-[uctodata](https://github.com/LanguageMachines/uctodata) package. Note that
-older versions of ucto used different two-letter codes, so you may need to
-update the way you invoke ucto.
-
-To output to file instead of standard output, just add another
-positional argument with the desired output filename.
-
-If you want each sentence on a separate line (i.e. newline delimited rather than delimited by
-``<utt>``), then pass the ``-n`` flag. If each sentence is already on one line
-in the input and you want to leave it at that, pass the ``-m`` flag.
-
-Tokenize plaintext to [FoLiA XML](https://proycon.github.io/folia) using the ``-X`` flag, you can specify an ID
-for the FoLiA document using the ``--id=`` flag.
-
- $ ucto -L eng -X --id=hamlet hamlet.txt hamlet.folia.xml
-
-Note that in the FoLiA XML output, ucto encodes the class of the token (date, url, smiley, etc...) based
-on the rule that matched.
-
-For further documentation consult the [ucto
-manual](https://github.com/LanguageMachines/ucto/blob/master/docs/ucto_manual.pdf)
-for further documentation.
-
-
-
diff --git a/aclocal.m4 b/aclocal.m4
index 0ce58dc..b5923f1 100644
--- a/aclocal.m4
+++ b/aclocal.m4
@@ -1150,7 +1150,6 @@ AC_SUBST([am__tar])
AC_SUBST([am__untar])
]) # _AM_PROG_TAR
-m4_include([m4/ax_icu_check.m4])
m4_include([m4/ax_lib_readline.m4])
m4_include([m4/libtool.m4])
m4_include([m4/ltoptions.m4])
diff --git a/bootstrap.sh b/bootstrap.sh
index 8a5b8bc..de12d31 100644
--- a/bootstrap.sh
+++ b/bootstrap.sh
@@ -1,6 +1,3 @@
-# $Id$
-# $URL$
-
# bootstrap - script to bootstrap the distribution rolling engine
# usage:
diff --git a/config.guess b/config.guess
index 6c32c86..2e9ad7f 100755
--- a/config.guess
+++ b/config.guess
@@ -1,8 +1,8 @@
#! /bin/sh
# Attempt to guess a canonical system name.
-# Copyright 1992-2014 Free Software Foundation, Inc.
+# Copyright 1992-2016 Free Software Foundation, Inc.
-timestamp='2014-11-04'
+timestamp='2016-10-02'
# This file is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
@@ -27,7 +27,7 @@ timestamp='2014-11-04'
# Originally written by Per Bothner; maintained since 2000 by Ben Elliston.
#
# You can get the latest version of this script from:
-# http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
+# http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess
#
# Please send patches to <config-patches at gnu.org>.
@@ -50,7 +50,7 @@ version="\
GNU config.guess ($timestamp)
Originally written by Per Bothner.
-Copyright 1992-2014 Free Software Foundation, Inc.
+Copyright 1992-2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE."
@@ -168,19 +168,29 @@ case "${UNAME_MACHINE}:${UNAME_SYSTEM}:${UNAME_RELEASE}:${UNAME_VERSION}" in
# Note: NetBSD doesn't particularly care about the vendor
# portion of the name. We always set it to "unknown".
sysctl="sysctl -n hw.machine_arch"
- UNAME_MACHINE_ARCH=`(/sbin/$sysctl 2>/dev/null || \
- /usr/sbin/$sysctl 2>/dev/null || echo unknown)`
+ UNAME_MACHINE_ARCH=`(uname -p 2>/dev/null || \
+ /sbin/$sysctl 2>/dev/null || \
+ /usr/sbin/$sysctl 2>/dev/null || \
+ echo unknown)`
case "${UNAME_MACHINE_ARCH}" in
armeb) machine=armeb-unknown ;;
arm*) machine=arm-unknown ;;
sh3el) machine=shl-unknown ;;
sh3eb) machine=sh-unknown ;;
sh5el) machine=sh5le-unknown ;;
+ earmv*)
+ arch=`echo ${UNAME_MACHINE_ARCH} | sed -e 's,^e\(armv[0-9]\).*$,\1,'`
+ endian=`echo ${UNAME_MACHINE_ARCH} | sed -ne 's,^.*\(eb\)$,\1,p'`
+ machine=${arch}${endian}-unknown
+ ;;
*) machine=${UNAME_MACHINE_ARCH}-unknown ;;
esac
# The Operating System including object format, if it has switched
- # to ELF recently, or will in the future.
+ # to ELF recently (or will in the future) and ABI.
case "${UNAME_MACHINE_ARCH}" in
+ earm*)
+ os=netbsdelf
+ ;;
arm*|i386|m68k|ns32k|sh3*|sparc|vax)
eval $set_cc_for_build
if echo __ELF__ | $CC_FOR_BUILD -E - 2>/dev/null \
@@ -197,6 +207,13 @@ case "${UNAME_MACHINE}:${UNAME_SYSTEM}:${UNAME_RELEASE}:${UNAME_VERSION}" in
os=netbsd
;;
esac
+ # Determine ABI tags.
+ case "${UNAME_MACHINE_ARCH}" in
+ earm*)
+ expr='s/^earmv[0-9]/-eabi/;s/eb$//'
+ abi=`echo ${UNAME_MACHINE_ARCH} | sed -e "$expr"`
+ ;;
+ esac
# The OS release
# Debian GNU/NetBSD machines have a different userland, and
# thus, need a distinct triplet. However, they do not need
@@ -207,13 +224,13 @@ case "${UNAME_MACHINE}:${UNAME_SYSTEM}:${UNAME_RELEASE}:${UNAME_VERSION}" in
release='-gnu'
;;
*)
- release=`echo ${UNAME_RELEASE}|sed -e 's/[-_].*/\./'`
+ release=`echo ${UNAME_RELEASE} | sed -e 's/[-_].*//' | cut -d. -f1,2`
;;
esac
# Since CPU_TYPE-MANUFACTURER-KERNEL-OPERATING_SYSTEM:
# contains redundant information, the shorter form:
# CPU_TYPE-MANUFACTURER-OPERATING_SYSTEM is used.
- echo "${machine}-${os}${release}"
+ echo "${machine}-${os}${release}${abi}"
exit ;;
*:Bitrig:*:*)
UNAME_MACHINE_ARCH=`arch | sed 's/Bitrig.//'`
@@ -223,6 +240,10 @@ case "${UNAME_MACHINE}:${UNAME_SYSTEM}:${UNAME_RELEASE}:${UNAME_VERSION}" in
UNAME_MACHINE_ARCH=`arch | sed 's/OpenBSD.//'`
echo ${UNAME_MACHINE_ARCH}-unknown-openbsd${UNAME_RELEASE}
exit ;;
+ *:LibertyBSD:*:*)
+ UNAME_MACHINE_ARCH=`arch | sed 's/^.*BSD\.//'`
+ echo ${UNAME_MACHINE_ARCH}-unknown-libertybsd${UNAME_RELEASE}
+ exit ;;
*:ekkoBSD:*:*)
echo ${UNAME_MACHINE}-unknown-ekkobsd${UNAME_RELEASE}
exit ;;
@@ -235,6 +256,9 @@ case "${UNAME_MACHINE}:${UNAME_SYSTEM}:${UNAME_RELEASE}:${UNAME_VERSION}" in
*:MirBSD:*:*)
echo ${UNAME_MACHINE}-unknown-mirbsd${UNAME_RELEASE}
exit ;;
+ *:Sortix:*:*)
+ echo ${UNAME_MACHINE}-unknown-sortix
+ exit ;;
alpha:OSF1:*:*)
case $UNAME_RELEASE in
*4.0)
@@ -251,42 +275,42 @@ case "${UNAME_MACHINE}:${UNAME_SYSTEM}:${UNAME_RELEASE}:${UNAME_VERSION}" in
ALPHA_CPU_TYPE=`/usr/sbin/psrinfo -v | sed -n -e 's/^ The alpha \(.*\) processor.*$/\1/p' | head -n 1`
case "$ALPHA_CPU_TYPE" in
"EV4 (21064)")
- UNAME_MACHINE="alpha" ;;
+ UNAME_MACHINE=alpha ;;
"EV4.5 (21064)")
- UNAME_MACHINE="alpha" ;;
+ UNAME_MACHINE=alpha ;;
"LCA4 (21066/21068)")
- UNAME_MACHINE="alpha" ;;
+ UNAME_MACHINE=alpha ;;
"EV5 (21164)")
- UNAME_MACHINE="alphaev5" ;;
+ UNAME_MACHINE=alphaev5 ;;
"EV5.6 (21164A)")
- UNAME_MACHINE="alphaev56" ;;
+ UNAME_MACHINE=alphaev56 ;;
"EV5.6 (21164PC)")
- UNAME_MACHINE="alphapca56" ;;
+ UNAME_MACHINE=alphapca56 ;;
"EV5.7 (21164PC)")
- UNAME_MACHINE="alphapca57" ;;
+ UNAME_MACHINE=alphapca57 ;;
"EV6 (21264)")
- UNAME_MACHINE="alphaev6" ;;
+ UNAME_MACHINE=alphaev6 ;;
"EV6.7 (21264A)")
- UNAME_MACHINE="alphaev67" ;;
+ UNAME_MACHINE=alphaev67 ;;
"EV6.8CB (21264C)")
- UNAME_MACHINE="alphaev68" ;;
+ UNAME_MACHINE=alphaev68 ;;
"EV6.8AL (21264B)")
- UNAME_MACHINE="alphaev68" ;;
+ UNAME_MACHINE=alphaev68 ;;
"EV6.8CX (21264D)")
- UNAME_MACHINE="alphaev68" ;;
+ UNAME_MACHINE=alphaev68 ;;
"EV6.9A (21264/EV69A)")
- UNAME_MACHINE="alphaev69" ;;
+ UNAME_MACHINE=alphaev69 ;;
"EV7 (21364)")
- UNAME_MACHINE="alphaev7" ;;
+ UNAME_MACHINE=alphaev7 ;;
"EV7.9 (21364A)")
- UNAME_MACHINE="alphaev79" ;;
+ UNAME_MACHINE=alphaev79 ;;
esac
# A Pn.n version is a patched version.
# A Vn.n version is a released version.
# A Tn.n version is a released field test version.
# A Xn.n version is an unreleased experimental baselevel.
# 1.2 uses "1.2" for uname -r.
- echo ${UNAME_MACHINE}-dec-osf`echo ${UNAME_RELEASE} | sed -e 's/^[PVTX]//' | tr 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' 'abcdefghijklmnopqrstuvwxyz'`
+ echo ${UNAME_MACHINE}-dec-osf`echo ${UNAME_RELEASE} | sed -e 's/^[PVTX]//' | tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz`
# Reset EXIT trap before exiting to avoid spurious non-zero exit code.
exitcode=$?
trap '' 0
@@ -359,16 +383,16 @@ case "${UNAME_MACHINE}:${UNAME_SYSTEM}:${UNAME_RELEASE}:${UNAME_VERSION}" in
exit ;;
i86pc:SunOS:5.*:* | i86xen:SunOS:5.*:*)
eval $set_cc_for_build
- SUN_ARCH="i386"
+ SUN_ARCH=i386
# If there is a compiler, see if it is configured for 64-bit objects.
# Note that the Sun cc does not turn __LP64__ into 1 like gcc does.
# This test works for both compilers.
- if [ "$CC_FOR_BUILD" != 'no_compiler_found' ]; then
+ if [ "$CC_FOR_BUILD" != no_compiler_found ]; then
if (echo '#ifdef __amd64'; echo IS_64BIT_ARCH; echo '#endif') | \
- (CCOPTS= $CC_FOR_BUILD -E - 2>/dev/null) | \
+ (CCOPTS="" $CC_FOR_BUILD -E - 2>/dev/null) | \
grep IS_64BIT_ARCH >/dev/null
then
- SUN_ARCH="x86_64"
+ SUN_ARCH=x86_64
fi
fi
echo ${SUN_ARCH}-pc-solaris2`echo ${UNAME_RELEASE}|sed -e 's/[^.]*//'`
@@ -393,7 +417,7 @@ case "${UNAME_MACHINE}:${UNAME_SYSTEM}:${UNAME_RELEASE}:${UNAME_VERSION}" in
exit ;;
sun*:*:4.2BSD:*)
UNAME_RELEASE=`(sed 1q /etc/motd | awk '{print substr($5,1,3)}') 2>/dev/null`
- test "x${UNAME_RELEASE}" = "x" && UNAME_RELEASE=3
+ test "x${UNAME_RELEASE}" = x && UNAME_RELEASE=3
case "`/bin/arch`" in
sun3)
echo m68k-sun-sunos${UNAME_RELEASE}
@@ -618,13 +642,13 @@ EOF
sc_cpu_version=`/usr/bin/getconf SC_CPU_VERSION 2>/dev/null`
sc_kernel_bits=`/usr/bin/getconf SC_KERNEL_BITS 2>/dev/null`
case "${sc_cpu_version}" in
- 523) HP_ARCH="hppa1.0" ;; # CPU_PA_RISC1_0
- 528) HP_ARCH="hppa1.1" ;; # CPU_PA_RISC1_1
+ 523) HP_ARCH=hppa1.0 ;; # CPU_PA_RISC1_0
+ 528) HP_ARCH=hppa1.1 ;; # CPU_PA_RISC1_1
532) # CPU_PA_RISC2_0
case "${sc_kernel_bits}" in
- 32) HP_ARCH="hppa2.0n" ;;
- 64) HP_ARCH="hppa2.0w" ;;
- '') HP_ARCH="hppa2.0" ;; # HP-UX 10.20
+ 32) HP_ARCH=hppa2.0n ;;
+ 64) HP_ARCH=hppa2.0w ;;
+ '') HP_ARCH=hppa2.0 ;; # HP-UX 10.20
esac ;;
esac
fi
@@ -663,11 +687,11 @@ EOF
exit (0);
}
EOF
- (CCOPTS= $CC_FOR_BUILD -o $dummy $dummy.c 2>/dev/null) && HP_ARCH=`$dummy`
+ (CCOPTS="" $CC_FOR_BUILD -o $dummy $dummy.c 2>/dev/null) && HP_ARCH=`$dummy`
test -z "$HP_ARCH" && HP_ARCH=hppa
fi ;;
esac
- if [ ${HP_ARCH} = "hppa2.0w" ]
+ if [ ${HP_ARCH} = hppa2.0w ]
then
eval $set_cc_for_build
@@ -680,12 +704,12 @@ EOF
# $ CC_FOR_BUILD="cc +DA2.0w" ./config.guess
# => hppa64-hp-hpux11.23
- if echo __LP64__ | (CCOPTS= $CC_FOR_BUILD -E - 2>/dev/null) |
+ if echo __LP64__ | (CCOPTS="" $CC_FOR_BUILD -E - 2>/dev/null) |
grep -q __LP64__
then
- HP_ARCH="hppa2.0w"
+ HP_ARCH=hppa2.0w
else
- HP_ARCH="hppa64"
+ HP_ARCH=hppa64
fi
fi
echo ${HP_ARCH}-hp-hpux${HPUX_REV}
@@ -790,14 +814,14 @@ EOF
echo craynv-cray-unicosmp${UNAME_RELEASE} | sed -e 's/\.[^.]*$/.X/'
exit ;;
F30[01]:UNIX_System_V:*:* | F700:UNIX_System_V:*:*)
- FUJITSU_PROC=`uname -m | tr 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' 'abcdefghijklmnopqrstuvwxyz'`
- FUJITSU_SYS=`uname -p | tr 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' 'abcdefghijklmnopqrstuvwxyz' | sed -e 's/\///'`
+ FUJITSU_PROC=`uname -m | tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz`
+ FUJITSU_SYS=`uname -p | tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz | sed -e 's/\///'`
FUJITSU_REL=`echo ${UNAME_RELEASE} | sed -e 's/ /_/'`
echo "${FUJITSU_PROC}-fujitsu-${FUJITSU_SYS}${FUJITSU_REL}"
exit ;;
5000:UNIX_System_V:4.*:*)
- FUJITSU_SYS=`uname -p | tr 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' 'abcdefghijklmnopqrstuvwxyz' | sed -e 's/\///'`
- FUJITSU_REL=`echo ${UNAME_RELEASE} | tr 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' 'abcdefghijklmnopqrstuvwxyz' | sed -e 's/ /_/'`
+ FUJITSU_SYS=`uname -p | tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz | sed -e 's/\///'`
+ FUJITSU_REL=`echo ${UNAME_RELEASE} | tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz | sed -e 's/ /_/'`
echo "sparc-fujitsu-${FUJITSU_SYS}${FUJITSU_REL}"
exit ;;
i*86:BSD/386:*:* | i*86:BSD/OS:*:* | *:Ascend\ Embedded/OS:*:*)
@@ -879,7 +903,7 @@ EOF
exit ;;
*:GNU/*:*:*)
# other systems with GNU libc and userland
- echo ${UNAME_MACHINE}-unknown-`echo ${UNAME_SYSTEM} | sed 's,^[^/]*/,,' | tr '[A-Z]' '[a-z]'``echo ${UNAME_RELEASE}|sed -e 's/[-(].*//'`-${LIBC}
+ echo ${UNAME_MACHINE}-unknown-`echo ${UNAME_SYSTEM} | sed 's,^[^/]*/,,' | tr "[:upper:]" "[:lower:]"``echo ${UNAME_RELEASE}|sed -e 's/[-(].*//'`-${LIBC}
exit ;;
i*86:Minix:*:*)
echo ${UNAME_MACHINE}-pc-minix
@@ -902,7 +926,7 @@ EOF
EV68*) UNAME_MACHINE=alphaev68 ;;
esac
objdump --private-headers /bin/sh | grep -q ld.so.1
- if test "$?" = 0 ; then LIBC="gnulibc1" ; fi
+ if test "$?" = 0 ; then LIBC=gnulibc1 ; fi
echo ${UNAME_MACHINE}-unknown-linux-${LIBC}
exit ;;
arc:Linux:*:* | arceb:Linux:*:*)
@@ -933,6 +957,9 @@ EOF
crisv32:Linux:*:*)
echo ${UNAME_MACHINE}-axis-linux-${LIBC}
exit ;;
+ e2k:Linux:*:*)
+ echo ${UNAME_MACHINE}-unknown-linux-${LIBC}
+ exit ;;
frv:Linux:*:*)
echo ${UNAME_MACHINE}-unknown-linux-${LIBC}
exit ;;
@@ -945,6 +972,9 @@ EOF
ia64:Linux:*:*)
echo ${UNAME_MACHINE}-unknown-linux-${LIBC}
exit ;;
+ k1om:Linux:*:*)
+ echo ${UNAME_MACHINE}-unknown-linux-${LIBC}
+ exit ;;
m32r*:Linux:*:*)
echo ${UNAME_MACHINE}-unknown-linux-${LIBC}
exit ;;
@@ -970,6 +1000,9 @@ EOF
eval `$CC_FOR_BUILD -E $dummy.c 2>/dev/null | grep '^CPU'`
test x"${CPU}" != x && { echo "${CPU}-unknown-linux-${LIBC}"; exit; }
;;
+ mips64el:Linux:*:*)
+ echo ${UNAME_MACHINE}-unknown-linux-${LIBC}
+ exit ;;
openrisc*:Linux:*:*)
echo or1k-unknown-linux-${LIBC}
exit ;;
@@ -1002,6 +1035,9 @@ EOF
ppcle:Linux:*:*)
echo powerpcle-unknown-linux-${LIBC}
exit ;;
+ riscv32:Linux:*:* | riscv64:Linux:*:*)
+ echo ${UNAME_MACHINE}-unknown-linux-${LIBC}
+ exit ;;
s390:Linux:*:* | s390x:Linux:*:*)
echo ${UNAME_MACHINE}-ibm-linux-${LIBC}
exit ;;
@@ -1021,7 +1057,7 @@ EOF
echo ${UNAME_MACHINE}-dec-linux-${LIBC}
exit ;;
x86_64:Linux:*:*)
- echo ${UNAME_MACHINE}-unknown-linux-${LIBC}
+ echo ${UNAME_MACHINE}-pc-linux-${LIBC}
exit ;;
xtensa*:Linux:*:*)
echo ${UNAME_MACHINE}-unknown-linux-${LIBC}
@@ -1100,7 +1136,7 @@ EOF
# uname -m prints for DJGPP always 'pc', but it prints nothing about
# the processor, so we play safe by assuming i586.
# Note: whatever this is, it MUST be the same as what config.sub
- # prints for the "djgpp" host, or else GDB configury will decide that
+ # prints for the "djgpp" host, or else GDB configure will decide that
# this is a cross-build.
echo i586-pc-msdosdjgpp
exit ;;
@@ -1249,6 +1285,9 @@ EOF
SX-8R:SUPER-UX:*:*)
echo sx8r-nec-superux${UNAME_RELEASE}
exit ;;
+ SX-ACE:SUPER-UX:*:*)
+ echo sxace-nec-superux${UNAME_RELEASE}
+ exit ;;
Power*:Rhapsody:*:*)
echo powerpc-apple-rhapsody${UNAME_RELEASE}
exit ;;
@@ -1262,9 +1301,9 @@ EOF
UNAME_PROCESSOR=powerpc
fi
if test `echo "$UNAME_RELEASE" | sed -e 's/\..*//'` -le 10 ; then
- if [ "$CC_FOR_BUILD" != 'no_compiler_found' ]; then
+ if [ "$CC_FOR_BUILD" != no_compiler_found ]; then
if (echo '#ifdef __LP64__'; echo IS_64BIT_ARCH; echo '#endif') | \
- (CCOPTS= $CC_FOR_BUILD -E - 2>/dev/null) | \
+ (CCOPTS="" $CC_FOR_BUILD -E - 2>/dev/null) | \
grep IS_64BIT_ARCH >/dev/null
then
case $UNAME_PROCESSOR in
@@ -1286,7 +1325,7 @@ EOF
exit ;;
*:procnto*:*:* | *:QNX:[0123456789]*:*)
UNAME_PROCESSOR=`uname -p`
- if test "$UNAME_PROCESSOR" = "x86"; then
+ if test "$UNAME_PROCESSOR" = x86; then
UNAME_PROCESSOR=i386
UNAME_MACHINE=pc
fi
@@ -1317,7 +1356,7 @@ EOF
# "uname -m" is not consistent, so use $cputype instead. 386
# is converted to i386 for consistency with other x86
# operating systems.
- if test "$cputype" = "386"; then
+ if test "$cputype" = 386; then
UNAME_MACHINE=i386
else
UNAME_MACHINE="$cputype"
@@ -1359,7 +1398,7 @@ EOF
echo i386-pc-xenix
exit ;;
i*86:skyos:*:*)
- echo ${UNAME_MACHINE}-pc-skyos`echo ${UNAME_RELEASE}` | sed -e 's/ .*$//'
+ echo ${UNAME_MACHINE}-pc-skyos`echo ${UNAME_RELEASE} | sed -e 's/ .*$//'`
exit ;;
i*86:rdos:*:*)
echo ${UNAME_MACHINE}-pc-rdos
@@ -1370,23 +1409,25 @@ EOF
x86_64:VMkernel:*:*)
echo ${UNAME_MACHINE}-unknown-esx
exit ;;
+ amd64:Isilon\ OneFS:*:*)
+ echo x86_64-unknown-onefs
+ exit ;;
esac
cat >&2 <<EOF
$0: unable to guess system type
-This script, last modified $timestamp, has failed to recognize
-the operating system you are using. It is advised that you
-download the most up to date version of the config scripts from
+This script (version $timestamp), has failed to recognize the
+operating system you are using. If your script is old, overwrite
+config.guess and config.sub with the latest versions from:
- http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
+ http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess
and
- http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD
+ http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub
-If the version you run ($0) is already up to date, please
-send the following data and any information you think might be
-pertinent to <config-patches at gnu.org> in order to provide the needed
-information to handle your system.
+If $0 has already been updated, send the following data and any
+information you think might be pertinent to config-patches at gnu.org to
+provide the necessary information to handle your system.
config.guess timestamp = $timestamp
diff --git a/config.h.in b/config.h.in
index 57d94b1..31c5bb2 100644
--- a/config.h.in
+++ b/config.h.in
@@ -6,9 +6,6 @@
/* Define to 1 if you have the <history.h> header file. */
#undef HAVE_HISTORY_H
-/* we want to use ICU */
-#undef HAVE_ICU
-
/* Define to 1 if you have the <inttypes.h> header file. */
#undef HAVE_INTTYPES_H
diff --git a/config.sub b/config.sub
index 7ffe373..dd2ca93 100755
--- a/config.sub
+++ b/config.sub
@@ -1,8 +1,8 @@
#! /bin/sh
# Configuration validation subroutine script.
-# Copyright 1992-2014 Free Software Foundation, Inc.
+# Copyright 1992-2016 Free Software Foundation, Inc.
-timestamp='2014-12-03'
+timestamp='2016-11-04'
# This file is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
@@ -33,7 +33,7 @@ timestamp='2014-12-03'
# Otherwise, we print the canonical config type on stdout and succeed.
# You can get the latest version of this script from:
-# http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD
+# http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub
# This file is supposed to be the same for all GNU packages
# and recognize all the CPU types, system types and aliases
@@ -53,8 +53,7 @@ timestamp='2014-12-03'
me=`echo "$0" | sed -e 's,.*/,,'`
usage="\
-Usage: $0 [OPTION] CPU-MFR-OPSYS
- $0 [OPTION] ALIAS
+Usage: $0 [OPTION] CPU-MFR-OPSYS or ALIAS
Canonicalize a configuration name.
@@ -68,7 +67,7 @@ Report bugs and patches to <config-patches at gnu.org>."
version="\
GNU config.sub ($timestamp)
-Copyright 1992-2014 Free Software Foundation, Inc.
+Copyright 1992-2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE."
@@ -117,8 +116,8 @@ maybe_os=`echo $1 | sed 's/^\(.*\)-\([^-]*-[^-]*\)$/\2/'`
case $maybe_os in
nto-qnx* | linux-gnu* | linux-android* | linux-dietlibc | linux-newlib* | \
linux-musl* | linux-uclibc* | uclinux-uclibc* | uclinux-gnu* | kfreebsd*-gnu* | \
- knetbsd*-gnu* | netbsd*-gnu* | \
- kopensolaris*-gnu* | \
+ knetbsd*-gnu* | netbsd*-gnu* | netbsd*-eabi* | \
+ kopensolaris*-gnu* | cloudabi*-eabi* | \
storm-chaos* | os2-emx* | rtmk-nova*)
os=-$maybe_os
basic_machine=`echo $1 | sed 's/^\(.*\)-\([^-]*-[^-]*\)$/\1/'`
@@ -255,12 +254,13 @@ case $basic_machine in
| arc | arceb \
| arm | arm[bl]e | arme[lb] | armv[2-8] | armv[3-8][lb] | armv7[arm] \
| avr | avr32 \
+ | ba \
| be32 | be64 \
| bfin \
| c4x | c8051 | clipper \
| d10v | d30v | dlx | dsp16xx \
- | epiphany \
- | fido | fr30 | frv \
+ | e2k | epiphany \
+ | fido | fr30 | frv | ft32 \
| h8300 | h8500 | hppa | hppa1.[01] | hppa2.0 | hppa2.0[nw] | hppa64 \
| hexagon \
| i370 | i860 | i960 | ia64 \
@@ -301,11 +301,12 @@ case $basic_machine in
| open8 | or1k | or1knd | or32 \
| pdp10 | pdp11 | pj | pjl \
| powerpc | powerpc64 | powerpc64le | powerpcle \
+ | pru \
| pyramid \
| riscv32 | riscv64 \
| rl78 | rx \
| score \
- | sh | sh[1234] | sh[24]a | sh[24]aeb | sh[23]e | sh[34]eb | sheb | shbe | shle | sh[1234]le | sh3ele \
+ | sh | sh[1234] | sh[24]a | sh[24]aeb | sh[23]e | sh[234]eb | sheb | shbe | shle | sh[1234]le | sh3ele \
| sh64 | sh64le \
| sparc | sparc64 | sparc64b | sparc64v | sparc86x | sparclet | sparclite \
| sparcv8 | sparcv9 | sparcv9b | sparcv9v \
@@ -376,12 +377,13 @@ case $basic_machine in
| alphapca5[67]-* | alpha64pca5[67]-* | arc-* | arceb-* \
| arm-* | armbe-* | armle-* | armeb-* | armv*-* \
| avr-* | avr32-* \
+ | ba-* \
| be32-* | be64-* \
| bfin-* | bs2000-* \
| c[123]* | c30-* | [cjt]90-* | c4x-* \
| c8051-* | clipper-* | craynv-* | cydra-* \
| d10v-* | d30v-* | dlx-* \
- | elxsi-* \
+ | e2k-* | elxsi-* \
| f30[01]-* | f700-* | fido-* | fr30-* | frv-* | fx80-* \
| h8300-* | h8500-* \
| hppa-* | hppa1.[01]-* | hppa2.0-* | hppa2.0[nw]-* | hppa64-* \
@@ -427,13 +429,15 @@ case $basic_machine in
| orion-* \
| pdp10-* | pdp11-* | pj-* | pjl-* | pn-* | power-* \
| powerpc-* | powerpc64-* | powerpc64le-* | powerpcle-* \
+ | pru-* \
| pyramid-* \
+ | riscv32-* | riscv64-* \
| rl78-* | romp-* | rs6000-* | rx-* \
| sh-* | sh[1234]-* | sh[24]a-* | sh[24]aeb-* | sh[23]e-* | sh[34]eb-* | sheb-* | shbe-* \
| shle-* | sh[1234]le-* | sh3ele-* | sh64-* | sh64le-* \
| sparc-* | sparc64-* | sparc64b-* | sparc64v-* | sparc86x-* | sparclet-* \
| sparclite-* \
- | sparcv8-* | sparcv9-* | sparcv9b-* | sparcv9v-* | sv1-* | sx?-* \
+ | sparcv8-* | sparcv9-* | sparcv9b-* | sparcv9v-* | sv1-* | sx*-* \
| tahoe-* \
| tic30-* | tic4x-* | tic54x-* | tic55x-* | tic6x-* | tic80-* \
| tile*-* \
@@ -518,6 +522,9 @@ case $basic_machine in
basic_machine=i386-pc
os=-aros
;;
+ asmjs)
+ basic_machine=asmjs-unknown
+ ;;
aux)
basic_machine=m68k-apple
os=-aux
@@ -638,6 +645,14 @@ case $basic_machine in
basic_machine=m68k-bull
os=-sysv3
;;
+ e500v[12])
+ basic_machine=powerpc-unknown
+ os=$os"spe"
+ ;;
+ e500v[12]-*)
+ basic_machine=powerpc-`echo $basic_machine | sed 's/^[^-]*-//'`
+ os=$os"spe"
+ ;;
ebmon29k)
basic_machine=a29k-amd
os=-ebmon
@@ -1017,7 +1032,7 @@ case $basic_machine in
ppc-* | ppcbe-*)
basic_machine=powerpc-`echo $basic_machine | sed 's/^[^-]*-//'`
;;
- ppcle | powerpclittle | ppc-le | powerpc-little)
+ ppcle | powerpclittle)
basic_machine=powerpcle-unknown
;;
ppcle-* | powerpclittle-*)
@@ -1027,7 +1042,7 @@ case $basic_machine in
;;
ppc64-*) basic_machine=powerpc64-`echo $basic_machine | sed 's/^[^-]*-//'`
;;
- ppc64le | powerpc64little | ppc64-le | powerpc64-little)
+ ppc64le | powerpc64little)
basic_machine=powerpc64le-unknown
;;
ppc64le-* | powerpc64little-*)
@@ -1373,18 +1388,18 @@ case $os in
| -hpux* | -unos* | -osf* | -luna* | -dgux* | -auroraux* | -solaris* \
| -sym* | -kopensolaris* | -plan9* \
| -amigaos* | -amigados* | -msdos* | -newsos* | -unicos* | -aof* \
- | -aos* | -aros* \
+ | -aos* | -aros* | -cloudabi* | -sortix* \
| -nindy* | -vxsim* | -vxworks* | -ebmon* | -hms* | -mvs* \
| -clix* | -riscos* | -uniplus* | -iris* | -rtu* | -xenix* \
| -hiux* | -386bsd* | -knetbsd* | -mirbsd* | -netbsd* \
- | -bitrig* | -openbsd* | -solidbsd* \
+ | -bitrig* | -openbsd* | -solidbsd* | -libertybsd* \
| -ekkobsd* | -kfreebsd* | -freebsd* | -riscix* | -lynxos* \
| -bosx* | -nextstep* | -cxux* | -aout* | -elf* | -oabi* \
| -ptx* | -coff* | -ecoff* | -winnt* | -domain* | -vsta* \
| -udi* | -eabi* | -lites* | -ieee* | -go32* | -aux* \
| -chorusos* | -chorusrdb* | -cegcc* \
| -cygwin* | -msys* | -pe* | -psos* | -moss* | -proelf* | -rtems* \
- | -mingw32* | -mingw64* | -linux-gnu* | -linux-android* \
+ | -midipix* | -mingw32* | -mingw64* | -linux-gnu* | -linux-android* \
| -linux-newlib* | -linux-musl* | -linux-uclibc* \
| -uxpv* | -beos* | -mpeix* | -udk* | -moxiebox* \
| -interix* | -uwin* | -mks* | -rhapsody* | -darwin* | -opened* \
@@ -1393,7 +1408,8 @@ case $os in
| -os2* | -vos* | -palmos* | -uclinux* | -nucleus* \
| -morphos* | -superux* | -rtmk* | -rtmk-nova* | -windiss* \
| -powermax* | -dnix* | -nx6 | -nx7 | -sei* | -dragonfly* \
- | -skyos* | -haiku* | -rdos* | -toppers* | -drops* | -es* | -tirtos*)
+ | -skyos* | -haiku* | -rdos* | -toppers* | -drops* | -es* \
+ | -onefs* | -tirtos* | -phoenix* | -fuchsia*)
# Remember, each alternative MUST END IN *, to match a version number.
;;
-qnx*)
@@ -1525,6 +1541,8 @@ case $os in
;;
-nacl*)
;;
+ -ios)
+ ;;
-none)
;;
*)
diff --git a/config/Makefile.in b/config/Makefile.in
index 788454b..b71e2eb 100644
--- a/config/Makefile.in
+++ b/config/Makefile.in
@@ -90,8 +90,7 @@ build_triplet = @build@
host_triplet = @host@
subdir = config
ACLOCAL_M4 = $(top_srcdir)/aclocal.m4
-am__aclocal_m4_deps = $(top_srcdir)/m4/ax_icu_check.m4 \
- $(top_srcdir)/m4/ax_lib_readline.m4 \
+am__aclocal_m4_deps = $(top_srcdir)/m4/ax_lib_readline.m4 \
$(top_srcdir)/m4/libtool.m4 $(top_srcdir)/m4/ltoptions.m4 \
$(top_srcdir)/m4/ltsugar.m4 $(top_srcdir)/m4/ltversion.m4 \
$(top_srcdir)/m4/lt~obsolete.m4 $(top_srcdir)/m4/pkg.m4 \
@@ -185,13 +184,7 @@ EXEEXT = @EXEEXT@
FGREP = @FGREP@
GREP = @GREP@
ICU_CFLAGS = @ICU_CFLAGS@
-ICU_CONFIG = @ICU_CONFIG@
-ICU_CPPSEARCHPATH = @ICU_CPPSEARCHPATH@
-ICU_CXXFLAGS = @ICU_CXXFLAGS@
-ICU_IOLIBS = @ICU_IOLIBS@
-ICU_LIBPATH = @ICU_LIBPATH@
ICU_LIBS = @ICU_LIBS@
-ICU_VERSION = @ICU_VERSION@
INSTALL = @INSTALL@
INSTALL_DATA = @INSTALL_DATA@
INSTALL_PROGRAM = @INSTALL_PROGRAM@
@@ -282,6 +275,7 @@ pdfdir = @pdfdir@
prefix = @prefix@
program_transform_name = @program_transform_name@
psdir = @psdir@
+runstatedir = @runstatedir@
sbindir = @sbindir@
sharedstatedir = @sharedstatedir@
srcdir = @srcdir@
diff --git a/configure b/configure
index 5876020..7920b2e 100755
--- a/configure
+++ b/configure
@@ -1,6 +1,6 @@
#! /bin/sh
# Guess values for system-dependent variables and create Makefiles.
-# Generated by GNU Autoconf 2.69 for ucto 0.9.6.
+# Generated by GNU Autoconf 2.69 for ucto 0.9.8.
#
# Report bugs to <lamasoftware at science.ru.nl>.
#
@@ -590,8 +590,8 @@ MAKEFLAGS=
# Identity of this package.
PACKAGE_NAME='ucto'
PACKAGE_TARNAME='ucto'
-PACKAGE_VERSION='0.9.6'
-PACKAGE_STRING='ucto 0.9.6'
+PACKAGE_VERSION='0.9.8'
+PACKAGE_STRING='ucto 0.9.8'
PACKAGE_BUGREPORT='lamasoftware at science.ru.nl'
PACKAGE_URL=''
@@ -644,17 +644,11 @@ folia_LIBS
folia_CFLAGS
XML2_LIBS
XML2_CFLAGS
+ICU_LIBS
+ICU_CFLAGS
PKG_CONFIG_LIBDIR
PKG_CONFIG_PATH
PKG_CONFIG
-ICU_IOLIBS
-ICU_LIBS
-ICU_LIBPATH
-ICU_VERSION
-ICU_CPPSEARCHPATH
-ICU_CXXFLAGS
-ICU_CFLAGS
-ICU_CONFIG
CXXCPP
CPP
LT_SYS_LIBRARY_PATH
@@ -757,6 +751,7 @@ infodir
docdir
oldincludedir
includedir
+runstatedir
localstatedir
sharedstatedir
sysconfdir
@@ -790,8 +785,6 @@ with_gnu_ld
with_sysroot
enable_libtool_lock
with_icu
-with_folia
-with_ticcutils
'
ac_precious_vars='build_alias
host_alias
@@ -810,6 +803,8 @@ CXXCPP
PKG_CONFIG
PKG_CONFIG_PATH
PKG_CONFIG_LIBDIR
+ICU_CFLAGS
+ICU_LIBS
XML2_CFLAGS
XML2_LIBS
folia_CFLAGS
@@ -856,6 +851,7 @@ datadir='${datarootdir}'
sysconfdir='${prefix}/etc'
sharedstatedir='${prefix}/com'
localstatedir='${prefix}/var'
+runstatedir='${localstatedir}/run'
includedir='${prefix}/include'
oldincludedir='/usr/include'
docdir='${datarootdir}/doc/${PACKAGE_TARNAME}'
@@ -1108,6 +1104,15 @@ do
| -silent | --silent | --silen | --sile | --sil)
silent=yes ;;
+ -runstatedir | --runstatedir | --runstatedi | --runstated \
+ | --runstate | --runstat | --runsta | --runst | --runs \
+ | --run | --ru | --r)
+ ac_prev=runstatedir ;;
+ -runstatedir=* | --runstatedir=* | --runstatedi=* | --runstated=* \
+ | --runstate=* | --runstat=* | --runsta=* | --runst=* | --runs=* \
+ | --run=* | --ru=* | --r=*)
+ runstatedir=$ac_optarg ;;
+
-sbindir | --sbindir | --sbindi | --sbind | --sbin | --sbi | --sb)
ac_prev=sbindir ;;
-sbindir=* | --sbindir=* | --sbindi=* | --sbind=* | --sbin=* \
@@ -1245,7 +1250,7 @@ fi
for ac_var in exec_prefix prefix bindir sbindir libexecdir datarootdir \
datadir sysconfdir sharedstatedir localstatedir includedir \
oldincludedir docdir infodir htmldir dvidir pdfdir psdir \
- libdir localedir mandir
+ libdir localedir mandir runstatedir
do
eval ac_val=\$$ac_var
# Remove trailing slashes.
@@ -1358,7 +1363,7 @@ if test "$ac_init_help" = "long"; then
# Omit some internal or obsolete options to make the list less imposing.
# This message is too long to be a string in the A/UX 3.1 sh.
cat <<_ACEOF
-\`configure' configures ucto 0.9.6 to adapt to many kinds of systems.
+\`configure' configures ucto 0.9.8 to adapt to many kinds of systems.
Usage: $0 [OPTION]... [VAR=VALUE]...
@@ -1398,6 +1403,7 @@ Fine tuning of the installation directories:
--sysconfdir=DIR read-only single-machine data [PREFIX/etc]
--sharedstatedir=DIR modifiable architecture-independent data [PREFIX/com]
--localstatedir=DIR modifiable single-machine data [PREFIX/var]
+ --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run]
--libdir=DIR object code libraries [EPREFIX/lib]
--includedir=DIR C header files [PREFIX/include]
--oldincludedir=DIR C header files for non-gcc [/usr/include]
@@ -1428,7 +1434,7 @@ fi
if test -n "$ac_init_help"; then
case $ac_init_help in
- short | recursive ) echo "Configuration of ucto 0.9.6:";;
+ short | recursive ) echo "Configuration of ucto 0.9.8:";;
esac
cat <<\_ACEOF
@@ -1459,13 +1465,7 @@ Optional Packages:
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
--with-sysroot[=DIR] Search for dependent libraries within DIR (or the
compiler's sysroot if not specified).
- --with-icu=DIR use ICU installed in <DIR>
- --with-folia=DIR use libfolia installed in <DIR>;
- note that you can install folia in a non-default directory with
- ./configure --prefix=<DIR> in the folia installation directory
- --with-ticcutils=DIR use ticcutils installed in <DIR>;
- note that you can install ticcutils in a non-default directory with
- ./configure --prefix=<DIR> in the ticcutils installation directory
+ --with-icu=DIR use icu installed in <DIR>
Some influential environment variables:
CXX C++ compiler command
@@ -1486,6 +1486,8 @@ Some influential environment variables:
directories to add to pkg-config's search path
PKG_CONFIG_LIBDIR
path overriding pkg-config's built-in search path
+ ICU_CFLAGS C compiler flags for ICU, overriding pkg-config
+ ICU_LIBS linker flags for ICU, overriding pkg-config
XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
XML2_LIBS linker flags for XML2, overriding pkg-config
folia_CFLAGS
@@ -1566,7 +1568,7 @@ fi
test -n "$ac_init_help" && exit $ac_status
if $ac_init_version; then
cat <<\_ACEOF
-ucto configure 0.9.6
+ucto configure 0.9.8
generated by GNU Autoconf 2.69
Copyright (C) 2012 Free Software Foundation, Inc.
@@ -2186,7 +2188,7 @@ cat >config.log <<_ACEOF
This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.
-It was created by ucto $as_me 0.9.6, which was
+It was created by ucto $as_me 0.9.8, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ $0 $@
@@ -3049,7 +3051,7 @@ fi
# Define the identity of the package.
PACKAGE='ucto'
- VERSION='0.9.6'
+ VERSION='0.9.8'
cat >>confdefs.h <<_ACEOF
@@ -3168,7 +3170,7 @@ if test -z "$CXX"; then
CXX=$CCC
else
if test -n "$ac_tool_prefix"; then
- for ac_prog in g++ c++
+ for ac_prog in c++
do
# Extract the first word of "$ac_tool_prefix$ac_prog", so it can be a program name with args.
set dummy $ac_tool_prefix$ac_prog; ac_word=$2
@@ -3212,7 +3214,7 @@ fi
fi
if test -z "$CXX"; then
ac_ct_CXX=$CXX
- for ac_prog in g++ c++
+ for ac_prog in c++
do
# Extract the first word of "$ac_prog", so it can be a program name with args.
set dummy $ac_prog; ac_word=$2
@@ -5897,7 +5899,7 @@ linux* | k*bsd*-gnu | kopensolaris*-gnu | gnu*)
lt_cv_deplibs_check_method=pass_all
;;
-netbsd*)
+netbsd* | netbsdelf*-gnu)
if echo __ELF__ | $CC -E - | $GREP __ELF__ > /dev/null; then
lt_cv_deplibs_check_method='match_pattern /lib[^/]+(\.so\.[0-9]+\.[0-9]+|_pic\.a)$'
else
@@ -9601,6 +9603,9 @@ $as_echo_n "checking whether the $compiler linker ($LD) supports shared librarie
openbsd* | bitrig*)
with_gnu_ld=no
;;
+ linux* | k*bsd*-gnu | gnu*)
+ link_all_deplibs=no
+ ;;
esac
ld_shlibs=yes
@@ -9855,7 +9860,7 @@ _LT_EOF
fi
;;
- netbsd*)
+ netbsd* | netbsdelf*-gnu)
if echo __ELF__ | $CC -E - | $GREP __ELF__ >/dev/null; then
archive_cmds='$LD -Bshareable $libobjs $deplibs $linker_flags -o $lib'
wlarc=
@@ -10525,6 +10530,7 @@ $as_echo "$lt_cv_irix_exported_symbol" >&6; }
if test yes = "$lt_cv_irix_exported_symbol"; then
archive_expsym_cmds='$CC -shared $pic_flag $libobjs $deplibs $compiler_flags $wl-soname $wl$soname `test -n "$verstring" && func_echo_all "$wl-set_version $wl$verstring"` $wl-update_registry $wl$output_objdir/so_locations $wl-exports_file $wl$export_symbols -o $lib'
fi
+ link_all_deplibs=no
else
archive_cmds='$CC -shared $libobjs $deplibs $compiler_flags -soname $soname `test -n "$verstring" && func_echo_all "-set_version $verstring"` -update_registry $output_objdir/so_locations -o $lib'
archive_expsym_cmds='$CC -shared $libobjs $deplibs $compiler_flags -soname $soname `test -n "$verstring" && func_echo_all "-set_version $verstring"` -update_registry $output_objdir/so_locations -exports_file $export_symbols -o $lib'
@@ -10546,7 +10552,7 @@ $as_echo "$lt_cv_irix_exported_symbol" >&6; }
esac
;;
- netbsd*)
+ netbsd* | netbsdelf*-gnu)
if echo __ELF__ | $CC -E - | $GREP __ELF__ >/dev/null; then
archive_cmds='$LD -Bshareable -o $lib $libobjs $deplibs $linker_flags' # a.out
else
@@ -11661,6 +11667,18 @@ fi
dynamic_linker='GNU/Linux ld.so'
;;
+netbsdelf*-gnu)
+ version_type=linux
+ need_lib_prefix=no
+ need_version=no
+ library_names_spec='${libname}${release}${shared_ext}$versuffix ${libname}${release}${shared_ext}$major ${libname}${shared_ext}'
+ soname_spec='${libname}${release}${shared_ext}$major'
+ shlibpath_var=LD_LIBRARY_PATH
+ shlibpath_overrides_runpath=no
+ hardcode_into_libs=yes
+ dynamic_linker='NetBSD ld.elf_so'
+ ;;
+
netbsd*)
version_type=sunos
need_lib_prefix=no
@@ -14555,7 +14573,7 @@ lt_prog_compiler_static_CXX=
;;
esac
;;
- netbsd*)
+ netbsd* | netbsdelf*-gnu)
;;
*qnx* | *nto*)
# QNX uses GNU C++, but need to define -shared option too, otherwise
@@ -14930,6 +14948,9 @@ $as_echo_n "checking whether the $compiler linker ($LD) supports shared librarie
;;
esac
;;
+ linux* | k*bsd*-gnu | gnu*)
+ link_all_deplibs_CXX=no
+ ;;
*)
export_symbols_cmds_CXX='$NM $libobjs $convenience | $global_symbol_pipe | $SED '\''s/.* //'\'' | sort | uniq > $export_symbols'
;;
@@ -15623,6 +15644,18 @@ fi
dynamic_linker='GNU/Linux ld.so'
;;
+netbsdelf*-gnu)
+ version_type=linux
+ need_lib_prefix=no
+ need_version=no
+ library_names_spec='${libname}${release}${shared_ext}$versuffix ${libname}${release}${shared_ext}$major ${libname}${shared_ext}'
+ soname_spec='${libname}${release}${shared_ext}$major'
+ shlibpath_var=LD_LIBRARY_PATH
+ shlibpath_overrides_runpath=no
+ hardcode_into_libs=yes
+ dynamic_linker='NetBSD ld.elf_so'
+ ;;
+
netbsd*)
version_type=sunos
need_lib_prefix=no
@@ -16281,12 +16314,6 @@ done
fi
-# ugly hack when PKG_CONFIG_PATH isn't defined.
-# couldn't get it to work otherwise
-if test "x$PKG_CONFIG_PATH" = x; then
- export PKG_CONFIG_PATH=""
-fi
-#AC_MSG_NOTICE( [pkg-config search path:$PKG_CONFIG_PATH dus] )
for ac_header in libexttextcat/textcat.h
do :
ac_fn_cxx_check_header_mongrel "$LINENO" "libexttextcat/textcat.h" "ac_cv_header_libexttextcat_textcat_h" "$ac_includes_default"
@@ -16391,153 +16418,8 @@ $as_echo "$as_me: Unable to find textcat library. textcat support not available"
fi
-useICU=1;
-# inspired by feh-1.3.4/configure.ac. Tnx Tom Gilbert and feh hackers.
-
-# Check whether --with-icu was given.
-if test "${with_icu+set}" = set; then :
- withval=$with_icu; if test "$with_icu" = "no"; then
- useICU=0
- else
- CXXFLAGS="$CXXFLAGS -I$withval/include"
- LIBS="-L$withval/lib $LIBS"
- fi
-fi
-
-
-if test "$useICU" = "1"; then
-
- succeeded=no
-
- if test -z "$ICU_CONFIG"; then
- # Extract the first word of "icu-config", so it can be a program name with args.
-set dummy icu-config; ac_word=$2
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for $ac_word" >&5
-$as_echo_n "checking for $ac_word... " >&6; }
-if ${ac_cv_path_ICU_CONFIG+:} false; then :
- $as_echo_n "(cached) " >&6
-else
- case $ICU_CONFIG in
- [\\/]* | ?:[\\/]*)
- ac_cv_path_ICU_CONFIG="$ICU_CONFIG" # Let the user override the test with a path.
- ;;
- *)
- as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
-for as_dir in $PATH
-do
- IFS=$as_save_IFS
- test -z "$as_dir" && as_dir=.
- for ac_exec_ext in '' $ac_executable_extensions; do
- if as_fn_executable_p "$as_dir/$ac_word$ac_exec_ext"; then
- ac_cv_path_ICU_CONFIG="$as_dir/$ac_word$ac_exec_ext"
- $as_echo "$as_me:${as_lineno-$LINENO}: found $as_dir/$ac_word$ac_exec_ext" >&5
- break 2
- fi
-done
- done
-IFS=$as_save_IFS
-
- test -z "$ac_cv_path_ICU_CONFIG" && ac_cv_path_ICU_CONFIG="no"
- ;;
-esac
-fi
-ICU_CONFIG=$ac_cv_path_ICU_CONFIG
-if test -n "$ICU_CONFIG"; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ICU_CONFIG" >&5
-$as_echo "$ICU_CONFIG" >&6; }
-else
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
-$as_echo "no" >&6; }
-fi
-
-
- fi
-
- if test "$ICU_CONFIG" = "no" ; then
- echo "*** The icu-config script could not be found. Make sure it is"
- echo "*** in your path, and that taglib is properly installed."
- echo "*** Or see http://www.icu-project.org/"
- else
- ICU_VERSION=`$ICU_CONFIG --version`
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ICU >= 5.2" >&5
-$as_echo_n "checking for ICU >= 5.2... " >&6; }
- VERSION_CHECK=`expr $ICU_VERSION \>\= 5.2`
- if test "$VERSION_CHECK" = "1" ; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
-$as_echo "yes" >&6; }
- succeeded=yes
-
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking ICU_CFLAGS" >&5
-$as_echo_n "checking ICU_CFLAGS... " >&6; }
- ICU_CFLAGS=`$ICU_CONFIG --cflags`
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ICU_CFLAGS" >&5
-$as_echo "$ICU_CFLAGS" >&6; }
-
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking ICU_CPPSEARCHPATH" >&5
-$as_echo_n "checking ICU_CPPSEARCHPATH... " >&6; }
- ICU_CPPSEARCHPATH=`$ICU_CONFIG --cppflags-searchpath`
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ICU_CPPSEARCHPATH" >&5
-$as_echo "$ICU_CPPSEARCHPATH" >&6; }
-
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking ICU_CXXFLAGS" >&5
-$as_echo_n "checking ICU_CXXFLAGS... " >&6; }
- ICU_CXXFLAGS=`$ICU_CONFIG --cxxflags`
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ICU_CXXFLAGS" >&5
-$as_echo "$ICU_CXXFLAGS" >&6; }
-
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking ICU_LIBS" >&5
-$as_echo_n "checking ICU_LIBS... " >&6; }
- ICU_LIBS=`$ICU_CONFIG --ldflags-libsonly`
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ICU_LIBS" >&5
-$as_echo "$ICU_LIBS" >&6; }
-
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking ICU_LIBPATH" >&5
-$as_echo_n "checking ICU_LIBPATH... " >&6; }
- ICU_LIBPATH=`$ICU_CONFIG --ldflags-searchpath`
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ICU_LIBPATH" >&5
-$as_echo "$ICU_LIBPATH" >&6; }
-
- { $as_echo "$as_me:${as_lineno-$LINENO}: checking ICU_IOLIBS" >&5
-$as_echo_n "checking ICU_IOLIBS... " >&6; }
- ICU_IOLIBS=`$ICU_CONFIG --ldflags-icuio`
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ICU_IOLIBS" >&5
-$as_echo "$ICU_IOLIBS" >&6; }
- else
- ICU_CFLAGS=""
- ICU_CXXFLAGS=""
- ICU_CPPSEARCHPATH=""
- ICU_LIBPATH=""
- ICU_LIBS=""
- ICU_IOLIBS=""
- ## If we have a custom action on failure, don't print errors, but
- ## do set a variable so people can do so.
-
- fi
-
-
-
-
-
-
-
-
- fi
-
- if test $succeeded = yes; then
- CXXFLAGS="$CXXFLAGS $ICU_CPPSEARCHPATH"
- LIBS="$ICU_LIBPATH $ICU_LIBS $ICU_IOLIBS $LIBS"
- else
- { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
-$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
-as_fn_error $? "\"No ICU development environment found. Please check if libicu-dev or the like is installed\"
-See \`config.log' for more details" "$LINENO" 5; }
- fi
-
-
-$as_echo "#define HAVE_ICU 1" >>confdefs.h
-
-else
- as_fn_error $? "\"ICU support is required\"" "$LINENO" 5
+if test $prefix = "NONE"; then
+ prefix="$ac_default_prefix"
fi
@@ -16661,6 +16543,114 @@ $as_echo "no" >&6; }
fi
fi
+if test "x$PKG_CONFIG_PATH" = x; then
+ export PKG_CONFIG_PATH="$prefix/lib/pkgconfig"
+else
+ export PKG_CONFIG_PATH="$prefix/lib/pkgconfig:$PKG_CONFIG_PATH"
+fi
+
+
+# Check whether --with-icu was given.
+if test "${with_icu+set}" = set; then :
+ withval=$with_icu; PKG_CONFIG_PATH="$PKG_CONFIG_PATH:$withval/lib/pkgconfig"
+fi
+
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for ICU" >&5
+$as_echo_n "checking for ICU... " >&6; }
+
+if test -n "$ICU_CFLAGS"; then
+ pkg_cv_ICU_CFLAGS="$ICU_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"icu-uc >= 50 icu-io \""; } >&5
+ ($PKG_CONFIG --exists --print-errors "icu-uc >= 50 icu-io ") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_ICU_CFLAGS=`$PKG_CONFIG --cflags "icu-uc >= 50 icu-io " 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$ICU_LIBS"; then
+ pkg_cv_ICU_LIBS="$ICU_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"icu-uc >= 50 icu-io \""; } >&5
+ ($PKG_CONFIG --exists --print-errors "icu-uc >= 50 icu-io ") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_ICU_LIBS=`$PKG_CONFIG --libs "icu-uc >= 50 icu-io " 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ ICU_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "icu-uc >= 50 icu-io " 2>&1`
+ else
+ ICU_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "icu-uc >= 50 icu-io " 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$ICU_PKG_ERRORS" >&5
+
+ as_fn_error $? "Package requirements (icu-uc >= 50 icu-io ) were not met:
+
+$ICU_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables ICU_CFLAGS
+and ICU_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables ICU_CFLAGS
+and ICU_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+ ICU_CFLAGS=$pkg_cv_ICU_CFLAGS
+ ICU_LIBS=$pkg_cv_ICU_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+CXXFLAGS="$CXXFLAGS $ICU_CFLAGS"
+LIBS="$ICU_LIBS $LIBS"
+
+
pkg_failed=no
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for XML2" >&5
$as_echo_n "checking for XML2... " >&6; }
@@ -16755,15 +16745,6 @@ CXXFLAGS="$CXXFLAGS $XML2_CFLAGS"
LIBS="$LIBS $XML2_LIBS"
-# Check whether --with-folia was given.
-if test "${with_folia+set}" = set; then :
- withval=$with_folia; PKG_CONFIG_PATH="$withval/lib/pkgconfig:$PKG_CONFIG_PATH"
-else
- PKG_CONFIG_PATH="$prefix/lib/pkgconfig:$PKG_CONFIG_PATH"
-fi
-
-#AC_MSG_NOTICE( [pkg-config search path: $PKG_CONFIG_PATH] )
-
pkg_failed=no
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for folia" >&5
$as_echo_n "checking for folia... " >&6; }
@@ -16772,12 +16753,12 @@ if test -n "$folia_CFLAGS"; then
pkg_cv_folia_CFLAGS="$folia_CFLAGS"
elif test -n "$PKG_CONFIG"; then
if test -n "$PKG_CONFIG" && \
- { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"folia >= 1.0 \""; } >&5
- ($PKG_CONFIG --exists --print-errors "folia >= 1.0 ") 2>&5
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"folia >= 1.10 \""; } >&5
+ ($PKG_CONFIG --exists --print-errors "folia >= 1.10 ") 2>&5
ac_status=$?
$as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
test $ac_status = 0; }; then
- pkg_cv_folia_CFLAGS=`$PKG_CONFIG --cflags "folia >= 1.0 " 2>/dev/null`
+ pkg_cv_folia_CFLAGS=`$PKG_CONFIG --cflags "folia >= 1.10 " 2>/dev/null`
test "x$?" != "x0" && pkg_failed=yes
else
pkg_failed=yes
@@ -16789,12 +16770,12 @@ if test -n "$folia_LIBS"; then
pkg_cv_folia_LIBS="$folia_LIBS"
elif test -n "$PKG_CONFIG"; then
if test -n "$PKG_CONFIG" && \
- { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"folia >= 1.0 \""; } >&5
- ($PKG_CONFIG --exists --print-errors "folia >= 1.0 ") 2>&5
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"folia >= 1.10 \""; } >&5
+ ($PKG_CONFIG --exists --print-errors "folia >= 1.10 ") 2>&5
ac_status=$?
$as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
test $ac_status = 0; }; then
- pkg_cv_folia_LIBS=`$PKG_CONFIG --libs "folia >= 1.0 " 2>/dev/null`
+ pkg_cv_folia_LIBS=`$PKG_CONFIG --libs "folia >= 1.10 " 2>/dev/null`
test "x$?" != "x0" && pkg_failed=yes
else
pkg_failed=yes
@@ -16815,14 +16796,14 @@ else
_pkg_short_errors_supported=no
fi
if test $_pkg_short_errors_supported = yes; then
- folia_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "folia >= 1.0 " 2>&1`
+ folia_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "folia >= 1.10 " 2>&1`
else
- folia_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "folia >= 1.0 " 2>&1`
+ folia_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "folia >= 1.10 " 2>&1`
fi
# Put the nasty error message in config.log where it belongs
echo "$folia_PKG_ERRORS" >&5
- as_fn_error $? "Package requirements (folia >= 1.0 ) were not met:
+ as_fn_error $? "Package requirements (folia >= 1.10 ) were not met:
$folia_PKG_ERRORS
@@ -16858,15 +16839,6 @@ CXXFLAGS="$folia_CFLAGS $CXXFLAGS"
LIBS="$folia_LIBS $LIBS"
-# Check whether --with-ticcutils was given.
-if test "${with_ticcutils+set}" = set; then :
- withval=$with_ticcutils; PKG_CONFIG_PATH="$PKG_CONFIG_PATH:$withval/lib/pkgconfig"
-else
- PKG_CONFIG_PATH="$PKG_CONFIG_PATH:$prefix/lib/pkgconfig"
-fi
-
-# AC_MSG_NOTICE( [pkg-config search path: $PKG_CONFIG_PATH] )
-
pkg_failed=no
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for ticcutils" >&5
$as_echo_n "checking for ticcutils... " >&6; }
@@ -17129,7 +17101,7 @@ fi
fi
# Checks for library functions.
-ac_config_files="$ac_config_files Makefile ucto.pc ucto-icu.pc m4/Makefile config/Makefile docs/Makefile src/Makefile tests/Makefile include/Makefile include/ucto/Makefile"
+ac_config_files="$ac_config_files Makefile ucto.pc m4/Makefile config/Makefile docs/Makefile src/Makefile tests/Makefile include/Makefile include/ucto/Makefile"
cat >confcache <<\_ACEOF
# This file is a shell script that caches the results of configure
@@ -17665,7 +17637,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
# report actual input values of CONFIG_FILES etc. instead of their
# values after options handling.
ac_log="
-This file was extended by ucto $as_me 0.9.6, which was
+This file was extended by ucto $as_me 0.9.8, which was
generated by GNU Autoconf 2.69. Invocation command line was
CONFIG_FILES = $CONFIG_FILES
@@ -17731,7 +17703,7 @@ _ACEOF
cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`"
ac_cs_version="\\
-ucto config.status 0.9.6
+ucto config.status 0.9.8
configured by $0, generated by GNU Autoconf 2.69,
with options \\"\$ac_cs_config\\"
@@ -18246,7 +18218,6 @@ do
"libtool") CONFIG_COMMANDS="$CONFIG_COMMANDS libtool" ;;
"Makefile") CONFIG_FILES="$CONFIG_FILES Makefile" ;;
"ucto.pc") CONFIG_FILES="$CONFIG_FILES ucto.pc" ;;
- "ucto-icu.pc") CONFIG_FILES="$CONFIG_FILES ucto-icu.pc" ;;
"m4/Makefile") CONFIG_FILES="$CONFIG_FILES m4/Makefile" ;;
"config/Makefile") CONFIG_FILES="$CONFIG_FILES config/Makefile" ;;
"docs/Makefile") CONFIG_FILES="$CONFIG_FILES docs/Makefile" ;;
diff --git a/configure.ac b/configure.ac
index ca95513..604a15d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.
AC_PREREQ(2.59)
-AC_INIT([ucto], [0.9.6], [lamasoftware at science.ru.nl])
+AC_INIT([ucto], [0.9.8], [lamasoftware at science.ru.nl])
AM_INIT_AUTOMAKE([foreign])
AC_CONFIG_SRCDIR([configure.ac])
AC_CONFIG_MACRO_DIR([m4])
@@ -19,7 +19,7 @@ else
fi
# Checks for programs.
-AC_PROG_CXX( [g++ c++] )
+AC_PROG_CXX( [c++] )
if $cxx_flags_were_set; then
CXXFLAGS=$CXXFLAGS
@@ -50,12 +50,6 @@ AC_TYPE_INT32_T
AX_LIB_READLINE
-# ugly hack when PKG_CONFIG_PATH isn't defined.
-# couldn't get it to work otherwise
-if test "x$PKG_CONFIG_PATH" = x; then
- export PKG_CONFIG_PATH=""
-fi
-#AC_MSG_NOTICE( [pkg-config search path:$PKG_CONFIG_PATH dus] )
AC_CHECK_HEADERS([libexttextcat/textcat.h],
[CXXFLAGS="$CXXFLAGS -I$prefix/include"],
[AC_CHECK_HEADERS([libtextcat/textcat.h],
@@ -67,49 +61,35 @@ AC_CHECK_HEADERS([libexttextcat/textcat.h],
AC_SEARCH_LIBS([textcat_Init],[exttextcat-2.0 exttextcat textcat],[AC_DEFINE(HAVE_TEXTCAT_LIB, 1, textcat_lib)],
[AC_MSG_NOTICE([Unable to find textcat library. textcat support not available])])
-useICU=1;
-# inspired by feh-1.3.4/configure.ac. Tnx Tom Gilbert and feh hackers.
-AC_ARG_WITH(icu,
- [ --with-icu=DIR use ICU installed in <DIR>],
- [if test "$with_icu" = "no"; then
- useICU=0
- else
- CXXFLAGS="$CXXFLAGS -I$withval/include"
- LIBS="-L$withval/lib $LIBS"
- fi] )
-
-if test "$useICU" = "1"; then
- AX_ICU_CHECK( [5.2],
- [CXXFLAGS="$CXXFLAGS $ICU_CPPSEARCHPATH"
- LIBS="$ICU_LIBPATH $ICU_LIBS $ICU_IOLIBS $LIBS"],
- [AC_MSG_FAILURE( "No ICU development environment found. Please check if libicu-dev or the like is installed" )] )
- AC_DEFINE(HAVE_ICU, 1, we want to use ICU )
+if test $prefix = "NONE"; then
+ prefix="$ac_default_prefix"
+fi
+
+PKG_PROG_PKG_CONFIG
+
+if test "x$PKG_CONFIG_PATH" = x; then
+ export PKG_CONFIG_PATH="$prefix/lib/pkgconfig"
else
- AC_MSG_ERROR("ICU support is required")
+ export PKG_CONFIG_PATH="$prefix/lib/pkgconfig:$PKG_CONFIG_PATH"
fi
+AC_ARG_WITH(icu,
+ [ --with-icu=DIR use icu installed in <DIR>],
+ [PKG_CONFIG_PATH="$PKG_CONFIG_PATH:$withval/lib/pkgconfig"],
+ [])
+
+PKG_CHECK_MODULES([ICU], [icu-uc >= 50 icu-io] )
+CXXFLAGS="$CXXFLAGS $ICU_CFLAGS"
+LIBS="$ICU_LIBS $LIBS"
+
PKG_CHECK_MODULES([XML2], [libxml-2.0 >= 2.6.16] )
CXXFLAGS="$CXXFLAGS $XML2_CFLAGS"
LIBS="$LIBS $XML2_LIBS"
-AC_ARG_WITH(folia,
- [ --with-folia=DIR use libfolia installed in <DIR>;
- note that you can install folia in a non-default directory with
- ./configure --prefix=<DIR> in the folia installation directory],
- [PKG_CONFIG_PATH="$withval/lib/pkgconfig:$PKG_CONFIG_PATH"],
- [PKG_CONFIG_PATH="$prefix/lib/pkgconfig:$PKG_CONFIG_PATH"])
-#AC_MSG_NOTICE( [pkg-config search path: $PKG_CONFIG_PATH] )
-PKG_CHECK_MODULES([folia], [folia >= 1.0] )
+PKG_CHECK_MODULES([folia], [folia >= 1.10] )
CXXFLAGS="$folia_CFLAGS $CXXFLAGS"
LIBS="$folia_LIBS $LIBS"
-AC_ARG_WITH(ticcutils,
- [ --with-ticcutils=DIR use ticcutils installed in <DIR>;
- note that you can install ticcutils in a non-default directory with
- ./configure --prefix=<DIR> in the ticcutils installation directory],
- [PKG_CONFIG_PATH="$PKG_CONFIG_PATH:$withval/lib/pkgconfig"],
- [PKG_CONFIG_PATH="$PKG_CONFIG_PATH:$prefix/lib/pkgconfig"])
-# AC_MSG_NOTICE( [pkg-config search path: $PKG_CONFIG_PATH] )
PKG_CHECK_MODULES([ticcutils], [ticcutils >= 0.6] )
CXXFLAGS="$CXXFLAGS $ticcutils_CFLAGS"
LIBS="$LIBS $ticcutils_LIBS"
@@ -135,7 +115,6 @@ PKG_CHECK_MODULES(
AC_OUTPUT([
Makefile
ucto.pc
- ucto-icu.pc
m4/Makefile
config/Makefile
docs/Makefile
diff --git a/docs/Makefile.in b/docs/Makefile.in
index f1784e5..29cf52a 100644
--- a/docs/Makefile.in
+++ b/docs/Makefile.in
@@ -92,8 +92,7 @@ build_triplet = @build@
host_triplet = @host@
subdir = docs
ACLOCAL_M4 = $(top_srcdir)/aclocal.m4
-am__aclocal_m4_deps = $(top_srcdir)/m4/ax_icu_check.m4 \
- $(top_srcdir)/m4/ax_lib_readline.m4 \
+am__aclocal_m4_deps = $(top_srcdir)/m4/ax_lib_readline.m4 \
$(top_srcdir)/m4/libtool.m4 $(top_srcdir)/m4/ltoptions.m4 \
$(top_srcdir)/m4/ltsugar.m4 $(top_srcdir)/m4/ltversion.m4 \
$(top_srcdir)/m4/lt~obsolete.m4 $(top_srcdir)/m4/pkg.m4 \
@@ -189,13 +188,7 @@ EXEEXT = @EXEEXT@
FGREP = @FGREP@
GREP = @GREP@
ICU_CFLAGS = @ICU_CFLAGS@
-ICU_CONFIG = @ICU_CONFIG@
-ICU_CPPSEARCHPATH = @ICU_CPPSEARCHPATH@
-ICU_CXXFLAGS = @ICU_CXXFLAGS@
-ICU_IOLIBS = @ICU_IOLIBS@
-ICU_LIBPATH = @ICU_LIBPATH@
ICU_LIBS = @ICU_LIBS@
-ICU_VERSION = @ICU_VERSION@
INSTALL = @INSTALL@
INSTALL_DATA = @INSTALL_DATA@
INSTALL_PROGRAM = @INSTALL_PROGRAM@
@@ -286,6 +279,7 @@ pdfdir = @pdfdir@
prefix = @prefix@
program_transform_name = @program_transform_name@
psdir = @psdir@
+runstatedir = @runstatedir@
sbindir = @sbindir@
sharedstatedir = @sharedstatedir@
srcdir = @srcdir@
diff --git a/docs/ucto.1 b/docs/ucto.1
index ef02a3a..3c2c3c4 100644
--- a/docs/ucto.1
+++ b/docs/ucto.1
@@ -1,4 +1,4 @@
-.TH ucto 1 "2014 december 2"
+.TH ucto 1 "2017 may 10"
.SH NAME
ucto \- Unicode Tokenizer
@@ -40,11 +40,17 @@ disable filtering of special characters
.BR \-L " language"
.RS
- Automatically selects a configuration file by language code.
+ Automatically selects a configuration file by language code.
The language code is generally a three-letter iso-639-3 code.
For example, 'fra' will select the file tokconfig\(hyfra from the installation directory
.RE
+.BR \-\-detectlanguages =<lang1,lang2,..langn>
+.RS
+try to detect all the specified languages. The default language will be 'lang1'.
+(only useful for FoLiA output)
+.RE
+
.BR \-l
.RS
Convert to all lowercase
@@ -60,6 +66,11 @@ Convert to all uppercase
Emit one sentence per line on output
.RE
+.BR \-\-normalize=class1,class2,..,classn
+.RS
+map all occurrences of tokens with class1,...class to their generic names. e.g \-\-normalize=DATE will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL's, DATE's, E\-mail addresses and so on.
+.RE
+
.BR \-m
.RS
Assume one sentence per line on input
@@ -72,7 +83,7 @@ Don't tokenize, but perform input decoding and simple token role detection
.BR \-\-filterpunct
.RS
-remove most of the punctuation from the output. (not from abreviations!)
+remove most of the punctuation from the output. (not from abreviations and embeddded punctuation like John's )
.RE
.B \-P
@@ -111,11 +122,23 @@ set Verbose mode
Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: \-nulPQvsS)
.RE
-.BR \-\-textclass "cls"
+.B \-\-inputclass "cls"
.RS
When tokenizing a FoLiA XML document, search for text nodes of class 'cls'
.RE
+.B \-\-outputclass "cls"
+.RS
+When tokenizing a FoLiA XML document, output the tokenized text in text nodes of class 'cls'
+.RE
+
+.B \-\-textclass "cls" (obsolete)
+.RS
+use 'cls' for input and output of text from FoLiA. Equivalent to both \-\-inputclass='cls' and \-\-outputclass='cls')
+
+This option is obsolete and NOT recommended. Please use the separate \-\-inputclass= and \-\-outputclass options.
+.RE
+
.B \-X
.RS
Output FoLiA XML. (this disables usage of most other options: \-nulPQvsS)
diff --git a/include/Makefile.in b/include/Makefile.in
index 4a6c158..69b750f 100644
--- a/include/Makefile.in
+++ b/include/Makefile.in
@@ -91,8 +91,7 @@ build_triplet = @build@
host_triplet = @host@
subdir = include
ACLOCAL_M4 = $(top_srcdir)/aclocal.m4
-am__aclocal_m4_deps = $(top_srcdir)/m4/ax_icu_check.m4 \
- $(top_srcdir)/m4/ax_lib_readline.m4 \
+am__aclocal_m4_deps = $(top_srcdir)/m4/ax_lib_readline.m4 \
$(top_srcdir)/m4/libtool.m4 $(top_srcdir)/m4/ltoptions.m4 \
$(top_srcdir)/m4/ltsugar.m4 $(top_srcdir)/m4/ltversion.m4 \
$(top_srcdir)/m4/lt~obsolete.m4 $(top_srcdir)/m4/pkg.m4 \
@@ -217,13 +216,7 @@ EXEEXT = @EXEEXT@
FGREP = @FGREP@
GREP = @GREP@
ICU_CFLAGS = @ICU_CFLAGS@
-ICU_CONFIG = @ICU_CONFIG@
-ICU_CPPSEARCHPATH = @ICU_CPPSEARCHPATH@
-ICU_CXXFLAGS = @ICU_CXXFLAGS@
-ICU_IOLIBS = @ICU_IOLIBS@
-ICU_LIBPATH = @ICU_LIBPATH@
ICU_LIBS = @ICU_LIBS@
-ICU_VERSION = @ICU_VERSION@
INSTALL = @INSTALL@
INSTALL_DATA = @INSTALL_DATA@
INSTALL_PROGRAM = @INSTALL_PROGRAM@
@@ -314,6 +307,7 @@ pdfdir = @pdfdir@
prefix = @prefix@
program_transform_name = @program_transform_name@
psdir = @psdir@
+runstatedir = @runstatedir@
sbindir = @sbindir@
sharedstatedir = @sharedstatedir@
srcdir = @srcdir@
diff --git a/include/ucto/Makefile.in b/include/ucto/Makefile.in
index cda6ffd..489b330 100644
--- a/include/ucto/Makefile.in
+++ b/include/ucto/Makefile.in
@@ -90,8 +90,7 @@ build_triplet = @build@
host_triplet = @host@
subdir = include/ucto
ACLOCAL_M4 = $(top_srcdir)/aclocal.m4
-am__aclocal_m4_deps = $(top_srcdir)/m4/ax_icu_check.m4 \
- $(top_srcdir)/m4/ax_lib_readline.m4 \
+am__aclocal_m4_deps = $(top_srcdir)/m4/ax_lib_readline.m4 \
$(top_srcdir)/m4/libtool.m4 $(top_srcdir)/m4/ltoptions.m4 \
$(top_srcdir)/m4/ltsugar.m4 $(top_srcdir)/m4/ltversion.m4 \
$(top_srcdir)/m4/lt~obsolete.m4 $(top_srcdir)/m4/pkg.m4 \
@@ -204,13 +203,7 @@ EXEEXT = @EXEEXT@
FGREP = @FGREP@
GREP = @GREP@
ICU_CFLAGS = @ICU_CFLAGS@
-ICU_CONFIG = @ICU_CONFIG@
-ICU_CPPSEARCHPATH = @ICU_CPPSEARCHPATH@
-ICU_CXXFLAGS = @ICU_CXXFLAGS@
-ICU_IOLIBS = @ICU_IOLIBS@
-ICU_LIBPATH = @ICU_LIBPATH@
ICU_LIBS = @ICU_LIBS@
-ICU_VERSION = @ICU_VERSION@
INSTALL = @INSTALL@
INSTALL_DATA = @INSTALL_DATA@
INSTALL_PROGRAM = @INSTALL_PROGRAM@
@@ -301,6 +294,7 @@ pdfdir = @pdfdir@
prefix = @prefix@
program_transform_name = @program_transform_name@
psdir = @psdir@
+runstatedir = @runstatedir@
sbindir = @sbindir@
sharedstatedir = @sharedstatedir@
srcdir = @srcdir@
diff --git a/include/ucto/setting.h b/include/ucto/setting.h
index 92cf1ac..6c630ca 100644
--- a/include/ucto/setting.h
+++ b/include/ucto/setting.h
@@ -94,6 +94,7 @@ namespace Tokenizer {
void add_rule( const UnicodeString&, const std::vector<UnicodeString>& );
void sortRules( std::map<UnicodeString, Rule *>&,
const std::vector<UnicodeString>& );
+ static std::set<std::string> installed_languages();
UnicodeString eosmarkers;
std::vector<Rule *> rules;
std::map<UnicodeString, Rule *> rulesmap;
diff --git a/include/ucto/textcat.h b/include/ucto/textcat.h
index 807a4dc..6a57720 100644
--- a/include/ucto/textcat.h
+++ b/include/ucto/textcat.h
@@ -55,7 +55,7 @@ extern "C" {
class TextCat {
public:
- TextCat( const std::string& cf );
+ explicit TextCat( const std::string& cf );
TextCat( const TextCat& in );
~TextCat();
bool isInit() const { return TC != 0; };
diff --git a/include/ucto/tokenize.h b/include/ucto/tokenize.h
index f2938e3..da70e1b 100644
--- a/include/ucto/tokenize.h
+++ b/include/ucto/tokenize.h
@@ -51,8 +51,7 @@ namespace Tokenizer {
BEGINQUOTE = 16,
ENDQUOTE = 32,
TEMPENDOFSENTENCE = 64,
- LISTITEM = 128, //reserved for future use
- TITLE = 256 //reserved for future use
+ LINEBREAK = 128
};
std::ostream& operator<<( std::ostream&, const TokenRole& );
@@ -149,6 +148,7 @@ namespace Tokenizer {
//return the sentence with the specified index in a Token vector;
std::vector<Token> getSentence( int );
+ void extractSentencesAndFlush( int, std::vector<Token>&, const std::string& );
//Get all sentences as a vector of strings (UTF-8 encoded)
std::vector<std::string> getSentences();
@@ -185,6 +185,10 @@ namespace Tokenizer {
bool setQuoteDetection( bool b=true ) { bool t = detectQuotes; detectQuotes = b; return t; }
bool getQuoteDetection() const { return detectQuotes; }
+ //Enable language detection
+ bool setLangDetection( bool b=true ) { bool t = doDetectLang; doDetectLang = b; return t; }
+ bool getLangDetection() const { return doDetectLang; }
+
//Enable filtering
bool setFiltering( bool b=true ) {
bool t = doFilter; doFilter = b; return t;
@@ -196,6 +200,8 @@ namespace Tokenizer {
}
bool getPunctFilter() const { return doPunctFilter; };
+ std::string setTextRedundancy( const std::string& );
+
// set normalization mode
std::string setNormalization( const std::string& s ) {
return normalizer.setMode( s );
@@ -227,6 +233,7 @@ namespace Tokenizer {
const std::string setTextClass( const std::string& cls) {
std::string res = inputclass;
inputclass = cls;
+ outputclass = cls;
return res;
}
const std::string getInputClass( ) const { return inputclass; }
@@ -261,6 +268,9 @@ namespace Tokenizer {
bool,
const std::string&,
const UnicodeString& ="" );
+ int tokenizeLine( const UnicodeString&,
+ const std::string&,
+ const std::string& );
bool detectEos( size_t, const UnicodeString&, const Quoting& ) const;
void detectSentenceBounds( const int offset,
@@ -276,7 +286,6 @@ namespace Tokenizer {
bool u_isquote( UChar32,
const Quoting& ) const;
std::string checkBOM( std::istream& );
- void outputTokensDoc( folia::Document&, const std::vector<Token>& ) const;
void outputTokensDoc_init( folia::Document& ) const;
int outputTokensXML( folia::FoliaElement *,
@@ -321,6 +330,13 @@ namespace Tokenizer {
//has a paragraph been signaled?
bool paragraphsignal;
+ //has do we attempt to assign languages?
+ bool doDetectLang;
+
+ //has do we percolate text up from <w> to <s> and <p> nodes? (FoLiA)
+ // values should be: 'full', 'minimal' or 'none'
+ std::string text_redundancy;
+
//one sentence per line output
bool sentenceperlineoutput;
bool sentenceperlineinput;
diff --git a/install-sh b/install-sh
index 0b0fdcb..59990a1 100755
--- a/install-sh
+++ b/install-sh
@@ -1,7 +1,7 @@
#!/bin/sh
# install - install a program, script, or datafile
-scriptversion=2013-12-25.23; # UTC
+scriptversion=2014-09-12.12; # UTC
# This originates from X11R5 (mit/util/scripts/install.sh), which was
# later released in X11R6 (xc/config/util/install.sh) with the
@@ -324,34 +324,41 @@ do
# is incompatible with FreeBSD 'install' when (umask & 300) != 0.
;;
*)
+ # $RANDOM is not portable (e.g. dash); use it when possible to
+ # lower collision chance
tmpdir=${TMPDIR-/tmp}/ins$RANDOM-$$
- trap 'ret=$?; rmdir "$tmpdir/d" "$tmpdir" 2>/dev/null; exit $ret' 0
+ trap 'ret=$?; rmdir "$tmpdir/a/b" "$tmpdir/a" "$tmpdir" 2>/dev/null; exit $ret' 0
+ # As "mkdir -p" follows symlinks and we work in /tmp possibly; so
+ # create the $tmpdir first (and fail if unsuccessful) to make sure
+ # that nobody tries to guess the $tmpdir name.
if (umask $mkdir_umask &&
- exec $mkdirprog $mkdir_mode -p -- "$tmpdir/d") >/dev/null 2>&1
+ $mkdirprog $mkdir_mode "$tmpdir" &&
+ exec $mkdirprog $mkdir_mode -p -- "$tmpdir/a/b") >/dev/null 2>&1
then
if test -z "$dir_arg" || {
# Check for POSIX incompatibilities with -m.
# HP-UX 11.23 and IRIX 6.5 mkdir -m -p sets group- or
# other-writable bit of parent directory when it shouldn't.
# FreeBSD 6.1 mkdir -m -p sets mode of existing directory.
- ls_ld_tmpdir=`ls -ld "$tmpdir"`
+ test_tmpdir="$tmpdir/a"
+ ls_ld_tmpdir=`ls -ld "$test_tmpdir"`
case $ls_ld_tmpdir in
d????-?r-*) different_mode=700;;
d????-?--*) different_mode=755;;
*) false;;
esac &&
- $mkdirprog -m$different_mode -p -- "$tmpdir" && {
- ls_ld_tmpdir_1=`ls -ld "$tmpdir"`
+ $mkdirprog -m$different_mode -p -- "$test_tmpdir" && {
+ ls_ld_tmpdir_1=`ls -ld "$test_tmpdir"`
test "$ls_ld_tmpdir" = "$ls_ld_tmpdir_1"
}
}
then posix_mkdir=:
fi
- rmdir "$tmpdir/d" "$tmpdir"
+ rmdir "$tmpdir/a/b" "$tmpdir/a" "$tmpdir"
else
# Remove any dirs left behind by ancient mkdir implementations.
- rmdir ./$mkdir_mode ./-p ./-- 2>/dev/null
+ rmdir ./$mkdir_mode ./-p ./-- "$tmpdir" 2>/dev/null
fi
trap '' 0;;
esac;;
diff --git a/ltmain.sh b/ltmain.sh
index 0f0a2da..a736cf9 100644
--- a/ltmain.sh
+++ b/ltmain.sh
@@ -31,7 +31,7 @@
PROGRAM=libtool
PACKAGE=libtool
-VERSION=2.4.6
+VERSION="2.4.6 Debian-2.4.6-2"
package_revision=2.4.6
@@ -2068,12 +2068,12 @@ include the following information:
compiler: $LTCC
compiler flags: $LTCFLAGS
linker: $LD (gnu? $with_gnu_ld)
- version: $progname (GNU libtool) 2.4.6
+ version: $progname $scriptversion Debian-2.4.6-2
automake: `($AUTOMAKE --version) 2>/dev/null |$SED 1q`
autoconf: `($AUTOCONF --version) 2>/dev/null |$SED 1q`
Report bugs to <bug-libtool at gnu.org>.
-GNU libtool home page: <http://www.gnu.org/software/libtool/>.
+GNU libtool home page: <http://www.gnu.org/s/libtool/>.
General help using GNU software: <http://www.gnu.org/gethelp/>."
exit 0
}
@@ -7272,10 +7272,13 @@ func_mode_link ()
# -tp=* Portland pgcc target processor selection
# --sysroot=* for sysroot support
# -O*, -g*, -flto*, -fwhopr*, -fuse-linker-plugin GCC link-time optimization
+ # -specs=* GCC specs files
# -stdlib=* select c++ std lib with clang
+ # -fsanitize=* Clang/GCC memory and address sanitizer
-64|-mips[0-9]|-r[0-9][0-9]*|-xarch=*|-xtarget=*|+DA*|+DD*|-q*|-m*| \
-t[45]*|-txscale*|-p|-pg|--coverage|-fprofile-*|-F*|@*|-tp=*|--sysroot=*| \
- -O*|-g*|-flto*|-fwhopr*|-fuse-linker-plugin|-fstack-protector*|-stdlib=*)
+ -O*|-g*|-flto*|-fwhopr*|-fuse-linker-plugin|-fstack-protector*|-stdlib=*| \
+ -specs=*|-fsanitize=*)
func_quote_for_eval "$arg"
arg=$func_quote_for_eval_result
func_append compile_command " $arg"
@@ -7568,7 +7571,10 @@ func_mode_link ()
case $pass in
dlopen) libs=$dlfiles ;;
dlpreopen) libs=$dlprefiles ;;
- link) libs="$deplibs %DEPLIBS% $dependency_libs" ;;
+ link)
+ libs="$deplibs %DEPLIBS%"
+ test "X$link_all_deplibs" != Xno && libs="$libs $dependency_libs"
+ ;;
esac
fi
if test lib,dlpreopen = "$linkmode,$pass"; then
@@ -7887,19 +7893,19 @@ func_mode_link ()
# It is a libtool convenience library, so add in its objects.
func_append convenience " $ladir/$objdir/$old_library"
func_append old_convenience " $ladir/$objdir/$old_library"
+ tmp_libs=
+ for deplib in $dependency_libs; do
+ deplibs="$deplib $deplibs"
+ if $opt_preserve_dup_deps; then
+ case "$tmp_libs " in
+ *" $deplib "*) func_append specialdeplibs " $deplib" ;;
+ esac
+ fi
+ func_append tmp_libs " $deplib"
+ done
elif test prog != "$linkmode" && test lib != "$linkmode"; then
func_fatal_error "'$lib' is not a convenience library"
fi
- tmp_libs=
- for deplib in $dependency_libs; do
- deplibs="$deplib $deplibs"
- if $opt_preserve_dup_deps; then
- case "$tmp_libs " in
- *" $deplib "*) func_append specialdeplibs " $deplib" ;;
- esac
- fi
- func_append tmp_libs " $deplib"
- done
continue
fi # $pass = conv
@@ -8823,6 +8829,9 @@ func_mode_link ()
revision=$number_minor
lt_irix_increment=no
;;
+ *)
+ func_fatal_configuration "$modename: unknown library version type '$version_type'"
+ ;;
esac
;;
no)
diff --git a/m4/Makefile.in b/m4/Makefile.in
index 6fec252..823f8bc 100644
--- a/m4/Makefile.in
+++ b/m4/Makefile.in
@@ -92,8 +92,7 @@ build_triplet = @build@
host_triplet = @host@
subdir = m4
ACLOCAL_M4 = $(top_srcdir)/aclocal.m4
-am__aclocal_m4_deps = $(top_srcdir)/m4/ax_icu_check.m4 \
- $(top_srcdir)/m4/ax_lib_readline.m4 \
+am__aclocal_m4_deps = $(top_srcdir)/m4/ax_lib_readline.m4 \
$(top_srcdir)/m4/libtool.m4 $(top_srcdir)/m4/ltoptions.m4 \
$(top_srcdir)/m4/ltsugar.m4 $(top_srcdir)/m4/ltversion.m4 \
$(top_srcdir)/m4/lt~obsolete.m4 $(top_srcdir)/m4/pkg.m4 \
@@ -158,13 +157,7 @@ EXEEXT = @EXEEXT@
FGREP = @FGREP@
GREP = @GREP@
ICU_CFLAGS = @ICU_CFLAGS@
-ICU_CONFIG = @ICU_CONFIG@
-ICU_CPPSEARCHPATH = @ICU_CPPSEARCHPATH@
-ICU_CXXFLAGS = @ICU_CXXFLAGS@
-ICU_IOLIBS = @ICU_IOLIBS@
-ICU_LIBPATH = @ICU_LIBPATH@
ICU_LIBS = @ICU_LIBS@
-ICU_VERSION = @ICU_VERSION@
INSTALL = @INSTALL@
INSTALL_DATA = @INSTALL_DATA@
INSTALL_PROGRAM = @INSTALL_PROGRAM@
@@ -255,6 +248,7 @@ pdfdir = @pdfdir@
prefix = @prefix@
program_transform_name = @program_transform_name@
psdir = @psdir@
+runstatedir = @runstatedir@
sbindir = @sbindir@
sharedstatedir = @sharedstatedir@
srcdir = @srcdir@
diff --git a/m4/ax_icu_check.m4 b/m4/ax_icu_check.m4
deleted file mode 100644
index 3ffe425..0000000
--- a/m4/ax_icu_check.m4
+++ /dev/null
@@ -1,86 +0,0 @@
-dnl @synopsis AX_ICU_CHECK([version], [action-if], [action-if-not])
-dnl
-dnl Test for ICU support
-dnl
-dnl This will define ICU_LIBS, ICU_CFLAGS, ICU_CXXFLAGS, ICU_IOLIBS.
-dnl
-dnl Based on ac_check_icu (http://autoconf-archive.cryp.to/ac_check_icu.html)
-dnl by Akos Maroy <darkeye at tyrell.hu>.
-dnl
-dnl Portions Copyright 2005 Akos Maroy <darkeye at tyrell.hu>
-dnl Copying and distribution of this file, with or without modification,
-dnl are permitted in any medium without royalty provided the copyright
-dnl notice and this notice are preserved.
-dnl
-dnl @author Hunter Morris <huntermorris at gmail.com>
-dnl @version 2008-03-18
-AC_DEFUN([AX_ICU_CHECK], [
- succeeded=no
-
- if test -z "$ICU_CONFIG"; then
- AC_PATH_PROG(ICU_CONFIG, icu-config, no)
- fi
-
- if test "$ICU_CONFIG" = "no" ; then
- echo "*** The icu-config script could not be found. Make sure it is"
- echo "*** in your path, and that taglib is properly installed."
- echo "*** Or see http://www.icu-project.org/"
- else
- ICU_VERSION=`$ICU_CONFIG --version`
- AC_MSG_CHECKING(for ICU >= $1)
- VERSION_CHECK=`expr $ICU_VERSION \>\= $1`
- if test "$VERSION_CHECK" = "1" ; then
- AC_MSG_RESULT(yes)
- succeeded=yes
-
- AC_MSG_CHECKING(ICU_CFLAGS)
- ICU_CFLAGS=`$ICU_CONFIG --cflags`
- AC_MSG_RESULT($ICU_CFLAGS)
-
- AC_MSG_CHECKING(ICU_CPPSEARCHPATH)
- ICU_CPPSEARCHPATH=`$ICU_CONFIG --cppflags-searchpath`
- AC_MSG_RESULT($ICU_CPPSEARCHPATH)
-
- AC_MSG_CHECKING(ICU_CXXFLAGS)
- ICU_CXXFLAGS=`$ICU_CONFIG --cxxflags`
- AC_MSG_RESULT($ICU_CXXFLAGS)
-
- AC_MSG_CHECKING(ICU_LIBS)
- ICU_LIBS=`$ICU_CONFIG --ldflags-libsonly`
- AC_MSG_RESULT($ICU_LIBS)
-
- AC_MSG_CHECKING(ICU_LIBPATH)
- ICU_LIBPATH=`$ICU_CONFIG --ldflags-searchpath`
- AC_MSG_RESULT($ICU_LIBPATH)
-
- AC_MSG_CHECKING(ICU_IOLIBS)
- ICU_IOLIBS=`$ICU_CONFIG --ldflags-icuio`
- AC_MSG_RESULT($ICU_IOLIBS)
- else
- ICU_CFLAGS=""
- ICU_CXXFLAGS=""
- ICU_CPPSEARCHPATH=""
- ICU_LIBPATH=""
- ICU_LIBS=""
- ICU_IOLIBS=""
- ## If we have a custom action on failure, don't print errors, but
- ## do set a variable so people can do so.
- ifelse([$3], ,echo "can't find ICU >= $1",)
- fi
-
- AC_SUBST(ICU_CFLAGS)
- AC_SUBST(ICU_CXXFLAGS)
- AC_SUBST(ICU_CPPSEARCHPATH)
- AC_SUBST(ICU_VERSION)
- AC_SUBST(ICU_LIBPATH)
- AC_SUBST(ICU_LIBS)
- AC_SUBST(ICU_IOLIBS)
- fi
-
- if test $succeeded = yes; then
- ifelse([$2], , :, [$2])
- else
- ifelse([$3], , AC_MSG_ERROR([Library requirements (ICU) not met.]), [$3])
- fi
-])
-
diff --git a/m4/libtool.m4 b/m4/libtool.m4
index a3bc337..10ab284 100644
--- a/m4/libtool.m4
+++ b/m4/libtool.m4
@@ -2887,6 +2887,18 @@ linux* | k*bsd*-gnu | kopensolaris*-gnu | gnu*)
dynamic_linker='GNU/Linux ld.so'
;;
+netbsdelf*-gnu)
+ version_type=linux
+ need_lib_prefix=no
+ need_version=no
+ library_names_spec='${libname}${release}${shared_ext}$versuffix ${libname}${release}${shared_ext}$major ${libname}${shared_ext}'
+ soname_spec='${libname}${release}${shared_ext}$major'
+ shlibpath_var=LD_LIBRARY_PATH
+ shlibpath_overrides_runpath=no
+ hardcode_into_libs=yes
+ dynamic_linker='NetBSD ld.elf_so'
+ ;;
+
netbsd*)
version_type=sunos
need_lib_prefix=no
@@ -3546,7 +3558,7 @@ linux* | k*bsd*-gnu | kopensolaris*-gnu | gnu*)
lt_cv_deplibs_check_method=pass_all
;;
-netbsd*)
+netbsd* | netbsdelf*-gnu)
if echo __ELF__ | $CC -E - | $GREP __ELF__ > /dev/null; then
lt_cv_deplibs_check_method='match_pattern /lib[[^/]]+(\.so\.[[0-9]]+\.[[0-9]]+|_pic\.a)$'
else
@@ -4424,7 +4436,7 @@ m4_if([$1], [CXX], [
;;
esac
;;
- netbsd*)
+ netbsd* | netbsdelf*-gnu)
;;
*qnx* | *nto*)
# QNX uses GNU C++, but need to define -shared option too, otherwise
@@ -4936,6 +4948,9 @@ m4_if([$1], [CXX], [
;;
esac
;;
+ linux* | k*bsd*-gnu | gnu*)
+ _LT_TAGVAR(link_all_deplibs, $1)=no
+ ;;
*)
_LT_TAGVAR(export_symbols_cmds, $1)='$NM $libobjs $convenience | $global_symbol_pipe | $SED '\''s/.* //'\'' | sort | uniq > $export_symbols'
;;
@@ -4998,6 +5013,9 @@ dnl Note also adjust exclude_expsyms for C++ above.
openbsd* | bitrig*)
with_gnu_ld=no
;;
+ linux* | k*bsd*-gnu | gnu*)
+ _LT_TAGVAR(link_all_deplibs, $1)=no
+ ;;
esac
_LT_TAGVAR(ld_shlibs, $1)=yes
@@ -5252,7 +5270,7 @@ _LT_EOF
fi
;;
- netbsd*)
+ netbsd* | netbsdelf*-gnu)
if echo __ELF__ | $CC -E - | $GREP __ELF__ >/dev/null; then
_LT_TAGVAR(archive_cmds, $1)='$LD -Bshareable $libobjs $deplibs $linker_flags -o $lib'
wlarc=
@@ -5773,6 +5791,7 @@ _LT_EOF
if test yes = "$lt_cv_irix_exported_symbol"; then
_LT_TAGVAR(archive_expsym_cmds, $1)='$CC -shared $pic_flag $libobjs $deplibs $compiler_flags $wl-soname $wl$soname `test -n "$verstring" && func_echo_all "$wl-set_version $wl$verstring"` $wl-update_registry $wl$output_objdir/so_locations $wl-exports_file $wl$export_symbols -o $lib'
fi
+ _LT_TAGVAR(link_all_deplibs, $1)=no
else
_LT_TAGVAR(archive_cmds, $1)='$CC -shared $libobjs $deplibs $compiler_flags -soname $soname `test -n "$verstring" && func_echo_all "-set_version $verstring"` -update_registry $output_objdir/so_locations -o $lib'
_LT_TAGVAR(archive_expsym_cmds, $1)='$CC -shared $libobjs $deplibs $compiler_flags -soname $soname `test -n "$verstring" && func_echo_all "-set_version $verstring"` -update_registry $output_objdir/so_locations -exports_file $export_symbols -o $lib'
@@ -5794,7 +5813,7 @@ _LT_EOF
esac
;;
- netbsd*)
+ netbsd* | netbsdelf*-gnu)
if echo __ELF__ | $CC -E - | $GREP __ELF__ >/dev/null; then
_LT_TAGVAR(archive_cmds, $1)='$LD -Bshareable -o $lib $libobjs $deplibs $linker_flags' # a.out
else
diff --git a/m4/ltsugar.m4 b/m4/ltsugar.m4
index 48bc934..9000a05 100644
--- a/m4/ltsugar.m4
+++ b/m4/ltsugar.m4
@@ -1,7 +1,6 @@
# ltsugar.m4 -- libtool m4 base layer. -*-Autoconf-*-
#
-# Copyright (C) 2004-2005, 2007-2008, 2011-2015 Free Software
-# Foundation, Inc.
+# Copyright (C) 2004, 2005, 2007, 2008 Free Software Foundation, Inc.
# Written by Gary V. Vaughan, 2004
#
# This file is free software; the Free Software Foundation gives
@@ -34,7 +33,7 @@ m4_define([_lt_join],
# ------------
# Manipulate m4 lists.
# These macros are necessary as long as will still need to support
-# Autoconf-2.59, which quotes differently.
+# Autoconf-2.59 which quotes differently.
m4_define([lt_car], [[$1]])
m4_define([lt_cdr],
[m4_if([$#], 0, [m4_fatal([$0: cannot be called without arguments])],
@@ -45,7 +44,7 @@ m4_define([lt_unquote], $1)
# lt_append(MACRO-NAME, STRING, [SEPARATOR])
# ------------------------------------------
-# Redefine MACRO-NAME to hold its former content plus 'SEPARATOR''STRING'.
+# Redefine MACRO-NAME to hold its former content plus `SEPARATOR'`STRING'.
# Note that neither SEPARATOR nor STRING are expanded; they are appended
# to MACRO-NAME as is (leaving the expansion for when MACRO-NAME is invoked).
# No SEPARATOR is output if MACRO-NAME was previously undefined (different
diff --git a/m4/lt~obsolete.m4 b/m4/lt~obsolete.m4
index c6b26f8..c573da9 100644
--- a/m4/lt~obsolete.m4
+++ b/m4/lt~obsolete.m4
@@ -1,7 +1,6 @@
# lt~obsolete.m4 -- aclocal satisfying obsolete definitions. -*-Autoconf-*-
#
-# Copyright (C) 2004-2005, 2007, 2009, 2011-2015 Free Software
-# Foundation, Inc.
+# Copyright (C) 2004, 2005, 2007, 2009 Free Software Foundation, Inc.
# Written by Scott James Remnant, 2004.
#
# This file is free software; the Free Software Foundation gives
@@ -12,7 +11,7 @@
# These exist entirely to fool aclocal when bootstrapping libtool.
#
-# In the past libtool.m4 has provided macros via AC_DEFUN (or AU_DEFUN),
+# In the past libtool.m4 has provided macros via AC_DEFUN (or AU_DEFUN)
# which have later been changed to m4_define as they aren't part of the
# exported API, or moved to Autoconf or Automake where they belong.
#
@@ -26,7 +25,7 @@
# included after everything else. This provides aclocal with the
# AC_DEFUNs it wants, but when m4 processes it, it doesn't do anything
# because those macros already exist, or will be overwritten later.
-# We use AC_DEFUN over AU_DEFUN for compatibility with aclocal-1.6.
+# We use AC_DEFUN over AU_DEFUN for compatibility with aclocal-1.6.
#
# Anytime we withdraw an AC_DEFUN or AU_DEFUN, remember to add it here.
# Yes, that means every name once taken will need to remain here until
diff --git a/m4/pkg.m4 b/m4/pkg.m4
index 82bea96..c5b26b5 100644
--- a/m4/pkg.m4
+++ b/m4/pkg.m4
@@ -1,60 +1,29 @@
-dnl pkg.m4 - Macros to locate and utilise pkg-config. -*- Autoconf -*-
-dnl serial 11 (pkg-config-0.29.1)
-dnl
-dnl Copyright © 2004 Scott James Remnant <scott at netsplit.com>.
-dnl Copyright © 2012-2015 Dan Nicholson <dbn.lists at gmail.com>
-dnl
-dnl This program is free software; you can redistribute it and/or modify
-dnl it under the terms of the GNU General Public License as published by
-dnl the Free Software Foundation; either version 2 of the License, or
-dnl (at your option) any later version.
-dnl
-dnl This program is distributed in the hope that it will be useful, but
-dnl WITHOUT ANY WARRANTY; without even the implied warranty of
-dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
-dnl General Public License for more details.
-dnl
-dnl You should have received a copy of the GNU General Public License
-dnl along with this program; if not, write to the Free Software
-dnl Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
-dnl 02111-1307, USA.
-dnl
-dnl As a special exception to the GNU General Public License, if you
-dnl distribute this file as part of a program that contains a
-dnl configuration script generated by Autoconf, you may include it under
-dnl the same distribution terms that you use for the rest of that
-dnl program.
-
-dnl PKG_PREREQ(MIN-VERSION)
-dnl -----------------------
-dnl Since: 0.29
-dnl
-dnl Verify that the version of the pkg-config macros are at least
-dnl MIN-VERSION. Unlike PKG_PROG_PKG_CONFIG, which checks the user's
-dnl installed version of pkg-config, this checks the developer's version
-dnl of pkg.m4 when generating configure.
-dnl
-dnl To ensure that this macro is defined, also add:
-dnl m4_ifndef([PKG_PREREQ],
-dnl [m4_fatal([must install pkg-config 0.29 or later before running autoconf/autogen])])
-dnl
-dnl See the "Since" comment for each macro you use to see what version
-dnl of the macros you require.
-m4_defun([PKG_PREREQ],
-[m4_define([PKG_MACROS_VERSION], [0.29.1])
-m4_if(m4_version_compare(PKG_MACROS_VERSION, [$1]), -1,
- [m4_fatal([pkg.m4 version $1 or higher is required but ]PKG_MACROS_VERSION[ found])])
-])dnl PKG_PREREQ
-
-dnl PKG_PROG_PKG_CONFIG([MIN-VERSION])
-dnl ----------------------------------
-dnl Since: 0.16
-dnl
-dnl Search for the pkg-config tool and set the PKG_CONFIG variable to
-dnl first found in the path. Checks that the version of pkg-config found
-dnl is at least MIN-VERSION. If MIN-VERSION is not specified, 0.9.0 is
-dnl used since that's the first version where most current features of
-dnl pkg-config existed.
+# pkg.m4 - Macros to locate and utilise pkg-config. -*- Autoconf -*-
+# serial 1 (pkg-config-0.24)
+#
+# Copyright © 2004 Scott James Remnant <scott at netsplit.com>.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+#
+# As a special exception to the GNU General Public License, if you
+# distribute this file as part of a program that contains a
+# configuration script generated by Autoconf, you may include it under
+# the same distribution terms that you use for the rest of that program.
+
+# PKG_PROG_PKG_CONFIG([MIN-VERSION])
+# ----------------------------------
AC_DEFUN([PKG_PROG_PKG_CONFIG],
[m4_pattern_forbid([^_?PKG_[A-Z_]+$])
m4_pattern_allow([^PKG_CONFIG(_(PATH|LIBDIR|SYSROOT_DIR|ALLOW_SYSTEM_(CFLAGS|LIBS)))?$])
@@ -76,19 +45,18 @@ if test -n "$PKG_CONFIG"; then
PKG_CONFIG=""
fi
fi[]dnl
-])dnl PKG_PROG_PKG_CONFIG
-
-dnl PKG_CHECK_EXISTS(MODULES, [ACTION-IF-FOUND], [ACTION-IF-NOT-FOUND])
-dnl -------------------------------------------------------------------
-dnl Since: 0.18
-dnl
-dnl Check to see whether a particular set of modules exists. Similar to
-dnl PKG_CHECK_MODULES(), but does not set variables or print errors.
-dnl
-dnl Please remember that m4 expands AC_REQUIRE([PKG_PROG_PKG_CONFIG])
-dnl only at the first occurence in configure.ac, so if the first place
-dnl it's called might be skipped (such as if it is within an "if", you
-dnl have to call PKG_CHECK_EXISTS manually
+])# PKG_PROG_PKG_CONFIG
+
+# PKG_CHECK_EXISTS(MODULES, [ACTION-IF-FOUND], [ACTION-IF-NOT-FOUND])
+#
+# Check to see whether a particular set of modules exists. Similar
+# to PKG_CHECK_MODULES(), but does not set variables or print errors.
+#
+# Please remember that m4 expands AC_REQUIRE([PKG_PROG_PKG_CONFIG])
+# only at the first occurence in configure.ac, so if the first place
+# it's called might be skipped (such as if it is within an "if", you
+# have to call PKG_CHECK_EXISTS manually
+# --------------------------------------------------------------
AC_DEFUN([PKG_CHECK_EXISTS],
[AC_REQUIRE([PKG_PROG_PKG_CONFIG])dnl
if test -n "$PKG_CONFIG" && \
@@ -98,10 +66,8 @@ m4_ifvaln([$3], [else
$3])dnl
fi])
-dnl _PKG_CONFIG([VARIABLE], [COMMAND], [MODULES])
-dnl ---------------------------------------------
-dnl Internal wrapper calling pkg-config via PKG_CONFIG and setting
-dnl pkg_failed based on the result.
+# _PKG_CONFIG([VARIABLE], [COMMAND], [MODULES])
+# ---------------------------------------------
m4_define([_PKG_CONFIG],
[if test -n "$$1"; then
pkg_cv_[]$1="$$1"
@@ -113,11 +79,10 @@ m4_define([_PKG_CONFIG],
else
pkg_failed=untried
fi[]dnl
-])dnl _PKG_CONFIG
+])# _PKG_CONFIG
-dnl _PKG_SHORT_ERRORS_SUPPORTED
-dnl ---------------------------
-dnl Internal check to see if pkg-config supports short errors.
+# _PKG_SHORT_ERRORS_SUPPORTED
+# -----------------------------
AC_DEFUN([_PKG_SHORT_ERRORS_SUPPORTED],
[AC_REQUIRE([PKG_PROG_PKG_CONFIG])
if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
@@ -125,17 +90,19 @@ if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
else
_pkg_short_errors_supported=no
fi[]dnl
-])dnl _PKG_SHORT_ERRORS_SUPPORTED
-
-
-dnl PKG_CHECK_MODULES(VARIABLE-PREFIX, MODULES, [ACTION-IF-FOUND],
-dnl [ACTION-IF-NOT-FOUND])
-dnl --------------------------------------------------------------
-dnl Since: 0.4.0
-dnl
-dnl Note that if there is a possibility the first call to
-dnl PKG_CHECK_MODULES might not happen, you should be sure to include an
-dnl explicit call to PKG_PROG_PKG_CONFIG in your configure.ac
+])# _PKG_SHORT_ERRORS_SUPPORTED
+
+
+# PKG_CHECK_MODULES(VARIABLE-PREFIX, MODULES, [ACTION-IF-FOUND],
+# [ACTION-IF-NOT-FOUND])
+#
+#
+# Note that if there is a possibility the first call to
+# PKG_CHECK_MODULES might not happen, you should be sure to include an
+# explicit call to PKG_PROG_PKG_CONFIG in your configure.ac
+#
+#
+# --------------------------------------------------------------
AC_DEFUN([PKG_CHECK_MODULES],
[AC_REQUIRE([PKG_PROG_PKG_CONFIG])dnl
AC_ARG_VAR([$1][_CFLAGS], [C compiler flags for $1, overriding pkg-config])dnl
@@ -189,40 +156,16 @@ else
AC_MSG_RESULT([yes])
$3
fi[]dnl
-])dnl PKG_CHECK_MODULES
-
-
-dnl PKG_CHECK_MODULES_STATIC(VARIABLE-PREFIX, MODULES, [ACTION-IF-FOUND],
-dnl [ACTION-IF-NOT-FOUND])
-dnl ---------------------------------------------------------------------
-dnl Since: 0.29
-dnl
-dnl Checks for existence of MODULES and gathers its build flags with
-dnl static libraries enabled. Sets VARIABLE-PREFIX_CFLAGS from --cflags
-dnl and VARIABLE-PREFIX_LIBS from --libs.
-dnl
-dnl Note that if there is a possibility the first call to
-dnl PKG_CHECK_MODULES_STATIC might not happen, you should be sure to
-dnl include an explicit call to PKG_PROG_PKG_CONFIG in your
-dnl configure.ac.
-AC_DEFUN([PKG_CHECK_MODULES_STATIC],
-[AC_REQUIRE([PKG_PROG_PKG_CONFIG])dnl
-_save_PKG_CONFIG=$PKG_CONFIG
-PKG_CONFIG="$PKG_CONFIG --static"
-PKG_CHECK_MODULES($@)
-PKG_CONFIG=$_save_PKG_CONFIG[]dnl
-])dnl PKG_CHECK_MODULES_STATIC
+])# PKG_CHECK_MODULES
-dnl PKG_INSTALLDIR([DIRECTORY])
-dnl -------------------------
-dnl Since: 0.27
-dnl
-dnl Substitutes the variable pkgconfigdir as the location where a module
-dnl should install pkg-config .pc files. By default the directory is
-dnl $libdir/pkgconfig, but the default can be changed by passing
-dnl DIRECTORY. The user can override through the --with-pkgconfigdir
-dnl parameter.
+# PKG_INSTALLDIR(DIRECTORY)
+# -------------------------
+# Substitutes the variable pkgconfigdir as the location where a module
+# should install pkg-config .pc files. By default the directory is
+# $libdir/pkgconfig, but the default can be changed by passing
+# DIRECTORY. The user can override through the --with-pkgconfigdir
+# parameter.
AC_DEFUN([PKG_INSTALLDIR],
[m4_pushdef([pkg_default], [m4_default([$1], ['${libdir}/pkgconfig'])])
m4_pushdef([pkg_description],
@@ -233,18 +176,16 @@ AC_ARG_WITH([pkgconfigdir],
AC_SUBST([pkgconfigdir], [$with_pkgconfigdir])
m4_popdef([pkg_default])
m4_popdef([pkg_description])
-])dnl PKG_INSTALLDIR
+]) dnl PKG_INSTALLDIR
-dnl PKG_NOARCH_INSTALLDIR([DIRECTORY])
-dnl --------------------------------
-dnl Since: 0.27
-dnl
-dnl Substitutes the variable noarch_pkgconfigdir as the location where a
-dnl module should install arch-independent pkg-config .pc files. By
-dnl default the directory is $datadir/pkgconfig, but the default can be
-dnl changed by passing DIRECTORY. The user can override through the
-dnl --with-noarch-pkgconfigdir parameter.
+# PKG_NOARCH_INSTALLDIR(DIRECTORY)
+# -------------------------
+# Substitutes the variable noarch_pkgconfigdir as the location where a
+# module should install arch-independent pkg-config .pc files. By
+# default the directory is $datadir/pkgconfig, but the default can be
+# changed by passing DIRECTORY. The user can override through the
+# --with-noarch-pkgconfigdir parameter.
AC_DEFUN([PKG_NOARCH_INSTALLDIR],
[m4_pushdef([pkg_default], [m4_default([$1], ['${datadir}/pkgconfig'])])
m4_pushdef([pkg_description],
@@ -255,15 +196,13 @@ AC_ARG_WITH([noarch-pkgconfigdir],
AC_SUBST([noarch_pkgconfigdir], [$with_noarch_pkgconfigdir])
m4_popdef([pkg_default])
m4_popdef([pkg_description])
-])dnl PKG_NOARCH_INSTALLDIR
+]) dnl PKG_NOARCH_INSTALLDIR
-dnl PKG_CHECK_VAR(VARIABLE, MODULE, CONFIG-VARIABLE,
-dnl [ACTION-IF-FOUND], [ACTION-IF-NOT-FOUND])
-dnl -------------------------------------------
-dnl Since: 0.28
-dnl
-dnl Retrieves the value of the pkg-config variable for the given module.
+# PKG_CHECK_VAR(VARIABLE, MODULE, CONFIG-VARIABLE,
+# [ACTION-IF-FOUND], [ACTION-IF-NOT-FOUND])
+# -------------------------------------------
+# Retrieves the value of the pkg-config variable for the given module.
AC_DEFUN([PKG_CHECK_VAR],
[AC_REQUIRE([PKG_PROG_PKG_CONFIG])dnl
AC_ARG_VAR([$1], [value of $3 for $2, overriding pkg-config])dnl
@@ -272,4 +211,4 @@ _PKG_CONFIG([$1], [variable="][$3]["], [$2])
AS_VAR_COPY([$1], [pkg_cv_][$1])
AS_VAR_IF([$1], [""], [$5], [$4])dnl
-])dnl PKG_CHECK_VAR
+])# PKG_CHECK_VAR
diff --git a/src/Makefile.am b/src/Makefile.am
index 74ff1ae..32693e7 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -1,8 +1,5 @@
-# $Id$
-# $URL $
-
AM_CPPFLAGS = -I at top_srcdir@/include
-AM_CXXFLAGS = -DSYSCONF_PATH=\"$(datadir)\" -std=c++0x # -Weffc++
+AM_CXXFLAGS = -DSYSCONF_PATH=\"$(datadir)\" -std=c++0x -W -Wall -pedantic -O3 -g
bin_PROGRAMS = ucto
diff --git a/src/Makefile.in b/src/Makefile.in
index 29a9f70..d73786d 100644
--- a/src/Makefile.in
+++ b/src/Makefile.in
@@ -14,9 +14,6 @@
@SET_MAKE@
-# $Id$
-# $URL $
-
VPATH = @srcdir@
am__is_gnu_make = { \
@@ -95,8 +92,7 @@ host_triplet = @host@
bin_PROGRAMS = ucto$(EXEEXT)
subdir = src
ACLOCAL_M4 = $(top_srcdir)/aclocal.m4
-am__aclocal_m4_deps = $(top_srcdir)/m4/ax_icu_check.m4 \
- $(top_srcdir)/m4/ax_lib_readline.m4 \
+am__aclocal_m4_deps = $(top_srcdir)/m4/ax_lib_readline.m4 \
$(top_srcdir)/m4/libtool.m4 $(top_srcdir)/m4/ltoptions.m4 \
$(top_srcdir)/m4/ltsugar.m4 $(top_srcdir)/m4/ltversion.m4 \
$(top_srcdir)/m4/lt~obsolete.m4 $(top_srcdir)/m4/pkg.m4 \
@@ -423,13 +419,7 @@ EXEEXT = @EXEEXT@
FGREP = @FGREP@
GREP = @GREP@
ICU_CFLAGS = @ICU_CFLAGS@
-ICU_CONFIG = @ICU_CONFIG@
-ICU_CPPSEARCHPATH = @ICU_CPPSEARCHPATH@
-ICU_CXXFLAGS = @ICU_CXXFLAGS@
-ICU_IOLIBS = @ICU_IOLIBS@
-ICU_LIBPATH = @ICU_LIBPATH@
ICU_LIBS = @ICU_LIBS@
-ICU_VERSION = @ICU_VERSION@
INSTALL = @INSTALL@
INSTALL_DATA = @INSTALL_DATA@
INSTALL_PROGRAM = @INSTALL_PROGRAM@
@@ -520,6 +510,7 @@ pdfdir = @pdfdir@
prefix = @prefix@
program_transform_name = @program_transform_name@
psdir = @psdir@
+runstatedir = @runstatedir@
sbindir = @sbindir@
sharedstatedir = @sharedstatedir@
srcdir = @srcdir@
@@ -533,7 +524,7 @@ top_srcdir = @top_srcdir@
uctodata_CFLAGS = @uctodata_CFLAGS@
uctodata_LIBS = @uctodata_LIBS@
AM_CPPFLAGS = -I at top_srcdir@/include
-AM_CXXFLAGS = -DSYSCONF_PATH=\"$(datadir)\" -std=c++0x # -Weffc++
+AM_CXXFLAGS = -DSYSCONF_PATH=\"$(datadir)\" -std=c++0x -W -Wall -pedantic -O3 -g
LDADD = libucto.la
ucto_SOURCES = ucto.cxx
lib_LTLIBRARIES = libucto.la
diff --git a/src/setting.cxx b/src/setting.cxx
index e87f466..9d7ec69 100644
--- a/src/setting.cxx
+++ b/src/setting.cxx
@@ -117,7 +117,7 @@ namespace Tokenizer {
class uLogicError: public std::logic_error {
public:
- uLogicError( const string& s ): logic_error( "ucto: logic error:" + s ){};
+ explicit uLogicError( const string& s ): logic_error( "ucto: logic error:" + s ){};
};
ostream& operator<<( ostream& os, const Quoting& q ){
@@ -230,12 +230,27 @@ namespace Tokenizer {
delete rule;
}
rulesmap.clear();
- delete theErrLog;
+ }
+
+ set<string> Setting::installed_languages() {
+ // we only return 'languages' which are installed as 'tokconfig-*'
+ //
+ vector<string> files = TiCC::searchFilesMatch( defaultConfigDir, "tokconfig-*" );
+ set<string> result;
+ for ( auto const& f : files ){
+ string base = TiCC::basename(f);
+ size_t pos = base.find("tokconfig-");
+ if ( pos == 0 ){
+ string lang = base.substr( 10 );
+ result.insert( lang );
+ }
+ }
+ return result;
}
bool Setting::readrules( const string& fname ){
if ( tokDebug > 0 ){
- *theErrLog << "%include " << fname << endl;
+ LOG << "%include " << fname << endl;
}
ifstream f( fname );
if ( !f ){
@@ -248,7 +263,7 @@ namespace Tokenizer {
line.trim();
if ((line.length() > 0) && (line[0] != '#')) {
if ( tokDebug >= 5 ){
- *theErrLog << "include line = " << rawline << endl;
+ LOG << "include line = " << rawline << endl;
}
const int splitpoint = line.indexOf("=");
if ( splitpoint < 0 ){
@@ -266,14 +281,14 @@ namespace Tokenizer {
bool Setting::readfilters( const string& fname ){
if ( tokDebug > 0 ){
- *theErrLog << "%include " << fname << endl;
+ LOG << "%include " << fname << endl;
}
return filter.fill( fname );
}
bool Setting::readquotes( const string& fname ){
if ( tokDebug > 0 ){
- *theErrLog << "%include " << fname << endl;
+ LOG << "%include " << fname << endl;
}
ifstream f( fname );
if ( !f ){
@@ -286,7 +301,7 @@ namespace Tokenizer {
line.trim();
if ((line.length() > 0) && (line[0] != '#')) {
if ( tokDebug >= 5 ){
- *theErrLog << "include line = " << rawline << endl;
+ LOG << "include line = " << rawline << endl;
}
int splitpoint = line.indexOf(" ");
if ( splitpoint == -1 )
@@ -314,7 +329,7 @@ namespace Tokenizer {
bool Setting::readeosmarkers( const string& fname ){
if ( tokDebug > 0 ){
- *theErrLog << "%include " << fname << endl;
+ LOG << "%include " << fname << endl;
}
ifstream f( fname );
if ( !f ){
@@ -327,7 +342,7 @@ namespace Tokenizer {
line.trim();
if ((line.length() > 0) && (line[0] != '#')) {
if ( tokDebug >= 5 ){
- *theErrLog << "include line = " << rawline << endl;
+ LOG << "include line = " << rawline << endl;
}
if ( ( line.startsWith("\\u") && line.length() == 6 ) ||
( line.startsWith("\\U") && line.length() == 10 ) ){
@@ -346,7 +361,7 @@ namespace Tokenizer {
bool Setting::readabbreviations( const string& fname,
UnicodeString& abbreviations ){
if ( tokDebug > 0 ){
- *theErrLog << "%include " << fname << endl;
+ LOG << "%include " << fname << endl;
}
ifstream f( fname );
if ( !f ){
@@ -359,7 +374,7 @@ namespace Tokenizer {
line.trim();
if ((line.length() > 0) && (line[0] != '#')) {
if ( tokDebug >= 5 ){
- *theErrLog << "include line = " << rawline << endl;
+ LOG << "include line = " << rawline << endl;
}
if ( !abbreviations.isEmpty())
abbreviations += '|';
@@ -661,7 +676,7 @@ namespace Tokenizer {
}
break;
default:
- throw uLogicError("unhandled case in switch");
+ throw uLogicError( "unhandled case in switch" );
}
}
}
diff --git a/src/textcat.cxx b/src/textcat.cxx
index ae5d97d..3fa0040 100644
--- a/src/textcat.cxx
+++ b/src/textcat.cxx
@@ -71,11 +71,11 @@ string TextCat::get_language( const string& in ) const {
#else
TextCat::~TextCat() {}
-TextCat::TextCat( const std::string& cf ) {
+TextCat::TextCat( const std::string& cf ): TC(0) {
throw runtime_error( "TextCat::TextCat(" + cf + "): TextCat Support not available" );
}
-TextCat::TextCat( const TextCat& in ) {
+TextCat::TextCat( const TextCat& in ): TC(0) {
throw runtime_error( "TextCat::TextCat(): TextCat Support not available" );
}
diff --git a/src/tokenize.cxx b/src/tokenize.cxx
index 502078a..274e1b8 100644
--- a/src/tokenize.cxx
+++ b/src/tokenize.cxx
@@ -74,17 +74,17 @@ namespace Tokenizer {
class uRangeError: public std::out_of_range {
public:
- uRangeError( const string& s ): out_of_range( "ucto: out of range:" + s ){};
+ explicit uRangeError( const string& s ): out_of_range( "ucto: out of range:" + s ){};
};
class uLogicError: public std::logic_error {
public:
- uLogicError( const string& s ): logic_error( "ucto: logic error:" + s ){};
+ explicit uLogicError( const string& s ): logic_error( "ucto: logic error:" + s ){};
};
class uCodingError: public std::runtime_error {
public:
- uCodingError( const string& s ): runtime_error( "ucto: coding problem:" + s ){};
+ explicit uCodingError( const string& s ): runtime_error( "ucto: coding problem:" + s ){};
};
@@ -153,18 +153,21 @@ namespace Tokenizer {
doPunctFilter(false),
detectPar(true),
paragraphsignal(true),
+ doDetectLang(false),
+ text_redundancy("minimal"),
sentenceperlineoutput(false),
sentenceperlineinput(false),
lowercase(false),
uppercase(false),
xmlout(false),
+ xmlin(false),
passthru(false),
inputclass("current"),
outputclass("current"),
tc( 0 )
{
- theErrLog = new TiCC::LogStream(cerr);
- theErrLog->setstamp( NoStamp );
+ theErrLog = new TiCC::LogStream(cerr, "ucto" );
+ theErrLog->setstamp( StampMessage );
#ifdef ENABLE_TEXTCAT
string textcat_cfg = string(SYSCONF_PATH) + "/ucto/textcat.cfg";
tc = new TextCat( textcat_cfg );
@@ -172,8 +175,21 @@ namespace Tokenizer {
}
TokenizerClass::~TokenizerClass(){
- // delete setting;
+ Setting *d = 0;
+ for ( const auto& s : settings ){
+ if ( s.first == "default" ){
+ // the 'default' may also return as a real 'language'
+ // avoud delettng it twice
+ d = s.second;
+ delete d;
+ }
+ if ( s.second != d ){
+ delete s.second;
+ }
+
+ }
delete theErrLog;
+ delete tc;
}
bool TokenizerClass::reset( const string& lang ){
@@ -204,6 +220,18 @@ namespace Tokenizer {
return old;
}
+ string TokenizerClass::setTextRedundancy( const std::string& tr ){
+ if ( tr == "none" || tr == "minimal" || tr == "full" ){
+ string s = text_redundancy;
+ text_redundancy = tr;
+ return s;
+ }
+ else {
+ throw runtime_error( "illegal value '" + tr + "' for textredundancy. "
+ "expected 'full', 'minimal' or 'none'." );
+ }
+ }
+
void stripCR( string& s ){
string::size_type pos = s.rfind( '\r' );
if ( pos != string::npos ){
@@ -211,6 +239,63 @@ namespace Tokenizer {
}
}
+ void TokenizerClass::extractSentencesAndFlush( int numS,
+ vector<Token>& outputTokens,
+ const string& lang ){
+ int count = 0;
+ const int size = tokens.size();
+ short quotelevel = 0;
+ size_t begin = 0;
+ size_t end = 0;
+ for ( int i = 0; i < size; ++i ) {
+ if (tokens[i].role & NEWPARAGRAPH) {
+ quotelevel = 0;
+ }
+ else if (tokens[i].role & ENDQUOTE) {
+ --quotelevel;
+ }
+ if ( (tokens[i].role & BEGINOFSENTENCE)
+ && (quotelevel == 0)) {
+ begin = i;
+ }
+ //FBK: QUOTELEVEL GOES UP BEFORE begin IS UPDATED... RESULTS IN DUPLICATE OUTPUT
+ if (tokens[i].role & BEGINQUOTE) {
+ ++quotelevel;
+ }
+ if ((tokens[i].role & ENDOFSENTENCE) && (quotelevel == 0)) {
+ end = i+1;
+ tokens[begin].role |= BEGINOFSENTENCE; //sanity check
+ if (tokDebug >= 1){
+ LOG << "[tokenize] extracted sentence " << count << ", begin="<<begin << ",end="<< end << endl;
+ }
+ for ( size_t i=begin; i < end; ++i ){
+ outputTokens.push_back( tokens[i] );
+ }
+ if ( ++count == numS ){
+ if (tokDebug >= 1){
+ LOG << "[tokenize] erase " << end << " tokens from " << tokens.size() << endl;
+ }
+ tokens.erase( tokens.begin(),tokens.begin()+end );
+ if ( !passthru ){
+ if ( !settings[lang]->quotes.emptyStack() ) {
+ settings[lang]->quotes.flushStack( end );
+ }
+ }
+ //After flushing, the first token still in buffer (if any) is always a BEGINOFSENTENCE:
+ if (!tokens.empty()) {
+ tokens[0].role |= BEGINOFSENTENCE;
+ }
+ return;
+ }
+ }
+ }
+ if ( count < numS ){
+ throw uRangeError( "Not enough sentences exists in the buffer: ("
+ + toString( count ) + " found. " + toString( numS)
+ + " wanted)" );
+ }
+ }
+
vector<Token> TokenizerClass::tokenizeStream( istream& IN,
const string& lang ) {
vector<Token> outputTokens;
@@ -289,7 +374,7 @@ namespace Tokenizer {
}
language = lan;
}
- tokenizeLine( input_line, language );
+ tokenizeLine( input_line, language, "" );
}
numS = countSentences(); //count full sentences in token buffer
}
@@ -297,15 +382,7 @@ namespace Tokenizer {
if ( tokDebug > 0 ){
LOG << "[tokenize] " << numS << " sentence(s) in buffer, processing..." << endl;
}
- for (int i = 0; i < numS; i++) {
- vector<Token> v = getSentence( i );
- outputTokens.insert( outputTokens.end(), v.begin(), v.end() );
- }
- // clear processed sentences from buffer
- if ( tokDebug > 0 ){
- LOG << "[tokenize] flushing " << numS << " sentence(s) from buffer..." << endl;
- }
- flushSentences(numS, lang );
+ extractSentencesAndFlush( numS, outputTokens, lang );
return outputTokens;
}
else {
@@ -355,7 +432,7 @@ namespace Tokenizer {
if ( passthru )
passthruLine( line, bos );
else
- tokenizeLine( line );
+ tokenizeLine( line, lang );
numS = countSentences(); //count full sentences in token buffer
}
if ( numS > 0 ) {
@@ -385,7 +462,7 @@ namespace Tokenizer {
folia::Document *TokenizerClass::tokenize( istream& IN ) {
inputEncoding = checkBOM( IN );
folia::Document *doc = new folia::Document( "id='" + docid + "'" );
- if ( default_language != "none" ){
+ if ( /*doDetectLang &&*/ default_language != "none" ){
if ( tokDebug > 0 ){
LOG << "[tokenize](stream): SET document language=" << default_language << endl;
}
@@ -396,16 +473,23 @@ namespace Tokenizer {
int parCount = 0;
vector<Token> buffer;
do {
- vector<Token> v = tokenizeStream( IN );
- for ( auto const& token : v ) {
- if ( token.role & NEWPARAGRAPH) {
- //process the buffer
- parCount = outputTokensXML( root, buffer, parCount );
- buffer.clear();
- }
- buffer.push_back( token );
+ if ( tokDebug > 0 ){
+ LOG << "[tokenize] looping on stream" << endl;
+ }
+ vector<Token> v = tokenizeStream( IN );
+ for ( auto const& token : v ) {
+ if ( token.role & NEWPARAGRAPH) {
+ //process the buffer
+ parCount = outputTokensXML( root, buffer, parCount );
+ buffer.clear();
}
- } while ( IN );
+ buffer.push_back( token );
+ }
+ }
+ while ( IN );
+ if ( tokDebug > 0 ){
+ LOG << "[tokenize] end of stream reached" << endl;
+ }
if (!buffer.empty()){
outputTokensXML( root, buffer, parCount);
}
@@ -427,8 +511,8 @@ namespace Tokenizer {
else {
IN = new ifstream( ifile );
if ( !IN || !IN->good() ){
- cerr << "Error: problems opening inputfile " << ifile << endl;
- cerr << "Courageously refusing to start..." << endl;
+ cerr << "ucto: problems opening inputfile " << ifile << endl;
+ cerr << "ucto: Courageously refusing to start..." << endl;
throw runtime_error( "unable to find or read file: '" + ifile + "'" );
}
}
@@ -437,6 +521,11 @@ namespace Tokenizer {
else {
folia::Document doc;
doc.readFromFile(ifile);
+ if ( xmlin && inputclass == outputclass ){
+ LOG << "ucto: --filter=NO is automatically set. inputclass equals outputclass!"
+ << endl;
+ setFiltering(false);
+ }
this->tokenize(doc);
*OUT << doc << endl;
}
@@ -490,12 +579,18 @@ namespace Tokenizer {
int i = 0;
inputEncoding = checkBOM( IN );
do {
+ if ( tokDebug > 0 ){
+ LOG << "[tokenize] looping on stream" << endl;
+ }
vector<Token> v = tokenizeStream( IN );
if ( !v.empty() ) {
outputTokens( OUT, v , (i>0) );
}
++i;
} while ( IN );
+ if ( tokDebug > 0 ){
+ LOG << "[tokenize] end_of_stream" << endl;
+ }
OUT << endl;
}
}
@@ -504,16 +599,29 @@ namespace Tokenizer {
if ( tokDebug >= 2 ){
LOG << "tokenize doc " << doc << endl;
}
- string lan = doc.doc()->language();
- if ( lan.empty() && default_language != "none" ){
- if ( tokDebug > 1 ){
- LOG << "[tokenize](FoLiA) SET document language=" << default_language << endl;
- }
- doc.set_metadata( "language", default_language );
+ if ( xmlin && inputclass == outputclass ){
+ LOG << "ucto: --filter=NO is automatically set. inputclass equals outputclass!"
+ << endl;
+ setFiltering(false);
}
- else {
- if ( tokDebug >= 2 ){
- LOG << "[tokenize](FoLiA) Document has language " << lan << endl;
+ if ( true /*doDetectLang*/ ){
+ string lan = doc.doc()->language();
+ if ( lan.empty() && default_language != "none" ){
+ if ( tokDebug > 1 ){
+ LOG << "[tokenize](FoLiA) SET document language=" << default_language << endl;
+ }
+ if ( doc.metadatatype() == "native" ){
+ doc.set_metadata( "language", default_language );
+ }
+ else {
+ LOG << "[WARNING] cannot set the language on FoLiA documents of type "
+ << doc.metadatatype() << endl;
+ }
+ }
+ else {
+ if ( tokDebug >= 2 ){
+ LOG << "[tokenize](FoLiA) Document has language " << lan << endl;
+ }
}
}
for ( size_t i = 0; i < doc.doc()->size(); i++) {
@@ -527,25 +635,55 @@ namespace Tokenizer {
void appendText( folia::FoliaElement *root,
const string& outputclass ){
- // cerr << endl << "appendText:" << root->id() << endl;
+ // set the textcontent of root to that of it's children
if ( root->hastext( outputclass ) ){
+ // there is already text, bail out.
return;
}
UnicodeString utxt = root->text( outputclass, false, false );
- // cerr << "untok: '" << utxt << "'" << endl;
- // UnicodeString txt = root->text( outputclass, true );
- // cerr << " tok: '" << txt << "'" << endl;
+ // so get Untokenized text from the children, and set it
root->settext( folia::UnicodeToUTF8(utxt), outputclass );
}
+ void removeText( folia::FoliaElement *root,
+ const string& outputclass ){
+ // remove the textcontent in outputclass of root
+ root->cleartextcontent( outputclass );
+ }
+
+ const string get_language( folia::FoliaElement *f ) {
+ // get the language of this element, if any, don't look up.
+ // we search in ALL possible sets!
+ string st = "";
+ std::set<folia::ElementType> exclude;
+ vector<folia::LangAnnotation*> v
+ = f->select<folia::LangAnnotation>( st, exclude, false );
+ string result;
+ if ( v.size() > 0 ){
+ result = v[0]->cls();
+ }
+ return result;
+ }
+
+ void set_language( folia::FoliaElement* e, const string& lan ){
+ // set or reset the language: append a LangAnnotation child of class 'lan'
+ folia::KWargs args;
+ args["class"] = lan;
+ args["set"] = ISO_SET;
+ folia::LangAnnotation *node = new folia::LangAnnotation( e->doc() );
+ node->setAttributes( args );
+ e->replace( node );
+ }
- void TokenizerClass::tokenizeElement(folia::FoliaElement * element) {
+ void TokenizerClass::tokenizeElement( folia::FoliaElement * element) {
if ( element->isinstance(folia::Word_t)
|| element->isinstance(folia::TextContent_t))
// shortcut
return;
if ( tokDebug >= 2 ){
- LOG << "[tokenizeElement] Processing FoLiA element " << element->id() << endl;
+ LOG << "[tokenizeElement] Processing FoLiA element " << element->xmltag()
+ << "(" << element->id() << ")" << endl;
+ LOG << "[tokenizeElement] inputclass=" << inputclass << " outputclass=" << outputclass << endl;
}
if ( element->hastext( inputclass ) ) {
// We have an element which contains text. That's nice
@@ -597,15 +735,36 @@ namespace Tokenizer {
}
}
// now let's check our language
- string lan = element->language(); // remember thus recurses upward
- // to get a language from the node, it's parents OR the doc
- if ( lan.empty() || default_language == "none" ){
- lan = "default";
+ string lan;
+ if ( doDetectLang ){
+ lan = get_language( element ); // is there a local element language?
+ if ( lan.empty() ){
+ // no, so try to detect it!
+ UnicodeString temp = element->text( inputclass );
+ temp.toLower();
+ lan = tc->get_language( folia::UnicodeToUTF8(temp) );
+ if ( lan.empty() ){
+ // too bad
+ lan = "default";
+ }
+ else {
+ if ( tokDebug >= 2 ){
+ LOG << "[tokenizeElement] textcat found a supported language: " << lan << endl;
+ }
+ }
+ }
+ }
+ else {
+ lan = element->language(); // remember thus recurses upward
+ // to get a language from the node, it's parents OR the doc
+ if ( lan.empty() || default_language == "none" ){
+ lan = "default";
+ }
}
auto const it = settings.find(lan);
if ( it != settings.end() ){
if ( tokDebug >= 2 ){
- LOG << "[tokenizeElement] Found a supported language! " << lan << endl;
+ LOG << "[tokenizeElement] Found a supported language: " << lan << endl;
}
}
else if ( !default_language.empty() ){
@@ -630,12 +789,7 @@ namespace Tokenizer {
if ( tokDebug >= 2 ){
LOG << "[tokenizeElement] set language to " << lan << endl;
}
- folia::KWargs args;
- args["class"] = lan;
- args["set"] = ISO_SET;
- folia::LangAnnotation *node = new folia::LangAnnotation( element->doc() );
- node->setAttributes( args );
- element->append( node );
+ set_language( element, lan );
}
tokenizeSentenceElement( element, lan );
return;
@@ -647,9 +801,27 @@ namespace Tokenizer {
for ( size_t i = 0; i < element->size(); i++) {
tokenizeElement( element->index(i));
}
+ if ( text_redundancy == "full" ){
+ if ( tokDebug > 0 ) {
+ LOG << "[tokenizeElement] Creating text on " << element->id() << endl;
+ }
+ appendText( element, outputclass );
+ }
+ else if ( text_redundancy == "none" ){
+ if ( tokDebug > 0 ) {
+ LOG << "[tokenizeElement] Removing text from: " << element->id() << endl;
+ }
+ removeText( element, outputclass );
+ }
return;
}
+ int split_nl( const UnicodeString& line,
+ vector<UnicodeString>& parts ){
+ static UnicodeRegexMatcher nl_split( "\\n", "newline_splitter" );
+ return nl_split.split( line, parts );
+ }
+
void TokenizerClass::tokenizeSentenceElement( folia::FoliaElement *element,
const string& lang ){
folia::Document *doc = element->doc();
@@ -662,7 +834,7 @@ namespace Tokenizer {
"annotator='ucto', annotatortype='auto', datetime='now()'" );
}
if ( tokDebug > 0 ){
- cerr << "tokenize sentence element: " << element->id() << endl;
+ LOG << "[tokenizeSentenceElement] " << element->id() << endl;
}
UnicodeString line = element->stricttext( inputclass );
if ( line.isEmpty() ){
@@ -679,17 +851,32 @@ namespace Tokenizer {
passthruLine( line, bos );
}
else {
- tokenizeLine( line, lang );
+ // folia may encode newlines. These should be converted to <br/> nodes
+ // but Linebreak and newline handling is very dangerous and complicated
+ // so for now is is disabled!
+ vector<UnicodeString> parts;
+ parts.push_back( line ); // just one part
+ //split_nl( line, parts ); // disabled multipart
+ for ( auto const& l : parts ){
+ if ( tokDebug >= 1 ){
+ LOG << "[tokenizeSentenceElement] tokenize part: " << l << endl;
+ }
+ tokenizeLine( l, lang, element->id() );
+ if ( &l != &parts.back() ){
+ // append '<br'>
+ Token T( "type_linebreak", "\n", LINEBREAK, "" );
+ if ( tokDebug >= 1 ){
+ LOG << "[tokenizeSentenceElement] added LINEBREAK token " << endl;
+ }
+ tokens.push_back( T );
+ }
+ }
}
//ignore EOL data, we have by definition only one sentence:
int numS = countSentences(true); //force buffer to empty
vector<Token> outputTokens;
- for (int i = 0; i < numS; i++) {
- vector<Token> v = getSentence( i );
- outputTokens.insert( outputTokens.end(), v.begin(), v.end() );
- }
+ extractSentencesAndFlush( numS, outputTokens, lang );
outputTokensXML( element, outputTokens, 0 );
- flushSentences( numS, lang );
}
void TokenizerClass::outputTokensDoc_init( folia::Document& doc ) const {
@@ -707,25 +894,6 @@ namespace Tokenizer {
doc.append( text );
}
- void TokenizerClass::outputTokensDoc( folia::Document& doc,
- const vector<Token>& tv ) const {
- folia::FoliaElement *root = doc.doc()->index(0);
- string lan = doc.doc()->language();
- if ( lan.empty() ){
- if ( tokDebug >= 1 ){
- LOG << "[outputTokensDoc] SET docuemnt language="
- << default_language << endl;
- }
- doc.set_metadata( "language", default_language );
- }
- else {
- if ( tokDebug >= 2 ){
- LOG << "[outputTokensDoc] Document has language " << lan << endl;
- }
- }
- outputTokensXML(root, tv );
- }
-
int TokenizerClass::outputTokensXML( folia::FoliaElement *root,
const vector<Token>& tv,
int parCount ) const {
@@ -741,11 +909,12 @@ namespace Tokenizer {
if ( root->isinstance( folia::Sentence_t ) ){
root_is_sentence = true;
}
- else if ( root->isinstance( folia::Paragraph_t )
+ else if ( root->isinstance( folia::Paragraph_t ) //TODO: can't we do this smarter?
|| root->isinstance( folia::Head_t )
|| root->isinstance( folia::Note_t )
|| root->isinstance( folia::ListItem_t )
|| root->isinstance( folia::Part_t )
+ || root->isinstance( folia::Utterance_t )
|| root->isinstance( folia::Caption_t )
|| root->isinstance( folia::Event_t ) ){
root_is_structure_element = true;
@@ -753,16 +922,27 @@ namespace Tokenizer {
bool in_paragraph = false;
for ( const auto& token : tv ) {
- if ( ( !root_is_structure_element && !root_is_sentence )
+ if ( ( !root_is_structure_element && !root_is_sentence ) //TODO: instead of !root_is_structurel check if is_structure and accepts paragraphs?
&&
( (token.role & NEWPARAGRAPH) || !in_paragraph ) ) {
- if ( in_paragraph ){
- appendText( root, outputclass );
- root = root->parent();
- }
if ( tokDebug > 0 ) {
LOG << "[outputTokensXML] Creating paragraph" << endl;
}
+ if ( in_paragraph ){
+ if ( text_redundancy == "full" ){
+ if ( tokDebug > 0 ) {
+ LOG << "[outputTokensXML] Creating text on root: " << root->id() << endl;
+ }
+ appendText( root, outputclass );
+ }
+ else if ( text_redundancy == "none" ){
+ if ( tokDebug > 0 ) {
+ LOG << "[outputTokensXML] Removing text from root: " << root->id() << endl;
+ }
+ removeText( root, outputclass );
+ }
+ root = root->parent();
+ }
folia::KWargs args;
args["id"] = root->doc()->id() + ".p." + toString(++parCount);
folia::FoliaElement *p = new folia::Paragraph( args, root->doc() );
@@ -782,12 +962,27 @@ namespace Tokenizer {
LOG << "[outputTokensXML] back to " << root->classname() << endl;
}
}
- if (( token.role & BEGINOFSENTENCE) && (!root_is_sentence)) {
+ if ( ( token.role & LINEBREAK) ){
+ if (tokDebug > 0) {
+ LOG << "[outputTokensXML] LINEBREAK!" << endl;
+ }
+ folia::FoliaElement *lb = new folia::Linebreak();
+ root->append( lb );
+ if (tokDebug > 0){
+ LOG << "[outputTokensXML] back to " << root->classname() << endl;
+ }
+ }
+ if ( ( token.role & BEGINOFSENTENCE)
+ && !root_is_sentence
+ && !root->isinstance( folia::Utterance_t ) ) {
folia::KWargs args;
- if ( root->id().empty() )
- args["generate_id"] = root->parent()->id();
- else
- args["generate_id"] = root->id();
+ string id = root->id();
+ if ( id.empty() ){
+ id = root->parent()->id();
+ }
+ if ( !id.empty() ){
+ args["generate_id"] = id;
+ }
if ( tokDebug > 0 ) {
LOG << "[outputTokensXML] Creating sentence in '"
<< args["generate_id"] << "'" << endl;
@@ -807,62 +1002,86 @@ namespace Tokenizer {
}
s->doc()->declare( folia::AnnotationType::LANG,
ISO_SET, "annotator='ucto'" );
- folia::KWargs args;
- args["class"] = tok_lan;
- args["set"] = ISO_SET;
- folia::LangAnnotation *node = new folia::LangAnnotation( s->doc() );
- node->setAttributes( args );
- s->append( node );
+ set_language( s, tok_lan );
}
root = s;
lastS = root;
}
- if (tokDebug > 0) {
- LOG << "[outputTokensXML] Creating word element for " << token.us << endl;
- }
- folia::KWargs args;
- args["generate_id"] = lastS->id();
- args["class"] = folia::UnicodeToUTF8( token.type );
- if ( passthru ){
- args["set"] = "passthru";
- }
- else {
- auto it = settings.find(token.lc);
- if ( it == settings.end() ){
- it = settings.find("default");
+ if ( !(token.role & LINEBREAK) ){
+ if (tokDebug > 0) {
+ LOG << "[outputTokensXML] Creating word element for " << token.us << endl;
+ }
+ folia::KWargs args;
+ string id = lastS->id();
+ if ( id.empty() ){
+ id = lastS->parent()->id();
+ }
+ if ( !id.empty() ){
+ args["generate_id"] = id;
+ }
+ args["class"] = folia::UnicodeToUTF8( token.type );
+ if ( passthru ){
+ args["set"] = "passthru";
+ }
+ else {
+ auto it = settings.find(token.lc);
+ if ( it == settings.end() ){
+ it = settings.find("default");
+ }
+ args["set"] = it->second->set_file;
+ }
+ if ( token.role & NOSPACE) {
+ args["space"]= "no";
+ }
+ if ( outputclass != inputclass ){
+ args["textclass"] = outputclass;
+ }
+ folia::FoliaElement *w = new folia::Word( args, root->doc() );
+ root->append( w );
+ UnicodeString out = token.us;
+ if (lowercase) {
+ out.toLower();
+ }
+ else if (uppercase) {
+ out.toUpper();
+ }
+ w->settext( folia::UnicodeToUTF8( out ), outputclass );
+ if ( tokDebug > 1 ) {
+ LOG << "created " << w << " text= " << token.us << "(" << outputclass << ")" << endl;
}
- args["set"] = it->second->set_file;
- }
- if ( token.role & NOSPACE) {
- args["space"]= "no";
- }
- folia::FoliaElement *w = new folia::Word( args, root->doc() );
- UnicodeString out = token.us;
- if (lowercase) {
- out.toLower();
- }
- else if (uppercase) {
- out.toUpper();
}
- w->settext( folia::UnicodeToUTF8( out ), outputclass );
- // LOG << "created " << w << " text= " << token.us << endl;
- root->append( w );
if ( token.role & BEGINQUOTE) {
if (tokDebug > 0) {
LOG << "[outputTokensXML] Creating quote element" << endl;
}
- folia::FoliaElement *q = new folia::Quote( folia::getArgs( "generate_id='" + root->id() + "'"),
- root->doc() );
+ folia::KWargs args;
+ string id = root->id();
+ if ( id.empty() ){
+ id = root->parent()->id();
+ }
+ if ( !id.empty() ){
+ args["generate_id"] = id;
+ }
+ folia::FoliaElement *q = new folia::Quote( args, root->doc() );
// LOG << "created " << q << endl;
root->append( q );
root = q;
quotelevel++;
}
- if ( ( token.role & ENDOFSENTENCE) && (!root_is_sentence) ) {
+ if ( ( token.role & ENDOFSENTENCE ) && (!root_is_sentence) && (!root->isinstance(folia::Utterance_t))) {
if (tokDebug > 0) {
LOG << "[outputTokensXML] End of sentence" << endl;
}
- appendText( root, outputclass );
+ if ( text_redundancy == "full" ){
+ appendText( root, outputclass );
+ }
+ else if ( text_redundancy == "none" ){
+ removeText( root, outputclass );
+ }
+ if ( token.role & LINEBREAK ){
+ folia::FoliaElement *lb = new folia::Linebreak();
+ root->append( lb );
+ }
root = root->parent();
lastS = root;
if (tokDebug > 0){
@@ -872,7 +1091,21 @@ namespace Tokenizer {
in_paragraph = true;
}
if ( tv.size() > 0 ){
- appendText( root, outputclass );
+ if ( text_redundancy == "full" ){
+ if ( tokDebug > 0 ) {
+ LOG << "[outputTokensXML] Creating text on root: " << root->id() << endl;
+ }
+ appendText( root, outputclass );
+ }
+ else if ( text_redundancy == "none" ){
+ if ( tokDebug > 0 ) {
+ LOG << "[outputTokensXML] Removing text from root: " << root->id() << endl;
+ }
+ removeText( root, outputclass );
+ }
+ }
+ if ( tokDebug > 0 ) {
+ LOG << "[outputTokensXML] Done. parCount= " << parCount << endl;
}
return parCount;
}
@@ -949,7 +1182,7 @@ namespace Tokenizer {
}
}
- int TokenizerClass::countSentences(bool forceentirebuffer) {
+ int TokenizerClass::countSentences( bool forceentirebuffer ) {
//Return the number of *completed* sentences in the token buffer
//Performs extra sanity checks at the same time! Making sure
@@ -1053,14 +1286,21 @@ namespace Tokenizer {
short quotelevel = 0;
size_t begin = 0;
size_t end = 0;
- for ( int i = 0; i < size; i++) {
- if (tokens[i].role & NEWPARAGRAPH) quotelevel = 0;
- if (tokens[i].role & ENDQUOTE) quotelevel--;
- if ((tokens[i].role & BEGINOFSENTENCE) && (quotelevel == 0)) {
+ for ( int i = 0; i < size; ++i ) {
+ if (tokens[i].role & NEWPARAGRAPH) {
+ quotelevel = 0;
+ }
+ else if (tokens[i].role & ENDQUOTE) {
+ --quotelevel;
+ }
+ if ( (tokens[i].role & BEGINOFSENTENCE)
+ && (quotelevel == 0)) {
begin = i;
}
//FBK: QUOTELEVEL GOES UP BEFORE begin IS UPDATED... RESULTS IN DUPLICATE OUTPUT
- if (tokens[i].role & BEGINQUOTE) quotelevel++;
+ if (tokens[i].role & BEGINQUOTE) {
+ ++quotelevel;
+ }
if ((tokens[i].role & ENDOFSENTENCE) && (quotelevel == 0)) {
if (count == index) {
@@ -1074,7 +1314,7 @@ namespace Tokenizer {
}
return outToks;
}
- count++;
+ ++count;
}
}
throw uRangeError( "No sentence exists with the specified index: "
@@ -1654,7 +1894,13 @@ namespace Tokenizer {
int TokenizerClass::tokenizeLine( const string& s,
const string& lang ){
UnicodeString uinputstring = convert( s, inputEncoding );
- return tokenizeLine( uinputstring, lang );
+ return tokenizeLine( uinputstring, lang, "" );
+ }
+
+ // UnicodeString wrapper
+ int TokenizerClass::tokenizeLine( const UnicodeString& u,
+ const string& lang ){
+ return tokenizeLine( u, lang, "" );
}
bool u_isemo( UChar32 c ){
@@ -1769,7 +2015,8 @@ namespace Tokenizer {
}
int TokenizerClass::tokenizeLine( const UnicodeString& originput,
- const string& _lang ){
+ const string& _lang,
+ const string& id ){
string lang = _lang;
if ( lang.empty() ){
lang = "default";
@@ -1791,7 +2038,14 @@ namespace Tokenizer {
input = settings[lang]->filter.filter( input );
}
if ( input.isBogus() ){ //only tokenize valid input
- *theErrLog << "ERROR: Invalid UTF-8 in line!:" << input << endl;
+ if ( id.empty() ){
+ LOG << "ERROR: Invalid UTF-8 in line:" << linenum << endl
+ << " '" << input << "'" << endl;
+ }
+ else {
+ LOG << "ERROR: Invalid UTF-8 in element:" << id << endl
+ << " '" << input << "'" << endl;
+ }
return 0;
}
int32_t len = input.countChar32();
@@ -1811,16 +2065,18 @@ namespace Tokenizer {
UnicodeString word;
StringCharacterIterator sit(input);
long int i = 0;
+ long int tok_size = 0;
while ( sit.hasNext() ){
UChar32 c = sit.current32();
if ( tokDebug > 8 ){
UnicodeString s = c;
int8_t charT = u_charType( c );
LOG << "examine character: " << s << " type= "
- << toString( charT ) << endl;
+ << toString( charT ) << endl;
}
if (reset) { //reset values for new word
reset = false;
+ tok_size = 0;
if (!u_isspace(c))
word = c;
else
@@ -1912,6 +2168,22 @@ namespace Tokenizer {
}
sit.next32();
++i;
+ ++tok_size;
+ if ( tok_size > 2500 ){
+ if ( id.empty() ){
+ LOG << "Ridiculously long word/token (over 2500 characters) detected "
+ << "in line: " << linenum << ". Skipped ..." << endl;
+ LOG << "The line starts with " << UnicodeString( word, 0, 75 )
+ << "..." << endl;
+ }
+ else {
+ LOG << "Ridiculously long word/token (over 2500 characters) detected "
+ << "in element: " << id << ". Skipped ..." << endl;
+ LOG << "The text starts with " << UnicodeString( word, 0, 75 )
+ << "..." << endl;
+ }
+ return 0;
+ }
}
int numNewTokens = tokens.size() - begintokencount;
if ( numNewTokens > 0 ){
@@ -2113,7 +2385,7 @@ namespace Tokenizer {
break;
}
}
- if ( ! a_rule_matched ){
+ if ( !a_rule_matched ){
// no rule matched
if ( tokDebug >=4 ){
LOG << "\tthere's no match at all" << endl;
@@ -2174,7 +2446,7 @@ namespace Tokenizer {
}
}
if ( settings.empty() ){
- cerr << "No useful settingsfile(s) could be found." << endl;
+ cerr << "ucto: No useful settingsfile(s) could be found." << endl;
return false;
}
return true;
diff --git a/src/ucto.cxx b/src/ucto.cxx
index 3d2ad72..05fb844 100644
--- a/src/ucto.cxx
+++ b/src/ucto.cxx
@@ -45,38 +45,53 @@ using namespace std;
using namespace Tokenizer;
void usage(){
+ set<string> languages = Setting::installed_languages();
cerr << "Usage: " << endl;
cerr << "\tucto [[options]] [input-file] [[output-file]]" << endl
<< "Options:" << endl
- << "\t-c <configfile> - Explicitly specify a configuration file" << endl
- << "\t-d <value> - set debug level" << endl
- << "\t-e <string> - set input encoding (default UTF8)" << endl
- << "\t-N <string> - set output normalization (default NFC)" << endl
- << "\t-f - Disable filtering of special characters" << endl
- << "\t-h or --help - this message" << endl
- << "\t-L <language> - Automatically selects a configuration file by language code. (default 'generic')" << endl
- << "\t-l - Convert to all lowercase" << endl
- << "\t-u - Convert to all uppercase" << endl
- << "\t-n - One sentence per line (output)" << endl
- << "\t-m - One sentence per line (input)" << endl
- << "\t-v - Verbose mode" << endl
- << "\t-s <string> - End-of-Sentence marker (default: <utt>)" << endl
- << "\t--passthru - Don't tokenize, but perform input decoding and simple token role detection" << endl
+ << "\t-c <configfile> - Explicitly specify a configuration file" << endl
+ << "\t-d <value> - set debug level" << endl
+ << "\t-e <string> - set input encoding (default UTF8)" << endl
+ << "\t-N <string> - set output normalization (default NFC)" << endl
+ << "\t--filter=[YES|NO] - Disable filtering of special characters" << endl
+ << "\t-f - OBSOLETE. use --filter=NO" << endl
+ << "\t-h or --help - this message" << endl
+ << "\t-L <language> - Automatically selects a configuration file by language code." << endl
+ << "\t - Available Languages:" << endl
+ << "\t ";
+ for( const auto l : languages ){
+ cerr << l << ",";
+ }
+ cerr << endl;
+ cerr << "\t-l - Convert to all lowercase" << endl
+ << "\t-u - Convert to all uppercase" << endl
+ << "\t-n - One sentence per line (output)" << endl
+ << "\t-m - One sentence per line (input)" << endl
+ << "\t-v - Verbose mode" << endl
+ << "\t-s <string> - End-of-Sentence marker (default: <utt>)" << endl
+ << "\t--passthru - Don't tokenize, but perform input decoding and simple token role detection" << endl
<< "\t--normalize=<class1>,class2>,... " << endl
- << "\t - For class1, class2, etc. output the class tokens instead of the tokens itself." << endl
- << "\t--filterpunct - remove all punctuation from the output" << endl
- << "\t--detectlanguages=<lang1,lang2,..langn> - try to detect languages. Default = 'lang1'" << endl
- << "\t-P - Disable paragraph detection" << endl
- << "\t-S - Disable sentence detection!" << endl
- << "\t-Q - Enable quote detection (experimental)" << endl
- << "\t-V or --version - Show version information" << endl
- << "\t-x <DocID> - Output FoLiA XML, use the specified Document ID (obsolete)" << endl
- << "\t-F - Input file is in FoLiA XML. All untokenised sentences will be tokenised." << endl
- << "\t-X - Output FoLiA XML, use the Document ID specified with --id=" << endl
- << "\t--id <DocID> - use the specified Document ID to label the FoLia doc." << endl
- << "\t--textclass <class> - use the specified class to search text in the FoLia doc. (deprecated. use --inputclass)" << endl
- << "\t--inputclass <class> - use the specified class to search text in the FoLia doc." << endl
- << "\t--outputclass <class> - use the specified class to output text in the FoLia doc. (default is 'current'. changing this is dangerous!)" << endl
+ << "\t - For class1, class2, etc. output the class tokens instead of the tokens itself." << endl
+ << "\t-T or --textredundancy=[full|minimal|none] - set text redundancy level for text nodes in FoLiA output: " << endl
+ << "\t 'full' - add text to all levels: <p> <s> <w> etc." << endl
+ << "\t 'minimal' - don't introduce text on higher levels, but retain what is already there." << endl
+ << "\t 'none' - only introduce text on <w>, AND remove all text from higher levels" << endl
+ << "\t--filterpunct - remove all punctuation from the output" << endl
+ << "\t--uselanguages=<lang1,lang2,..langn> - only tokenize strings in these languages. Default = 'lang1'" << endl
+ << "\t--detectlanguages=<lang1,lang2,..langn> - try to assignlanguages before using. Default = 'lang1'" << endl
+ << "\t-P - Disable paragraph detection" << endl
+ << "\t-S - Disable sentence detection!" << endl
+ << "\t-Q - Enable quote detection (experimental)" << endl
+ << "\t-V or --version - Show version information" << endl
+ << "\t-x <DocID> - Output FoLiA XML, use the specified Document ID (obsolete)" << endl
+ << "\t-F - Input file is in FoLiA XML. All untokenised sentences will be tokenised." << endl
+ << "\t -F is automatically set when inputfile has extension '.xml'" << endl
+ << "\t-X - Output FoLiA XML, use the Document ID specified with --id=" << endl
+ << "\t--id <DocID> - use the specified Document ID to label the FoLia doc." << endl
+ << " -X is automatically set when inputfile has extension '.xml'" << endl
+ << "\t--inputclass <class> - use the specified class to search text in the FoLia doc.(default is 'current')" << endl
+ << "\t--outputclass <class> - use the specified class to output text in the FoLia doc. (default is 'current')" << endl
+ << "\t--textclass <class> - use the specified class for both input and output of text in the FoLia doc. (default is 'current'). Implies --filter=NO." << endl
<< "\t (-x and -F disable usage of most other options: -nPQVsS)" << endl;
}
@@ -88,18 +103,20 @@ int main( int argc, char *argv[] ){
bool sentenceperlineinput = false;
bool paragraphdetection = true;
bool quotedetection = false;
+ bool do_language_detect = false;
bool dofiltering = true;
bool dopunctfilter = false;
bool splitsentences = true;
bool xmlin = false;
bool xmlout = false;
bool verbose = false;
+ string redundancy = "minimal";
string eosmarker = "<utt>";
string docid = "untitleddoc";
- string inputclass = "current";
- string outputclass = "current";
string normalization = "NFC";
string inputEncoding = "UTF-8";
+ string inputclass = "current";
+ string outputclass = "current";
vector<string> language_list;
string cfile;
string ifile;
@@ -109,8 +126,8 @@ int main( int argc, char *argv[] ){
string norm_set_string;
try {
- TiCC::CL_Options Opts( "d:e:fhlPQunmN:vVSL:c:s:x:FX",
- "filterpunct,passthru,textclass:,inputclass:,outputclass:,normalize:,id:,version,help,detectlanguages:");
+ TiCC::CL_Options Opts( "d:e:fhlPQunmN:vVSL:c:s:x:FXT:",
+ "filter:,filterpunct,passthru,textclass:,inputclass:,outputclass:,normalize:,id:,version,help,detectlanguages:,uselanguages:,textredundancy:");
Opts.init(argc, argv );
if ( Opts.extract( 'h' )
|| Opts.extract( "help" ) ){
@@ -120,13 +137,13 @@ int main( int argc, char *argv[] ){
if ( Opts.extract( 'V' ) ||
Opts.extract( "version" ) ){
cout << "Ucto - Unicode Tokenizer - version " << Version() << endl
- << "(c) ILK 2009 - 2014, Induction of Linguistic Knowledge Research Group, Tilburg University" << endl
+ << "(c) CLST 2015 - 2017, Centre for Language and Speech Technology, Radboud University Nijmegen" << endl
+ << "(c) ILK 2009 - 2015, Induction of Linguistic Knowledge Research Group, Tilburg University" << endl
<< "Licensed under the GNU General Public License v3" << endl;
cout << "based on [" << folia::VersionName() << "]" << endl;
return EXIT_SUCCESS;
}
Opts.extract('e', inputEncoding );
- dofiltering = !Opts.extract( 'f' );
dopunctfilter = Opts.extract( "filterpunct" );
paragraphdetection = !Opts.extract( 'P' );
splitsentences = !Opts.extract( 'S' );
@@ -137,6 +154,13 @@ int main( int argc, char *argv[] ){
tolowercase = Opts.extract( 'l' );
sentenceperlineoutput = Opts.extract( 'n' );
sentenceperlineinput = Opts.extract( 'm' );
+ Opts.extract( 'T', redundancy );
+ Opts.extract( "textredundancy", redundancy );
+ if ( redundancy != "full"
+ && redundancy != "minimal"
+ && redundancy != "none" ){
+ throw TiCC::OptionError( "unknown textredundancy level: " + redundancy );
+ }
Opts.extract( 'N', normalization );
verbose = Opts.extract( 'v' );
if ( Opts.extract( 'x', docid ) ){
@@ -153,9 +177,38 @@ int main( int argc, char *argv[] ){
Opts.extract( "id", docid );
}
passThru = Opts.extract( "passthru" );
- Opts.extract( "textclass", inputclass );
+ string textclass;
+ Opts.extract( "textclass", textclass );
Opts.extract( "inputclass", inputclass );
Opts.extract( "outputclass", outputclass );
+ if ( !textclass.empty() ){
+ if ( inputclass != "current" ){
+ throw TiCC::OptionError( "--textclass conflicts with --inputclass" );
+ }
+ if ( outputclass != "current" ){
+ throw TiCC::OptionError( "--textclass conflicts with --outputclass");
+ }
+ inputclass = textclass;
+ outputclass = textclass;
+ }
+ if ( Opts.extract( 'f' ) ){
+ cerr << "ucto: The -f option is used. Please consider using --filter=NO" << endl;
+ dofiltering = false;
+ }
+ string value;
+ if ( Opts.extract( "filter", value ) ){
+ bool result;
+ if ( !TiCC::stringTo( value, result ) ){
+ throw TiCC::OptionError( "illegal value for '--filter' option. (boolean expected)" );
+ }
+ dofiltering = result;
+ }
+ if ( dofiltering && xmlin && outputclass == inputclass ){
+ // we cannot mangle the original inputclass, so disable filtering
+ cerr << "ucto: --filter=NO is automatically set. inputclass equals outputclass!"
+ << endl;
+ dofiltering = false;
+ }
if ( xmlin && outputclass.empty() ){
if ( dopunctfilter ){
throw TiCC::OptionError( "--outputclass required for --filterpunct on FoLiA input ");
@@ -167,7 +220,6 @@ int main( int argc, char *argv[] ){
throw TiCC::OptionError( "--outputclass required for -l on FoLiA input ");
}
}
- string value;
if ( Opts.extract('d', value ) ){
if ( !TiCC::stringTo(value,debug) ){
throw TiCC::OptionError( "invalid value for -d: " + value );
@@ -175,30 +227,44 @@ int main( int argc, char *argv[] ){
}
if ( Opts.is_present('L') ) {
if ( Opts.is_present('c') ){
- cerr << "Error: -L and -c options conflict. Use only one of them." << endl;
- return EXIT_FAILURE;
+ throw TiCC::OptionError( "-L and -c options conflict. Use only one of these." );
}
else if ( Opts.is_present( "detectlanguages" ) ){
- cerr << "Error: -L and --detectlanguages options conflict. Use only one of them." << endl;
- return EXIT_FAILURE;
+ throw TiCC::OptionError( "-L and --detectlanguages options conflict. Use only one of these." );
+ }
+ else if ( Opts.is_present( "uselanguages" ) ){
+ throw TiCC::OptionError( "-L and --uselanguages options conflict. Use only one of these." );
}
}
- else if ( Opts.is_present( 'c' )
- && Opts.is_present( "detectlanguages" ) ){
- cerr << "Error: -c and --detectlanguages options conflict. Use only one of them." << endl;
- return EXIT_FAILURE;
+ else if ( Opts.is_present( 'c' ) ){
+ if ( Opts.is_present( "detectlanguages" ) ){
+ throw TiCC::OptionError( "-c and --detectlanguages options conflict. Use only one of these" );
+ }
+ else if ( Opts.is_present( "uselanguages" ) ){
+ throw TiCC::OptionError( "-L and --uselanguages options conflict. Use only one of these." );
+ }
+ }
+ if ( Opts.is_present( "detectlanguages" ) &&
+ Opts.is_present( "uselanguages" ) ){
+ throw TiCC::OptionError( "--detectlanguages and --uselanguages options conflict. Use only one of these." );
}
-
Opts.extract( 'c', c_file );
+
string languages;
Opts.extract( "detectlanguages", languages );
- bool do_language_detect = !languages.empty();
- if ( do_language_detect ){
+ if ( languages.empty() ){
+ Opts.extract( "uselanguages", languages );
+ }
+ else {
+ do_language_detect = true;
+ }
+ if ( !languages.empty() ){
if ( TiCC::split_at( languages, language_list, "," ) < 1 ){
throw TiCC::OptionError( "invalid language list: " + languages );
}
}
else {
+ // so nu --detectlanguages ot --uselanguages
string language;
if ( Opts.extract('L', language ) ){
// support some backward compatability to old ISO 639-1 codes
@@ -248,56 +314,113 @@ int main( int argc, char *argv[] ){
vector<string> files = Opts.getMassOpts();
if ( files.size() > 0 ){
ifile = files[0];
+ if ( TiCC::match_back( ifile, ".xml" ) ){
+ xmlin = true;
+ }
}
- if ( files.size() > 1 ){
+ if ( files.size() == 2 ){
ofile = files[1];
+ if ( TiCC::match_back( ofile, ".xml" ) ){
+ xmlout = true;
+ }
+ }
+ if ( files.size() > 2 ){
+ cerr << "found additional arguments on the commandline: " << files[2]
+ << "...." << endl;
}
+
}
catch( const TiCC::OptionError& e ){
cerr << "ucto: " << e.what() << endl;
usage();
return EXIT_FAILURE;
}
-
if ( !passThru ){
+ set<string> available_languages = Setting::installed_languages();
if ( !c_file.empty() ){
cfile = c_file;
}
else if ( language_list.empty() ){
- cfile = "tokconfig-generic";
+ cerr << "ucto: missing a language specification (-L or --detectlanguages or --uselanguages option)" << endl;
+ if ( available_languages.size() == 1
+ && *available_languages.begin() == "generic" ){
+ cerr << "ucto: The uctodata package seems not to be installed." << endl;
+ cerr << "ucto: You can use '-L generic' to run a simple default tokenizer."
+ << endl;
+ cerr << "ucto: Installing uctodata is highly recommended." << endl;
+ }
+ else {
+ cerr << "ucto: Available Languages: ";
+ for( const auto& l : available_languages ){
+ cerr << l << ",";
+ }
+ cerr << endl;
+ }
+ return EXIT_FAILURE;
+ }
+ else {
+ for ( const auto& l : language_list ){
+ if ( available_languages.find(l) == available_languages.end() ){
+ cerr << "ucto: unsupported language '" << l << "'" << endl;
+ if ( available_languages.size() == 1
+ && *available_languages.begin() == "generic" ){
+ cerr << "ucto: The uctodata package seems not to be installed." << endl;
+ cerr << "ucto: You can use '-L generic' to run a simple default tokenizer."
+ << endl;
+ cerr << "ucto: Installing uctodata is highly recommended." << endl;
+ }
+ else {
+ cerr << "ucto: Available Languages: ";
+ for( const auto& l : available_languages ){
+ cerr << l << ",";
+ }
+ cerr << endl;
+ }
+ return EXIT_FAILURE;
+ }
+ }
}
}
if ((!ifile.empty()) && (ifile == ofile)) {
- cerr << "Error: Output file equals input file! Courageously refusing to start..." << endl;
+ cerr << "ucto: Output file equals input file! Courageously refusing to start..." << endl;
return EXIT_FAILURE;
}
- if ( !passThru ){
- cerr << "configfile = " << cfile << endl;
- }
- cerr << "inputfile = " << ifile << endl;
- cerr << "outputfile = " << ofile << endl;
+ cerr << "ucto: inputfile = " << ifile << endl;
+ cerr << "ucto: outputfile = " << ofile << endl;
istream *IN = 0;
if (!xmlin) {
- if ( ifile.empty() )
+ if ( ifile.empty() ){
IN = &cin;
+ }
else {
IN = new ifstream( ifile );
if ( !IN || !IN->good() ){
- cerr << "Error: problems opening inputfile " << ifile << endl;
- cerr << "Courageously refusing to start..." << endl;
+ cerr << "ucto: problems opening inputfile " << ifile << endl;
+ cerr << "ucto: Courageously refusing to start..." << endl;
+ delete IN;
return EXIT_FAILURE;
}
}
}
ostream *OUT = 0;
- if ( ofile.empty() )
+ if ( ofile.empty() ){
OUT = &cout;
+ }
else {
OUT = new ofstream( ofile );
+ if ( !OUT || !OUT->good() ){
+ cerr << "ucto: problems opening outputfile " << ofile << endl;
+ cerr << "ucto: Courageously refusing to start..." << endl;
+ delete OUT;
+ if ( IN != &cin ){
+ delete IN;
+ }
+ return EXIT_FAILURE;
+ }
}
try {
@@ -309,15 +432,24 @@ int main( int argc, char *argv[] ){
}
else {
// init exept for passthru mode
- if ( !cfile.empty() ){
- if ( !tokenizer.init( cfile ) ){
- return EXIT_FAILURE;
+ if ( !cfile.empty()
+ && !tokenizer.init( cfile ) ){
+ if ( IN != &cin ){
+ delete IN;
}
+ if ( OUT != &cout ){
+ delete OUT;
+ }
+ return EXIT_FAILURE;
}
- else {
- if ( !tokenizer.init( language_list ) ){
- return EXIT_FAILURE;
+ else if ( !tokenizer.init( language_list ) ){
+ if ( IN != &cin ){
+ delete IN;
}
+ if ( OUT != &cout ){
+ delete OUT;
+ }
+ return EXIT_FAILURE;
}
}
@@ -334,11 +466,13 @@ int main( int argc, char *argv[] ){
tokenizer.setNormalization( normalization );
tokenizer.setInputEncoding( inputEncoding );
tokenizer.setFiltering(dofiltering);
+ tokenizer.setLangDetection(do_language_detect);
tokenizer.setPunctFilter(dopunctfilter);
tokenizer.setInputClass(inputclass);
tokenizer.setOutputClass(outputclass);
tokenizer.setXMLOutput(xmlout, docid);
tokenizer.setXMLInput(xmlin);
+ tokenizer.setTextRedundancy(redundancy);
if (xmlin) {
folia::Document doc;
@@ -354,7 +488,7 @@ int main( int argc, char *argv[] ){
}
}
catch ( exception &e ){
- cerr << e.what() << endl;
+ cerr << "ucto: " << e.what() << endl;
return EXIT_FAILURE;
}
diff --git a/src/unicode.cxx b/src/unicode.cxx
index 72a25a9..e0d5d81 100644
--- a/src/unicode.cxx
+++ b/src/unicode.cxx
@@ -172,10 +172,10 @@ namespace Tokenizer {
return true;
}
- class uConfigError: public std::invalid_argument {
+ class uRegexError: public std::invalid_argument {
public:
- uConfigError( const string& s ): invalid_argument( "ucto: config file:" + s ){};
- uConfigError( const UnicodeString& us ): invalid_argument( "ucto: config file:" + folia::UnicodeToUTF8(us) ){};
+ explicit uRegexError( const string& s ): invalid_argument( "Invalid regular expression: " + s ){};
+ explicit uRegexError( const UnicodeString& us ): invalid_argument( "Invalid regular expression: " + folia::UnicodeToUTF8(us) ){};
};
@@ -196,21 +196,20 @@ namespace Tokenizer {
string spat = folia::UnicodeToUTF8(pat);
failString = folia::UnicodeToUTF8(_name);
if ( errorInfo.offset >0 ){
- failString += " Invalid regular expression at position " + TiCC::toString( errorInfo.offset ) + "\n";
+ failString += " at position " + TiCC::toString( errorInfo.offset ) + "\n";
UnicodeString pat1 = UnicodeString( pat, 0, errorInfo.offset -1 );
failString += folia::UnicodeToUTF8(pat1) + " <== HERE\n";
}
else {
- failString += " Invalid regular expression '" + spat + "' ";
+ failString += "'" + spat + "' ";
}
- throw uConfigError(failString);
+ throw uRegexError(failString);
}
else {
matcher = pattern->matcher( u_stat );
if (U_FAILURE(u_stat)){
- failString = "unable to create PatterMatcher with pattern '" +
- folia::UnicodeToUTF8(pat) + "'";
- throw uConfigError(failString);
+ failString = "'" + folia::UnicodeToUTF8(pat) + "'";
+ throw uRegexError(failString);
}
}
}
diff --git a/tests/Makefile.in b/tests/Makefile.in
index 2e11062..928e8c3 100644
--- a/tests/Makefile.in
+++ b/tests/Makefile.in
@@ -89,8 +89,7 @@ build_triplet = @build@
host_triplet = @host@
subdir = tests
ACLOCAL_M4 = $(top_srcdir)/aclocal.m4
-am__aclocal_m4_deps = $(top_srcdir)/m4/ax_icu_check.m4 \
- $(top_srcdir)/m4/ax_lib_readline.m4 \
+am__aclocal_m4_deps = $(top_srcdir)/m4/ax_lib_readline.m4 \
$(top_srcdir)/m4/libtool.m4 $(top_srcdir)/m4/ltoptions.m4 \
$(top_srcdir)/m4/ltsugar.m4 $(top_srcdir)/m4/ltversion.m4 \
$(top_srcdir)/m4/lt~obsolete.m4 $(top_srcdir)/m4/pkg.m4 \
@@ -155,13 +154,7 @@ EXEEXT = @EXEEXT@
FGREP = @FGREP@
GREP = @GREP@
ICU_CFLAGS = @ICU_CFLAGS@
-ICU_CONFIG = @ICU_CONFIG@
-ICU_CPPSEARCHPATH = @ICU_CPPSEARCHPATH@
-ICU_CXXFLAGS = @ICU_CXXFLAGS@
-ICU_IOLIBS = @ICU_IOLIBS@
-ICU_LIBPATH = @ICU_LIBPATH@
ICU_LIBS = @ICU_LIBS@
-ICU_VERSION = @ICU_VERSION@
INSTALL = @INSTALL@
INSTALL_DATA = @INSTALL_DATA@
INSTALL_PROGRAM = @INSTALL_PROGRAM@
@@ -252,6 +245,7 @@ pdfdir = @pdfdir@
prefix = @prefix@
program_transform_name = @program_transform_name@
psdir = @psdir@
+runstatedir = @runstatedir@
sbindir = @sbindir@
sharedstatedir = @sharedstatedir@
srcdir = @srcdir@
diff --git a/ucto.pc.in b/ucto.pc.in
index fe99841..791e04e 100644
--- a/ucto.pc.in
+++ b/ucto.pc.in
@@ -6,7 +6,6 @@ includedir=@includedir@
Name: ucto
Version: @VERSION@
Description: Unicode Tokenizer
-Requires.private: ucto-icu >= 3.6 folia >= 0.3
Libs: -L${libdir} -lucto
Libs.private: @LIBS@
Cflags: -I${includedir}
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/ucto.git
More information about the debian-science-commits
mailing list