[ucto] 01/03: New upstream version 0.9.5
Maarten van Gompel
proycon-guest at moszumanska.debian.org
Fri Jan 6 14:54:07 UTC 2017
This is an automated email from the git hooks/post-receive script.
proycon-guest pushed a commit to branch master
in repository ucto.
commit eea086d264dc877a18711ee896c35e9a0a73ae79
Author: proycon <proycon at anaproy.nl>
Date: Fri Jan 6 15:53:22 2017 +0100
New upstream version 0.9.5
---
ChangeLog | 29 +++++++++
NEWS | 7 +++
README | 108 +---------------------------------
config/tokconfig-generic | 10 +---
configure | 149 +++++++++++++++++++++++++++++++++++------------
configure.ac | 20 ++++++-
6 files changed, 167 insertions(+), 156 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index 5d5ddcf..ecc14ef 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,32 @@
+2017-01-06 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * NEWS: updated NEWS for the release
+
+2017-01-06 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * configure.ac: We do not longer require the uctodata package to be
+ installed. But issue a notice! If present we check for a recent and
+ decent version.
+
+2017-01-06 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * : commit 3bb3f7b6fba6a1d1ce566591cba65b606bbf738b Author: Ko van
+ der Sloot <K.vanderSloot at let.ru.nl> Date: Fri Jan 6 13:10:35 2017
+ +0100
+
+2017-01-06 Maarten van Gompel <proycon at anaproy.nl>
+
+ * config/tokconfig-generic: Updated tokconfig-generic with version
+ information
+
+2017-01-05 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * configure.ac: bumped version after release
+
+2017-01-05 Ko van der Sloot <K.vanderSloot at let.ru.nl>
+
+ * NEWS: updatede NEWS for upcoming release
+
2016-12-20 Ko van der Sloot <K.vanderSloot at let.ru.nl>
* src/tokenize.cxx, src/unicode.cxx: some refactoring, generally use
diff --git a/NEWS b/NEWS
index c9c9396..b95d3ab 100644
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,10 @@
+0.9.5 2017-01-06
+[Ko van der Sloot]
+Bug fix release:
+ * updated tokconfig-generic, which is removed from the uctodata package
+ * configure no longer insists on the presence of uctodata, it merely warns
+ when missing
+
0.9.4 2017-01-05
[Ko van der Sloot]
Major update
diff --git a/README b/README
index 383582a..98cdbd6 100644
--- a/README
+++ b/README
@@ -1,107 +1 @@
-[](https://travis-ci.org/LanguageMachines/ucto) [](http://applejack.science.ru.nl/languagemachines/)
-
-================================
-Ucto - A rule-based tokeniser
-================================
-
- Centre for Language and Speech technology, Radboud University Nijmegen
- Induction of Linguistic Knowledge Research Group, Tilburg University
-
-Website: https://languagemachines.github.io/ucto/
-
-Ucto tokenizes text files: it separates words from punctuation, and splits
-sentences. This is one of the first tasks for almost any Natural Language
-Processing application. Ucto offers several other basic preprocessing steps
-such as changing case that you can all use to make your text suited for further
-processing such as indexing, part-of-speech tagging, or machine translation.
-
-Ucto comes with tokenisation rules for several languages (packaged separately)
-and can be easily extended to suit other languages. It has been incorporated
-for tokenizing Dutch text in Frog (https://languagemachines.github.io/frog),
-our Dutch morpho-syntactic processor.
-
-The software is intended to be used from the command-line by researchers in
-Natural Language Processing or related areas, as well as software developers.
-An [Ucto python binding](https://github.com/proycon/python-ucto) is also available
-separately.
-
-Features:
-
-- Comes with tokenization rules for English, Dutch, French, Italian, Turkish,
- Spanish, Portuguese and Swedish; easily extendible to other languages. Rules
- consists of regular expressions and lists. They are
- packaged separately as [uctodata](https://github.com/LanguageMachines/uctodata).
-- Recognizes dates, times, units, currencies, abbreviations.
-- Recognizes paired quote spans, sentences, and paragraphs.
-- Produces UTF8 encoding and NFC output normalization, optionally accepting
- other input encodings as well.
-- Ligature normalization (can undo for isntance fi,fl as single codepoints).
-- Optional conversion to all lowercase or uppercase.
-- Supports [FoLiA XML](https://proycon.github.io/folia)
-
-Ucto was written by Maarten van Gompel and Ko van der Sloot. Work on Ucto was
-funded by NWO, the Netherlands Organisation for Scientific Research, under the
-Implicit Linguistics project, the CLARIN-NL program, and the CLARIAH project.
-
-This software is available under the GNU Public License v3 (see the file
-COPYING).
-
-------------------------------------------------------------
-Installation
-------------------------------------------------------------
-
-To install ucto, first consult whether your distribution's package manager has an up-to-date package for it.
-If not, for easy installation of ucto and all dependencies, it is included as part of our software
-distribution [LaMachine](https://proycon.github.io/LaMachine).
-
-To compile and install manually from source, provided you have all the
-dependencies installed:
-
- $ bash bootstrap.sh
- $ ./configure
- $ make
- $ sudo make install
-
-You will need current versions of the following dependencies of our software:
-
-* [ticcutils](https://github.com/LanguageMachine/ticcutils) - A shared utility library
-* [libfolia](https://github.com/LanguageMachines/libfolia) - A library for the FoLiA format.
-* [uctodata](https://github.com/LanguageMachines/uctodata) - Data files for ucto, packaged separately
-
-As well as the following 3rd party dependencies:
-
-* ``icu`` - A C++ library for Unicode and Globalization support. On Debian/Ubuntu systems, install the package libicu-dev.
-* ``libxml2`` - An XML library. On Debian/Ubuntu systems install the package libxml2-dev.
-* A sane build environment with a C++ compiler (e.g. gcc or clang), autotools, libtool, pkg-config
-
-------------------------------------------------------------
-Usage
-------------------------------------------------------------
-
-Tokenize an english text file to standard output, tokens will be
-space-seperated, sentences delimiter by ``<utt>``:
-
- $ ucto -L en yourfile.txt
-
-The -L flag specifies the language (as an iso-639-1 code), provided a configuration file exists for
-that language. To output to file instead of standard output, just add another
-positional argument with the desired output filename.
-
-If you want each sentence on a separate line (i.e. newline delimited rather than delimited by
-``<utt>``), then pass the ``-n`` flag. If each sentence is already on one line
-in the input and you want to leave it at that, pass the ``-m`` flag.
-
-Tokenize plaintext to [FoLiA XML](https://proycon.github.io/folia) using the ``-X`` flag, you can specify an ID
-for the FoLiA document using the ``--id=`` flag.
-
- $ ucto -L en -X --id=hamlet hamlet.txt hamlet.folia.xml
-
-Note that in the FoLiA XML output, ucto encodes the class of the token (date, url, smiley, etc...) based
-on the rule that matched.
-
-For further documentation consult the [ucto
-manual](https://github.com/LanguageMachines/ucto/blob/master/docs/ucto_manual.pdf)
-for further documentation.
-
-
-
+Please consulr README.md
diff --git a/config/tokconfig-generic b/config/tokconfig-generic
index 2a2bd58..bda63f0 100644
--- a/config/tokconfig-generic
+++ b/config/tokconfig-generic
@@ -1,3 +1,4 @@
+version=0.2
[RULE-ORDER]
URL URL-WWW URL-DOMAIN
E-MAIL WORD-PARPREFIX WORD-PARSUFFIX WORD-COMPOUND
@@ -6,15 +7,6 @@ NUMBER-YEAR TIME FRACNUMBER NUMBER CURRENCY WORD PUNCTUATION UNKNOWN
[META-RULES]
-SPLITTER=%
-NUMBER-ORDINAL = \p{N}+-?(?: %ORDINALS% )(?:\Z|\P{Lu}|\P{Ll})$
-ABBREVIATION-KNOWN = (?:\p{P}*)?(?:\A|[^\p{L}\.])((?:%ABBREVIATIONS%)\.)(?:\Z|\P{L})
-WORD-TOKEN =(%TOKENS%)(?:\p{P}*)?$
-#WORD-WITHPREFIX = (?:\A|[^\p{Lu}\.]|[^\p{Ll}\.])(?: %ATTACHEDPREFIXES% )\p{L}+
-#WORD-WITHSUFFIX = ((?:\p{L}|\p{N}|-)+(?: %ATTACHEDSUFFIXES% ))(?:\Z)
-#WORD-INFIX-COMPOUND = ((?:\p{L}|\p{N}|-)+(?: %ATTACHEDSUFFIXES% )-(?:\p{L}+))$
-PREFIX = (?:\A|[^\p{Lu}\.]|[^\p{Ll}\.])(%PREFIXES% )(\p{L}+)
-SUFFIX = ((?:\p{L})+)( %SUFFIXES% )(?:\Z|\P{L})
[RULES]
%include url
diff --git a/configure b/configure
index ef382a1..df143d4 100755
--- a/configure
+++ b/configure
@@ -1,6 +1,6 @@
#! /bin/sh
# Guess values for system-dependent variables and create Makefiles.
-# Generated by GNU Autoconf 2.69 for ucto 0.9.4.
+# Generated by GNU Autoconf 2.69 for ucto 0.9.5.
#
# Report bugs to <lamasoftware at science.ru.nl>.
#
@@ -590,8 +590,8 @@ MAKEFLAGS=
# Identity of this package.
PACKAGE_NAME='ucto'
PACKAGE_TARNAME='ucto'
-PACKAGE_VERSION='0.9.4'
-PACKAGE_STRING='ucto 0.9.4'
+PACKAGE_VERSION='0.9.5'
+PACKAGE_STRING='ucto 0.9.5'
PACKAGE_BUGREPORT='lamasoftware at science.ru.nl'
PACKAGE_URL=''
@@ -1369,7 +1369,7 @@ if test "$ac_init_help" = "long"; then
# Omit some internal or obsolete options to make the list less imposing.
# This message is too long to be a string in the A/UX 3.1 sh.
cat <<_ACEOF
-\`configure' configures ucto 0.9.4 to adapt to many kinds of systems.
+\`configure' configures ucto 0.9.5 to adapt to many kinds of systems.
Usage: $0 [OPTION]... [VAR=VALUE]...
@@ -1440,7 +1440,7 @@ fi
if test -n "$ac_init_help"; then
case $ac_init_help in
- short | recursive ) echo "Configuration of ucto 0.9.4:";;
+ short | recursive ) echo "Configuration of ucto 0.9.5:";;
esac
cat <<\_ACEOF
@@ -1578,7 +1578,7 @@ fi
test -n "$ac_init_help" && exit $ac_status
if $ac_init_version; then
cat <<\_ACEOF
-ucto configure 0.9.4
+ucto configure 0.9.5
generated by GNU Autoconf 2.69
Copyright (C) 2012 Free Software Foundation, Inc.
@@ -2198,7 +2198,7 @@ cat >config.log <<_ACEOF
This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.
-It was created by ucto $as_me 0.9.4, which was
+It was created by ucto $as_me 0.9.5, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ $0 $@
@@ -3061,7 +3061,7 @@ fi
# Define the identity of the package.
PACKAGE='ucto'
- VERSION='0.9.4'
+ VERSION='0.9.5'
cat >>confdefs.h <<_ACEOF
@@ -17012,12 +17012,12 @@ if test -n "$uctodata_CFLAGS"; then
pkg_cv_uctodata_CFLAGS="$uctodata_CFLAGS"
elif test -n "$PKG_CONFIG"; then
if test -n "$PKG_CONFIG" && \
- { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"uctodata >= 0.3 \""; } >&5
- ($PKG_CONFIG --exists --print-errors "uctodata >= 0.3 ") 2>&5
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"uctodata\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "uctodata") 2>&5
ac_status=$?
$as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
test $ac_status = 0; }; then
- pkg_cv_uctodata_CFLAGS=`$PKG_CONFIG --cflags "uctodata >= 0.3 " 2>/dev/null`
+ pkg_cv_uctodata_CFLAGS=`$PKG_CONFIG --cflags "uctodata" 2>/dev/null`
test "x$?" != "x0" && pkg_failed=yes
else
pkg_failed=yes
@@ -17029,12 +17029,12 @@ if test -n "$uctodata_LIBS"; then
pkg_cv_uctodata_LIBS="$uctodata_LIBS"
elif test -n "$PKG_CONFIG"; then
if test -n "$PKG_CONFIG" && \
- { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"uctodata >= 0.3 \""; } >&5
- ($PKG_CONFIG --exists --print-errors "uctodata >= 0.3 ") 2>&5
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"uctodata\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "uctodata") 2>&5
ac_status=$?
$as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
test $ac_status = 0; }; then
- pkg_cv_uctodata_LIBS=`$PKG_CONFIG --libs "uctodata >= 0.3 " 2>/dev/null`
+ pkg_cv_uctodata_LIBS=`$PKG_CONFIG --libs "uctodata" 2>/dev/null`
test "x$?" != "x0" && pkg_failed=yes
else
pkg_failed=yes
@@ -17055,38 +17055,111 @@ else
_pkg_short_errors_supported=no
fi
if test $_pkg_short_errors_supported = yes; then
- uctodata_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "uctodata >= 0.3 " 2>&1`
+ uctodata_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "uctodata" 2>&1`
else
- uctodata_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "uctodata >= 0.3 " 2>&1`
+ uctodata_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "uctodata" 2>&1`
fi
# Put the nasty error message in config.log where it belongs
echo "$uctodata_PKG_ERRORS" >&5
- as_fn_error $? "Package requirements (uctodata >= 0.3 ) were not met:
+ { $as_echo "$as_me:${as_lineno-$LINENO}: ATTENTION:
+ ucto datafiles are not installed!
+ ucto will work with only a minimal default configuration.
+ You should consider installing the uctodata package! " >&5
+$as_echo "$as_me: ATTENTION:
+ ucto datafiles are not installed!
+ ucto will work with only a minimal default configuration.
+ You should consider installing the uctodata package! " >&6;}
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { $as_echo "$as_me:${as_lineno-$LINENO}: ATTENTION:
+ ucto datafiles are not installed!
+ ucto will work with only a minimal default configuration.
+ You should consider installing the uctodata package! " >&5
+$as_echo "$as_me: ATTENTION:
+ ucto datafiles are not installed!
+ ucto will work with only a minimal default configuration.
+ You should consider installing the uctodata package! " >&6;}
+else
+ uctodata_CFLAGS=$pkg_cv_uctodata_CFLAGS
+ uctodata_LIBS=$pkg_cv_uctodata_LIBS
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
-$uctodata_PKG_ERRORS
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for uctodata" >&5
+$as_echo_n "checking for uctodata... " >&6; }
-Consider adjusting the PKG_CONFIG_PATH environment variable if you
-installed software in a non-standard prefix.
+if test -n "$uctodata_CFLAGS"; then
+ pkg_cv_uctodata_CFLAGS="$uctodata_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"uctodata >= 0.3\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "uctodata >= 0.3") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_uctodata_CFLAGS=`$PKG_CONFIG --cflags "uctodata >= 0.3" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
+if test -n "$uctodata_LIBS"; then
+ pkg_cv_uctodata_LIBS="$uctodata_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+ if test -n "$PKG_CONFIG" && \
+ { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"uctodata >= 0.3\""; } >&5
+ ($PKG_CONFIG --exists --print-errors "uctodata >= 0.3") 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; then
+ pkg_cv_uctodata_LIBS=`$PKG_CONFIG --libs "uctodata >= 0.3" 2>/dev/null`
+ test "x$?" != "x0" && pkg_failed=yes
+else
+ pkg_failed=yes
+fi
+ else
+ pkg_failed=untried
+fi
-Alternatively, you may set the environment variables uctodata_CFLAGS
-and uctodata_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details." "$LINENO" 5
-elif test $pkg_failed = untried; then
- { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+
+
+if test $pkg_failed = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
$as_echo "no" >&6; }
- { { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
-$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
-as_fn_error $? "The pkg-config script could not be found or is too old. Make sure it
-is in your PATH or set the PKG_CONFIG environment variable to the full
-path to pkg-config.
-Alternatively, you may set the environment variables uctodata_CFLAGS
-and uctodata_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details.
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+ _pkg_short_errors_supported=yes
+else
+ _pkg_short_errors_supported=no
+fi
+ if test $_pkg_short_errors_supported = yes; then
+ uctodata_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "uctodata >= 0.3" 2>&1`
+ else
+ uctodata_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "uctodata >= 0.3" 2>&1`
+ fi
+ # Put the nasty error message in config.log where it belongs
+ echo "$uctodata_PKG_ERRORS" >&5
-To get pkg-config, see <http://pkg-config.freedesktop.org/>.
-See \`config.log' for more details" "$LINENO" 5; }
+ { $as_echo "$as_me:${as_lineno-$LINENO}: ATTENTION:
+ Your ucto datafiles are are outdated
+ You should consider installing a newer version of the uctodata package!" >&5
+$as_echo "$as_me: ATTENTION:
+ Your ucto datafiles are are outdated
+ You should consider installing a newer version of the uctodata package!" >&6;}
+elif test $pkg_failed = untried; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+ { $as_echo "$as_me:${as_lineno-$LINENO}: ATTENTION:
+ Your ucto datafiles are are outdated
+ You should consider installing a newer version of the uctodata package!" >&5
+$as_echo "$as_me: ATTENTION:
+ Your ucto datafiles are are outdated
+ You should consider installing a newer version of the uctodata package!" >&6;}
else
uctodata_CFLAGS=$pkg_cv_uctodata_CFLAGS
uctodata_LIBS=$pkg_cv_uctodata_LIBS
@@ -17095,6 +17168,8 @@ $as_echo "yes" >&6; }
fi
+
+fi
# Checks for library functions.
ac_config_files="$ac_config_files Makefile ucto.pc ucto-icu.pc m4/Makefile config/Makefile docs/Makefile src/Makefile tests/Makefile include/Makefile include/ucto/Makefile"
@@ -17633,7 +17708,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
# report actual input values of CONFIG_FILES etc. instead of their
# values after options handling.
ac_log="
-This file was extended by ucto $as_me 0.9.4, which was
+This file was extended by ucto $as_me 0.9.5, which was
generated by GNU Autoconf 2.69. Invocation command line was
CONFIG_FILES = $CONFIG_FILES
@@ -17699,7 +17774,7 @@ _ACEOF
cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`"
ac_cs_version="\\
-ucto config.status 0.9.4
+ucto config.status 0.9.5
configured by $0, generated by GNU Autoconf 2.69,
with options \\"\$ac_cs_config\\"
diff --git a/configure.ac b/configure.ac
index c4cd9a6..cd03408 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.
AC_PREREQ(2.59)
-AC_INIT([ucto], [0.9.4], [lamasoftware at science.ru.nl])
+AC_INIT([ucto], [0.9.5], [lamasoftware at science.ru.nl])
AM_INIT_AUTOMAKE([foreign])
AC_CONFIG_SRCDIR([configure.ac])
AC_CONFIG_MACRO_DIR([m4])
@@ -114,8 +114,22 @@ PKG_CHECK_MODULES([ticcutils], [ticcutils >= 0.6] )
CXXFLAGS="$CXXFLAGS $ticcutils_CFLAGS"
LIBS="$LIBS $ticcutils_LIBS"
-PKG_CHECK_MODULES([uctodata], [uctodata >= 0.3] )
-
+PKG_CHECK_MODULES(
+ [uctodata],
+ [uctodata],
+ [PKG_CHECK_MODULES(
+ [uctodata],
+ [uctodata >= 0.3],
+ [],
+ [AC_MSG_NOTICE([ATTENTION:
+ Your ucto datafiles are are outdated
+ You should consider installing a newer version of the uctodata package!])])
+
+ ],
+ [AC_MSG_NOTICE([ATTENTION:
+ ucto datafiles are not installed!
+ ucto will work with only a minimal default configuration.
+ You should consider installing the uctodata package!] )] )
# Checks for library functions.
AC_OUTPUT([
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/ucto.git
More information about the debian-science-commits
mailing list