[SCM] Packaging for numexpr branch, upstream, updated. b73b60b9a4464e295f837f531c92d212825192fc

Sun Mar 11 17:36:41 UTC 2012

The following commit has been merged in the upstream branch:
commit b73b60b9a4464e295f837f531c92d212825192fc
Author: Antonio Valentino <antonio.valentino at tiscali.it>
Date:   Sat Mar 10 19:14:03 2012 +0100

    Imported Upstream version 2.0.1

diff --git a/ANNOUNCE.txt b/ANNOUNCE.txt
index 1e6cf69..5d4563a 100644
--- a/ANNOUNCE.txt
+++ b/ANNOUNCE.txt
@@ -1,18 +1,21 @@
 ==========================
- Announcing Numexpr 1.4.2
+ Announcing Numexpr 2.0.1
 ==========================
 
 Numexpr is a fast numerical expression evaluator for NumPy.  With it,
 expressions that operate on arrays (like "3*a+4*b") are accelerated
 and use less memory than doing the same calculation in Python.
 
+It wears multi-threaded capabilities, as well as support for Intel's
+VML library, which allows for squeezing the last drop of performance
+out of your multi-core processors.
+
 What's new
 ==========
 
-This is a maintenance release.  The most annying issues have been
-fixed (including the reduction mal-function introduced in 1.4 series).
-Also, several performance enhancements (specially for VML and small
-array operation) are included too.
+In this release, better docstrings for `evaluate` and reduction
+methods (`sum`, `prod`) is in place.  Also, compatibility with Python
+2.5 has been restored (2.4 is definitely not supported anymore).
 
 In case you want to know more in detail what has changed in this
 version, see:
diff --git a/AUTHORS.txt b/AUTHORS.txt
new file mode 100644
index 0000000..5ff9ab3
--- /dev/null
+++ b/AUTHORS.txt
@@ -0,0 +1,18 @@
+Numexpr was initially written by David Cooke, and extended to more
+types by Tim Hochberg.
+
+Francesc Alted contributed support for booleans and simple-precision
+floating point types, efficient strided and unaligned array operations
+and multi-threading code.
+
+Ivan Vilata contributed support for strings.
+
+Gregor Thalhammer implemented the support for Intel VML (Vector Math
+Library).
+
+Mark Wiebe added support for the new iterator in NumPy, which allows
+for better performance in more scenarios (like broadcasting,
+fortran-ordered or non-native byte orderings).
+
+Gaëtan de Menten contributed important bug fixes and speed
+enhancements.
diff --git a/INSTALL.txt b/INSTALL.txt
index f12866d..5881d90 100644
--- a/INSTALL.txt
+++ b/INSTALL.txt
@@ -7,7 +7,9 @@ VML support.
 Building
 ========
 
-`Numexpr` requires Python 2.4 or greater, and NumPy 1.0 or greater.
+This version of `Numexpr` requires Python 2.5 or greater,
+and NumPy 1.6 or greater.
+
 It's built in the standard Python way:
 
 $ python setup.py build
@@ -21,10 +23,10 @@ $ python -c "import numexpr; numexpr.test()"
 Enabling Intel's VML support
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Starting from release 1.2 on, numexpr includes support for Intel's VML
-library.  This allows for better performance on Intel architectures,
-mainly when evaluating transcendental functions (trigonometrical,
-exponential...).  It also enables numexpr using several CPU cores.
+numexpr includes support for Intel's VML library.  This allows for
+better performance on Intel architectures, mainly when evaluating
+transcendental functions (trigonometrical, exponential...).  It also
+enables numexpr using several CPU cores.
 
 If you have Intel's MKL (the library that embeds VML), just copy the
 `site.cfg.example` that comes in the distribution to `site.cfg` and
diff --git a/LICENSE.txt b/LICENSE.txt
index c583780..de9a582 100644
--- a/LICENSE.txt
+++ b/LICENSE.txt
@@ -1,5 +1,6 @@
 Copyright (c) 2007,2008  David M. Cooke <david.m.cooke at gmail.com>
-Copyright (c) 2009  Francesc Alted <faltet at pytables.org>
+Copyright (c) 2009,2010  Francesc Alted <faltet at pytables.org>
+Copyright (c) 2011-      See AUTHORS.txt
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
diff --git a/PKG-INFO b/PKG-INFO
index f6a0d7b..23e96a2 100644
--- a/PKG-INFO
+++ b/PKG-INFO
@@ -1,6 +1,6 @@
 Metadata-Version: 1.0
 Name: numexpr
-Version: 1.4.2
+Version: 2.0.1
 Summary: Fast numerical expression evaluator for NumPy
 Home-page: http://code.google.com/p/numexpr/
 Author: David M. Cooke, Francesc Alted and others
diff --git a/README.txt b/README.txt
index da2c7ee..1267c4d 100644
--- a/README.txt
+++ b/README.txt
@@ -5,6 +5,14 @@ Numexpr is a fast numerical expression evaluator for NumPy.  With it,
 expressions that operate on arrays (like "3*a+4*b") are accelerated
 and use less memory than doing the same calculation in Python.
 
+In addition, its multi-threaded capabilities can make use of all your
+cores, which may accelerate computations, most specially if they are
+not memory-bounded (e.g. those using transcendental functions).
+
+Last but not least, numexpr can make use of Intel's VML (Vector Math
+Library, normally integrated in its Math Kernel Library, or MKL).
+This allows further acceleration of transcendent expressions.
+
 
 Examples of use
 ===============
@@ -101,49 +109,63 @@ Supported functions
 
 The next are the current supported set:
 
-    * where(bool, number1, number2): number
-        Number1 if the bool condition is true, number2 otherwise.
-    * {sin,cos,tan}(float|complex): float|complex
-        Trigonometric sine, cosine or tangent.
-    * {arcsin,arccos,arctan}(float|complex): float|complex
-        Trigonometric inverse sine, cosine or tangent.
-    * arctan2(float1, float2): float
-        Trigonometric inverse tangent of float1/float2.
-    * {sinh,cosh,tanh}(float|complex): float|complex
-        Hyperbolic sine, cosine or tangent.
-    * {arcsinh,arccosh,arctanh}(float|complex): float|complex
-        Hyperbolic inverse sine, cosine or tangent.
-    * {log,log10,log1p}(float|complex): float|complex
-        Natural, base-10 and log(1+x) logarithms.
-    * {exp,expm1}(float|complex): float|complex
-        Exponential and exponential minus one.
-    * sqrt(float|complex): float|complex
-        Square root.
-    * abs(float|complex): float|complex
-        Absolute value.
-    * {real,imag}(complex): float
-        Real or imaginary part of complex.
-    * complex(float, float): complex
-        Complex from real and imaginary parts.
+  * where(bool, number1, number2): number
+      Number1 if the bool condition is true, number2 otherwise.
+  * {sin,cos,tan}(float|complex): float|complex
+      Trigonometric sine, cosine or tangent.
+  * {arcsin,arccos,arctan}(float|complex): float|complex
+      Trigonometric inverse sine, cosine or tangent.
+  * arctan2(float1, float2): float
+      Trigonometric inverse tangent of float1/float2.
+  * {sinh,cosh,tanh}(float|complex): float|complex
+      Hyperbolic sine, cosine or tangent.
+  * {arcsinh,arccosh,arctanh}(float|complex): float|complex
+      Hyperbolic inverse sine, cosine or tangent.
+  * {log,log10,log1p}(float|complex): float|complex
+      Natural, base-10 and log(1+x) logarithms.
+  * {exp,expm1}(float|complex): float|complex
+      Exponential and exponential minus one.
+  * sqrt(float|complex): float|complex
+      Square root.
+  * abs(float|complex): float|complex
+      Absolute value.
+  * {real,imag}(complex): float
+      Real or imaginary part of complex.
+  * complex(float, float): complex
+      Complex from real and imaginary parts.
 
 .. Notes:
 
-   + `abs()` for complex inputs returns a ``complex`` output too.
-   This is a departure from NumPy where a ``float`` is returned
-   instead.  However, Numexpr is not flexible enough yet so as to
-   allow this to happen.  Meanwhile, if you want to mimic NumPy
-   behaviour, you may want to select the real part via the ``real``
-   function (e.g. "real(abs(cplx))") or via the ``real`` selector
-   (e.g. "abs(cplx).real").
+  + `abs()` for complex inputs returns a ``complex`` output too.  This
+  is a departure from NumPy where a ``float`` is returned instead.
+  However, Numexpr is not flexible enough yet so as to allow this to
+  happen.  Meanwhile, if you want to mimic NumPy behaviour, you may
+  want to select the real part via the ``real`` function
+  (e.g. "real(abs(cplx))") or via the ``real`` selector
+  (e.g. "abs(cplx).real").
 
 More functions can be added if you need them.
 
 
+Supported reduction operations
+==============================
+
+The next are the current supported set:
+
+  * sum(number, axis=None): Sum of array elements over a given axis.
+    Negative axis are not supported.
+
+  * prod(number, axis=None): Product of array elements over a given
+    axis.  Negative axis are not supported.
+
+
 General routines
 ================
 
-  * evaluate(expression, local_dict=None, global_dict=None, **kwargs):
-  Evaluate a simple array expression element-wise.  See examples above.
+  * evaluate(expression, local_dict=None, global_dict=None,
+             out=None, order='K', casting='safe', **kwargs):
+    Evaluate a simple array expression element-wise.  See docstrings
+    for more info on parameters.  Also, see examples above.
 
   * test():  Run all the tests in the test suite.
 
@@ -246,12 +268,7 @@ http://code.google.com/p/numexpr/wiki/Overview
 Authors
 =======
 
-Numexpr was initially written by David Cooke, and extended to more
-types by Tim Hochberg.  Francesc Alted contributed support for
-booleans and simple-precision floating point types, efficient strided
-and unaligned array operations and multi-threading code.  Ivan Vilata
-contributed support for strings.  Gregor Thalhammer implemented the
-support for Intel VML (Vector Math Library).
+See AUTHORS.txt
 
 
 License
diff --git a/RELEASE_NOTES.txt b/RELEASE_NOTES.txt
index 7ae1294..88beda3 100644
--- a/RELEASE_NOTES.txt
+++ b/RELEASE_NOTES.txt
@@ -1,7 +1,48 @@
 ======================================
- Release notes for Numexpr 1.4 series
+ Release notes for Numexpr 2.0 series
 ======================================
 
+Changes from 2.0 series to 2.0.1
+================================
+
+* Added compatibility with Python 2.5 (2.4 is definitely not supported
+  anymore).
+
+* `numexpr.evaluate` is fully documented now, in particular the new
+  `out`, `order` and `casting` parameters.
+
+* Reduction operations are fully documented now.
+
+* Negative axis in reductions are not supported (they have never been
+  actually), and a `ValueError` will be raised if they are used.
+
+
+Changes from 1.x series to 2.0
+==============================
+
+- Added support for the new iterator object in NumPy 1.6 and later.
+
+  This allows for better performance with operations that implies
+  broadcast operations, fortran-ordered or non-native byte orderings.
+  Performance for other scenarios is preserved (except for very small
+  arrays).
+
+- Division in numexpr is consistent now with Python/NumPy.  Fixes #22
+  and #58.
+
+- Constants like "2." or "2.0" must be evaluated as float, not
+  integer.  Fixes #59.
+
+- `evaluate()` function has received a new parameter `out` for storing
+  the result in already allocated arrays.  This is very useful when
+  dealing with large arrays, and a allocating new space for keeping
+  the result is not acceptable.  Closes #56.
+
+- Maximum number of threads raised from 256 to 4096.  Machines with a
+  higher number of cores will still be able to import numexpr, but
+  limited to 4096 (which is an absurdly high number already).
+
+
 Changes from 1.4.1 to 1.4.2
 ===========================
 
diff --git a/bench/boolean_timing.py b/bench/boolean_timing.py
index cd0e81e..711287e 100644
--- a/bench/boolean_timing.py
+++ b/bench/boolean_timing.py
@@ -1,3 +1,13 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 import sys
 import timeit
 import numpy
diff --git a/bench/issue-47.py b/bench/issue-47.py
new file mode 100644
index 0000000..31c68a6
--- /dev/null
+++ b/bench/issue-47.py
@@ -0,0 +1,7 @@
+import numpy
+import numexpr
+
+numexpr.set_num_threads(8)
+x0,x1,x2,x3,x4,x5 = [0,1,2,3,4,5]
+t = numpy.linspace(0,1,44100000).reshape(-1,1)
+numexpr.evaluate('(x0+x1*t+x2*t**2)* cos(x3+x4*t+x5**t)')
diff --git a/bench/latency.py b/bench/latency.py
deleted file mode 100644
index 9fe7802..0000000
--- a/bench/latency.py
+++ /dev/null
@@ -1,33 +0,0 @@
-# Benchmark for checking if numexpr leaks memory when evaluating
-# expressions that changes continously.  It also serves for computing
-# the latency of numexpr when working with small arrays.
-
-import sys
-from time import time
-import numpy as np
-import numexpr as ne
-
-N = 1000*10
-M = 1000
-
-print "Number of iterations %s.  Length of the array: %s " % (N, M)
-
-a = np.arange(M)
-
-print "Expressions that *cannot* be cached:"
-t1 = time()
-for i in xrange(N):
-    r = ne.evaluate("2 * a + (a + 1) ** 2 == %d" % i)
-    if i % 1000 == 0:
-        sys.stdout.write('.')
-
-print "\nEvaluated %s iterations in: %s seconds" % (N, round(time()-t1,3))
-
-print "Expressions that *can* be cached:"
-t1 = time()
-for i in xrange(N):
-    r = ne.evaluate("2 * a + (a + 1) ** 2 == i")
-    if i % 1000 == 0:
-        sys.stdout.write('.')
-
-print "\nEvaluated %s iterations in: %s seconds" % (N, round(time()-t1,3))
diff --git a/bench/multidim.py b/bench/multidim.py
index 5bc6ae6..c3ddffe 100644
--- a/bench/multidim.py
+++ b/bench/multidim.py
@@ -1,3 +1,13 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 # Script to check that multidimensional arrays are speed-up properly too
 # Based on a script provided by Andrew Collette.
 
diff --git a/bench/poly.py b/bench/poly.py
index 56b9d88..68e295e 100644
--- a/bench/poly.py
+++ b/bench/poly.py
@@ -1,3 +1,13 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 #######################################################################
 # This script compares the speed of the computation of a polynomial
 # for different (numpy and numexpr) in-memory paradigms.
diff --git a/bench/timing.py b/bench/timing.py
index 7c310fa..6c02614 100644
--- a/bench/timing.py
+++ b/bench/timing.py
@@ -1,7 +1,17 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 import timeit, numpy
 
 array_size = 1e6
-iterations = 10
+iterations = 2
 
 # Choose the type you want to benchmark
 #dtype = 'int8'
diff --git a/bench/unaligned-simple.py b/bench/unaligned-simple.py
index 9bed22d..9edfb59 100644
--- a/bench/unaligned-simple.py
+++ b/bench/unaligned-simple.py
@@ -1,3 +1,13 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 """Very simple test that compares the speed of operating with
 aligned vs unaligned arrays.
 """
@@ -13,7 +23,7 @@ shape = (1000, 10000)   # multidimensional test
 ne.print_versions()
 
 Z_fast = np.zeros(shape, dtype=[('x',np.float64),('y',np.int64)])
-Z_slow = np.zeros(shape, dtype=[('x',np.float64),('y',np.bool)])
+Z_slow = np.zeros(shape, dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])
 
 x_fast = Z_fast['x']
 t = Timer("x_fast * x_fast", "from __main__ import x_fast")
diff --git a/bench/varying-expr.py b/bench/varying-expr.py
index a851f31..b9e1250 100644
--- a/bench/varying-expr.py
+++ b/bench/varying-expr.py
@@ -1,3 +1,13 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 # Benchmark for checking if numexpr leaks memory when evaluating
 # expressions that changes continously.  It also serves for computing
 # the latency of numexpr when working with small arrays.
@@ -7,33 +17,37 @@ from time import time
 import numpy as np
 import numexpr as ne
 
-N = 1000*10
-M = 1000
+N = 100
+M = 10
+
+def timed_eval(eval_func, expr_func):
+    t1 = time()
+    for i in xrange(N):
+        r = eval_func(expr_func(i))
+        if i % 10 == 0:
+            sys.stdout.write('.')
+    print " done in %s seconds" % round(time() - t1, 3)
 
 print "Number of iterations %s.  Length of the array: %s " % (N, M)
 
 a = np.arange(M)
 
-print "Expressions that *cannot* be cached:"
-t1 = time()
-for i in xrange(N):
-    r = ne.evaluate("2 * a + (a + 1) ** 2 == %d" % i)
-    if i % 1000 == 0:
-        sys.stdout.write('.')
-print "\nEvaluated %s iterations in: %s seconds" % (N, round(time()-t1,3))
-
-print "Expressions that *can* be cached:"
-t1 = time()
-for i in xrange(N):
-    r = ne.evaluate("2 * a + (a + 1) ** 2 == i")
-    if i % 1000 == 0:
-        sys.stdout.write('.')
-print "\nEvaluated %s iterations in: %s seconds" % (N, round(time()-t1,3))
-
-print "Using python virtual machine with non-cacheable expressions:"
-t1 = time()
-for i in xrange(N):
-    r = eval("2 * a + (a + 1) ** 2 == %d" % i)
-    if i % 1000 == 0:
-        sys.stdout.write('.')
-print "\nEvaluated %s iterations in: %s seconds" % (N, round(time()-t1,3))
+# lots of duplicates to collapse
+#expr = '+'.join('(a + 1) * %d' % i for i in range(50))
+# no duplicate to collapse
+expr = '+'.join('(a + %d) * %d' % (i, i) for i in range(50))
+
+def non_cacheable(i):
+    return expr + '+ %d' % i
+
+def cacheable(i):
+    return expr + '+ i'
+
+print "* Numexpr with non-cacheable expressions: ",
+timed_eval(ne.evaluate, non_cacheable)
+print "* Numexpr with cacheable expressions: ",
+timed_eval(ne.evaluate, cacheable)
+print "* Numpy with non-cacheable expressions: ",
+timed_eval(eval, non_cacheable)
+print "* Numpy with cacheable expressions: ",
+timed_eval(eval, cacheable)
diff --git a/bench/vml_timing.py b/bench/vml_timing.py
index 1f8b3c3..7775d11 100644
--- a/bench/vml_timing.py
+++ b/bench/vml_timing.py
@@ -1,3 +1,13 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 import sys
 import timeit
 import numpy
@@ -71,7 +81,8 @@ def compare_times(expr, nexpr):
 setupNP = """\
 from numpy import arange, linspace, arctan2, sqrt, sin, cos, exp, log
 from numpy import rec as records
-from numexpr import evaluate
+#from numexpr import evaluate
+from numexpr import %s
 
 # Initialize a recarray of 16 MB in size
 r=records.array(None, formats='a%s,i4,f4,f8', shape=%s)
@@ -85,11 +96,14 @@ f3[:] = linspace(0,1,len(i2))
 f4[:] = f3*1.23
 """
 
-setupNP_contiguous = setupNP % ((4, array_size,) + \
+eval_method = "evaluate"
+setupNP_contiguous = setupNP % ((eval_method, 4, array_size,) + \
                                (".copy()",)*4 + \
                                (array_size,))
-setupNP_strided = setupNP % (4, array_size, "", "", "", "", array_size)
-setupNP_unaligned = setupNP % (1, array_size, "", "", "", "", array_size)
+setupNP_strided = setupNP % (eval_method, 4, array_size,
+                             "", "", "", "", array_size)
+setupNP_unaligned = setupNP % (eval_method, 1, array_size,
+                               "", "", "", "", array_size)
 
 
 expressions = []
@@ -117,6 +131,8 @@ if __name__ == '__main__':
     import numexpr
     numexpr.print_versions()
 
+    numpy.seterr(all='ignore')
+
     numexpr.set_vml_accuracy_mode('low')
     numexpr.set_vml_num_threads(2)
 
@@ -132,6 +148,7 @@ if __name__ == '__main__':
     ntratios = numpy.array(numpy_nttime) / numpy.array(numexpr_nttime)
 
 
+    print "eval method: %s" % eval_method
     print "*************** Numexpr vs NumPy speed-ups *******************"
 #     print "numpy total:", sum(numpy_ttime)/iterations
 #     print "numpy strided total:", sum(numpy_sttime)/iterations
diff --git a/numexpr.egg-info/PKG-INFO b/numexpr.egg-info/PKG-INFO
index f6a0d7b..23e96a2 100644
--- a/numexpr.egg-info/PKG-INFO
+++ b/numexpr.egg-info/PKG-INFO
@@ -1,6 +1,6 @@
 Metadata-Version: 1.0
 Name: numexpr
-Version: 1.4.2
+Version: 2.0.1
 Summary: Fast numerical expression evaluator for NumPy
 Home-page: http://code.google.com/p/numexpr/
 Author: David M. Cooke, Francesc Alted and others
diff --git a/numexpr.egg-info/SOURCES.txt b/numexpr.egg-info/SOURCES.txt
index ea3af86..278a7ba 100644
--- a/numexpr.egg-info/SOURCES.txt
+++ b/numexpr.egg-info/SOURCES.txt
@@ -1,4 +1,5 @@
 ANNOUNCE.txt
+AUTHORS.txt
 INSTALL.txt
 LICENSE.txt
 MANIFEST.in
@@ -10,7 +11,7 @@ site.cfg.example
 ./numexpr/tests/test_numexpr.py
 bench/boolean_timing.py
 bench/issue-36.py
-bench/latency.py
+bench/issue-47.py
 bench/multidim.py
 bench/poly.py
 bench/timing.py
diff --git a/numexpr/__init__.py b/numexpr/__init__.py
index 69b5d10..24c0b41 100644
--- a/numexpr/__init__.py
+++ b/numexpr/__init__.py
@@ -1,3 +1,13 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 """
 Numexpr is a fast numerical expression evaluator for NumPy.  With it,
 expressions that operate on arrays (like "3*a+4*b") are accelerated
@@ -34,6 +44,9 @@ from numexpr.utils import (
 
 # Initialize the number of threads to be used
 ncores = detect_number_of_cores()
+# Check that we don't surpass the MAX_THREADS in interpreter.c
+if ncores > 4096:
+    ncores = 4096
 set_num_threads(ncores)
 # The default for VML is 1 thread (see #39)
 set_vml_num_threads(1)
diff --git a/numexpr/complex_functions.inc b/numexpr/complex_functions.inc
index b3971a7..77e2be2 100644
--- a/numexpr/complex_functions.inc
+++ b/numexpr/complex_functions.inc
@@ -1,3 +1,11 @@
+/*********************************************************************
+  Numexpr - Fast numerical array expression evaluator for NumPy.
+
+      License: MIT
+      Author:  See AUTHORS.txt
+
+  See LICENSE.txt for details about copyright and rights to use.
+**********************************************************************/
 
 
 /* constants */
diff --git a/numexpr/cpuinfo.py b/numexpr/cpuinfo.py
index c353382..9fb3fe9 100644
--- a/numexpr/cpuinfo.py
+++ b/numexpr/cpuinfo.py
@@ -1,4 +1,15 @@
 #!/usr/bin/env python
+
+###################################################################
+#  cpuinfo - Get information about CPU
+#
+#      License: BSD
+#      Author:  Pearu Peterson <pearu at cens.ioc.ee>
+#
+#  See LICENSES/cpuinfo.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 """
 cpuinfo
 
@@ -25,7 +36,7 @@ def getoutput(cmd, successful_status=(0,), stacklevel=1):
         status, output = commands.getstatusoutput(cmd)
     except EnvironmentError, e:
         warnings.warn(str(e), UserWarning, stacklevel=stacklevel)
-        return False, output
+        return False, ''
     if os.WIFEXITED(status) and os.WEXITSTATUS(status) in successful_status:
         return True, output
     return False, output
diff --git a/numexpr/expressions.py b/numexpr/expressions.py
index bee19e8..abd1a9a 100644
--- a/numexpr/expressions.py
+++ b/numexpr/expressions.py
@@ -1,3 +1,13 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 __all__ = ['E']
 
 import operator
@@ -60,7 +70,7 @@ def ophelper(f):
             if isConstant(x):
                 args[i] = x = ConstantNode(x)
             if not isinstance(x, ExpressionNode):
-                raise TypeError("unsupported object type: %s" % (type(x),))
+                raise TypeError("unsupported object type: %s" % type(x))
         return f(*args)
     func.__name__ = f.__name__
     func.__doc__ = f.__doc__
@@ -92,6 +102,7 @@ def commonKind(nodes):
 
 max_int32 = 2147483647
 min_int32 = -max_int32 - 1
+
 def bestConstantType(x):
     if isinstance(x, str):  # ``numpy.string_`` is a subclass of ``str``
         return str
@@ -103,7 +114,7 @@ def bestConstantType(x):
     # ``double`` objects are kept as is to allow the user to force
     # promotion of results by using double constants, e.g. by operating
     # a float (32-bit) array with a double (64-bit) constant.
-    if isinstance(x, (double)):
+    if isinstance(x, double):
         return double
     # Numeric conversion to boolean values is not tried because
     # ``bool(1) == True`` (same for 0 and False), so 0 and 1 would be
@@ -111,23 +122,24 @@ def bestConstantType(x):
     # supported.
     if isinstance(x, (bool, numpy.bool_)):
         return bool
-    # ``long`` is not explicitly needed since ``int`` automatically
-    # returns longs when needed (since Python 2.3).
+    if isinstance(x, (int, numpy.integer)):
+        # Constants needing more than 32 bits are always
+        # considered ``long``, *regardless of the platform*, so we
+        # can clearly tell 32- and 64-bit constants apart.
+        if not (min_int32 <= x <= max_int32):
+            return long
+        return int
     # The duality of float and double in Python avoids that we have to list
     # ``double`` too.
-    for converter in int, float, complex:
+    for converter in float, complex:
         try:
             y = converter(x)
         except StandardError, err:
             continue
-        if x == y:
-            # Constants needing more than 32 bits are always
-            # considered ``long``, *regardless of the platform*, so we
-            # can clearly tell 32- and 64-bit constants apart.
-            if converter is int and not (min_int32 <= x <= max_int32):
-                return long
+        if y == x:
             return converter
 
+
 def getKind(x):
     converter = bestConstantType(x)
     return type_to_kind[converter]
@@ -154,6 +166,9 @@ def func(func, minkind=None, maxkind=None):
         kind = commonKind(args)
         if kind in ('int', 'long'):
             # Exception for following NumPy casting rules
+            #FIXME: this is not always desirable. The following
+            # functions which return ints (for int inputs) on numpy
+            # but not on numexpr: copy, abs, fmod, ones_like
             kind = 'double'
         else:
             # Apply regular casting rules
@@ -167,6 +182,7 @@ def func(func, minkind=None, maxkind=None):
 @ophelper
 def where_func(a, b, c):
     if isinstance(a, ConstantNode):
+        #FIXME: This prevents where(True, a, b)
         raise ValueError("too many dimensions")
     if allConstantNodes([a,b,c]):
         return ConstantNode(numpy.where(a, b, c))
@@ -179,12 +195,12 @@ def encode_axis(axis):
         axis = interpreter.allaxes
     else:
         if axis < 0:
-            axis = interpreter.maxdims - axis
+            raise ValueError("negative axis are not supported")
         if axis > 254:
             raise ValueError("cannot encode axis")
     return RawNode(axis)
 
-def sum_func(a, axis=-1):
+def sum_func(a, axis=None):
     axis = encode_axis(axis)
     if isinstance(a, ConstantNode):
         return a
@@ -192,7 +208,7 @@ def sum_func(a, axis=-1):
         a = ConstantNode(a)
     return FuncNode('sum', [a, axis], kind=a.astKind)
 
-def prod_func(a, axis=-1):
+def prod_func(a, axis=None):
     axis = encode_axis(axis)
     if isinstance(a, (bool, int, long, float, double, complex)):
         a = ConstantNode(a)
@@ -210,8 +226,24 @@ def div_op(a, b):
     return OpNode('div', [a,b])
 
 @ophelper
+def truediv_op(a, b):
+    if get_optimization() in ('moderate', 'aggressive'):
+        if (isinstance(b, ConstantNode) and
+            (a.astKind == b.astKind) and
+            a.astKind in ('float', 'double', 'complex')):
+            return OpNode('mul', [a, ConstantNode(1./b.value)])
+    kind = commonKind([a, b])
+    if kind in ('bool', 'int', 'long'):
+        kind = 'double'
+    return OpNode('div', [a, b], kind=kind)
+
+ at ophelper
+def rtruediv_op(a, b):
+    return truediv_op(b, a)
+
+ at ophelper
 def pow_op(a, b):
-    if allConstantNodes([a,b]):
+    if allConstantNodes([a, b]):
         return ConstantNode(a**b)
     if isinstance(b, ConstantNode):
         x = b.value
@@ -237,7 +269,8 @@ def pow_op(a, b):
                     p = OpNode('mul', [p,p])
                 if ishalfpower:
                     kind = commonKind([a])
-                    if kind in ('int', 'long'): kind = 'double'
+                    if kind in ('int', 'long'):
+                        kind = 'double'
                     r = multiply(r, OpNode('sqrt', [a], kind))
                 if r is None:
                     r = OpNode('ones_like', [a])
@@ -248,7 +281,7 @@ def pow_op(a, b):
             if x == -1:
                 return OpNode('div', [ConstantNode(1),a])
             if x == 0:
-                return FuncNode('ones_like', [a])
+                return OpNode('ones_like', [a])
             if x == 0.5:
                 kind = a.astKind
                 if kind in ('int', 'long'): kind = 'double'
@@ -345,12 +378,22 @@ class ExpressionNode(object):
     def __pos__(self):
         return self
 
+    # The next check is commented out. See #24 for more info.
+
+    # def __nonzero__(self):
+    #     raise TypeError("You can't use Python's standard boolean operators in "
+    #                     "NumExpr expressions. You should use their bitwise "
+    #                     "counterparts instead: '&' instead of 'and', "
+    #                     "'|' instead of 'or', and '~' instead of 'not'.")
+
     __add__ = __radd__ = binop('add')
     __sub__ = binop('sub')
     __rsub__ = binop('sub', reversed=True)
     __mul__ = __rmul__ = binop('mul')
-    __div__ =  div_op
+    __div__ = div_op
     __rdiv__ = binop('div', reversed=True)
+    __truediv__ = truediv_op
+    __rtruediv__ = rtruediv_op
     __pow__ = pow_op
     __rpow__ = binop('pow', reversed=True)
     __mod__ = binop('mod')
diff --git a/numexpr/functions.inc b/numexpr/functions.inc
index 3466770..4db313d 100644
--- a/numexpr/functions.inc
+++ b/numexpr/functions.inc
@@ -1,4 +1,12 @@
 // -*- c-mode -*-
+/*********************************************************************
+  Numexpr - Fast numerical array expression evaluator for NumPy.
+
+      License: MIT
+      Author:  See AUTHORS.txt
+
+  See LICENSE.txt for details about copyright and rights to use.
+**********************************************************************/
 
 /* These #if blocks make it easier to query this file, without having
    to define every row function before #including it. */
diff --git a/numexpr/interp_body.c b/numexpr/interp_body.c
index 050e039..11be655 100644
--- a/numexpr/interp_body.c
+++ b/numexpr/interp_body.c
@@ -1,7 +1,25 @@
+/*********************************************************************
+  Numexpr - Fast numerical array expression evaluator for NumPy.
+
+      License: MIT
+      Author:  See AUTHORS.txt
+
+  See LICENSE.txt for details about copyright and rights to use.
+**********************************************************************/
+
+
 {
 #define VEC_LOOP(expr) for(j = 0; j < BLOCK_SIZE; j++) {       \
         expr;                                   \
     }
+
+#define VEC_ARG0(expr)                          \
+    BOUNDS_CHECK(store_in);                     \
+    {                                           \
+        char *dest = mem[store_in];             \
+        VEC_LOOP(expr);                         \
+    } break
+
 #define VEC_ARG1(expr)                          \
     BOUNDS_CHECK(store_in);                     \
     BOUNDS_CHECK(arg1);                         \
@@ -9,7 +27,7 @@
         char *dest = mem[store_in];             \
         char *x1 = mem[arg1];                   \
         intp ss1 = params.memsizes[arg1];       \
-        intp sb1 = params.memsteps[arg1];       \
+        intp sb1 = memsteps[arg1];              \
         /* nowarns is defined and used so as to \
         avoid compiler warnings about unused    \
         variables */                            \
@@ -17,6 +35,7 @@
         nowarns += 1;                           \
         VEC_LOOP(expr);                         \
     } break
+
 #define VEC_ARG2(expr)                          \
     BOUNDS_CHECK(store_in);                     \
     BOUNDS_CHECK(arg1);                         \
@@ -25,14 +44,14 @@
         char *dest = mem[store_in];             \
         char *x1 = mem[arg1];                   \
         intp ss1 = params.memsizes[arg1];       \
-        intp sb1 = params.memsteps[arg1];       \
+        intp sb1 = memsteps[arg1];              \
         /* nowarns is defined and used so as to \
         avoid compiler warnings about unused    \
         variables */                            \
         intp nowarns = ss1+sb1+*x1;             \
         char *x2 = mem[arg2];                   \
         intp ss2 = params.memsizes[arg2];       \
-        intp sb2 = params.memsteps[arg2];       \
+        intp sb2 = memsteps[arg2];              \
         nowarns += ss2+sb2+*x2;                 \
         VEC_LOOP(expr);                         \
     } break
@@ -46,17 +65,17 @@
         char *dest = mem[store_in];             \
         char *x1 = mem[arg1];                   \
         intp ss1 = params.memsizes[arg1];       \
-        intp sb1 = params.memsteps[arg1];       \
+        intp sb1 = memsteps[arg1];              \
         /* nowarns is defined and used so as to \
         avoid compiler warnings about unused    \
         variables */                            \
         intp nowarns = ss1+sb1+*x1;             \
         char *x2 = mem[arg2];                   \
         intp ss2 = params.memsizes[arg2];       \
-        intp sb2 = params.memsteps[arg2];       \
+        intp sb2 = memsteps[arg2];              \
         char *x3 = mem[arg3];                   \
         intp ss3 = params.memsizes[arg3];       \
-        intp sb3 = params.memsteps[arg3];       \
+        intp sb3 = memsteps[arg3];              \
         nowarns += ss2+sb2+*x2;                 \
         nowarns += ss3+sb3+*x3;                 \
         VEC_LOOP(expr);                         \
@@ -95,37 +114,19 @@
         expr;                                   \
     } break
 
-
-    unsigned int pc, j, k, r;
+    unsigned int pc, j;
 
     /* set up pointers to next block of inputs and outputs */
-    mem[0] = params.output + index * params.memsteps[0];
-    for (r = 0; r < params.n_inputs; r++) {
-        struct index_data id = params.index_data[r+1];
-        if (id.count) {
-            mem[1+r] = params.inputs[r];
-            for (j = 0; j < BLOCK_SIZE; j++) {
-                unsigned int flatindex = 0;
-                for (k = 0; k < id.count; k ++)
-                    flatindex += id.strides[k] * id.index[k];
-                memcpy(mem[1+r] + (j*id.size), id.buffer + flatindex, id.size);
-                k = id.count - 1;
-                id.index[k] += 1;
-                if (id.index[k] >= id.shape[k])
-                    while (id.index[k] >= id.shape[k]) {
-                        id.index[k] -= id.shape[k];
-                        if (k < 1) break;
-                        id.index[k-1] += 1;
-                        k -= 1;
-                    }
-            }
-        } else {
-            mem[1+r] = params.inputs[r] + index * params.memsteps[1+r];
-        }
-    }
+#ifdef SINGLE_ITEM_CONST_LOOP
+    mem[0] = params.output;
+#else
+    /* use the iterator's inner loop data */
+    memcpy(mem, iter_dataptr, (1+params.n_inputs)*sizeof(char*));
+    memcpy(memsteps, iter_strides, (1+params.n_inputs)*sizeof(intp));
+#endif
 
     /* WARNING: From now on, only do references to mem[arg[123]]
-       & params.memsteps[arg[123]] inside the VEC_ARG[123] macros,
+       & memsteps[arg[123]] inside the VEC_ARG[123] macros,
        or you will risk accessing invalid addresses.  */
 
     for (pc = 0; pc < params.prog_len; pc += 4) {
@@ -134,14 +135,22 @@
         unsigned int arg1 = params.program[pc+2];
         unsigned int arg2 = params.program[pc+3];
         #define      arg3   params.program[pc+5]
-        #define store_index params.index_data[store_in]
-        #define reduce_ptr  (dest + flat_index(&store_index, j))
-        #define i_reduce    *(int *)reduce_ptr
-        #define l_reduce    *(long long *)reduce_ptr
-        #define f_reduce    *(float *)reduce_ptr
-        #define d_reduce    *(double *)reduce_ptr
-        #define cr_reduce   *(double *)ptr
-        #define ci_reduce   *((double *)ptr+1)
+        /* Iterator reduce macros */
+#ifdef REDUCTION_INNER_LOOP /* Reduce is the inner loop */
+        #define i_reduce    *(int *)dest
+        #define l_reduce    *(long long *)dest
+        #define f_reduce    *(float *)dest
+        #define d_reduce    *(double *)dest
+        #define cr_reduce   *(double *)dest
+        #define ci_reduce   *((double *)dest+1)
+#else /* Reduce is the outer loop */
+        #define i_reduce    i_dest
+        #define l_reduce    l_dest
+        #define f_reduce    f_dest
+        #define d_reduce    d_dest
+        #define cr_reduce   cr_dest
+        #define ci_reduce   ci_dest
+#endif
         #define b_dest ((char *)dest)[j]
         #define i_dest ((int *)dest)[j]
         #define l_dest ((long long *)dest)[j]
@@ -149,7 +158,7 @@
         #define d_dest ((double *)dest)[j]
         #define cr_dest ((double *)dest)[2*j]
         #define ci_dest ((double *)dest)[2*j+1]
-        #define s_dest ((char *)dest + j*params.memsteps[store_in])
+        #define s_dest ((char *)dest + j*memsteps[store_in])
         #define b1    ((char   *)(x1+j*sb1))[0]
         #define i1    ((int    *)(x1+j*sb1))[0]
         #define l1    ((long long *)(x1+j*sb1))[0]
@@ -177,7 +186,6 @@
         /* Some temporaries */
         double da, db;
         cdouble ca, cb;
-        char *ptr;
 
         switch (op) {
 
@@ -194,6 +202,7 @@
         case OP_COPY_DD: VEC_ARG1(memcpy(&d_dest, s1, sizeof(double)));
         case OP_COPY_CC: VEC_ARG1(memcpy(&cr_dest, s1, sizeof(double)*2));
 
+        /* Bool */
         case OP_INVERT_BB: VEC_ARG1(b_dest = !b1);
         case OP_AND_BBB: VEC_ARG2(b_dest = (b1 && b2));
         case OP_OR_BBB: VEC_ARG2(b_dest = (b1 || b2));
@@ -202,6 +211,7 @@
         case OP_NE_BBB: VEC_ARG2(b_dest = (b1 != b2));
         case OP_WHERE_BBBB: VEC_ARG3(b_dest = b1 ? b2 : b3);
 
+        /* Comparisons */
         case OP_GT_BII: VEC_ARG2(b_dest = (i1 > i2));
         case OP_GE_BII: VEC_ARG2(b_dest = (i1 >= i2));
         case OP_EQ_BII: VEC_ARG2(b_dest = (i1 == i2));
@@ -227,8 +237,9 @@
         case OP_EQ_BSS: VEC_ARG2(b_dest = (stringcmp(s1, s2, ss1, ss2) == 0));
         case OP_NE_BSS: VEC_ARG2(b_dest = (stringcmp(s1, s2, ss1, ss2) != 0));
 
+        /* Int */
         case OP_CAST_IB: VEC_ARG1(i_dest = (int)(b1));
-        case OP_ONES_LIKE_I: VEC_ARG1(i_dest = 1);
+        case OP_ONES_LIKE_II: VEC_ARG0(i_dest = 1);
         case OP_NEG_II: VEC_ARG1(i_dest = -i1);
 
         case OP_ADD_III: VEC_ARG2(i_dest = i1 + i2);
@@ -240,8 +251,9 @@
 
         case OP_WHERE_IBII: VEC_ARG3(i_dest = b1 ? i2 : i3);
 
+        /* Long */
         case OP_CAST_LI: VEC_ARG1(l_dest = (long long)(i1));
-        case OP_ONES_LIKE_L: VEC_ARG1(l_dest = 1);
+        case OP_ONES_LIKE_LL: VEC_ARG0(l_dest = 1);
         case OP_NEG_LL: VEC_ARG1(l_dest = -l1);
 
         case OP_ADD_LLL: VEC_ARG2(l_dest = l1 + l2);
@@ -253,10 +265,10 @@
 
         case OP_WHERE_LBLL: VEC_ARG3(l_dest = b1 ? l2 : l3);
 
-          /* Float */
+        /* Float */
         case OP_CAST_FI: VEC_ARG1(f_dest = (float)(i1));
         case OP_CAST_FL: VEC_ARG1(f_dest = (float)(l1));
-        case OP_ONES_LIKE_F: VEC_ARG1(f_dest = 1.0);
+        case OP_ONES_LIKE_FF: VEC_ARG0(f_dest = 1.0);
         case OP_NEG_FF: VEC_ARG1(f_dest = -f1);
 
         case OP_ADD_FFF: VEC_ARG2(f_dest = f1 + f2);
@@ -303,11 +315,11 @@
             VEC_ARG2(f_dest = functions_fff[arg3](f1, f2));
 #endif
 
-          /* Double */
+        /* Double */
         case OP_CAST_DI: VEC_ARG1(d_dest = (double)(i1));
         case OP_CAST_DL: VEC_ARG1(d_dest = (double)(l1));
         case OP_CAST_DF: VEC_ARG1(d_dest = (double)(f1));
-        case OP_ONES_LIKE_D: VEC_ARG1(d_dest = 1.0);
+        case OP_ONES_LIKE_DD: VEC_ARG0(d_dest = 1.0);
         case OP_NEG_DD: VEC_ARG1(d_dest = -d1);
 
         case OP_ADD_DDD: VEC_ARG2(d_dest = d1 + d2);
@@ -354,7 +366,7 @@
             VEC_ARG2(d_dest = functions_ddd[arg3](d1, d2));
 #endif
 
-            /* Complex */
+        /* Complex */
         case OP_CAST_CI: VEC_ARG1(cr_dest = (double)(i1);
                                   ci_dest = 0);
         case OP_CAST_CL: VEC_ARG1(cr_dest = (double)(l1);
@@ -363,8 +375,8 @@
                                   ci_dest = 0);
         case OP_CAST_CD: VEC_ARG1(cr_dest = d1;
                                   ci_dest = 0);
-        case OP_ONES_LIKE_C: VEC_ARG1(cr_dest = 1;
-                                  ci_dest = 0);
+        case OP_ONES_LIKE_CC: VEC_ARG0(cr_dest = 1;
+                                       ci_dest = 0);
         case OP_NEG_CC: VEC_ARG1(cr_dest = -c1r;
                                  ci_dest = -c1i);
 
@@ -415,20 +427,19 @@
         case OP_COMPLEX_CDD: VEC_ARG2(cr_dest = d1;
                                       ci_dest = d2);
 
+        /* Reductions */
         case OP_SUM_IIN: VEC_ARG1(i_reduce += i1);
         case OP_SUM_LLN: VEC_ARG1(l_reduce += l1);
         case OP_SUM_FFN: VEC_ARG1(f_reduce += f1);
         case OP_SUM_DDN: VEC_ARG1(d_reduce += d1);
-        case OP_SUM_CCN: VEC_ARG1(ptr = reduce_ptr;
-                                  cr_reduce += c1r;
+        case OP_SUM_CCN: VEC_ARG1(cr_reduce += c1r;
                                   ci_reduce += c1i);
 
         case OP_PROD_IIN: VEC_ARG1(i_reduce *= i1);
         case OP_PROD_LLN: VEC_ARG1(l_reduce *= l1);
         case OP_PROD_FFN: VEC_ARG1(f_reduce *= f1);
         case OP_PROD_DDN: VEC_ARG1(d_reduce *= d1);
-        case OP_PROD_CCN: VEC_ARG1(ptr = reduce_ptr;
-                                   da = cr_reduce*c1r - ci_reduce*c1i;
+        case OP_PROD_CCN: VEC_ARG1(da = cr_reduce*c1r - ci_reduce*c1i;
                                    ci_reduce = cr_reduce*c1i + ci_reduce*c1r;
                                    cr_reduce = da);
 
diff --git a/numexpr/interpreter.c b/numexpr/interpreter.c
index 82a0214..b58d2c7 100644
--- a/numexpr/interpreter.c
+++ b/numexpr/interpreter.c
@@ -1,7 +1,19 @@
+/*********************************************************************
+  Numexpr - Fast numerical array expression evaluator for NumPy.
+
+      License: MIT
+      Author:  See AUTHORS.txt
+
+  See LICENSE.txt for details about copyright and rights to use.
+**********************************************************************/
+
+
 #include "Python.h"
 #include "structmember.h"
 #include "numpy/noprefix.h"
 #include "numpy/arrayscalars.h"
+#include "numpy/ndarrayobject.h"
+#include "numpy/npy_cpu.h"
 #include "math.h"
 #include "string.h"
 #include "assert.h"
@@ -27,12 +39,17 @@
 #endif
 
 #ifdef _WIN32
-#define inline __inline
-#include "missing_posix_functions.inc"
-#include "msvc_function_stubs.inc"
+  #define inline __inline
+  #ifndef __MINGW32__
+    #include "missing_posix_functions.inc"
+  #endif
+  #include "msvc_function_stubs.inc"
 #endif
 
-#define L1_SIZE 32*1024         /* The average L1 cache size */
+/* x86 platform works with unaligned reads and writes */
+#if (defined(NPY_CPU_X86) || defined(NPY_CPU_AMD64))
+#  define USE_UNALIGNED_ACCESS 1
+#endif
 
 #ifdef USE_VML
 /* The values below have been tuned for a nowadays Core2 processor */
@@ -51,9 +68,10 @@
 #define BLOCK_SIZE2 16
 #endif
 
-
-/* The maximum number of threads (for some static arrays) */
-#define MAX_THREADS 256
+/* The maximum number of threads (for some static arrays).
+ * Choose this large enough for most monsters out there.
+   Keep in sync this with the number in __init__.py. */
+#define MAX_THREADS 4096
 
 /* Global variables for threads */
 int nthreads = 1;                /* number of desired threads in pool */
@@ -413,11 +431,15 @@ get_return_sig(PyObject* program) {
     int sig;
     char last_opcode;
     int end = PyString_Size(program);
+    char *program_str = PyString_AS_STRING(program);
+
     do {
         end -= 4;
         if (end < 0) return 'X';
+        last_opcode = program_str[end];
     }
-    while ((last_opcode = PyString_AS_STRING(program)[end]) == OP_NOOP);
+    while (last_opcode == OP_NOOP);
+
     sig = op_signature(last_opcode, 0);
     if (sig <= 0) {
         return 'X';
@@ -906,34 +928,18 @@ struct thread_data {
     struct vm_params params;
     int ret_code;
     int *pc_error;
+    char **errmsg;
+    /* One memsteps array per thread */
+    intp *memsteps[MAX_THREADS];
+    /* One iterator per thread */
+    NpyIter *iter[MAX_THREADS];
+    /* When doing nested iteration for a reduction */
+    NpyIter *reduce_iter[MAX_THREADS];
+    /* Flag indicating reduction is the outer loop instead of the inner */
+    int reduction_outer_loop;
 } th_params;
 
 
-static inline unsigned int
-flat_index(struct index_data *id, unsigned int j) {
-    int i, k = id->count - 1;
-    unsigned int findex = id->findex;
-    if (k < 0) return 0;
-    if (findex == -1) {
-        findex = 0;
-        for (i = 0; i < id->count; i++)
-            findex += id->strides[i] * id->index[i];
-    }
-    id->index[k] += 1;
-    if (id->index[k] >= id->shape[k]) {
-        while (id->index[k] >= id->shape[k]) {
-            id->index[k] -= id->shape[k];
-            if (k < 1) break;
-            id->index[--k] += 1;
-        }
-        id->findex = -1;
-    } else {
-        id->findex = findex + id->strides[k];
-    }
-    return findex;
-}
-
-
 #define DO_BOUNDS_CHECK 1
 
 #if DO_BOUNDS_CHECK
@@ -991,52 +997,153 @@ free_temps_space(struct vm_params params, char **mem)
     }
 }
 
-/* Serial version of VM engine */
-static inline int
-vm_engine_serial(intp start, intp vlen, intp block_size,
-                 struct vm_params params, int *pc_error)
+/* Serial/parallel task iterator version of the VM engine */
+static int
+vm_engine_iter_task(NpyIter *iter, intp *memsteps, struct vm_params params, int *pc_error, char **errmsg)
 {
-    intp index;
     char **mem = params.mem;
-    get_temps_space(params, mem, block_size);
-    for (index = start; index < vlen; index += block_size) {
-#define BLOCK_SIZE block_size
+    NpyIter_IterNextFunc *iternext;
+    intp block_size, *size_ptr;
+    char **iter_dataptr;
+    intp *iter_strides;
+
+    iternext = NpyIter_GetIterNext(iter, errmsg);
+    if (iternext == NULL) {
+        return -1;
+    }
+
+    size_ptr = NpyIter_GetInnerLoopSizePtr(iter);
+    iter_dataptr = NpyIter_GetDataPtrArray(iter);
+    iter_strides = NpyIter_GetInnerStrideArray(iter);
+
+    /*
+     * First do all the blocks with a compile-time fixed size.
+     * This makes a big difference (30-50% on some tests).
+     */
+    block_size = *size_ptr;
+    while (block_size == BLOCK_SIZE1) {
+#define REDUCTION_INNER_LOOP
+#define BLOCK_SIZE BLOCK_SIZE1
 #include "interp_body.c"
 #undef BLOCK_SIZE
+#undef REDUCTION_INNER_LOOP
+        iternext(iter);
+        block_size = *size_ptr;
     }
-    free_temps_space(params, mem);
+
+    /* Then finish off the rest */
+    if (block_size > 0) do {
+#define REDUCTION_INNER_LOOP
+#define BLOCK_SIZE block_size
+#include "interp_body.c"
+#undef BLOCK_SIZE
+#undef REDUCTION_INNER_LOOP
+    } while (iternext(iter));
+
     return 0;
 }
 
-/* Serial version of VM engine (specific for BLOCK_SIZE1) */
-static inline int
-vm_engine_serial1(intp start, intp vlen,
-                  struct vm_params params, int *pc_error)
+static int
+vm_engine_iter_outer_reduce_task(NpyIter *iter, intp *memsteps, struct vm_params params, int *pc_error, char **errmsg)
 {
-    intp index;
     char **mem = params.mem;
-    get_temps_space(params, mem, BLOCK_SIZE1);
-    for (index = start; index < vlen; index += BLOCK_SIZE1) {
+    NpyIter_IterNextFunc *iternext;
+    intp block_size, *size_ptr;
+    char **iter_dataptr;
+    intp *iter_strides;
+
+    iternext = NpyIter_GetIterNext(iter, errmsg);
+    if (iternext == NULL) {
+        return -1;
+    }
+
+    size_ptr = NpyIter_GetInnerLoopSizePtr(iter);
+    iter_dataptr = NpyIter_GetDataPtrArray(iter);
+    iter_strides = NpyIter_GetInnerStrideArray(iter);
+
+    /*
+     * First do all the blocks with a compile-time fixed size.
+     * This makes a big difference (30-50% on some tests).
+     */
+    block_size = *size_ptr;
+    while (block_size == BLOCK_SIZE1) {
 #define BLOCK_SIZE BLOCK_SIZE1
 #include "interp_body.c"
 #undef BLOCK_SIZE
+        iternext(iter);
+        block_size = *size_ptr;
     }
-    free_temps_space(params, mem);
+
+    /* Then finish off the rest */
+    if (block_size > 0) do {
+#define BLOCK_SIZE block_size
+#include "interp_body.c"
+#undef BLOCK_SIZE
+    } while (iternext(iter));
+
     return 0;
 }
 
-/* Parallel version of VM engine */
-static inline int
-vm_engine_parallel(intp start, intp vlen, intp block_size,
-                   struct vm_params params, int *pc_error)
+/* Parallel iterator version of VM engine */
+static int
+vm_engine_iter_parallel(NpyIter *iter, struct vm_params params, int *pc_error,
+                        char **errmsg)
 {
+    unsigned int i;
+    npy_intp numblocks, taskfactor;
+
+    if (errmsg == NULL) {
+        return -1;
+    }
+
     /* Populate parameters for worker threads */
-    th_params.start = start;
-    th_params.vlen = vlen;
-    th_params.block_size = block_size;
+    NpyIter_GetIterIndexRange(iter, &th_params.start, &th_params.vlen);
+    /*
+     * Try to make it so each thread gets 16 tasks.  This is a compromise
+     * between 1 task per thread and one block per task.
+     */
+    taskfactor = 16*BLOCK_SIZE1*nthreads;
+    numblocks = (th_params.vlen - th_params.start + taskfactor - 1) /
+                            taskfactor;
+    th_params.block_size = numblocks * BLOCK_SIZE1;
+
     th_params.params = params;
     th_params.ret_code = 0;
     th_params.pc_error = pc_error;
+    th_params.errmsg = errmsg;
+    th_params.iter[0] = iter;
+    /* Make one copy for each additional thread */
+    for (i = 1; i < nthreads; ++i) {
+        th_params.iter[i] = NpyIter_Copy(iter);
+        if (th_params.iter[i] == NULL) {
+            --i;
+            for (; i > 0; --i) {
+                NpyIter_Deallocate(th_params.iter[i]);
+            }
+            return -1;
+        }
+    }
+    th_params.memsteps[0] = params.memsteps;
+    /* Make one copy of memsteps for each additional thread */
+    for (i = 1; i < nthreads; ++i) {
+        th_params.memsteps[i] = PyMem_New(intp,
+                    1 + params.n_inputs + params.n_constants + params.n_temps);
+        if (th_params.memsteps[i] == NULL) {
+            --i;
+            for (; i > 0; --i) {
+                PyMem_Del(th_params.memsteps[i]);
+            }
+            for (i = 0; i < nthreads; ++i) {
+                NpyIter_Deallocate(th_params.iter[i]);
+            }
+            return -1;
+        }
+        memcpy(th_params.memsteps[i], th_params.memsteps[0],
+                sizeof(intp) *
+                (1 + params.n_inputs + params.n_constants + params.n_temps));
+    }
+
+    Py_BEGIN_ALLOW_THREADS;
 
     /* Synchronization point for all threads (wait for initialization) */
     pthread_mutex_lock(&count_threads_mutex);
@@ -1060,40 +1167,26 @@ vm_engine_parallel(intp start, intp vlen, intp block_size,
     }
     pthread_mutex_unlock(&count_threads_mutex);
 
-    return th_params.ret_code;
-}
+    Py_END_ALLOW_THREADS;
 
-/* VM engine for each thread (specific for BLOCK_SIZE1) */
-static inline int
-vm_engine_thread1(char **mem, intp index,
-                  struct vm_params params, int *pc_error)
-{
-#define BLOCK_SIZE BLOCK_SIZE1
-#include "interp_body.c"
-#undef BLOCK_SIZE
-    return 0;
-}
+    /* Deallocate all the iterator and memsteps copies */
+    for (i = 1; i < nthreads; ++i) {
+        NpyIter_Deallocate(th_params.iter[i]);
+        PyMem_Del(th_params.memsteps[i]);
+    }
 
-/* VM engine for each threadi (general) */
-static inline int
-vm_engine_thread(char **mem, intp index, intp block_size,
-                  struct vm_params params, int *pc_error)
-{
-#define BLOCK_SIZE block_size
-#include "interp_body.c"
-#undef BLOCK_SIZE
-    return 0;
+    return th_params.ret_code;
 }
 
 /* Do the worker job for a certain thread */
-void *th_worker(void *tids)
+void *th_worker(void *tidptr)
 {
-    /* int tid = *(int *)tids; */
-    intp index;                 /* private copy of gindex */
+    int tid = *(int *)tidptr;
     /* Parameters for threads */
     intp start;
     intp vlen;
     intp block_size;
+    NpyIter *iter;
     struct vm_params params;
     int *pc_error;
     int ret;
@@ -1102,6 +1195,9 @@ void *th_worker(void *tids)
     int n_temps;
     size_t memsize;
     char **mem;
+    intp *memsteps;
+    intp istart, iend;
+    char **errmsg;
 
     while (1) {
 
@@ -1138,47 +1234,72 @@ void *th_worker(void *tids)
         /* XXX malloc seems thread safe for POSIX, but for Win? */
         mem = malloc(memsize);
         memcpy(mem, params.mem, memsize);
-        /* Get temporary space for each thread */
-        ret = get_temps_space(params, mem, block_size);
-        if (ret < 0) {
-            pthread_mutex_lock(&count_mutex);
-            giveup = 1;
-            /* Propagate error to main thread */
-            th_params.ret_code = ret;
-            pthread_mutex_unlock(&count_mutex);
-        }
+
+        errmsg = th_params.errmsg;
+
+        params.mem = mem;
 
         /* Loop over blocks */
         pthread_mutex_lock(&count_mutex);
         if (!init_sentinels_done) {
             /* Set sentinels and other global variables */
             gindex = start;
-            index = gindex;
-            init_sentinels_done = 1;    /* sentinels have been initialised */
+            istart = gindex;
+            iend = istart + block_size;
+            if (iend > vlen) {
+                iend = vlen;
+            }
+            init_sentinels_done = 1;  /* sentinels have been initialised */
             giveup = 0;            /* no giveup initially */
         } else {
             gindex += block_size;
-            index = gindex;
+            istart = gindex;
+            iend = istart + block_size;
+            if (iend > vlen) {
+                iend = vlen;
+            }
+        }
+        /* Grab one of the iterators */
+        iter = th_params.iter[tid];
+        if (iter == NULL) {
+            th_params.ret_code = -1;
+            giveup = 1;
+        }
+        memsteps = th_params.memsteps[tid];
+        /* Get temporary space for each thread */
+        ret = get_temps_space(params, mem, BLOCK_SIZE1);
+        if (ret < 0) {
+            /* Propagate error to main thread */
+            th_params.ret_code = ret;
+            giveup = 1;
         }
         pthread_mutex_unlock(&count_mutex);
-        while (index < vlen && !giveup) {
-            if (block_size == BLOCK_SIZE1) {
-                ret = vm_engine_thread1(mem, index, params, pc_error);
-            }
-            else {
-                ret = vm_engine_thread(mem, index, block_size,
-                                       params, pc_error);
+
+        while (istart < vlen && !giveup) {
+            /* Reset the iterator to the range for this task */
+            ret = NpyIter_ResetToIterIndexRange(iter, istart, iend,
+                                                errmsg);
+            /* Execute the task */
+            if (ret >= 0) {
+                ret = vm_engine_iter_task(iter, memsteps, params, pc_error, errmsg);
             }
+
             if (ret < 0) {
                 pthread_mutex_lock(&count_mutex);
                 giveup = 1;
                 /* Propagate error to main thread */
                 th_params.ret_code = ret;
                 pthread_mutex_unlock(&count_mutex);
+                break;
             }
+
             pthread_mutex_lock(&count_mutex);
             gindex += block_size;
-            index = gindex;
+            istart = gindex;
+            iend = istart + block_size;
+            if (iend > vlen) {
+                iend = vlen;
+            }
             pthread_mutex_unlock(&count_mutex);
         }
 
@@ -1203,87 +1324,155 @@ void *th_worker(void *tids)
     return(0);
 }
 
-/* Compute expresion in [start:vlen], if possible with threads */
-static inline int
-vm_engine_block(intp start, intp vlen, intp block_size,
-                struct vm_params params, int *pc_error)
+static int
+run_interpreter(NumExprObject *self, NpyIter *iter, NpyIter *reduce_iter,
+                     int reduction_outer_loop, int *pc_error)
 {
     int r;
+    intp plen;
+    struct vm_params params;
+    char *errmsg = NULL;
 
-    /* From now on, we can release the GIL */
-    Py_BEGIN_ALLOW_THREADS;
-    /* Run the serial version when nthreads is 1 or when the total
-       length to compute is small */
-    if ((nthreads == 1) || (vlen <= L1_SIZE) || force_serial) {
-        if (block_size == BLOCK_SIZE1) {
-            r = vm_engine_serial1(start, vlen, params, pc_error);
+    *pc_error = -1;
+    if (PyString_AsStringAndSize(self->program, (char **)&(params.program),
+                                 &plen) < 0) {
+        return -1;
+    }
+
+    params.prog_len = plen;
+    params.output = NULL;
+    params.inputs = NULL;
+    params.index_data = NULL;
+    params.n_inputs = self->n_inputs;
+    params.n_constants = self->n_constants;
+    params.n_temps = self->n_temps;
+    params.mem = self->mem;
+    params.memsteps = self->memsteps;
+    params.memsizes = self->memsizes;
+    params.r_end = PyString_Size(self->fullsig);
+
+    if ((nthreads == 1) || force_serial) {
+        /* Can do it as one "task" */
+        if (reduce_iter == NULL) {
+            /* Reset the iterator to allocate its buffers */
+            if(NpyIter_Reset(iter, NULL) != NPY_SUCCEED) {
+                return -1;
+            }
+            get_temps_space(params, params.mem, BLOCK_SIZE1);
+            Py_BEGIN_ALLOW_THREADS;
+            r = vm_engine_iter_task(iter, params.memsteps,
+                                        params, pc_error, &errmsg);
+            Py_END_ALLOW_THREADS;
+            free_temps_space(params, params.mem);
         }
         else {
-            r = vm_engine_serial(start, vlen, block_size, params, pc_error);
+            if (reduction_outer_loop) {
+                char **dataptr;
+                NpyIter_IterNextFunc *iternext;
+
+                dataptr = NpyIter_GetDataPtrArray(reduce_iter);
+                iternext = NpyIter_GetIterNext(reduce_iter, NULL);
+                if (iternext == NULL) {
+                    return -1;
+                }
+
+                get_temps_space(params, params.mem, BLOCK_SIZE1);
+                Py_BEGIN_ALLOW_THREADS;
+                do {
+                    r = NpyIter_ResetBasePointers(iter, dataptr, &errmsg);
+                    if (r >= 0) {
+                        r = vm_engine_iter_outer_reduce_task(iter,
+                                                params.memsteps, params,
+                                                pc_error, &errmsg);
+                    }
+                    if (r < 0) {
+                        break;
+                    }
+                } while (iternext(reduce_iter));
+                Py_END_ALLOW_THREADS;
+                free_temps_space(params, params.mem);
+            }
+            else {
+                char **dataptr;
+                NpyIter_IterNextFunc *iternext;
+
+                dataptr = NpyIter_GetDataPtrArray(iter);
+                iternext = NpyIter_GetIterNext(iter, NULL);
+                if (iternext == NULL) {
+                    return -1;
+                }
+
+                get_temps_space(params, params.mem, BLOCK_SIZE1);
+                Py_BEGIN_ALLOW_THREADS;
+                do {
+                    r = NpyIter_ResetBasePointers(reduce_iter, dataptr,
+                                                                    &errmsg);
+                    if (r >= 0) {
+                        r = vm_engine_iter_task(reduce_iter, params.memsteps,
+                                                params, pc_error, &errmsg);
+                    }
+                    if (r < 0) {
+                        break;
+                    }
+                } while (iternext(iter));
+                Py_END_ALLOW_THREADS;
+                free_temps_space(params, params.mem);
+            }
         }
     }
     else {
-        r = vm_engine_parallel(start, vlen, block_size, params, pc_error);
+        if (reduce_iter == NULL) {
+            r = vm_engine_iter_parallel(iter, params, pc_error, &errmsg);
+        }
+        else {
+            errmsg = "Parallel engine doesn't support reduction yet";
+            r = -1;
+        }
+    }
+
+    if (r < 0 && errmsg != NULL) {
+        PyErr_SetString(PyExc_RuntimeError, errmsg);
     }
-    /* Get the GIL again */
-    Py_END_ALLOW_THREADS;
-    return r;
-}
 
-static inline int
-vm_engine_rest(intp start, intp blen,
-               struct vm_params params, int *pc_error)
-{
-    intp index = start;
-    intp block_size = blen - start;
-    char **mem = params.mem;
-    get_temps_space(params, mem, block_size);
-#define BLOCK_SIZE block_size
-#include "interp_body.c"
-#undef BLOCK_SIZE
-    free_temps_space(params, mem);
     return 0;
 }
 
 static int
-run_interpreter(NumExprObject *self, intp len, char *output, char **inputs,
-                struct index_data *index_data, int *pc_error)
+run_interpreter_const(NumExprObject *self, char *output, int *pc_error)
 {
-    int r;
-    intp plen;
-    intp blen1, blen2;
     struct vm_params params;
+    intp plen;
+    char **mem;
+    intp *memsteps;
 
     *pc_error = -1;
     if (PyString_AsStringAndSize(self->program, (char **)&(params.program),
                                  &plen) < 0) {
         return -1;
     }
+    if (self->n_inputs != 0) {
+        return -1;
+    }
     params.prog_len = plen;
     params.output = output;
-    params.inputs = inputs;
-    params.index_data = index_data;
+    params.inputs = NULL;
+    params.index_data = NULL;
     params.n_inputs = self->n_inputs;
     params.n_constants = self->n_constants;
     params.n_temps = self->n_temps;
     params.mem = self->mem;
-    params.memsteps = self->memsteps;
+    memsteps = self->memsteps;
     params.memsizes = self->memsizes;
     params.r_end = PyString_Size(self->fullsig);
 
-    blen1 = len - len % BLOCK_SIZE1;
-    r = vm_engine_block(0, blen1, BLOCK_SIZE1, params, pc_error);
-    if (r < 0) return r;
-    if (len != blen1) {
-        blen2 = len - len % BLOCK_SIZE2;
-        /* A block is generally too small for threading to be an advantage */
-        r = vm_engine_serial(blen1, blen2, BLOCK_SIZE2, params, pc_error);
-        if (r < 0) return r;
-        if (len != blen2) {
-            r = vm_engine_rest(blen2, len, params, pc_error);
-            if (r < 0) return r;
-        }
-    }
+    mem = params.mem;
+    get_temps_space(params, mem, 1);
+#define SINGLE_ITEM_CONST_LOOP
+#define BLOCK_SIZE 1
+#include "interp_body.c"
+#undef BLOCK_SIZE
+#undef SINGLE_ITEM_CONST_LOOP
+    free_temps_space(params, mem);
 
     return 0;
 }
@@ -1419,18 +1608,28 @@ void numexpr_free_resources(void)
     }
 }
 
-/* keyword arguments are ignored! */
 static PyObject *
 NumExpr_run(NumExprObject *self, PyObject *args, PyObject *kwds)
 {
-    PyObject *output = NULL, *a_inputs = NULL;
-    struct index_data *inddata = NULL;
-    unsigned int n_inputs, n_dimensions = 0;
-    intp shape[MAX_DIMS];
-    int i, j, r, pc_error;
-    intp size;
-    char **inputs = NULL;
-    intp strides[MAX_DIMS]; /* clean up XXX */
+    PyArrayObject *operands[NPY_MAXARGS];
+    PyArray_Descr *dtypes[NPY_MAXARGS], **dtypes_tmp;
+    PyObject *tmp, *ret;
+    npy_uint32 op_flags[NPY_MAXARGS];
+    NPY_CASTING casting = NPY_SAFE_CASTING;
+    NPY_ORDER order = NPY_KEEPORDER;
+    unsigned int i, n_inputs;
+    int r, pc_error = 0;
+    int reduction_axis = -1, reduction_size = 1;
+    int ex_uses_vml = 0, is_reduction = 0, reduction_outer_loop = 0;
+
+    /* To specify axes when doing a reduction */
+    int op_axes_values[NPY_MAXARGS][NPY_MAXDIMS],
+         op_axes_reduction_values[NPY_MAXARGS];
+    int *op_axes_ptrs[NPY_MAXDIMS];
+    int oa_ndim = 0;
+    int **op_axes = NULL;
+
+    NpyIter *iter = NULL, *reduce_iter = NULL;
 
     /* Check whether we need to restart threads */
     if (!init_threads_done || pid != getpid()) {
@@ -1440,292 +1639,420 @@ NumExpr_run(NumExprObject *self, PyObject *args, PyObject *kwds)
     /* Don't force serial mode by default */
     force_serial = 0;
 
+    /* Check whether there's a reduction as the final step */
+    is_reduction = last_opcode(self->program) > OP_REDUCTION;
+
     n_inputs = PyTuple_Size(args);
     if (PyString_Size(self->signature) != n_inputs) {
         return PyErr_Format(PyExc_ValueError,
                             "number of inputs doesn't match program");
     }
-    if (kwds && PyObject_Length(kwds) > 0) {
+    else if (n_inputs+1 > NPY_MAXARGS) {
         return PyErr_Format(PyExc_ValueError,
-                            "keyword arguments are not accepted");
+                            "too many inputs");
     }
 
-    /* This is overkill - we shouldn't need to allocate all of this space,
-       but this makes it easier figure out */
-    a_inputs = PyTuple_New(3*n_inputs);
-    if (!a_inputs) goto cleanup_and_exit;
-
-    inputs = PyMem_New(char *, n_inputs);
-    if (!inputs) goto cleanup_and_exit;
+    memset(operands, 0, sizeof(operands));
+    memset(dtypes, 0, sizeof(dtypes));
 
-    inddata = PyMem_New(struct index_data, n_inputs+1);
-    if (!inddata) goto cleanup_and_exit;
-    for (i = 0; i < n_inputs+1; i++)
-        inddata[i].count = 0;
-
-    /* First, make sure everything is some sort of array so that we can work
-       with their shapes. Count dimensions concurrently. */
+    if (kwds) {
+        tmp = PyDict_GetItemString(kwds, "casting"); /* borrowed ref */
+        if (tmp != NULL && !PyArray_CastingConverter(tmp, &casting)) {
+            return NULL;
+        }
+        tmp = PyDict_GetItemString(kwds, "order"); /* borrowed ref */
+        if (tmp != NULL && !PyArray_OrderConverter(tmp, &order)) {
+            return NULL;
+        }
+        tmp = PyDict_GetItemString(kwds, "ex_uses_vml"); /* borrowed ref */
+        if (tmp == NULL) {
+            return PyErr_Format(PyExc_ValueError,
+                                "ex_uses_vml parameter is required");
+        }
+        if (tmp == Py_True) {
+            ex_uses_vml = 1;
+        }
+            /* borrowed ref */
+        operands[0] = (PyArrayObject *)PyDict_GetItemString(kwds, "out");
+        if (operands[0] != NULL) {
+            if ((PyObject *)operands[0] == Py_None) {
+                operands[0] = NULL;
+            }
+            else if (!PyArray_Check(operands[0])) {
+                return PyErr_Format(PyExc_ValueError,
+                                    "out keyword parameter is not an array");
+            }
+            else {
+                Py_INCREF(operands[0]);
+            }
+        }
+    }
 
     for (i = 0; i < n_inputs; i++) {
         PyObject *o = PyTuple_GET_ITEM(args, i); /* borrowed ref */
         PyObject *a;
         char c = PyString_AS_STRING(self->signature)[i];
         int typecode = typecode_from_char(c);
-        if (typecode == -1) goto cleanup_and_exit;
-        /* Convert it just in case of a non-swapped array */
-        if (!PyArray_Check(o) || PyArray_TYPE(o) != PyArray_STRING) {
+        /* Convert it if it's not an array */
+        if (!PyArray_Check(o)) {
+            if (typecode == -1) goto fail;
             a = PyArray_FROM_OTF(o, typecode, NOTSWAPPED);
-        } else {
-            Py_INCREF(PyArray_DESCR(o));  /* typecode is not enough */
-            a = PyArray_FromAny(o, PyArray_DESCR(o), 0, 0, NOTSWAPPED, NULL);
         }
-        if (!a) goto cleanup_and_exit;
-        PyTuple_SET_ITEM(a_inputs, i, a);  /* steals reference */
-        if (PyArray_NDIM(a) > n_dimensions)
-            n_dimensions = PyArray_NDIM(a);
-    }
+        else {
+            Py_INCREF(o);
+            a = o;
+        }
+        operands[i+1] = (PyArrayObject *)a;
+        dtypes[i+1] = PyArray_DescrFromType(typecode);
 
-    /* Broadcast all of the inputs to determine the output shape (this will
-       require some modifications if we later allow a final reduction
-       operation). If an array has too few dimensions it's shape is padded
-       with ones fromthe left. All array dimensions must match, or be one. */
+        if (operands[i+1] == NULL || dtypes[i+1] == NULL) {
+            goto fail;
+        }
+        op_flags[i+1] = NPY_ITER_READONLY|
+#ifdef USE_VML
+                        (ex_uses_vml ? (NPY_ITER_CONTIG|NPY_ITER_ALIGNED) : 0)|
+#endif
+#ifndef USE_UNALIGNED_ACCESS
+                        NPY_ITER_ALIGNED|
+#endif
+                        NPY_ITER_NBO
+                        ;
+    }
 
-    for (i = 0; i < n_dimensions; i++)
-        shape[i] = 1;
-    for (i = 0; i < n_inputs; i++) {
-        PyObject *a = PyTuple_GET_ITEM(a_inputs, i);
-        unsigned int ndims = PyArray_NDIM(a);
-        int delta = n_dimensions - ndims;
-        for (j = 0; j < ndims; j++) {
-            unsigned int n = PyArray_DIM(a, j);
-            if (n == 1 || n == shape[delta+j]) continue;
-            if (shape[delta+j] == 1)
-                shape[delta+j] = n;
+    if (is_reduction) {
+        /* A reduction can not result in a string,
+           so we don't need to worry about item sizes here. */
+        char retsig = get_return_sig(self->program);
+        reduction_axis = get_reduction_axis(self->program);
+
+        /* Need to set up op_axes for the non-reduction part */
+        if (reduction_axis != 255) {
+            /* Get the number of broadcast dimensions */
+            for (i = 0; i < n_inputs; ++i) {
+                intp ndim = PyArray_NDIM(operands[i+1]);
+                if (ndim > oa_ndim) {
+                    oa_ndim = ndim;
+                }
+            }
+            if (reduction_axis < 0 || reduction_axis >= oa_ndim) {
+                PyErr_Format(PyExc_ValueError,
+                        "reduction axis is out of bounds");
+                goto fail;
+            }
+            /* Fill in the op_axes */
+            op_axes_ptrs[0] = NULL;
+            op_axes_reduction_values[0] = -1;
+            for (i = 0; i < n_inputs; ++i) {
+                intp j = 0, idim, ndim = PyArray_NDIM(operands[i+1]);
+                for (idim = 0; idim < oa_ndim-ndim; ++idim) {
+                    if (idim != reduction_axis) {
+                        op_axes_values[i+1][j++] = -1;
+                    }
+                    else {
+                        op_axes_reduction_values[i+1] = -1;
+                    }
+                }
+                for (idim = oa_ndim-ndim; idim < oa_ndim; ++idim) {
+                    if (idim != reduction_axis) {
+                        op_axes_values[i+1][j++] = idim-(oa_ndim-ndim);
+                    }
+                    else {
+                        intp size = PyArray_DIM(operands[i+1],
+                                                    idim-(oa_ndim-ndim));
+                        if (size > reduction_size) {
+                            reduction_size = size;
+                        }
+                        op_axes_reduction_values[i+1] = idim-(oa_ndim-ndim);
+                    }
+                }
+                op_axes_ptrs[i+1] = op_axes_values[i+1];
+            }
+            /* op_axes has one less than the broadcast dimensions */
+            --oa_ndim;
+            if (oa_ndim > 0) {
+                op_axes = op_axes_ptrs;
+            }
             else {
-                PyErr_SetString(PyExc_ValueError,
-                                "cannot broadcast inputs to common shape");
-                goto cleanup_and_exit;
+                reduction_size = 1;
+            }
+        }
+        /* A full reduction can be done without nested iteration */
+        if (oa_ndim == 0) {
+            if (operands[0] == NULL) {
+                intp dim = 1;
+                operands[0] = (PyArrayObject *)PyArray_SimpleNew(0, &dim,
+                                            typecode_from_char(retsig));
+                if (!operands[0])
+                    goto fail;
+            } else if (PyArray_SIZE(operands[0]) != 1) {
+                PyErr_Format(PyExc_ValueError,
+                        "out argument must have size 1 for a full reduction");
+                goto fail;
             }
         }
-    }
-    size = PyArray_MultiplyList(shape, n_dimensions);
 
-    /* Broadcast indices of all of the arrays. We could improve efficiency
-       by keeping track of what needs to be broadcast above */
+        dtypes[0] = PyArray_DescrFromType(typecode_from_char(retsig));
 
-    for (i = 0; i < n_inputs; i++) {
-        PyObject *a = PyTuple_GET_ITEM(a_inputs, i);
-        PyObject *b;
-        intp strides[MAX_DIMS];
-        int delta = n_dimensions - PyArray_NDIM(a);
-        if (PyArray_NDIM(a)) {
-            for (j = 0; j < n_dimensions; j++)
-                strides[j] = (j < delta || PyArray_DIM(a, j-delta) == 1) ?
-                                0 : PyArray_STRIDE(a, j-delta);
-            Py_INCREF(PyArray_DESCR(a));
-            b = PyArray_NewFromDescr(a->ob_type,
-                                       PyArray_DESCR(a),
-                                       n_dimensions, shape,
-                                       strides, PyArray_DATA(a), 0, a);
-            if (!b) goto cleanup_and_exit;
-        } else { /* Leave scalars alone */
-            b = a;
-            Py_INCREF(b);
+        op_flags[0] = NPY_ITER_READWRITE|
+                      NPY_ITER_ALLOCATE|
+                      /* Copy, because it can't buffer the reduction */
+                      NPY_ITER_UPDATEIFCOPY|
+                      NPY_ITER_NBO|
+#ifndef USE_UNALIGNED_ACCESS
+                      NPY_ITER_ALIGNED|
+#endif
+                      (oa_ndim == 0 ? 0 : NPY_ITER_NO_BROADCAST);
+    }
+    else {
+        char retsig = get_return_sig(self->program);
+        if (retsig != 's') {
+            dtypes[0] = PyArray_DescrFromType(typecode_from_char(retsig));
+        } else {
+            /* Since the *only* supported operation returning a string
+             * is a copy, the size of returned strings
+             * can be directly gotten from the first (and only)
+             * input/constant/temporary. */
+            if (n_inputs > 0) {  /* input, like in 'a' where a -> 'foo' */
+                dtypes[0] = PyArray_DESCR(operands[1]);
+                Py_INCREF(dtypes[0]);
+            } else {  /* constant, like in '"foo"' */
+                dtypes[0] = PyArray_DescrNewFromType(PyArray_STRING);
+                dtypes[0]->elsize = self->memsizes[1];
+            }  /* no string temporaries, so no third case  */
+        }
+        if (dtypes[0] == NULL) {
+            goto fail;
         }
-        /* Store b so that it stays alive till we're done */
-        PyTuple_SET_ITEM(a_inputs, i+n_inputs, b);
+        op_flags[0] = NPY_ITER_WRITEONLY|
+                      NPY_ITER_ALLOCATE|
+                      NPY_ITER_CONTIG|
+                      NPY_ITER_NBO|
+#ifndef USE_UNALIGNED_ACCESS
+                      NPY_ITER_ALIGNED|
+#endif
+                      NPY_ITER_NO_BROADCAST;
     }
 
+    /* Check for empty arrays in expression */
+    if (n_inputs > 0) {
+        char retsig = get_return_sig(self->program);
 
-    for (i = 0; i < n_inputs; i++) {
-        PyObject *a = PyTuple_GET_ITEM(a_inputs, i+n_inputs);
-        PyObject *b;
-        char c = PyString_AS_STRING(self->signature)[i];
-        int typecode = typecode_from_char(c);
-        if (PyArray_NDIM(a) == 0) {
-            /* Broadcast scalars */
-            intp dims[1] = {BLOCK_SIZE1};
-            Py_INCREF(PyArray_DESCR(a));
-            b = PyArray_SimpleNewFromDescr(1, dims, PyArray_DESCR(a));
-            if (!b) goto cleanup_and_exit;
-            self->memsteps[i+1] = 0;
-            self->memsizes[i+1] = PyArray_ITEMSIZE(a);
-            PyTuple_SET_ITEM(a_inputs, i+2*n_inputs, b);  /* steals reference */
-            inputs[i] = PyArray_DATA(b);
-            if (typecode == PyArray_BOOL) {
-                char value = ((char*)PyArray_DATA(a))[0];
-                for (j = 0; j < BLOCK_SIZE1; j++)
-                    ((char*)PyArray_DATA(b))[j] = value;
-            } else if (typecode == PyArray_INT) {
-                int value = ((int*)PyArray_DATA(a))[0];
-                for (j = 0; j < BLOCK_SIZE1; j++)
-                    ((int*)PyArray_DATA(b))[j] = value;
-            } else if (typecode == PyArray_LONGLONG) {
-                long long value = ((long long*)PyArray_DATA(a))[0];
-                for (j = 0; j < BLOCK_SIZE1; j++)
-                    ((long long*)PyArray_DATA(b))[j] = value;
-            } else if (typecode == PyArray_FLOAT) {
-                float value = ((float*)PyArray_DATA(a))[0];
-                for (j = 0; j < BLOCK_SIZE1; j++)
-                    ((float*)PyArray_DATA(b))[j] = value;
-            } else if (typecode == PyArray_DOUBLE) {
-                double value = ((double*)PyArray_DATA(a))[0];
-                for (j = 0; j < BLOCK_SIZE1; j++)
-                    ((double*)PyArray_DATA(b))[j] = value;
-            } else if (typecode == PyArray_CDOUBLE) {
-                double rvalue = ((double*)PyArray_DATA(a))[0];
-                double ivalue = ((double*)PyArray_DATA(a))[1];
-                for (j = 0; j < 2*BLOCK_SIZE1; j+=2) {
-                    ((double*)PyArray_DATA(b))[j] = rvalue;
-                    ((double*)PyArray_DATA(b))[j+1] = ivalue;
-                }
-            } else if (typecode == PyArray_STRING) {
-                int itemsize = PyArray_ITEMSIZE(a);
-                char *value = (char*)(PyArray_DATA(a));
-                for (j = 0; j < itemsize*BLOCK_SIZE1; j+=itemsize)
-                    memcpy((char*)(PyArray_DATA(b)) + j, value, itemsize);
-            } else {
-                PyErr_SetString(PyExc_RuntimeError, "illegal typecode value");
-                goto cleanup_and_exit;
-            }
-        } else {
-            /* Check that discontiguous strides appear only on the last
-               dimension. If not, the arrays should be copied.
-               Furthermore, such arrays can appear when doing
-               broadcasting above, so this check really needs to be
-               here, and not in Python space. */
-            intp inner_size;
-            for (j = PyArray_NDIM(a)-2; j >= 0; j--) {
-                inner_size = PyArray_STRIDE(a, j+1) * PyArray_DIM(a, j+1);
-                if (PyArray_STRIDE(a, j) != inner_size) {
-                    intp dims[1] = {BLOCK_SIZE1};
-                    inddata[i+1].count = PyArray_NDIM(a);
-                    inddata[i+1].findex = -1;
-                    inddata[i+1].size = PyArray_ITEMSIZE(a);
-                    inddata[i+1].shape = PyArray_DIMS(a);
-                    inddata[i+1].strides = PyArray_STRIDES(a);
-                    inddata[i+1].buffer = PyArray_BYTES(a);
-                    inddata[i+1].index = PyMem_New(int, inddata[i+1].count);
-                    for (j = 0; j < inddata[i+1].count; j++)
-                        inddata[i+1].index[j] = 0;
-                    Py_INCREF(PyArray_DESCR(a));
-                    a = PyArray_SimpleNewFromDescr(1, dims, PyArray_DESCR(a));
-                    /* steals reference below */
-                    PyTuple_SET_ITEM(a_inputs, i+2*n_inputs, a);
-                    /* Broadcasting code only seems to work well for
-                       serial code (don't know exactly why) */
-                    force_serial = 1;
-                    break;
-                }
+        /* Check length for all inputs */
+        int zeroi, zerolen = 0;
+        for (i=0; i < n_inputs; i++) {
+            if (PyArray_SIZE(operands[i+1]) == 0) {
+                zerolen = 1;
+                zeroi = i+1;
+                break;
             }
+        }
 
-            self->memsteps[i+1] = PyArray_STRIDE(a, PyArray_NDIM(a)-1);
-            self->memsizes[i+1] = PyArray_ITEMSIZE(a);
-            inputs[i] = PyArray_DATA(a);
+        if (zerolen != 0) {
+            /* Allocate the output */
+            intp ndim = PyArray_NDIM(operands[zeroi]);
+            intp *dims = PyArray_DIMS(operands[zeroi]);
+            operands[0] = (PyArrayObject *)PyArray_SimpleNew(ndim, dims,
+                                              typecode_from_char(retsig));
+            if (operands[0] == NULL) {
+                goto fail;
+            }
 
+            ret = (PyObject *)operands[0];
+            Py_INCREF(ret);
+            goto cleanup_and_exit;
         }
     }
 
-    if (last_opcode(self->program) > OP_REDUCTION) {
-        /* A reduction can not result in a string,
-           so we don't need to worry about item sizes here. */
+
+    /* A case with a single constant output */
+    if (n_inputs == 0) {
         char retsig = get_return_sig(self->program);
-        int axis = get_reduction_axis(self->program);
-        /* Reduction ops only works with 1 thread */
-        force_serial = 1;
-        self->memsteps[0] = 0; /*size_from_char(retsig);*/
-        if (axis == 255) {
-            intp dims[1];
-            for (i = 0; i < n_dimensions; i++)
-                strides[i] = 0;
-            output = PyArray_SimpleNew(0, dims, typecode_from_char(retsig));
-            if (!output) goto cleanup_and_exit;
-        } else {
-            intp dims[MAX_DIMS];
-            if (axis < 0)
-                axis = n_dimensions + axis;
-            if (axis < 0 || axis >= n_dimensions) {
-                PyErr_SetString(PyExc_ValueError, "axis out of range");
-                goto cleanup_and_exit;
+
+        /* Allocate the output */
+        if (operands[0] == NULL) {
+            intp dim = 1;
+            operands[0] = (PyArrayObject *)PyArray_SimpleNew(0, &dim,
+                                        typecode_from_char(retsig));
+            if (operands[0] == NULL) {
+                goto fail;
             }
-            for (i = j = 0; i < n_dimensions; i++) {
-                if (i != axis) {
-                    dims[j] = shape[i];
-                    j += 1;
-                }
+        }
+        else {
+            PyArrayObject *a;
+            if (PyArray_SIZE(operands[0]) != 1) {
+                PyErr_SetString(PyExc_ValueError,
+                        "output for a constant expression must have size 1");
+                goto fail;
             }
-            output = PyArray_SimpleNew(n_dimensions-1, dims,
-                                       typecode_from_char(retsig));
-            if (!output) goto cleanup_and_exit;
-            for (i = j = 0; i < n_dimensions; i++) {
-                if (i != axis) {
-                    strides[i] = PyArray_STRIDES(output)[j];
-                    j += 1;
-                } else {
-                    strides[i] = 0;
-                }
+            else if (!PyArray_ISWRITEABLE(operands[0])) {
+                PyErr_SetString(PyExc_ValueError,
+                        "output is not writeable");
+                goto fail;
+            }
+            Py_INCREF(dtypes[0]);
+            a = (PyArrayObject *)PyArray_FromArray(operands[0], dtypes[0],
+                                        NPY_ALIGNED|NPY_UPDATEIFCOPY);
+            if (a == NULL) {
+                goto fail;
             }
+            Py_DECREF(operands[0]);
+            operands[0] = a;
+        }
+
+        r = run_interpreter_const(self, PyArray_DATA(operands[0]), &pc_error);
+
+        ret = (PyObject *)operands[0];
+        Py_INCREF(ret);
+        goto cleanup_and_exit;
+    }
 
 
+    /* Allocate the iterator or nested iterators */
+    if (reduction_size == 1) {
+        /* When there's no reduction, reduction_size is 1 as well */
+        iter = NpyIter_AdvancedNew(n_inputs+1, operands,
+                            NPY_ITER_BUFFERED|
+                            NPY_ITER_REDUCE_OK|
+                            NPY_ITER_RANGED|
+                            NPY_ITER_DELAY_BUFALLOC|
+                            NPY_ITER_EXTERNAL_LOOP,
+                            order, casting,
+                            op_flags, dtypes,
+                            0, NULL, NULL,
+                            BLOCK_SIZE1);
+        if (iter == NULL) {
+            goto fail;
+        }
+    } else {
+        npy_uint32 op_flags_outer[NPY_MAXDIMS];
+        /* The outer loop is unbuffered */
+        op_flags_outer[0] = NPY_ITER_READWRITE|
+                            NPY_ITER_ALLOCATE|
+                            NPY_ITER_NO_BROADCAST;
+        for (i = 0; i < n_inputs; ++i) {
+            op_flags_outer[i+1] = NPY_ITER_READONLY;
         }
-        /* TODO optimize strides -- in this and other inddata cases, strides and
-           shape can be tweaked to minimize the amount of looping */
-        inddata[0].count = n_dimensions;
-        inddata[0].findex = -1;
-        inddata[0].size = PyArray_ITEMSIZE(output);
-        inddata[0].shape = shape;
-        inddata[0].strides = strides;
-        inddata[0].buffer = PyArray_BYTES(output);
-        inddata[0].index = PyMem_New(int, n_dimensions);
-        for (j = 0; j < inddata[0].count; j++)
-            inddata[0].index[j] = 0;
+        /* Arbitrary threshold for which is the inner loop...benchmark? */
+        if (reduction_size < 64) {
+            reduction_outer_loop = 1;
+            iter = NpyIter_AdvancedNew(n_inputs+1, operands,
+                                NPY_ITER_BUFFERED|
+                                NPY_ITER_RANGED|
+                                NPY_ITER_DELAY_BUFALLOC|
+                                NPY_ITER_EXTERNAL_LOOP,
+                                order, casting,
+                                op_flags, dtypes,
+                                oa_ndim, op_axes, NULL,
+                                BLOCK_SIZE1);
+            if (iter == NULL) {
+                goto fail;
+            }
+
+            /* If the output was allocated, get it for the second iterator */
+            if (operands[0] == NULL) {
+                operands[0] = NpyIter_GetOperandArray(iter)[0];
+                Py_INCREF(operands[0]);
+            }
+
+            op_axes[0] = &op_axes_reduction_values[0];
+            for (i = 0; i < n_inputs; ++i) {
+                op_axes[i+1] = &op_axes_reduction_values[i+1];
+            }
+            op_flags_outer[0] &= ~NPY_ITER_NO_BROADCAST;
+            reduce_iter = NpyIter_AdvancedNew(n_inputs+1, operands,
+                                NPY_ITER_REDUCE_OK,
+                                order, casting,
+                                op_flags_outer, NULL,
+                                1, op_axes, NULL,
+                                0);
+            if (reduce_iter == NULL) {
+                goto fail;
+            }
+        }
+        else {
+            PyArray_Descr *dtypes_outer[NPY_MAXDIMS];
+
+            /* If the output is being allocated, need to specify its dtype */
+            dtypes_outer[0] = dtypes[0];
+            for (i = 0; i < n_inputs; ++i) {
+                dtypes_outer[i+1] = NULL;
+            }
+            iter = NpyIter_AdvancedNew(n_inputs+1, operands,
+                                NPY_ITER_RANGED,
+                                order, casting,
+                                op_flags_outer, dtypes_outer,
+                                oa_ndim, op_axes, NULL,
+                                0);
+            if (iter == NULL) {
+                goto fail;
+            }
 
+            /* If the output was allocated, get it for the second iterator */
+            if (operands[0] == NULL) {
+                operands[0] = NpyIter_GetOperandArray(iter)[0];
+                Py_INCREF(operands[0]);
+            }
+
+            op_axes[0] = &op_axes_reduction_values[0];
+            for (i = 0; i < n_inputs; ++i) {
+                op_axes[i+1] = &op_axes_reduction_values[i+1];
+            }
+            op_flags[0] &= ~NPY_ITER_NO_BROADCAST;
+            reduce_iter = NpyIter_AdvancedNew(n_inputs+1, operands,
+                                NPY_ITER_BUFFERED|
+                                NPY_ITER_REDUCE_OK|
+                                NPY_ITER_DELAY_BUFALLOC|
+                                NPY_ITER_EXTERNAL_LOOP,
+                                order, casting,
+                                op_flags, dtypes,
+                                1, op_axes, NULL,
+                                BLOCK_SIZE1);
+            if (reduce_iter == NULL) {
+                goto fail;
+            }
+        }
+    }
+
+    /* Initialize the output to the reduction unit */
+    if (is_reduction) {
+        PyArrayObject *a = NpyIter_GetOperandArray(iter)[0];
         if (last_opcode(self->program) >= OP_SUM &&
             last_opcode(self->program) < OP_PROD) {
                 PyObject *zero = PyInt_FromLong(0);
-                PyArray_FillWithScalar((PyArrayObject *)output, zero);
+                PyArray_FillWithScalar(a, zero);
                 Py_DECREF(zero);
         } else {
                 PyObject *one = PyInt_FromLong(1);
-                PyArray_FillWithScalar((PyArrayObject *)output, one);
+                PyArray_FillWithScalar(a, one);
                 Py_DECREF(one);
         }
     }
-    else {
-        char retsig = get_return_sig(self->program);
-        if (retsig != 's') {
-            self->memsteps[0] = self->memsizes[0] = size_from_char(retsig);
-            output = PyArray_SimpleNew(
-                n_dimensions, shape, typecode_from_char(retsig));
-        } else {
-            /* Since the *only* supported operation returning a string
-             * is a copy, the size of returned strings
-             * can be directly gotten from the first (and only)
-             * input/constant/temporary. */
-            PyArray_Descr *descr;
-            if (n_inputs > 0) {  /* input, like in 'a' where a -> 'foo' */
-                descr = PyArray_DESCR(PyTuple_GET_ITEM(a_inputs, 1));
-            Py_INCREF(descr);
-            } else {  /* constant, like in '"foo"' */
-                descr = PyArray_DescrFromType(PyArray_STRING);
-                descr->elsize = self->memsizes[1];
-            }  /* no string temporaries, so no third case  */
-            self->memsteps[0] = self->memsizes[0] = self->memsizes[1];
-            output = PyArray_SimpleNewFromDescr(n_dimensions, shape, descr);
-        }
-        if (!output) goto cleanup_and_exit;
+
+    /* Get the sizes of all the operands */
+    dtypes_tmp = NpyIter_GetDescrArray(iter);
+    for (i = 0; i < n_inputs+1; ++i) {
+        self->memsizes[i] = dtypes_tmp[i]->elsize;
     }
 
+    /* For small calculations, just use 1 thread */
+    if (NpyIter_GetIterSize(iter) < 2*BLOCK_SIZE1) {
+        force_serial = 1;
+    }
 
-    r = run_interpreter(self, size, PyArray_DATA(output), inputs, inddata,
-                        &pc_error);
+    /* Reductions do not support parallel execution yet */
+    if (is_reduction) {
+        force_serial = 1;
+    }
+
+    r = run_interpreter(self, iter, reduce_iter,
+                             reduction_outer_loop, &pc_error);
 
     if (r < 0) {
-        Py_XDECREF(output);
-        output = NULL;
         if (r == -1) {
-            PyErr_SetString(PyExc_RuntimeError,
-                            "an error occurred while running the program");
+            if (!PyErr_Occurred()) {
+                PyErr_SetString(PyExc_RuntimeError,
+                                "an error occurred while running the program");
+            }
         } else if (r == -2) {
             PyErr_Format(PyExc_RuntimeError,
                          "bad argument at pc=%d", pc_error);
@@ -1736,19 +2063,37 @@ NumExpr_run(NumExprObject *self, PyObject *args, PyObject *kwds)
             PyErr_SetString(PyExc_RuntimeError,
                             "unknown error occurred while running the program");
         }
+        goto fail;
+    }
+
+    /* Get the output from the iterator */
+    ret = (PyObject *)NpyIter_GetOperandArray(iter)[0];
+    Py_INCREF(ret);
+
+    NpyIter_Deallocate(iter);
+    if (reduce_iter != NULL) {
+        NpyIter_Deallocate(reduce_iter);
     }
 cleanup_and_exit:
-    Py_XDECREF(a_inputs);
-    PyMem_Del(inputs);
-    if (inddata) {
-        for (i = 0; i < n_inputs+1; i++) {
-            if (inddata[i].count) {
-                PyMem_Del(inddata[i].index);
-            }
-        }
+    for (i = 0; i < n_inputs+1; i++) {
+        Py_XDECREF(operands[i]);
+        Py_XDECREF(dtypes[i]);
     }
-    PyMem_Del(inddata);
-    return output;
+
+    return ret;
+fail:
+    for (i = 0; i < n_inputs+1; i++) {
+        Py_XDECREF(operands[i]);
+        Py_XDECREF(dtypes[i]);
+    }
+    if (iter != NULL) {
+        NpyIter_Deallocate(iter);
+    }
+    if (reduce_iter != NULL) {
+        NpyIter_Deallocate(reduce_iter);
+    }
+
+    return NULL;
 }
 
 static PyMethodDef NumExpr_methods[] = {
diff --git a/numexpr/missing_posix_functions.inc b/numexpr/missing_posix_functions.inc
index eee6dae..a3ef26a 100644
--- a/numexpr/missing_posix_functions.inc
+++ b/numexpr/missing_posix_functions.inc
@@ -1,3 +1,12 @@
+/*********************************************************************
+  Numexpr - Fast numerical array expression evaluator for NumPy.
+
+      License: MIT
+      Author:  See AUTHORS.txt
+
+  See LICENSE.txt for details about copyright and rights to use.
+**********************************************************************/
+
 /* These functions are not included in some non-POSIX compilers,
   like MSVC 7.1  */
 
diff --git a/numexpr/msvc_function_stubs.inc b/numexpr/msvc_function_stubs.inc
index e9b2c8c..dc40367 100644
--- a/numexpr/msvc_function_stubs.inc
+++ b/numexpr/msvc_function_stubs.inc
@@ -1,121 +1,130 @@
-/* Declare stub functions for MSVC.  It turns out that single precision
-   definitions in <math.h> are actually #define'd and are not usable
-   as function pointers :-/ */
-
-#if _MSC_VER < 1400  // 1310 == MSVC 7.1
-/* Apparently, single precision functions are not included in MSVC 7.1 */
-
-#define sqrtf(x)    ((float)sqrt((double)(x)))
-#define sinf(x)    ((float)sin((double)(x)))
-#define cosf(x)    ((float)cos((double)(x)))
-#define tanf(x)    ((float)tan((double)(x)))
-#define asinf(x)    ((float)asin((double)(x)))
-#define acosf(x)    ((float)acos((double)(x)))
-#define atanf(x)    ((float)atan((double)(x)))
-#define sinhf(x)    ((float)sinh((double)(x)))
-#define coshf(x)    ((float)cosh((double)(x)))
-#define tanhf(x)    ((float)tanh((double)(x)))
-#define asinhf(x)    ((float)asinh((double)(x)))
-#define acoshf(x)    ((float)acosh((double)(x)))
-#define atanhf(x)    ((float)atanh((double)(x)))
-#define logf(x)    ((float)log((double)(x)))
-#define log1pf(x)    ((float)log1p((double)(x)))
-#define log10f(x)    ((float)log10((double)(x)))
-#define expf(x)    ((float)exp((double)(x)))
-#define expm1f(x)    ((float)expm1((double)(x)))
-#define fabsf(x)    ((float)fabs((double)(x)))
-#define fmodf(x, y)    ((float)fmod((double)(x), (double)(y)))
-#define atan2f(x, y)    ((float)atan2((double)(x), (double)(y)))
-
-/* The next are directly called from interp_body.c */
-#define powf(x, y)    ((float)pow((double)(x), (double)(y)))
-#define floorf(x)    ((float)floor((double)(x)))
-
-#endif  // _MSC_VER < 1400
-
-
-/* Now the actual stubs */
-
-inline float sqrtf2(float x) {
-    return sqrtf(x);
-}
-
-inline float sinf2(float x) {
-    return sinf(x);
-}
-
-inline float cosf2(float x) {
-    return cosf(x);
-}
-
-inline float tanf2(float x) {
-    return tanf(x);
-}
-
-inline float asinf2(float x) {
-    return asinf(x);
-}
-
-inline float acosf2(float x) {
-    return acosf(x);
-}
-
-inline float atanf2(float x) {
-    return atanf(x);
-}
-
-inline float sinhf2(float x) {
-    return sinhf(x);
-}
-
-inline float coshf2(float x) {
-    return coshf(x);
-}
-
-inline float tanhf2(float x) {
-    return tanhf(x);
-}
-
-inline float asinhf2(float x) {
-    return asinhf(x);
-}
-
-inline float acoshf2(float x) {
-    return acoshf(x);
-}
-
-inline float atanhf2(float x) {
-    return atanhf(x);
-}
-
-inline float logf2(float x) {
-    return logf(x);
-}
-
-inline float log1pf2(float x) {
-    return log1pf(x);
-}
-
-inline float log10f2(float x) {
-    return log10f(x);
-}
-
-inline float expf2(float x) {
-    return expf(x);
-}
-
-inline float expm1f2(float x) {
-    return expm1f(x);
-}
-
-inline float fabsf2(float x) {
-    return fabsf(x);
-}
-
-inline float fmodf2(float x, float y) {
-    return fmodf(x, y);
-}
-
-inline float atan2f2(float x, float y) {
-    return atan2f(x, y);
-}
+/*********************************************************************
+  Numexpr - Fast numerical array expression evaluator for NumPy.
+
+      License: MIT
+      Author:  See AUTHORS.txt
+
+  See LICENSE.txt for details about copyright and rights to use.
+**********************************************************************/
+
+/* Declare stub functions for MSVC.  It turns out that single precision
+   definitions in <math.h> are actually #define'd and are not usable
+   as function pointers :-/ */
+
+#if _MSC_VER < 1400  // 1310 == MSVC 7.1
+/* Apparently, single precision functions are not included in MSVC 7.1 */
+
+#define sqrtf(x)    ((float)sqrt((double)(x)))
+#define sinf(x)    ((float)sin((double)(x)))
+#define cosf(x)    ((float)cos((double)(x)))
+#define tanf(x)    ((float)tan((double)(x)))
+#define asinf(x)    ((float)asin((double)(x)))
+#define acosf(x)    ((float)acos((double)(x)))
+#define atanf(x)    ((float)atan((double)(x)))
+#define sinhf(x)    ((float)sinh((double)(x)))
+#define coshf(x)    ((float)cosh((double)(x)))
+#define tanhf(x)    ((float)tanh((double)(x)))
+#define asinhf(x)    ((float)asinh((double)(x)))
+#define acoshf(x)    ((float)acosh((double)(x)))
+#define atanhf(x)    ((float)atanh((double)(x)))
+#define logf(x)    ((float)log((double)(x)))
+#define log1pf(x)    ((float)log1p((double)(x)))
+#define log10f(x)    ((float)log10((double)(x)))
+#define expf(x)    ((float)exp((double)(x)))
+#define expm1f(x)    ((float)expm1((double)(x)))
+#define fabsf(x)    ((float)fabs((double)(x)))
+#define fmodf(x, y)    ((float)fmod((double)(x), (double)(y)))
+#define atan2f(x, y)    ((float)atan2((double)(x), (double)(y)))
+
+/* The next are directly called from interp_body.c */
+#define powf(x, y)    ((float)pow((double)(x), (double)(y)))
+#define floorf(x)    ((float)floor((double)(x)))
+
+#endif  // _MSC_VER < 1400
+
+
+/* Now the actual stubs */
+
+inline float sqrtf2(float x) {
+    return sqrtf(x);
+}
+
+inline float sinf2(float x) {
+    return sinf(x);
+}
+
+inline float cosf2(float x) {
+    return cosf(x);
+}
+
+inline float tanf2(float x) {
+    return tanf(x);
+}
+
+inline float asinf2(float x) {
+    return asinf(x);
+}
+
+inline float acosf2(float x) {
+    return acosf(x);
+}
+
+inline float atanf2(float x) {
+    return atanf(x);
+}
+
+inline float sinhf2(float x) {
+    return sinhf(x);
+}
+
+inline float coshf2(float x) {
+    return coshf(x);
+}
+
+inline float tanhf2(float x) {
+    return tanhf(x);
+}
+
+inline float asinhf2(float x) {
+    return asinhf(x);
+}
+
+inline float acoshf2(float x) {
+    return acoshf(x);
+}
+
+inline float atanhf2(float x) {
+    return atanhf(x);
+}
+
+inline float logf2(float x) {
+    return logf(x);
+}
+
+inline float log1pf2(float x) {
+    return log1pf(x);
+}
+
+inline float log10f2(float x) {
+    return log10f(x);
+}
+
+inline float expf2(float x) {
+    return expf(x);
+}
+
+inline float expm1f2(float x) {
+    return expm1f(x);
+}
+
+inline float fabsf2(float x) {
+    return fabsf(x);
+}
+
+inline float fmodf2(float x, float y) {
+    return fmodf(x, y);
+}
+
+inline float atan2f2(float x, float y) {
+    return atan2f(x, y);
+}
diff --git a/numexpr/necompiler.py b/numexpr/necompiler.py
index b077fcc..899bcf6 100644
--- a/numexpr/necompiler.py
+++ b/numexpr/necompiler.py
@@ -1,3 +1,14 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
+import __future__
 import sys
 import numpy
 
@@ -17,6 +28,7 @@ type_to_kind = expressions.type_to_kind
 kind_to_type = expressions.kind_to_type
 default_type = kind_to_type[expressions.default_kind]
 
+
 class ASTNode(object):
     """Abstract Syntax Tree node.
 
@@ -54,9 +66,7 @@ class ASTNode(object):
     def __hash__(self):
         if self.astType == 'alias':
             self = self.value
-        # Fast hash (see issue #43)
-        return ( hash(self.astType) ^ hash(self.astKind) ^
-                 hash(self.value) ^ hash(self.children) )
+        return hash((self.astType, self.astKind, self.value, self.children))
 
     def __str__(self):
         return 'AST(%s, %s, %s, %s, %s)' % (self.astType, self.astKind,
@@ -88,9 +98,8 @@ def expressionToAST(ex):
     This is necessary as ExpressionNode overrides many methods to act
     like a number.
     """
-    this_ast = ASTNode(ex.astType, ex.astKind, ex.value,
-                       [expressionToAST(c) for c in ex.children])
-    return this_ast
+    return ASTNode(ex.astType, ex.astKind, ex.value,
+                   [expressionToAST(c) for c in ex.children])
 
 
 def sigPerms(s):
@@ -150,10 +159,8 @@ def typeCompileAst(ast):
     else:
         value = ast.value
         children = ast.children
-    new_ast = ASTNode(ast.astType, ast.astKind, value,
-                           [typeCompileAst(c) for c in children])
-    return new_ast
-
+    return ASTNode(ast.astType, ast.astKind, value,
+                   [typeCompileAst(c) for c in children])
 
 class Register(object):
     """Abstraction for a register in the VM.
@@ -202,7 +209,11 @@ def stringToExpression(s, types, context):
     try:
         expressions._context.set_new_context(context)
         # first compile to a code object to determine the names
-        c = compile(s, '<expr>', 'eval')
+        if context.get('truediv', False):
+            flags = __future__.division.compiler_flag
+        else:
+            flags = 0
+        c = compile(s, '<expr>', 'eval', flags)
         # make VariableNode's for the names
         names = {}
         for name in c.co_names:
@@ -241,7 +252,10 @@ def getInputOrder(ast, input_order=None):
 
     if input_order:
         if variable_names != set(input_order):
-            raise ValueError("input names don't match those found in expression")
+            raise ValueError(
+                "input names (%s) don't match those found in expression (%s)"
+                % (input_order, variable_names))
+
         ordered_names = input_order
     else:
         ordered_names = list(variable_names)
@@ -310,28 +324,34 @@ def collapseDuplicateSubtrees(ast):
     for a in aliases:
         while a.value.astType == 'alias':
             a.value = a.value.value
-        a.reg = a.value.reg
+    return aliases
 
 def optimizeTemporariesAllocation(ast):
     """Attempt to minimize the number of temporaries needed, by
     reusing old ones.
     """
-    nodes = list(x for x in ast.postorderWalk() if x.reg.temporary)
+    nodes = [n for n in ast.postorderWalk() if n.reg.temporary]
     users_of = dict((n.reg, set()) for n in nodes)
+
+    node_regs = dict((n, set(c.reg for c in n.children if c.reg.temporary))
+                     for n in nodes)
     if nodes and nodes[-1] is not ast:
-        for c in ast.children:
-            if c.reg.temporary:
-                users_of[c.reg].add(ast)
-    for n in reversed(nodes):
+        nodes_to_check = nodes + [ast]
+    else:
+        nodes_to_check = nodes
+    for n in nodes_to_check:
         for c in n.children:
             if c.reg.temporary:
                 users_of[c.reg].add(n)
-    unused = {'bool' : set(), 'int' : set(), 'long': set(), 'float' : set(),
-              'double': set(), 'complex' : set(), 'str': set()}
+
+    unused = {'bool': set(), 'int': set(), 'long': set(), 'float': set(),
+              'double': set(), 'complex': set(), 'str': set()}
     for n in nodes:
-        for reg, users in users_of.iteritems():
-            if n in users:
-                users.remove(n)
+        for c in n.children:
+            reg = c.reg
+            if reg.temporary:
+                users = users_of[reg]
+                users.discard(n)
                 if not users:
                     unused[reg.node.astKind].add(reg)
         if unused[n.astKind]:
@@ -361,7 +381,7 @@ def setRegisterNumbersForTemporaries(ast, start):
             node.reg.n = node.value
             continue
         reg = node.reg
-        if reg.n < 0:
+        if reg.n is None:
             reg.n = start + seen
             seen += 1
             signature += reg.node.typecode()
@@ -378,13 +398,8 @@ def convertASTtoThreeAddrForm(ast):
     I suppose this should be called three register form, but three
     address form is found in compiler theory.
     """
-    program = []
-    for node in ast.allOf('op'):
-        children = node.children
-        instr = (node.value, node.reg) \
-                + tuple([c.reg for c in children])
-        program.append(instr)
-    return program
+    return [(node.value, node.reg) + tuple([c.reg for c in node.children])
+            for node in ast.allOf('op')]
 
 def compileThreeAddrForm(program):
     """Given a three address form of the program, compile it a string that
@@ -394,7 +409,7 @@ def compileThreeAddrForm(program):
         if reg is None:
             return '\xff'
         elif reg.n < 0:
-            raise ValueError("negative value for register number %s" % (reg.n,))
+            raise ValueError("negative value for register number %s" % reg.n)
         else:
             return chr(reg.n)
 
@@ -405,10 +420,10 @@ def compileThreeAddrForm(program):
         ca2 = nToChr(a2)
         return cop + cs + ca1 + ca2
 
-    def toString(*args):
+    def toString(args):
         while len(args) < 4:
             args += (None,)
-        opcode, store, a1, a2 = args[0:4]
+        opcode, store, a1, a2 = args[:4]
         s = quadrupleToString(opcode, store, a1, a2)
         l = [s]
         args = args[4:]
@@ -418,31 +433,38 @@ def compileThreeAddrForm(program):
             args = args[3:]
         return ''.join(l)
 
-    prog_str = ''.join([toString(*t) for t in program])
+    prog_str = ''.join([toString(t) for t in program])
     return prog_str
 
 context_info = [
     ('optimization', ('none', 'moderate', 'aggressive'), 'aggressive'),
+    ('truediv', (False, True, 'auto'), 'auto')
                ]
-def getContext(map):
+
+def getContext(kwargs, frame_depth=1):
+    d = kwargs.copy()
     context = {}
     for name, allowed, default in context_info:
-        value = map.pop(name, default)
+        value = d.pop(name, default)
         if value in allowed:
             context[name] = value
         else:
             raise ValueError("'%s' must be one of %s" % (name, allowed))
-    if map:
-        raise ValueError("Unknown keyword argument '%s'" % map.popitem()[0])
-    return context
 
+    if d:
+        raise ValueError("Unknown keyword argument '%s'" % d.popitem()[0])
+    if context['truediv'] == 'auto':
+        caller_globals = sys._getframe(frame_depth + 1).f_globals
+        context['truediv'] = \
+            caller_globals.get('division', None) == __future__.division
 
-def precompile(ex, signature=(), copy_args=(), **kwargs):
+    return context
+
+def precompile(ex, signature=(), context={}):
     """Compile the expression to an intermediate form.
     """
     types = dict(signature)
     input_order = [name for (name, type_) in signature]
-    context = getContext(kwargs)
 
     if isinstance(ex, str):
         ex = stringToExpression(ex, types, context)
@@ -452,29 +474,20 @@ def precompile(ex, signature=(), copy_args=(), **kwargs):
 
     ast = expressionToAST(ex)
 
-    # Add a copy for strided or unaligned unidimensional arrays
-    for a in ast.postorderWalk():
-        if a.astType == "variable" and a.value in copy_args:
-            newVar = ASTNode(*a.key())
-            a.astType, a.value, a.children = ('op', 'copy', (newVar,))
-
-    if ex.astType not in ('op'):
+    if ex.astType != 'op':
         ast = ASTNode('op', value='copy', astKind=ex.astKind, children=(ast,))
 
     ast = typeCompileAst(ast)
 
-    reg_num = [-1]
-    def registerMaker(node, temporary=False):
-        reg = Register(node, temporary=temporary)
-        reg.n = reg_num[0]
-        reg_num[0] -= 1
-        return reg
+    aliases = collapseDuplicateSubtrees(ast)
 
     assignLeafRegisters(ast.allOf('raw'), Immediate)
-    assignLeafRegisters(ast.allOf('variable', 'constant'), registerMaker)
-    assignBranchRegisters(ast.allOf('op'), registerMaker)
+    assignLeafRegisters(ast.allOf('variable', 'constant'), Register)
+    assignBranchRegisters(ast.allOf('op'), Register)
 
-    collapseDuplicateSubtrees(ast)
+    # assign registers for aliases
+    for a in aliases:
+        a.reg = a.value.reg
 
     input_order = getInputOrder(ast, input_order)
     constants_order, constants = getConstants(ast)
@@ -511,8 +524,21 @@ def NumExpr(ex, signature=(), copy_args=(), **kwargs):
 
     Returns a `NumExpr` object containing the compiled function.
     """
+    # NumExpr can be called either directly by the end-user, in which case
+    # kwargs need to be sanitized by getContext, or by evaluate,
+    # in which case kwargs are in already sanitized.
+    # In that case frame_depth is wrong (it should be 2) but it doesn't matter
+    # since it will not be used (because truediv='auto' has already been
+    # translated to either True or False).
+
+    # NOTE: `copy_args` is not necessary from 2.0 on.  It remains here
+    # basically because PyTables trusted on it for certain operations.
+    # I have filed a ticket for PyTables asking for its removal:
+    # https://github.com/PyTables/PyTables/issues/117
+
+    context = getContext(kwargs, frame_depth=1)
     threeAddrProgram, inputsig, tempsig, constants, input_names = \
-                      precompile(ex, signature, copy_args, **kwargs)
+                      precompile(ex, signature, context)
     program = compileThreeAddrForm(threeAddrProgram)
     return interpreter.NumExpr(inputsig, tempsig, program, constants,
                                input_names)
@@ -608,20 +634,56 @@ def getExprNames(text, context):
 _names_cache = CacheDict(256)
 _numexpr_cache = CacheDict(256)
 
-def evaluate(ex, local_dict=None, global_dict=None, **kwargs):
-    """Evaluate a simple array expression element-wise.
+def evaluate(ex, local_dict=None, global_dict=None,
+             out=None, order='K', casting='safe', **kwargs):
+    """Evaluate a simple array expression element-wise, using the new iterator.
 
     ex is a string forming an expression, like "2*a+3*b". The values for "a"
     and "b" will by default be taken from the calling function's frame
     (through use of sys._getframe()). Alternatively, they can be specifed
     using the 'local_dict' or 'global_dict' arguments.
+
+    Parameters
+    ----------
+
+    local_dict : dictionary, optional
+        A dictionary that replaces the local operands in current frame.
+
+    global_dict : dictionary, optional
+        A dictionary that replaces the global operands in current frame.
+
+    out : NumPy array, optional
+        An existing array where the outcome is going to be stored.  Care is
+        required so that this array has the same shape and type than the
+        actual outcome of the computation.  Useful for avoiding unnecessary
+        new array allocations.
+
+    order : {'C', 'F', 'A', or 'K'}, optional
+        Controls the iteration order for operands. 'C' means C order, 'F'
+        means Fortran order, 'A' means 'F' order if all the arrays are
+        Fortran contiguous, 'C' order otherwise, and 'K' means as close to
+        the order the array elements appear in memory as possible.  For
+        efficient computations, typically 'K'eep order (the default) is
+        desired.
+
+    casting : {'no', 'equiv', 'safe', 'same_kind', 'unsafe'}, optional
+        Controls what kind of data casting may occur when making a copy or
+        buffering.  Setting this to 'unsafe' is not recommended, as it can
+        adversely affect accumulations.
+
+          * 'no' means the data types should not be cast at all.
+          * 'equiv' means only byte-order changes are allowed.
+          * 'safe' means only casts which can preserve values are allowed.
+          * 'same_kind' means only safe casts or casts within a kind,
+            like float64 to float32, are allowed.
+          * 'unsafe' means any data conversions may be done.
     """
     if not isinstance(ex, str):
         raise ValueError("must specify expression as a string")
     # Get the names for this expression
-    expr_key = (ex, tuple(sorted(kwargs.items())))
+    context = getContext(kwargs, frame_depth=1)
+    expr_key = (ex, tuple(sorted(context.items())))
     if expr_key not in _names_cache:
-        context = getContext(kwargs)
         _names_cache[expr_key] = getExprNames(ex, context)
     names, ex_uses_vml = _names_cache[expr_key]
     # Get the arguments based on the names.
@@ -630,61 +692,25 @@ def evaluate(ex, local_dict=None, global_dict=None, **kwargs):
         local_dict = call_frame.f_locals
     if global_dict is None:
         global_dict = call_frame.f_globals
-    arguments = []
-    copy_args = []
 
+    arguments = []
     for name in names:
         try:
             a = local_dict[name]
         except KeyError:
             a = global_dict[name]
-        b = numpy.asarray(a)
-        # Byteswapped arrays are dealt with in the extension
-        # All the opcodes can deal with strided arrays directly as
-        # long as they are undimensional (strides in other
-        # dimensions are dealt within the extension), so we don't
-        # need a copy for the strided case.
-
-        if not b.flags.aligned:
-            # Only take actions if CPU is different from AMD and Intel
-            # as they can deal with unaligned arrays very efficiently.
-            # If using VML, do the copy as the VML functions works
-            # much faster with aligned arrays.
-            if not is_cpu_amd_intel or use_vml:
-                # For the unaligned case, we have two cases:
-                if b.ndim == 1:
-                    # For unidimensional arrays we can use the copy
-                    # opcode because it can deal with unaligned arrays
-                    # as long as they are unidimensionals with a
-                    # possible stride (very common case for
-                    # recarrays).  This can be up to 2x faster than
-                    # doing a copy using NumPy.
-                    copy_args.append(name)
-                else:
-                    # For multimensional unaligned arrays do a plain
-                    # copy.  We could refine more this and do a plain
-                    # copy only in the case that strides doesn't exist
-                    # in dimensions other than the last one (whose
-                    # case is supported by the copy opcode).
-                    b = b.copy()
-        elif use_vml and ex_uses_vml: #only make a copy of strided arrays if
-                                      #vml is in use
-            if not b.flags.contiguous:
-                if b.ndim == 1:
-                    copy_args.append(name)
-                else:
-                    b = b.copy()
-
-        arguments.append(b)
+        arguments.append(numpy.asarray(a))
 
     # Create a signature
     signature = [(name, getType(arg)) for (name, arg) in zip(names, arguments)]
-    # Look up numexpr if possible. copy_args *must* be added to the key,
-    # just in case a non-copy expression is already in cache.
-    numexpr_key = expr_key + (tuple(signature),) + tuple(copy_args)
+
+    # Look up numexpr if possible.
+    numexpr_key = expr_key + (tuple(signature),)
     try:
         compiled_ex = _numexpr_cache[numexpr_key]
     except KeyError:
         compiled_ex = _numexpr_cache[numexpr_key] = \
-                      NumExpr(ex, signature, copy_args, **kwargs)
-    return compiled_ex(*arguments)
+                      NumExpr(ex, signature, **context)
+    kwargs = {'out': out, 'order': order, 'casting': casting,
+              'ex_uses_vml': ex_uses_vml}
+    return compiled_ex(*arguments, **kwargs)
diff --git a/numexpr/opcodes.inc b/numexpr/opcodes.inc
index 846633e..7534a57 100644
--- a/numexpr/opcodes.inc
+++ b/numexpr/opcodes.inc
@@ -1,3 +1,12 @@
+/*********************************************************************
+  Numexpr - Fast numerical array expression evaluator for NumPy.
+
+      License: MIT
+      Author:  See AUTHORS.txt
+
+  See LICENSE.txt for details about copyright and rights to use.
+**********************************************************************/
+
 /*
 OPCODE(n, enum_name, exported, return_type, arg1_type, arg2_type, arg3_type)
 
@@ -45,7 +54,7 @@ OPCODE(26, OP_NE_BSS, "ne_bss", Tb, Ts, Ts, T0)
 
 OPCODE(27, OP_CAST_IB, "cast_ib", Ti, Tb, T0, T0)
 OPCODE(28, OP_COPY_II, "copy_ii", Ti, Ti, T0, T0)
-OPCODE(29, OP_ONES_LIKE_I, "ones_like_i", Ti, T0, T0, T0)
+OPCODE(29, OP_ONES_LIKE_II, "ones_like_ii", Ti, T0, T0, T0)
 OPCODE(30, OP_NEG_II, "neg_ii", Ti, Ti, T0, T0)
 OPCODE(31, OP_ADD_III, "add_iii", Ti, Ti, Ti, T0)
 OPCODE(32, OP_SUB_III, "sub_iii", Ti, Ti, Ti, T0)
@@ -57,7 +66,7 @@ OPCODE(37, OP_WHERE_IBII, "where_ibii", Ti, Tb, Ti, Ti)
 
 OPCODE(38, OP_CAST_LI, "cast_li", Tl, Ti, T0, T0)
 OPCODE(39, OP_COPY_LL, "copy_ll", Tl, Tl, T0, T0)
-OPCODE(40, OP_ONES_LIKE_L, "ones_like_l", Tl, T0, T0, T0)
+OPCODE(40, OP_ONES_LIKE_LL, "ones_like_ll", Tl, T0, T0, T0)
 OPCODE(41, OP_NEG_LL, "neg_ll", Tl, Tl, T0, T0)
 OPCODE(42, OP_ADD_LLL, "add_lll", Tl, Tl, Tl, T0)
 OPCODE(43, OP_SUB_LLL, "sub_lll", Tl, Tl, Tl, T0)
@@ -70,7 +79,7 @@ OPCODE(48, OP_WHERE_LBLL, "where_lbll", Tl, Tb, Tl, Tl)
 OPCODE(49, OP_CAST_FI, "cast_fi", Tf, Ti, T0, T0)
 OPCODE(50, OP_CAST_FL, "cast_fl", Tf, Tl, T0, T0)
 OPCODE(51, OP_COPY_FF, "copy_ff", Tf, Tf, T0, T0)
-OPCODE(52, OP_ONES_LIKE_F, "ones_like_f", Tf, T0, T0, T0)
+OPCODE(52, OP_ONES_LIKE_FF, "ones_like_ff", Tf, T0, T0, T0)
 OPCODE(53, OP_NEG_FF, "neg_ff", Tf, Tf, T0, T0)
 OPCODE(54, OP_ADD_FFF, "add_fff", Tf, Tf, Tf, T0)
 OPCODE(55, OP_SUB_FFF, "sub_fff", Tf, Tf, Tf, T0)
@@ -87,7 +96,7 @@ OPCODE(64, OP_CAST_DI, "cast_di", Td, Ti, T0, T0)
 OPCODE(65, OP_CAST_DL, "cast_dl", Td, Tl, T0, T0)
 OPCODE(66, OP_CAST_DF, "cast_df", Td, Tf, T0, T0)
 OPCODE(67, OP_COPY_DD, "copy_dd", Td, Td, T0, T0)
-OPCODE(68, OP_ONES_LIKE_D, "ones_like_d", Td, T0, T0, T0)
+OPCODE(68, OP_ONES_LIKE_DD, "ones_like_dd", Td, T0, T0, T0)
 OPCODE(69, OP_NEG_DD, "neg_dd", Td, Td, T0, T0)
 OPCODE(70, OP_ADD_DDD, "add_ddd", Td, Td, Td, T0)
 OPCODE(71, OP_SUB_DDD, "sub_ddd", Td, Td, Td, T0)
@@ -107,7 +116,7 @@ OPCODE(82, OP_CAST_CI, "cast_ci", Tc, Ti, T0, T0)
 OPCODE(83, OP_CAST_CL, "cast_cl", Tc, Tl, T0, T0)
 OPCODE(84, OP_CAST_CF, "cast_cf", Tc, Tf, T0, T0)
 OPCODE(85, OP_CAST_CD, "cast_cd", Tc, Td, T0, T0)
-OPCODE(86, OP_ONES_LIKE_C, "ones_like_c", Tc, T0, T0, T0)
+OPCODE(86, OP_ONES_LIKE_CC, "ones_like_cc", Tc, T0, T0, T0)
 OPCODE(87, OP_COPY_CC, "copy_cc", Tc, Tc, T0, T0)
 OPCODE(88, OP_NEG_CC, "neg_cc", Tc, Tc, T0, T0)
 OPCODE(89, OP_ADD_CCC, "add_ccc", Tc, Tc, Tc, T0)
diff --git a/numexpr/tests/__init__.py b/numexpr/tests/__init__.py
index 0f2391e..3fff411 100644
--- a/numexpr/tests/__init__.py
+++ b/numexpr/tests/__init__.py
@@ -1,3 +1,13 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 from numexpr.tests.test_numexpr import test, print_versions
 
 if __name__ == '__main__':
diff --git a/numexpr/tests/test_numexpr.py b/numexpr/tests/test_numexpr.py
index 52a8129..4f65463 100644
--- a/numexpr/tests/test_numexpr.py
+++ b/numexpr/tests/test_numexpr.py
@@ -1,4 +1,15 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 import new, sys, os
+import warnings
 
 import numpy
 from numpy import (
@@ -9,7 +20,7 @@ from numpy import (
     sinh, cosh, tanh, arcsinh, arccosh, arctanh,
     log, log1p, log10, exp, expm1)
 from numpy.testing import *
-from numpy import shape, allclose, ravel, isnan, isinf
+from numpy import shape, allclose, array_equal, ravel, isnan, isinf
 
 import numexpr
 from numexpr import E, NumExpr, evaluate, disassemble, use_vml
@@ -19,8 +30,9 @@ TestCase = unittest.TestCase
 
 double = numpy.double
 
+
 # Recommended minimum versions
-minimum_numpy_version = "1.2"
+minimum_numpy_version = "1.6"
 
 class test_numexpr(TestCase):
 
@@ -73,29 +85,42 @@ class test_numexpr(TestCase):
                       ('prod_ddn', 'r0', 't3', 2)])
         # Check that full reductions work.
         x = zeros(1e5)+.01   # checks issue #41
-        assert_equal(evaluate("sum(x+2,axis=0)"), sum(x+2,axis=0))
-        assert_equal(evaluate("prod(x,axis=0)"), prod(x,axis=0))
+        assert_allclose(evaluate("sum(x+2,axis=None)"), sum(x+2,axis=None))
+        assert_allclose(evaluate("sum(x+2,axis=0)"), sum(x+2,axis=0))
+        assert_allclose(evaluate("prod(x,axis=0)"), prod(x,axis=0))
+
+        x = arange(10.0)
+        assert_allclose(evaluate("sum(x**2+2,axis=0)"), sum(x**2+2,axis=0))
+        assert_allclose(evaluate("prod(x**2+2,axis=0)"), prod(x**2+2,axis=0))
+
+        x = arange(100.0)
+        assert_allclose(evaluate("sum(x**2+2,axis=0)"), sum(x**2+2,axis=0))
+        assert_allclose(evaluate("prod(x-1,axis=0)"), prod(x-1,axis=0))
+        x = linspace(0.1,1.0,2000)
+        assert_allclose(evaluate("sum(x**2+2,axis=0)"), sum(x**2+2,axis=0))
+        assert_allclose(evaluate("prod(x-1,axis=0)"), prod(x-1,axis=0))
+
         # Check that reductions along an axis work
         y = arange(9.0).reshape(3,3)
-        assert_equal(evaluate("sum(y**2, axis=1)"), sum(y**2, axis=1))
-        assert_equal(evaluate("sum(y**2, axis=0)"), sum(y**2, axis=0))
-        assert_equal(evaluate("sum(y**2, axis=None)"), sum(y**2, axis=None))
-        assert_equal(evaluate("prod(y**2, axis=1)"), prod(y**2, axis=1))
-        assert_equal(evaluate("prod(y**2, axis=0)"), prod(y**2, axis=0))
-        assert_equal(evaluate("prod(y**2, axis=None)"), prod(y**2, axis=None))
+        assert_allclose(evaluate("sum(y**2, axis=1)"), sum(y**2, axis=1))
+        assert_allclose(evaluate("sum(y**2, axis=0)"), sum(y**2, axis=0))
+        assert_allclose(evaluate("sum(y**2, axis=None)"), sum(y**2, axis=None))
+        assert_allclose(evaluate("prod(y**2, axis=1)"), prod(y**2, axis=1))
+        assert_allclose(evaluate("prod(y**2, axis=0)"), prod(y**2, axis=0))
+        assert_allclose(evaluate("prod(y**2, axis=None)"), prod(y**2, axis=None))
         # Check integers
         x = arange(10.)
         x = x.astype(int)
-        assert_equal(evaluate("sum(x**2+2,axis=0)"), sum(x**2+2,axis=0))
-        assert_equal(evaluate("prod(x**2+2,axis=0)"), prod(x**2+2,axis=0))
+        assert_allclose(evaluate("sum(x**2+2,axis=0)"), sum(x**2+2,axis=0))
+        assert_allclose(evaluate("prod(x**2+2,axis=0)"), prod(x**2+2,axis=0))
         # Check longs
         x = x.astype(long)
-        assert_equal(evaluate("sum(x**2+2,axis=0)"), sum(x**2+2,axis=0))
-        assert_equal(evaluate("prod(x**2+2,axis=0)"), prod(x**2+2,axis=0))
+        assert_allclose(evaluate("sum(x**2+2,axis=0)"), sum(x**2+2,axis=0))
+        assert_allclose(evaluate("prod(x**2+2,axis=0)"), prod(x**2+2,axis=0))
         # Check complex
-        x = x + 5j
-        assert_equal(evaluate("sum(x**2+2,axis=0)"), sum(x**2+2,axis=0))
-        assert_equal(evaluate("prod(x**2+2,axis=0)"), prod(x**2+2,axis=0))
+        x = x + .1j
+        assert_allclose(evaluate("sum(x**2+2,axis=0)"), sum(x**2+2,axis=0))
+        assert_allclose(evaluate("prod(x-1,axis=0)"), prod(x-1,axis=0))
 
     def test_axis(self):
         y = arange(9.0).reshape(3,3)
@@ -111,12 +136,16 @@ class test_numexpr(TestCase):
             pass
         else:
             raise ValueError("should raise exception!")
-
-
-
+        try:
+            # Negative axis are not supported
+            evaluate("sum(y, axis=-1)")
+        except ValueError:
+            pass
+        else:
+            raise ValueError("should raise exception!")
 
     def test_r0_reuse(self):
-        assert_equal(disassemble(NumExpr("x**2+2", [('x', double)])),
+        assert_equal(disassemble(NumExpr("x * x + 2", [('x', double)])),
                     [('mul_ddd', 'r0', 'r1[x]', 'r1[x]'),
                      ('add_ddd', 'r0', 'r0', 'c2[2.0]')])
 
@@ -156,6 +185,25 @@ class test_evaluate(TestCase):
         x2[1] = 1
         assert_array_equal(x2, y)
 
+    # Test for issue #22
+    def test_true_div(self):
+        x = arange(10, dtype='i4')
+        assert_array_equal(evaluate("x/2"), x / 2)
+        assert_array_equal(evaluate("x/2", truediv=False), x / 2)
+        assert_array_equal(evaluate("x/2", truediv='auto'), x / 2)
+        assert_array_equal(evaluate("x/2", truediv=True), x / 2.0)
+
+    # PyTables uses __nonzero__ among ExpressionNode objects internally
+    # so this should be commented out for the moment.  See #24.
+    def _test_boolean_operator(self):
+        x = arange(10, dtype='i4')
+        try:
+            evaluate("(x > 1) and (x < 9)")
+        except TypeError:
+            pass
+        else:
+            raise ValueError("should raise exception!")
+
     def test_rational_expr(self):
         a = arange(1e6)
         b = arange(1e6) * 0.1
@@ -201,7 +249,7 @@ class test_evaluate(TestCase):
     def test_all_scalar(self):
         a = 3.
         b = 4.
-        assert_equal(evaluate("a+b"), a+b)
+        assert_allclose(evaluate("a+b"), a+b)
         expr = NumExpr("2*a+3*b",[('a', double),('b', double)])
         assert_equal(expr(a,b), 2*a+3*b)
 
@@ -255,7 +303,7 @@ tests = [
           '1+1',
           '1',
           'cos(a2)',
-          '(a+1)**0'])]
+          ])]
 
 optests = []
 for op in list('+-*/%') + ['**']:
@@ -282,31 +330,35 @@ for func in ['copy', 'ones_like', 'sqrt',
              'sinh', 'cosh', 'tanh', 'arcsinh', 'arccosh', 'arctanh',
              'log', 'log1p', 'log10', 'exp', 'expm1', 'abs']:
     func1tests.append("a + %s(b+c)" % func)
-tests.append(('1-ARG FUNCS', func1tests))
+tests.append(('1_ARG_FUNCS', func1tests))
 
 func2tests = []
 for func in ['arctan2', 'fmod']:
     func2tests.append("a + %s(b+c, d+1)" % func)
     func2tests.append("a + %s(b+c, 1)" % func)
     func2tests.append("a + %s(1, d+1)" % func)
-tests.append(('2-ARG FUNCS', func2tests))
+tests.append(('2_ARG_FUNCS', func2tests))
 
 powtests = []
-for n in (-2.5, -1.5, -1.3, -.5, 0, 0.5, 1, 0.5, 1, 2.3, 2.5):
+# n = -1, 0.5, 2, 4 already handled in section "OPERATIONS"
+for n in (-7, -2.5, -1.5, -1.3, -.5, 0, 0.0, 1, 2.3, 2.5, 3):
     powtests.append("(a+1)**%s" % n)
-tests.append(('POW TESTS', powtests))
+tests.append(('POW_TESTS', powtests))
 
 def equal(a, b, exact):
+    if array_equal(a, b):
+        return True
+
     if hasattr(a, 'dtype') and a.dtype in ['f4','f8']:
         nnans = isnan(a).sum()
-        if isnan(a).sum() > 0:
+        if nnans > 0:
             # For results containing NaNs, just check that the number
             # of NaNs is the same in both arrays.  This check could be
             # made more exhaustive, but checking element by element in
             # python space is very expensive in general.
             return nnans == isnan(b).sum()
         ninfs = isinf(a).sum()
-        if isinf(a).sum() > 0:
+        if ninfs > 0:
             # Ditto for Inf's
             return ninfs == isinf(b).sum()
     if exact:
@@ -324,22 +376,32 @@ class Skip(Exception): pass
 def test_expressions():
     test_no = [0]
     def make_test_method(a, a2, b, c, d, e, x, expr,
-                         test_scalar, dtype, optimization, exact):
+                         test_scalar, dtype, optimization, exact, section):
         this_locals = locals()
         def method():
-            npval = eval(expr, globals(), this_locals)
+            # We don't want to listen at RuntimeWarnings like
+            # "overflows" or "divide by zero".  Feel free to expand
+            # the range for this filter, if needed.
+            if dtype.__name__ == "float32" or 'arctanh' in expr:
+                warnings.simplefilter("ignore")
+                npval = eval(expr, globals(), this_locals)
+                warnings.simplefilter("always")
+            else:
+                npval = eval(expr, globals(), this_locals)
             try:
                 neval = evaluate(expr, local_dict=this_locals,
                                  optimization=optimization)
                 assert equal(npval, neval, exact), """%r
 (test_scalar=%r, dtype=%r, optimization=%r, exact=%r,
- npval=%r (%r)\n neval=%r (%r))""" % (expr, test_scalar, dtype.__name__,
+ npval=%r (%r - %r)\n neval=%r (%r - %r))""" % (expr, test_scalar, dtype.__name__,
                                      optimization, exact,
-                                     npval, type(npval), neval, type(neval))
+                                     npval, type(npval), shape(npval),
+                                     neval, type(neval), shape(neval))
             except AssertionError:
                 raise
             except NotImplementedError:
-                print('%r not implemented for %s' % (expr,dtype.__name__))
+                print('%r not implemented for %s (scalar=%d, opt=%s)'
+                      % (expr, dtype.__name__, test_scalar, optimization))
             except:
                 print('numexpr error for expression %r' % (expr,))
                 raise
@@ -347,11 +409,15 @@ def test_expressions():
                               'dtype=%r, optimization=%r, exact=%r)') \
                     % (expr, test_scalar, dtype.__name__, optimization, exact)
         test_no[0] += 1
-        method.__name__ = 'test_%04d' % (test_no[0],)
+        method.__name__ = 'test_scalar%d_%s_%s_%s_%04d' % (test_scalar,
+                                                           dtype.__name__,
+                                                           optimization,
+                                                           section,
+                                                           test_no[0])
         return method
     x = None
-    for test_scalar in [0,1,2]:
-        for dtype in [int, long, numpy.float32, double, complex]:
+    for test_scalar in (0, 1, 2):
+        for dtype in (int, long, numpy.float32, double, complex):
             array_size = 100
             a = arange(2*array_size, dtype=dtype)[::2]
             a2 = zeros([array_size, array_size], dtype=dtype)
@@ -365,9 +431,9 @@ def test_expressions():
                     x += 1j
                     x *= 1+1j
             if test_scalar == 1:
-                a = a[array_size/2]
+                a = a[array_size / 2]
             if test_scalar == 2:
-                b = b[array_size/2]
+                b = b[array_size / 2]
             for optimization, exact in [
                 ('none', False), ('moderate', False), ('aggressive', False)]:
                 for section_name, section_tests in tests:
@@ -375,16 +441,18 @@ def test_expressions():
                         if dtype == complex and (
                                '<' in expr or '>' in expr or '%' in expr
                                or "arctan2" in expr or "fmod" in expr):
-                             # skip complex comparisons or functions not
-                             # defined in complex domain.
+                            # skip complex comparisons or functions not
+                            # defined in complex domain.
                             continue
                         if (dtype in (int, long) and test_scalar and
                             expr == '(a+1) ** -1'):
                             continue
+
                         m = make_test_method(a, a2, b, c, d, e, x,
                                              expr, test_scalar, dtype,
-                                             optimization, exact)
-                        yield m,
+                                             optimization, exact,
+                                             section_name)
+                        yield m
 
 class test_int64(TestCase):
     def test_neg(self):
@@ -561,6 +629,28 @@ class test_irregular_stride(TestCase):
         assert_array_equal(f0[i0], arange(5, dtype=int32))
         assert_array_equal(f1[i1], arange(5, dtype=float64))
 
+# Cases for testing arrays with dimensions that can be zero.
+class test_zerodim(TestCase):
+
+    def test_zerodim1d(self):
+        a0 = array([], dtype=int32)
+        a1 = array([], dtype=float64)
+
+        r0 = evaluate('a0 + a1')
+        r1 = evaluate('a0 * a1')
+
+        assert_array_equal(r0, a1)
+        assert_array_equal(r1, a1)
+
+    def test_zerodim3d(self):
+        a0 = array([], dtype=int32).reshape(0,2,4)
+        a1 = array([], dtype=float64).reshape(0,2,4)
+
+        r0 = evaluate('a0 + a1')
+        r1 = evaluate('a0 * a1')
+
+        assert_array_equal(r0, a1)
+        assert_array_equal(r1, a1)
 
 # Case test for threads
 class test_threading(TestCase):
@@ -585,7 +675,10 @@ def _worker(qout = None):
 # Case test for subprocesses (via multiprocessing module)
 class test_subprocess(TestCase):
     def test_multiprocess(self):
-        import multiprocessing as mp
+        try:
+            import multiprocessing as mp
+        except ImportError:
+	    return
         # Check for two threads at least
         numexpr.set_num_threads(2)
         #print "**** Running from main process:"
@@ -640,12 +733,16 @@ def suite():
 
     class TestExpressions(TestCase):
         pass
-    for m, in test_expressions():
+
+    def add_method(func):
         def method(self):
-            return m()
-        setattr(TestExpressions, m.__name__,
+            return func()
+        setattr(TestExpressions, func.__name__,
                 new.instancemethod(method, None, TestExpressions))
 
+    for func in test_expressions():
+        add_method(func)
+
     for n in range(niter):
         theSuite.addTest(unittest.makeSuite(test_numexpr1))
         theSuite.addTest(unittest.makeSuite(test_numexpr2))
@@ -656,6 +753,7 @@ def suite():
         theSuite.addTest(unittest.makeSuite(test_strings))
         theSuite.addTest(
             unittest.makeSuite(test_irregular_stride) )
+        theSuite.addTest(unittest.makeSuite(test_zerodim))
         theSuite.addTest(unittest.makeSuite(test_subprocess))
         # I need to put this test after test_subprocess because
         # if not, the test suite locks immediately before test_subproces.
@@ -668,3 +766,5 @@ def suite():
 if __name__ == '__main__':
     print_versions()
     unittest.main(defaultTest = 'suite')
+#    suite = suite()
+#    unittest.TextTestRunner(verbosity=2).run(suite)
diff --git a/numexpr/utils.py b/numexpr/utils.py
index e32f3d4..0cfa7e5 100644
--- a/numexpr/utils.py
+++ b/numexpr/utils.py
@@ -1,3 +1,13 @@
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 import os
 
 from numexpr.interpreter import _set_num_threads
diff --git a/numexpr/version.py b/numexpr/version.py
index cf03429..8e0f9ac 100644
--- a/numexpr/version.py
+++ b/numexpr/version.py
@@ -1,4 +1,14 @@
-version='1.4.2'
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
+version='2.0.1'
 release=True
 
 if not release:
diff --git a/setup.py b/setup.py
index ece0531..26c9150 100644
--- a/setup.py
+++ b/setup.py
@@ -1,10 +1,32 @@
 #!/usr/bin/env python
+###################################################################
+#  Numexpr - Fast numerical array expression evaluator for NumPy.
+#
+#      License: MIT
+#      Author:  See AUTHORS.txt
+#
+#  See LICENSE.txt and LICENSES/*.txt for details about copyright and
+#  rights to use.
+####################################################################
+
 import shutil
-import os
+import os, sys
 import os.path as op
 from distutils.command.clean import clean
+import numpy
 from numpy.distutils.command.build_ext import build_ext as numpy_build_ext
 
+
+minimum_numpy_version = "1.6"
+
+if sys.version_info < (2, 4):
+    raise "must use python 2.5 or greater"
+
+if numpy.__version__ < minimum_numpy_version:
+    print "*Error*: NumPy version is lower than needed: %s < %s" % \
+          (numpy.__version__, minimum_numpy_version)
+    sys.exit(1)
+
 try:
     import setuptools
 except ImportError:
@@ -51,6 +73,7 @@ def configuration():
         'depends': ['numexpr/interp_body.c',
                     'numexpr/complex_functions.inc',
                     'numexpr/msvc_function_stubs.inc'],
+        'libraries': ['m'],
         'extra_compile_args': ['-funroll-all-loops',],
         }
     dict_append(extension_config_data, **mkl_config_data)
@@ -120,23 +143,13 @@ def setup_package():
 class build_ext(numpy_build_ext):
     def build_extension(self, ext):
         # at this point we know what the C compiler is.
-        c = self.compiler
-        old_compile_options = None
-        # For MS Visual C, we use /O1 instead of the default /Ox,
-        # as /Ox takes a long time (~5 mins) to compile.
-        # The speed of the code isn't noticeably different.
-        if c.compiler_type == 'msvc':
-            if not c.initialized:
-                c.initialize()
-            old_compile_options = c.compile_options[:]
-            if '/Ox' in c.compile_options:
-                c.compile_options.remove('/Ox')
-            c.compile_options.append('/O1')
+        if self.compiler.compiler_type == 'msvc':
             ext.extra_compile_args = []
+            # also remove extra linker arguments msvc doesn't understand
+            ext.extra_link_args = []
+            # also remove gcc math library
+            ext.libraries.remove('m')
         numpy_build_ext.build_extension(self, ext)
-        if old_compile_options is not None:
-            self.compiler.compile_options = old_compile_options
 
 if __name__ == '__main__':
     setup_package()
-

-- 
Packaging for numexpr