[vspline] 06/72: Finished coding for the 1D specialization of the filter I feel I've now found a reasonable path treating 1D arrays, where large arrays will benefit very much and resource still aren't stretched too far. I also did some minor changes in the filter code, now passing the gain to the filter instead of the previous lambda_exponent, the new method is more flexible.

Sun Jul 2 09:02:37 UTC 2017

This is an automated email from the git hooks/post-receive script.

kfj-guest pushed a commit to branch master
in repository vspline.

commit 131eeac15f05cad8473280cf2099557051ce9255
Author: Kay F. Jahnke <kfjahnke at gmail.com>
Date:   Thu Oct 27 21:23:11 2016 +0200

    Finished coding for the 1D specialization of the filter
    I feel I've now found a reasonable path treating 1D arrays,
    where large arrays will benefit very much and resource still aren't
    stretched too far.
    I also did some minor changes in the filter code, now passing the
    gain to the filter instead of the previous lambda_exponent, the new
    method is more flexible.
---
 README.rst                  |  12 +-
 bspline.h                   |  21 +-
 common.h                    |   5 +-
 doxy.h                      |   6 +-
 example/impulse_response.cc |  73 ++++++
 filter.h                    | 539 ++++++++++++++++++++++++++++++++++++--------
 poles.cc                    |   5 +
 prefilter.h                 |  31 +--
 8 files changed, 567 insertions(+), 125 deletions(-)

diff --git a/README.rst b/README.rst
index 4e4b6df..df91f62 100644
--- a/README.rst
+++ b/README.rst
@@ -70,7 +70,7 @@ On the evaluation side I provide
 
 The code at the very core of my B-spline coefficient generation code evolved from the code by Philippe Thévenaz which he published here_, with some of the boundary condition treatment code derived from formulae which Philippe Thévenaz communicated to me. Next I needed code to handle multidimensional arrays in a generic fashion in C++. I chose to use Vigra_. Since my work has a focus on signal (and, more specifically image) processing, it's an excellent tool, as it provides a large body o [...]
 
-I did all my programming on a Kubuntu_ system, running on an intel(R) Core (TM) i5-4570 CPU, and used GNU gcc_ to compile the code. While I am confident that the code will run on other CPUs, I have not tested it with different compilers or operating systems (yet).
+I did all my programming on a Kubuntu_ 16.04 system, running on an intel(R) Core (TM) i5-4570 CPU, and used GNU gcc_ 5.4.0 to compile the code in C++11 dialect. While I am confident that the code runs on other CPUs, I have not tested it with different compilers or operating systems (yet).
 
 .. _here: http://bigwww.epfl.ch/thevenaz/interpolation/
 .. _Vigra: http://ukoethe.github.io/vigra/
@@ -88,7 +88,15 @@ I did all my programming on a Kubuntu_ system, running on an intel(R) Core (TM)
 Documentation
 -------------
 
-There is reasonably comprehensive documentation for vspline, generated with doxygen, in the doc folder. TODO: link here to static HTML
+There is reasonably comprehensive documentation for vspline, which can be generated doxygen
+by running
+
+doxygen vspline.doxy
+
+in vspline's base folder. A reasonable up-to-date version of the documentation will be
+provided online when I get around to it.
+
+TODO: construction site! link here to static HTML once it's online
 
 -----
 Speed
diff --git a/bspline.h b/bspline.h
index ce5c4e8..301c0c1 100644
--- a/bspline.h
+++ b/bspline.h
@@ -173,6 +173,7 @@ struct bspline
                             view_type& ,
                             bcv_type ,
                             int ,
+                            double ,
                             int ) ;
 
   /// elementary type of value_type, float or double
@@ -190,6 +191,7 @@ public:
   view_type core ;          ///< view to the core part of the coefficient array
   int spline_degree ;       ///< degree of the spline (3 == cubic spline)
   bcv_type bcv ;            ///< boundary condition, see common.h
+  double tolerance ;        ///< acceptable error
   bool braced ;             ///< whether coefficient array is 'braced' or not
   int horizon ;             ///< additional frame size for explicit scheme
   shape_type left_brace ;   ///< width(s) of the left handside bracing
@@ -300,12 +302,14 @@ public:
             bcv_type _bcv = bcv_type ( MIRROR ) ,   ///< boundary conditions and common default
             prefilter_strategy _strategy = BRACED , ///< default strategy is the 'implicit' scheme
             int _horizon = -1 ,                     ///< width of frame for explicit scheme
-            view_type _space = view_type()          ///< coefficient storage to 'adopt'
+            view_type _space = view_type() ,        ///< coefficient storage to 'adopt'
+            double _tolerance = .0000001            ///< acceptable error (relative to unit pulse)
           )
   : core_shape ( _core_shape ) ,
     spline_degree ( _spline_degree ) ,
     bcv ( _bcv ) ,
-    strategy ( _strategy )
+    strategy ( _strategy ) ,
+    tolerance ( _tolerance )
   {
     // heuristic horizon for reasonable precision - we assume that no one in their right
     // minds would want a negative horizon ;)
@@ -412,10 +416,7 @@ public:
 
     bracer<view_type> br ;
 
-    // KFJ 2016-10-22 changed this code from using zero-padding as explicit_bcv
-    // to using an assumed constant continuation, which produces less of a discontinuity.
-    // for the explicit schemes, we use bc code CONSTANT
-    bcv_type explicit_bcv ( CONSTANT ) ;
+    bcv_type explicit_bcv ( IGNORE ) ;
 
     switch ( strategy )
     {
@@ -425,6 +426,7 @@ public:
                 core ,
                 bcv ,
                 spline_degree ,
+                tolerance ,
                 nthreads
               ) ;
         break ;
@@ -435,13 +437,14 @@ public:
                 core ,
                 bcv ,
                 spline_degree ,
+                tolerance ,
                 nthreads
               ) ;
         for ( int d = 0 ; d < dimension ; d++ )
           br ( coeffs , bcv[d] , spline_degree , d ) ;
         break ;
       case EXPLICIT:
-        // apply bracing with BC codes passed in, then solve with BC code CONSTANT
+        // apply bracing with BC codes passed in, then solve with BC code IGNORE
         // this automatically fills the brace, as well, since it's part of the frame.
         // TODO: the values in the frame will not come out precisely the same as they
         // would by filling the brace after the coefficients have been calculated.
@@ -455,12 +458,13 @@ public:
                 container ,
                 explicit_bcv ,
                 spline_degree ,
+                tolerance ,
                 nthreads
               ) ;
         break ;
       case MANUAL:
         // like EXPLICIT, but don't apply a frame, assume a frame was applied
-        // by external code. process whole container with CONSTANT BC. For cases
+        // by external code. process whole container with IGNORE BC. For cases
         // where the frame can't be constructed by applying any of the stock bracing
         // modes. Note that if any data were passed into this routine, in this case
         // they will be silently ignored (makes no sense overwriting the core after
@@ -469,6 +473,7 @@ public:
                 container ,
                 explicit_bcv ,
                 spline_degree ,
+                tolerance ,
                 nthreads
               ) ;
         break ;
diff --git a/common.h b/common.h
index b2557b6..4e1b578 100644
--- a/common.h
+++ b/common.h
@@ -101,14 +101,16 @@ typedef enum { PERIODIC,   ///< periodic boundary conditions / periodic mapping
                NATURAL,    ///< natural boundary conditions / uses constant mapping
                MIRROR ,    ///< mirror on the bounds, both bc and mapping
                REFLECT ,   ///< reflect, so  that f(-1) == f(0), both bc and mapping
+               SAFE_BEYOND , ///< assume the input extends beyond [0...M]
                CONSTANT ,  ///< used for framing, with explicit prefilter scheme, and for mapping
                ZEROPAD ,   ///< used for boundary condition, bracing
                IGNORE ,    ///< used for boundary condition, bracing
                IDENTITY ,  ///< used as solver argument, mostly internal use
+//                GUESS ,
                SPHERICAL , ///< intended use for spherical panoramas, needs work
                REJECT ,    ///< mapping mode, throws out_of_bounds for out-of-bounds coordinates
                LIMIT ,     ///< mapping mode, maps out-of-bounds coordinates to left/right boundary
-               RAW         ///< mapping mode, processes coordinates unchecked, may crash the program
+               RAW   ,     ///< mapping mode, processes coordinates unchecked, may crash the program
 } bc_code;
 
 /// This enumeration is used by the convenience class 'bspline' to determine the prefiltering
@@ -128,6 +130,7 @@ std::vector < std::string > bc_name =
   "NATURAL",
   "MIRROR" ,
   "REFLECT" ,
+  "SAFE_BEYOND" ,
   "CONSTANT" ,
   "ZEROPAD" ,
   "IGNORE" ,
diff --git a/doxy.h b/doxy.h
index 6c19403..3735170 100644
--- a/doxy.h
+++ b/doxy.h
@@ -97,9 +97,9 @@ On the evaluation side I provide
  
  g++ -pthread -O3 --std=c++11 your_code.cc -lvigraimpex
  
- All access to Vc in the code is inside #define USE_VC .... #endif statements, so not defining USE_VC will effectively prevent it's use.
+ All access to Vc in the code is inside #ifdef USE_VC .... #endif statements, so not defining USE_VC will effectively prevent it's use.
  
- For simplicity's sake, even if the code isn't compiled to use Vc, the higher level code will still accept the common use_vc flag in the call signatures, but it's value wont have an effect. When the code is compiled to use Vc, the unvectorized code is still built and available by calling the relevant routines with use_vc set to false. The documentation is built to contain text for vectorized operation, if this is unwanted, change the doxy file.
+ For simplicity's sake, even if the code isn't compiled to use Vc, the higher level code may still accept the common use_vc flag in the call signatures, but it's value wont have an effect. When the code is compiled to use Vc, the unvectorized code is still built and available by calling the relevant routines with use_vc set to false. The documentation is built to contain text for vectorized operation, if this is unwanted, change the doxy file.
  
  \section license_sec License
  
@@ -237,7 +237,7 @@ Using double precision arithmetics, vectorization doesn't help so much, and pref
  
  Another speedup method is data-parallel processing. This is often thought to be the domain of GPUs, but modern CPUs also offer it in the form of vector units. I chose implementing data-parallel processing in the CPU, as it offers tight integration with unvectorized CPU code. It's almost familiar terrain, and the way from writing conventional CPU code to vector unit code is not too far, when using tools like Vc, which abstract the hardware away. Using horizontal vectorization does requir [...]
 
- Using both techniques together makes vspline fast. The target I was roughly aiming at was to achieve frame rates of ca. 50 fps in float RGB and full HD, producing the images via remap from a precalculated warp array. On my system, I have almost reached that goal - my remap times are around 25 msec (for a cubic spline), and with memory access etc. I come up to frame rates near half of what I was aiming at. The idea is to exploit the CPU fully, leaving the GPU, if present, free to do some [...]
+ Using both techniques together makes vspline fast. The target I was roughly aiming at was to achieve frame rates of ca. 50 fps in float RGB and full HD, producing the images via remap from a precalculated warp array. On my system, I have almost reached that goal - my remap times are around 25 msec (for a cubic spline), and with memory access etc. I come up to frame rates over half of what I was aiming at. The idea is to exploit the CPU fully, leaving the GPU, if present, free to do some [...]
  
  On the other hand, even without using vectorization, the code is certainly fast enough for casual use and may suffice for some production scenarios. This way, vigra becomes the only dependency, and the same binary will work on a wide range of hardware.
  
diff --git a/example/impulse_response.cc b/example/impulse_response.cc
new file mode 100644
index 0000000..d444ae1
--- /dev/null
+++ b/example/impulse_response.cc
@@ -0,0 +1,73 @@
+/************************************************************************/
+/*                                                                      */
+/*    vspline - a set of generic tools for creation and evaluation      */
+/*              of uniform b-splines                                    */
+/*                                                                      */
+/*            Copyright 2015, 2016 by Kay F. Jahnke                     */
+/*                                                                      */
+/*    Permission is hereby granted, free of charge, to any person       */
+/*    obtaining a copy of this software and associated documentation    */
+/*    files (the "Software"), to deal in the Software without           */
+/*    restriction, including without limitation the rights to use,      */
+/*    copy, modify, merge, publish, distribute, sublicense, and/or      */
+/*    sell copies of the Software, and to permit persons to whom the    */
+/*    Software is furnished to do so, subject to the following          */
+/*    conditions:                                                       */
+/*                                                                      */
+/*    The above copyright notice and this permission notice shall be    */
+/*    included in all copies or substantial portions of the             */
+/*    Software.                                                         */
+/*                                                                      */
+/*    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND    */
+/*    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES   */
+/*    OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND          */
+/*    NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT       */
+/*    HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,      */
+/*    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING      */
+/*    FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR     */
+/*    OTHER DEALINGS IN THE SOFTWARE.                                   */
+/*                                                                      */
+/************************************************************************/
+
+/// impulse_response.cc
+///
+/// filter a unit pulse with a b-spline prefilter of a given degree
+/// and display the central section of the result
+///
+/// compile with:
+/// g++ -std=c++11 -o impulse_response -O3 -pthread -DUSE_VC=1 impulse_response.cc -lVc
+///
+/// to get the central section with values byond +/- 0.00042 of a degree 7 b-spline:
+///
+/// impulse_response 7 .00042
+
+#include <iomanip>
+#include <vspline/prefilter.h>
+#include <random>
+
+int main ( int argc , char * argv[] )
+{
+  int degree = std::atoi ( argv[1] ) ;
+  double cutoff = std::atof ( argv[2] ) ;
+  
+  vigra::MultiArray < 1 , double > array1 ( 10001 ) ;
+  vigra::MultiArrayView < 1 , double > v1 ( array1 ) ;
+  vigra::TinyVector<vspline::bc_code, 1> bcv ( vspline::MIRROR ) ;
+  
+  v1 = 0.0 ;
+  v1[5000] = 1.0 ;
+  vspline::solve_vc ( v1 , v1 , bcv , degree , 0.00001 ) ;
+  
+  std::cout << "double ir_" << degree << "[] = {" << std::endl ;
+//   std::cout << std::fixed << std::showpos << std::showpoint << std::setprecision(16);
+  std::cout << std::showpos ;
+  int k ;
+  for ( k = 0 ; k < 10000 ; k++ )
+  {
+    if ( fabs ( v1[k] > cutoff ) )
+      std::cout << v1[k] << " ," << std::endl ;
+  }
+  if ( fabs ( v1[k] > cutoff ) )
+    std::cout << v1[k] ;
+  std::cout << " } ;" << std::endl ;
+}
diff --git a/filter.h b/filter.h
index 453f622..23b33e4 100644
--- a/filter.h
+++ b/filter.h
@@ -60,6 +60,23 @@
 
 namespace vspline {
 
+/// overall_gain is a helper routine:
+/// Simply executing the filtering code by itself will attenuate the signal.
+/// Here we calculate the gain which, applied to the signal, will cancel this effect.
+/// While this code was initially part of the filter's constructor, I took it out
+/// to gain some flexibility by passing in the gain as a parameter.
+
+double overall_gain ( const int & nbpoles , const double * const pole )
+{
+  double lambda = 1.0 ;
+
+  for ( int k = 0 ; k < nbpoles ; k++ )
+
+    lambda = lambda * ( 1.0 - pole[k] ) * ( 1.0 - 1.0 / pole[k] ) ;
+  
+  return lambda ;
+}
+  
 /// for each pole passed in, this filter will perform a forward-backward
 /// first order IIR filter, initially on the data passed in via in_iter, subsequently
 /// on the result of the application of the previous pole, using these recursions:
@@ -115,7 +132,7 @@ class filter
 
   const double* pole ;               ///< poles of the IIR filter
   std::vector<int> horizon ;         ///< corresponding horizon values
-  real_type lambda ;                 ///< (potentiated) overall gain.  
+  const real_type lambda ;           ///< (potentiated) overall gain.  
   const int npoles ;                 ///< Number of filter poles
   const int M ;                      ///< length of the data
 
@@ -136,7 +153,7 @@ class filter
   p_iacc  _p_iacc ;  ///< pointer to calculation of initial anticausal coefficient
   
 public:
-  
+
  /// solve() takes two iterators, one to the input data and one to the output space.
  /// The containers must have the same size. It's safe to use solve() in-place.
 
@@ -390,6 +407,7 @@ value_type icc_periodic ( IT c , int k )
  return Sum ;
 }
 
+/// TODO doublecheck this routine!
 value_type iacc_periodic ( out_iter c , int k )
 {
   real_type z = pole[k] ;
@@ -421,26 +439,56 @@ value_type iacc_periodic ( out_iter c , int k )
   return Sum ;
 }
 
-// icc_constant and iacc_constant calculate the initial coefficients by assuming
-// that all values of the input outside the defined range are equal to the value
-// at the nearest defined location. This produces a very straightforward calculation
-// which doesn't need any 'looking into' the data and can be executed quickly. But
-// even though it's simple, it's a step up from zero-padding (which is what using
-// c[0] and c[m-1] as the initial coefficients is equivalent to) because it introduces
-// less of a discontinuity and therefore makes the recursion approach it's 'real' value
-// more quickly, since there is less of an initial disturbance to 'ierate away'.
+/// icc_safe_beyond and iacc_safe_beyond are for situations where c[0] ... c[M-1]
+/// merely define a section of a larger array of data, so that it's safe to access
+/// horizon[k] values beyond either end of c[0] ... c[M-1].
+/// currently unused, because it doesn't work with buffering
 
 template < class IT >
-value_type icc_constant ( IT c , int k )
+value_type icc_safe_beyond ( IT c , int k )
 {
-  return c[0] * ( 1.0 / ( 1.0 - pole[k] ) ) ;
+  real_type p = pole[k] ;
+  int z = horizon[k] ;
+  value_type icc = c [ - z ] ;
+  
+  for ( int i = - z + 1 ; i <= 0 ; i++ )
+  {
+    icc += p * icc + c[i] ;
+  }
+  
+  std::cout << "icc: " << icc << std::endl ;
+  return icc ;
 }
 
-value_type iacc_constant ( out_iter c , int k )
+value_type iacc_safe_beyond ( out_iter c , int k )
 {
-  return c [ M - 1 ] * ( pole[k] / ( pole[k] - 1.0 ) ) ;
+  real_type p = pole[k] ;
+  int z = horizon[k] ;
+  value_type iacc = c [ M - 1 + z ] ;
+
+  for ( int i = z - 1 ; i >= 0 ; i-- )
+    iacc = p * ( iacc - c [ M - 1 + i ] ) ; 
+  std::cout << "iacc: " << iacc << std::endl ;
+  return iacc ;
 }
 
+// guess the initial coefficient. This tries to minimize the effect
+// of starting out with a hard discontinuity as it occurs with zero-padding.
+// TODO needs some thought, currently unused.
+
+// template < class IT >
+// value_type icc_guess ( IT c , int k )
+// {
+//   return c[0] * ( 1.0 / ( 1.0 - pole[k] ) ) ;
+// }
+// 
+// value_type iacc_guess ( out_iter c , int k )
+// {
+//   for ( int i = 0 ; i < M ; i++ )
+//     std::cout << "c[" << i << "] = " << c[i] << std::endl ;
+//   return c[M-1] * ( pole[k] / ( pole[k] * pole[k] - 1.0 ) ) ;
+// }
+
 template < class IT >
 value_type icc_identity ( IT c , int k )
 {
@@ -458,12 +506,10 @@ value_type iacc_identity ( out_iter c , int k )
 /// in exchange for maximum efficiency.
 /// The code itself is adapted from P. Thevenaz' code.
 ///
-/// the first solve routine is to be used for the first dimension.
-/// here lambda, the overall gain, is applied to the elements of the input as they
-/// are processed, saving a separate loop to preapply the gain. Subsequent poles
-/// and further dimensions then use the next routine. The gain which is applied here
-/// may be a power of the 'orthodox' gain, to avoid having to reapply the 'orthodox'
-/// lambda with every dimension which is processed. See the constructor.
+/// This variant uses a 'carry' element, 'X', to carry the result of the recursion
+/// from one iteration to the next instead of using the direct implementation of the
+/// recursion formula, which would read the previous value of the recursion from memory
+/// by accessing x[n-1], or, x[n+1], respectively.
 
 int solve_gain_inlined ( in_iter c , out_iter x )
 {
@@ -472,6 +518,7 @@ int solve_gain_inlined ( in_iter c , out_iter x )
   // use a buffer of one value_type for the recursion (see below)
 
   value_type X ;
+  real_type p = pole[0] ;
   
   // process first pole, applying overall gain in the process
   // of consuming the input. This gain may be a power of the 'orthodox'
@@ -499,7 +546,7 @@ int solve_gain_inlined ( in_iter c , out_iter x )
   
   for (int n = 1; n < M; n++)
   {
-    x[n] = X = lambda * c[n] + real_type ( pole[0] ) * X ;
+    x[n] = X = lambda * c[n] + p * X ;
   }
   
   // now the input is used up and won't be looked at any more; all subsequent
@@ -512,7 +559,7 @@ int solve_gain_inlined ( in_iter c , out_iter x )
   /* anticausal recursion */
   for (int n = M - 2; 0 <= n; n--)
   {
-    x[n] = X = real_type ( pole[0] ) * ( X - x[n]);
+    x[n] = X = real_type ( p ) * ( X - x[n]);
   }
   
   // for the remaining poles, if any, don't apply the gain
@@ -520,13 +567,14 @@ int solve_gain_inlined ( in_iter c , out_iter x )
   
   for (int k = 1; k < npoles; k++)
   {
+    p = pole[k] ;
     /* causal initialization */
     x[0] = X = (this->*_p_icc2)(x, k);
     
     /* causal recursion */
     for (int n = 1; n < M; n++)
     {
-      x[n] = X = x[n] + real_type ( pole[k] ) * X ;
+      x[n] = X = x[n] + p * X ;
     }
     
     /* anticausal initialization */
@@ -535,7 +583,7 @@ int solve_gain_inlined ( in_iter c , out_iter x )
     /* anticausal recursion */
     for (int n = M - 2; 0 <= n; n--)
     {
-      x[n] = X = real_type ( pole[k] ) * ( X - x[n] );
+      x[n] = X = p * ( X - x[n] );
     }
   }
 }
@@ -548,6 +596,7 @@ int solve_no_gain ( in_iter c , out_iter x )
   assert ( M > 1 ) ;
 
   value_type X ;
+  real_type p = pole[0] ;
   
   // process first pole, consuming the input
   
@@ -557,7 +606,7 @@ int solve_no_gain ( in_iter c , out_iter x )
   /* causal recursion */
   for ( int n = 1; n < M; n++)
   {
-    x[n] = X = c[n] + real_type ( pole[0] ) * X ;
+    x[n] = X = c[n] + p * X ;
   }
   
   /* anticausal initialization */
@@ -566,7 +615,7 @@ int solve_no_gain ( in_iter c , out_iter x )
   /* anticausal recursion */
   for ( int n = M - 2; 0 <= n; n--)
   {
-    x[n] = X = real_type ( pole[0] ) * ( X - x[n]);
+    x[n] = X = p * ( X - x[n]);
   }
   
   // for the remaining poles, if any, work on the result
@@ -574,13 +623,14 @@ int solve_no_gain ( in_iter c , out_iter x )
   
   for ( int k = 1 ; k < npoles; k++)
   {
+    p = pole[k] ;
     /* causal initialization */
     x[0] = X = (this->*_p_icc2)(x, k);
     
     /* causal recursion */
     for (int n = 1; n < M; n++)
     {
-      x[n] = X = x[n] + real_type ( pole[k] ) * X ;
+      x[n] = X = x[n] + p * X ;
     }
     
     /* anticausal initialization */
@@ -589,7 +639,7 @@ int solve_no_gain ( in_iter c , out_iter x )
     /* anticausal recursion */
     for (int n = M - 2; 0 <= n; n--)
     {
-      x[n] = X = real_type ( pole[k] ) * ( X - x[n] );
+      x[n] = X = p * ( X - x[n] );
     }
   }
 }
@@ -623,14 +673,15 @@ int solve_identity ( in_iter c , out_iter x )
 public:
   
 filter ( int _M ,               ///< number of input/output elements (DataLength)
-         int apply_gain ,       ///< power of lambda to apply while processing the first pole
+         double gain ,          ///< gain to apply to the signal to cancel attenuation
          bc_code bc ,           ///< boundary conditions for this filter
          int _npoles ,          ///< number of poles
          const double * _pole , ///< pointer to _npoles doubles holding the filter poles
          double tolerance )     ///< acceptable loss of precision, absolute value
 : M ( _M ) ,
   npoles ( _npoles ) ,
-  pole ( _pole )
+  pole ( _pole ) ,
+  lambda ( gain )
 {
   if ( npoles < 1 )
   {
@@ -653,25 +704,16 @@ filter ( int _M ,               ///< number of input/output elements (DataLength
       horizon.push_back ( M ) ;
   }
 
-  // compute the overall gain of the filter. Simply executing the filtering
-  // code by itself will attenuate the signal. Here we calculate the amount of
-  // this attenuation so that we can cancel this effect.
-
-  lambda = 1.0 ; // if apply_gain is 0, lambda won't be applied at all
-
-  if ( apply_gain ) // if apply_gain is set, it will be used as an *exponent* on lambda
-                    // as to apply a power of lambda when processing the first dimension
+  if ( gain == 1.0 )
   {
-    for (int k = 0; k < npoles; k++)
-      lambda = lambda * (1.0 - pole[k]) * (1.0 - 1.0 / pole[k]);
-    
-    lambda = pow ( lambda , apply_gain ) ;
-
-    _p_solve = & filter_type::solve_gain_inlined ; // multiply input with pow(lambda,apply_gain)
+    // gain == 1.0 has no effect, we can use this solve variant, applying no gain:
+    _p_solve = & filter_type::solve_no_gain ;
   }
   else
   {
-    _p_solve = & filter_type::solve_no_gain ;      // no gain will be applied
+    // if gain isn't 1.0, we use the solve variant which applies it
+    // to the signal as it goes along.
+    _p_solve = & filter_type::solve_gain_inlined ;
   }
 
   // while the forward/backward IIR filter in the solve_... routines is the same for all
@@ -679,13 +721,7 @@ filter ( int _M ,               ///< number of input/output elements (DataLength
   // depends on the boundary conditions and is handled by a call through a method pointer
   // in the solve_... routines. Here we fix these method pointers:
   
-  if ( bc == CONSTANT )
-  {
-    _p_icc1 = & filter_type::icc_constant<in_iter> ;
-    _p_icc2 = & filter_type::icc_constant<out_iter> ;
-    _p_iacc = & filter_type::iacc_constant ;
-  }
-  else if ( bc == MIRROR )
+  if ( bc == MIRROR )
   {     
     _p_icc1 = & filter_type::icc_mirror<in_iter> ;
     _p_icc2 = & filter_type::icc_mirror<out_iter> ;
@@ -709,6 +745,12 @@ filter ( int _M ,               ///< number of input/output elements (DataLength
     _p_icc2 = & filter_type::icc_reflect<out_iter> ;
     _p_iacc = & filter_type::iacc_reflect ;
   }
+  else if ( bc == SAFE_BEYOND )
+  {
+    _p_icc1 = & filter_type::icc_safe_beyond<in_iter> ;
+    _p_icc2 = & filter_type::icc_safe_beyond<out_iter> ;
+    _p_iacc = & filter_type::iacc_safe_beyond ;
+  }
   else if ( bc == ZEROPAD || bc == IGNORE )
   {
     _p_icc1 = & filter_type::icc_identity<in_iter> ;
@@ -719,6 +761,12 @@ filter ( int _M ,               ///< number of input/output elements (DataLength
   {
     _p_solve = & filter_type::solve_identity ;
   }
+//   else if ( bc == GUESS ) // currently unused
+//   {
+//     _p_icc1 = & filter_type::icc_guess<in_iter> ;
+//     _p_icc2 = & filter_type::icc_guess<out_iter> ;
+//     _p_iacc = & filter_type::iacc_guess ;
+//   }
   else
     throw not_supported ( "boundary condition not supported by vspline::filter" ) ;
 }
@@ -785,6 +833,7 @@ template < class source_type ,
 int nonaggregating_filter ( source_type &source ,
                             target_type &target ,
                             int axis ,
+                            double gain ,
                             bc_code bc ,
                             int nbpoles ,
                             const double * pole ,
@@ -793,31 +842,20 @@ int nonaggregating_filter ( source_type &source ,
 {
   const int dim = source_type::actual_dimension ;
   typedef typename source_type::value_type value_type ;
-  typedef typename ExpandElementResult < value_type > :: type ele_type ;
+  typedef typename vigra::ExpandElementResult < value_type > :: type ele_type ;
 
   int count = source.shape ( axis ) ; 
 
   /// we use a buffer of count value_types
 
-  MultiArray < 1 , value_type > buffer ( count ) ;
-
-  int lambda_exponent = 1 ;
-
-// deactivating the code below may produce slightly more precise results
-
-  if ( pow ( nbpoles , dim ) < 32 ) // heuristic. for high degrees, below optimization reduces precision too much
-  {
-    lambda_exponent = 0 ;
-    if ( axis == 0 )
-      lambda_exponent = dim ;
-  }
+  vigra::MultiArray < 1 , value_type > buffer ( count ) ;
 
   // avoiding being specific about the iterator's type allows us to slot in
   // any old iterator we can get by calling begin() on buffer 
   
   typedef decltype ( buffer.begin() ) iter_type ;
   typedef filter < iter_type , iter_type , value_type > filter_type ;
-  filter_type solver ( source.shape(axis) , lambda_exponent , bc , nbpoles , pole , tolerance ) ;
+  filter_type solver ( source.shape(axis) , gain , bc , nbpoles , pole , tolerance ) ;
 
   // offset into the source slice which will be used for gather/scatter
 
@@ -1065,8 +1103,8 @@ template < class source_type ,
 int ele_aggregating_filter ( source_type &source ,
                              target_type &target ,
                              int axis ,
+                             double gain ,
                              bc_code bc ,
-                             int lambda_exponent ,
                              int nbpoles ,
                              const double * pole ,
                              double tolerance
@@ -1100,7 +1138,7 @@ int ele_aggregating_filter ( source_type &source ,
   
   // Vc::Memory < simdized_type > buffer ( count * vsize ) ; // doesn't work for me
 
-  MultiArray < 1 , simdized_type , Vc::Allocator<simdized_type> > buffer ( count ) ;
+  vigra::MultiArray < 1 , simdized_type , Vc::Allocator<simdized_type> > buffer ( count ) ;
 
   // avoiding being specific about the iterator's type allows us to slot in
   // any old iterator we can get by calling begin() on buffer 
@@ -1134,7 +1172,7 @@ int ele_aggregating_filter ( source_type &source ,
   // don't define an element-expansion via vigra::ExpandElementResult for simdized_type.
 
   typedef filter < viter_type , viter_type , ele_type > filter_type ;
-  filter_type solver ( source.shape(axis) , lambda_exponent , bc , nbpoles , pole , tolerance ) ;
+  filter_type solver ( source.shape(axis) , gain , bc , nbpoles , pole , tolerance ) ;
 
   if ( source.stride() == target.stride() )
   {
@@ -1209,7 +1247,6 @@ int ele_aggregating_filter ( source_type &source ,
     auto target_slice = target.bindAt ( axis , 0 ) ;
     int target_stride = target.stride ( axis ) ;
     auto target_sliter = target_slice.begin() ;
-//     auto target_sliter_end = target_slice.end() ;
     index_type target_indexes ;
 
     while ( source_sliter < source_sliter_end )
@@ -1237,6 +1274,7 @@ template < class source_type ,
 int aggregating_filter ( source_type &source ,
                          target_type &target ,
                          int axis ,
+                         double gain ,
                          bc_code bc ,
                          int nbpoles ,
                          const double * pole ,
@@ -1247,20 +1285,9 @@ int aggregating_filter ( source_type &source ,
   typedef typename source_type::value_type value_type ;
   static_assert ( std::is_same < value_type , typename target_type::value_type > :: value ,
     "aggregating_filter: both arrays must have the same value_type" ) ;
-  typedef typename ExpandElementResult < value_type > :: type ele_type ;
-
-  int lambda_exponent = 1 ;
-  
-// deactivating the code below may produce slightly more precise results
+  typedef typename vigra::ExpandElementResult < value_type > :: type ele_type ;
 
-  if ( pow ( nbpoles , dim ) < 32 ) // heuristic. for high degrees, below optimization reduces precision too much
-  {
-    lambda_exponent = 0 ;
-    if ( axis == 0 )
-      lambda_exponent = dim ;
-  }
-  
-  if ( ExpandElementResult < value_type > :: size != 1 )
+  if ( vigra::ExpandElementResult < value_type > :: size != 1 )
   {
     // value_type is an aggregate type, but we want to operate on elementary types
     // so we element-expand the array and call ele_aggregating_filter, which works on
@@ -1268,7 +1295,7 @@ int aggregating_filter ( source_type &source ,
     auto expanded_source = source.expandElements ( 0 ) ;
     auto expanded_target = target.expandElements ( 0 ) ;
     return ele_aggregating_filter ( expanded_source , expanded_target ,
-                                    axis + 1 , bc , lambda_exponent ,
+                                    axis + 1 , gain , bc ,
                                     nbpoles , pole , tolerance ) ;
   }
   else
@@ -1278,12 +1305,12 @@ int aggregating_filter ( source_type &source ,
     // a direct call to ele_filter here wouldn't go away, but won't compile, since
     // in ele_aggregating_filter we try and typedef a vector of the contained
     // (aggregate) data type. Hence the acrobatics...
-    MultiArrayView < source.actual_dimension , ele_type >
+    vigra::MultiArrayView < source.actual_dimension , ele_type >
       es ( source.shape() , source.stride() , (ele_type*)(source.data()) ) ;
-    MultiArrayView < target.actual_dimension , ele_type >
+    vigra::MultiArrayView < target.actual_dimension , ele_type >
       et ( target.shape() , target.stride() , (ele_type*)(target.data()) ) ;
     return ele_aggregating_filter ( es , et ,
-                                    axis , bc , lambda_exponent ,
+                                    axis , gain , bc ,
                                     nbpoles , pole , tolerance ) ;
   }
 }
@@ -1293,13 +1320,23 @@ int aggregating_filter ( source_type &source ,
 /// Now we have the routines which perform the buffering and filtering for a chunk of data,
 /// We add code for multithreading. This is done by using utility code from common.h, namely
 /// the routine 'divide_and_conquer_2', which splits two arrays in the same way and applies
-/// a functor to each chunk in a separate thread:
+/// a functor to each chunk in a separate thread.
+/// filter_1d, which is the routine processing nD arrays along a specific axis, might as well
+/// be a function. But functions can't be patially specialized (at least not with my compiler)
+/// so I use a functor instead, which, as a class, can be partially specialized. We want a
+/// partial specialization for 1D arrays, where all of our usual schemes of multithreading and
+/// vectorization don't intrinsically work and we have to employ a different method, see there.
 
 template < typename input_array_type ,      ///< type of array with knot point data
-           typename output_array_type >     ///< type of array for coefficients (may be the same)
-void filter_1d ( input_array_type &input ,   ///< source data. the routine can also operate in-place
+           typename output_array_type ,     ///< type of array for coefficients (may be the same)
+           int dim = input_array_type::actual_dimension >
+class filter_1d
+{
+public:
+  void operator() ( input_array_type &input ,   ///< source data. the routine can also operate in-place
                    output_array_type &output ,  ///< where input == output.
                    int axis ,
+                   double gain ,
                    bc_code bc ,                 ///< boundary treatment for this solver
                    int nbpoles ,
                    const double * pole ,
@@ -1307,9 +1344,8 @@ void filter_1d ( input_array_type &input ,   ///< source data. the routine can a
                    bool use_vc = true ,
                    int nslices = ncores )  ///< number of threads to use
 {
-  const int dim = input_array_type::actual_dimension ;
   typedef typename input_array_type::value_type value_type ;
-  typedef typename MultiArray < 1 , value_type > ::iterator iter_type ;
+  typedef typename vigra::MultiArray < 1 , value_type > ::iterator iter_type ;
 
   using namespace std::placeholders ;
 
@@ -1340,6 +1376,7 @@ void filter_1d ( input_array_type &input ,   ///< source data. the routine can a
                 _1 ,          // placeholders to accept data chunks
                 _2 ,          // from divide_and_conquer
                 axis ,        // axis to process
+                gain ,
                 bc ,
                 nbpoles ,
                 pole ,
@@ -1355,6 +1392,280 @@ void filter_1d ( input_array_type &input ,   ///< source data. the routine can a
            axis ,          // forbid splitting along processing axis
            nslices ) ;     // use nslices threads
 }
+} ;
+
+/// now here's the specialization for 1D arrays.
+
+template < typename input_array_type ,      ///< type of array with knot point data
+           typename output_array_type >     ///< type of array for coefficients (may be the same)
+class filter_1d < input_array_type ,
+                  output_array_type ,
+                  1 >                 // specialize for 1D
+{
+public:
+  void operator() ( input_array_type &input ,   ///< source data. can operate in-place
+                   output_array_type &output ,  ///< where input == output.
+                   int axis ,
+                   double gain ,
+                   bc_code bc ,                 ///< boundary treatment for this solver
+                   int nbpoles ,
+                   const double * pole ,
+                   double tolerance ,
+                   bool use_vc = true ,
+                   int nslices = ncores )  ///< number of threads to use
+{
+  typedef typename input_array_type::value_type value_type ;
+  typedef decltype ( input.begin() ) input_iter_type ;
+  typedef decltype ( output.begin() ) output_iter_type ;
+  typedef vspline::filter < input_iter_type , output_iter_type , double > filter_type ;
+  typedef typename vigra::ExpandElementResult < value_type > :: type ele_type ;
+
+  const int bands = vigra::ExpandElementResult < value_type > :: size ;
+  int runup ;
+
+  // if we can multithread, start out with as many lanes as the desired number of threads
+
+  int lanes = nslices ;
+  
+#ifdef USE_VC
+ 
+  // TODO: this is duplicate with the definition in ele_aggregating_filter
+  typedef Vc::SimdArray < ele_type ,
+                          2 * Vc::Vector<ele_type>::Size > simdized_type ; ///< vectorized type
+
+  // if we can use vector code, the number of lanes is multiplied by the
+  // number of elements a simdized type inside the vector code can handle
+
+  if ( use_vc )
+  {
+    lanes *= simdized_type::Size ;
+  }
+
+#endif
+
+  std::cout << "have " << lanes << " lanes" << std::endl ;
+  
+  // we give the filter some space to run up to precision
+  // TODO we may want to use the proper horizon values here
+  // TODO and make sure 64 is enough.
+
+  std::cout << "tolerance: " << tolerance << std::endl ;
+  
+  if ( tolerance <= 0.0 )
+  {
+    // we can't use the fake_2d method if the tolerance is 0.0
+    lanes = 1 ;
+  }
+  else
+  {
+    // there are minimum requirements for using the fake 2D filter. First find
+    // the horizon at the given tolerance
+    
+    int horizon = ceil ( log ( tolerance ) / log ( fabs ( pole[0] ) ) ) ;
+    
+    // tis is just was much as we want for the filter to run up to precision
+    // starting with BC code 'IGNORE' at the margins
+    
+    runup = horizon ;
+    
+    // the absolute minimum to successfully run the fake 2D filter is this:
+    
+    int min_length = 4 * runup * lanes + 2 * runup ;
+    
+    // input is too short to bother with fake 2D, just single-lane it
+    
+    if ( input.shape(0) < min_length )
+    {
+      std::cout << "input is too short for fake 2D processing with the given tolerance" << std::endl ;
+      lanes = 1 ;
+    }
+    else
+    {
+      // input's larger than the absolute minimum, maybe we can even increase
+      // the number of lanes some more? we'd like to do this if the input is
+      // very large, since we use buffering and don't want the buffers to become
+      // overly large. But the smaller the run along the split x axis, the more
+      // incorrect margin values we have to mend, so we need a compromise.
+      // assume a 'good' length for input: some length where further splitting
+      // would not be wanted anymore. TODO: do some testing, find a good value
+      
+      int good_length = 64 * runup * lanes + 2 * runup ;
+      
+      int split = 1 ;
+      
+      // suppose we split input.shape(0) in ( 2 * split ) parts, is it still larger
+      // than this 'good' length? If not, leave split factor as it is.
+      
+      while ( input.shape(0) / ( 2 * split ) >= good_length )
+      {  
+        // if yes, double split factor, try again
+        split *= 2 ;
+      }
+      
+      lanes *= split ; // increase number of lanes by additional split
+    }
+    
+  }
+  
+  // if there's only one lane we just use this simple code:
+
+  if ( lanes == 1 )
+  {
+    std::cout << "using simple 1D filter" << std::endl ;
+    // this is a simple single-threaded implementation
+    filter_type solver ( input.shape(0) ,
+                         gain ,
+                         bc ,
+                         nbpoles ,
+                         pole ,
+                         0.0 ) ;
+    solver.solve ( input.begin() , output.begin() ) ;
+    return ; // return prematurely, saving us an else clause
+  }
+  
+  // the input qualifies for fake 2D processing.
+
+  std::cout << "input qualifies for fake 2D, using " << lanes << " lanes" << std::endl ;
+  
+  // we want as many chunks as we have lanes. There may be some data left
+  // beyond the chunks (tail_size of value_type)
+  
+  int core_size = input.shape(0) ;
+  int chunk_size = core_size / lanes ;
+  core_size = lanes * chunk_size ;
+  int tail_size = input.shape(0) - core_size ;
+  
+  std::cout << "tail size: " << tail_size << std::endl ;
+  
+  // just doublecheck
+
+  assert ( core_size + tail_size == input.shape(0) ) ;
+  
+  // now here's the strategy: we treat the data as if they were 2D. This will
+  // introduce errors along the 'vertical' margins, since there the 2D treatment
+  // will start with some boundary condition along the x axis instead of looking
+  // at the neighbouring line where the actual continuation is.
+  
+  // create buffers for head and tail
+  
+  vigra::MultiArray < 1 , value_type > head ( 2 * runup ) ;
+  vigra::MultiArray < 1 , value_type > tail ( tail_size + 2 * runup ) ;
+  
+  // filter the beginning and end of the signal into these buffers. Note how
+  // we call this filter with the boundary condition passed in
+
+  filter_type head_solver ( head.size() ,
+                            gain ,
+                            bc ,
+                            nbpoles ,
+                            pole ,
+                            0.0 ) ;
+                            
+  head_solver.solve ( input.begin() , head.begin() ) ;
+  
+  filter_type tail_solver ( tail.size() ,
+                            gain ,
+                            bc ,
+                            nbpoles ,
+                            pole ,
+                            0.0 ) ;
+
+  tail_solver.solve ( input.end() - tail.shape(0) , tail.begin() ) ;
+  
+  // head now has runup correct values at the beginning, succeeded by runup invalid
+  // values, and tail has tail_size + runup correct values at the end, preceded by
+  // runup values which aren't usable, which were needed to run the filter
+  // up to precision. We'll use these correct data later to make up for the fact that
+  // the margin treatment omits the beginning and end of the data.
+
+  // now we create a fake 2D view to the margin of the data. Note how we let the
+  // view begin 2 * runup before the end of the first line, capturing the 'wraparound'
+  // right in the middle of the view
+  
+  typedef vigra::MultiArrayView < 2 , value_type > fake_2d_type ;
+  
+  fake_2d_type
+    fake_2d_margin ( vigra::Shape2 ( 4 * runup , lanes - 1 ) ,
+                     vigra::Shape2 ( input.stride(0) , input.stride(0) * chunk_size ) ,
+                     input.data() + chunk_size - 2 * runup ) ;
+ 
+  // again we create a buffer and filter into the buffer
+
+  vigra::MultiArray < 2 , value_type > margin_buffer ( fake_2d_margin.shape() ) ;
+  
+  filter_1d < fake_2d_type , fake_2d_type , 2 > () ( fake_2d_margin ,
+                                                     margin_buffer ,
+                                                     0 ,
+                                                     gain ,
+                                                     IGNORE ,
+                                                     nbpoles ,
+                                                     pole ,
+                                                     tolerance ,
+                                                     0 ,
+                                                     1 ) ;
+ 
+  // now we have filtered data for the margins in margin_buffer, of which the central half
+  // is usable, the remainder being runup data which we'll ignore. Here's a view to the
+  // central half:
+  
+  vigra::MultiArrayView < 2 , value_type > margin
+  = margin_buffer.subarray ( vigra::Shape2 ( runup , 0 ) ,
+                             vigra::Shape2 ( 3 * runup , lanes - 1 ) ) ;
+  
+  // we already create a view to the target array's margin which we intend to overwrite,
+  // but the data will only be copied in from margin after the treatment of the core.
+
+  vigra::MultiArrayView < 2 , value_type >
+    margin_target ( vigra::Shape2 ( 2 * runup , lanes - 1 ) ,
+                    vigra::Shape2 ( output.stride(0) , output.stride(0) * chunk_size ) ,
+                    output.data() + chunk_size - runup ) ;
+                    
+  // next we fake a 2D array from input and filter it to output, this may be an
+  // in-place operation, since we've extracted all margin information earlier and
+  // deposited what we need in buffers
+  
+  fake_2d_type
+    fake_2d_source ( vigra::Shape2 ( chunk_size , lanes ) ,
+                     vigra::Shape2 ( input.stride(0) , input.stride(0) * chunk_size ) ,
+                     input.data() ) ;
+
+  fake_2d_type
+    fake_2d_target ( vigra::Shape2 ( chunk_size , lanes ) ,
+                     vigra::Shape2 ( output.stride(0) , output.stride(0) * chunk_size ) ,
+                     output.data() ) ;
+  
+  // now we filter the fake 2D source to the fake 2D target
+
+  filter_1d < fake_2d_type , fake_2d_type , 2 > () ( fake_2d_source ,
+                                                     fake_2d_target ,
+                                                     0 ,
+                                                     gain ,
+                                                     IGNORE ,
+                                                     nbpoles ,
+                                                     pole ,
+                                                     tolerance ,
+                                                     use_vc ,
+                                                     nslices ) ;
+
+  // we now have filtered data in target, but the stripes along the magin
+  // in x-direction (1 runup wide) are wrong, because we applied IGNORE BC.
+  // this is why we have the data in 'margin', and we now copy them to the
+  // relevant section of 'target'
+               
+  margin_target = margin ;
+  
+  // finally we have to fix the first and last few values, which weren't touched
+  // by the margin operation (due to margin's offset and length)
+  
+  typedef vigra::Shape1 dt ;
+  
+  output.subarray ( dt(0) , dt(runup) )
+    = head.subarray ( dt(0) , dt(runup) ) ;
+
+  output.subarray ( dt(output.size() - tail_size - runup ) , dt(output.size()) )
+    = tail.subarray ( dt(tail.size() - tail_size - runup ) , dt(tail.size()) ) ;
+}
+} ;
 
 /// This routine calls the 1D filtering routine for all axes in turn. This is the
 /// highest-level routine in filter.h, and the only routine used by other code in
@@ -1377,7 +1688,7 @@ template < typename input_array_type ,      // type of array with knot point dat
            typename output_array_type >     // type of array for coefficients (may be the same)
 void filter_nd ( input_array_type & input ,
                  output_array_type & output ,
-                 TinyVector<bc_code,input_array_type::actual_dimension> bc ,
+                 vigra::TinyVector<bc_code,input_array_type::actual_dimension> bc ,
                  int nbpoles ,
                  const double * pole ,
                  double tolerance ,
@@ -1413,20 +1724,54 @@ void filter_nd ( input_array_type & input ,
     throw shape_mismatch ( "input and output array must have the same shape" ) ;
   }
 
+  // normally the gain is the same for all dimensions.
+
+  double gain_d0 = overall_gain ( nbpoles , pole ) ;
+  double gain_dn = gain_d0 ;
+
+  // deactivating the code below may produce slightly more precise results
+  // This bit of code results in applictation of the cumulated gain for all dimensions
+  // while processing axis 0, and no gain application for subsequent axes.
+  // heuristic. for high degrees, below optimization reduces precision too much
+  // TODO: the effect of this optimization seems negligible.
+  
+  if ( dim > 1 && pow ( nbpoles , dim ) < 32 )
+  {
+    gain_d0 = pow ( gain_d0 , dim ) ;
+    gain_dn = 1.0 ;
+  }
+
   // even if degree <= 1, we'll only arrive here if input != output.
   // So we still have to copy the input data to the output (solve_identity)
   
-  filter_1d<input_array_type,output_array_type>
-             ( input , output , 0 , bc[0] , nbpoles , pole , tolerance , use_vc , nslices ) ;
+  filter_1d<input_array_type,output_array_type,dim>() ( input ,
+                                                        output ,
+                                                        0 ,
+                                                        gain_d0 ,
+                                                        bc[0] ,
+                                                        nbpoles ,
+                                                        pole ,
+                                                        tolerance ,
+                                                        use_vc ,
+                                                        nslices ) ;
 
   // but if degree <= 1 we're done already, since copying the data again
   // in dimensions 1... is futile
 
   if ( nbpoles > 0 )
   {
+    // so for the remaining dimensions we also call the filter.
     for ( int d = 1 ; d < dim ; d++ )
-      filter_1d<output_array_type,output_array_type>
-                ( output , output , d , bc[d] , nbpoles , pole , tolerance , use_vc , nslices ) ;
+      filter_1d<output_array_type,output_array_type,dim>() ( output ,
+                                                             output ,
+                                                             d ,
+                                                             gain_dn ,
+                                                             bc[d] ,
+                                                             nbpoles ,
+                                                             pole ,
+                                                             tolerance ,
+                                                             use_vc ,
+                                                             nslices ) ;
   }
 }
 
diff --git a/poles.cc b/poles.cc
index f91715b..4079876 100644
--- a/poles.cc
+++ b/poles.cc
@@ -39,6 +39,8 @@
     gsl and BLAS, which provide only double precision.
 */
 
+#ifndef VSPLINE_POLES_CC
+
 long double K0[] = {
  1L ,   // basis(0)
  } ; 
@@ -658,3 +660,6 @@ const long double* precomputed_basis_function_values[] = {
   K23, 
   K24, 
 } ;
+
+#define VSPLINE_POLES_CC
+#endif
\ No newline at end of file
diff --git a/prefilter.h b/prefilter.h
index 1f8dff3..ccdddd9 100644
--- a/prefilter.h
+++ b/prefilter.h
@@ -121,6 +121,7 @@
 
 #include "common.h"
 #include "filter.h"
+#include "basis.h"
 
 namespace vspline {
 
@@ -159,16 +160,17 @@ void solve_vigra ( input_array_type & input ,
                    output_array_type & output ,
                    TinyVector<bc_code,input_array_type::actual_dimension> bcv ,
                    int degree ,
+                   double tolerance ,
                    int nslices = ncores )
 {
   filter_nd ( input ,
-             output ,
-             bcv ,
-             degree / 2 ,
-             precomputed_poles [ degree ] ,
-             0.0000001 ,
-             false ,
-             nslices ) ;
+              output ,
+              bcv ,
+              degree / 2 ,
+              precomputed_poles [ degree ] ,
+              tolerance ,
+              false ,
+              nslices ) ;
 }
 
 #ifdef USE_VC
@@ -179,16 +181,17 @@ void solve_vc ( input_array_type & input ,
                    output_array_type & output ,
                    TinyVector<bc_code,input_array_type::actual_dimension> bcv ,
                    int degree ,
+                   double tolerance ,
                    int nslices = ncores )
 {
   filter_nd ( input ,
-             output ,
-             bcv ,
-             degree / 2 ,
-             precomputed_poles [ degree ] ,
-             0.0000001 ,
-             true ,
-             nslices ) ;
+              output ,
+              bcv ,
+              degree / 2 ,
+              precomputed_poles [ degree ] ,
+              tolerance ,
+              true ,
+              nslices ) ;
 }
 
 #endif

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/vspline.git