[vspline] 63/72: changed gather/scatter with extrusion code and it's use Given a simdized type ST, I noticed recently that, apart from using ST::IndexType (which uses int) for gather/scatter operations, I can use any type offering operator[] and holding the required number of indexes. Initially I filed this under nice-to-have, but since the code I'm working on produces indexes initially as ptrdiff_t (which is long on my system), I experimentally coded passing vigra::TinyVector<ptrdiff_t, N> as index type, vigra::TinyVector being a fixed size container type. To my surprise, the resulting code was faster than the code using ST::IndexType, so it looks like a win-win situation: I don't need to safeguard against index overflow (as unlikely as that may be) and I get faster code.

Kay F. Jahnke kfj-guest at moszumanska.debian.org
Sun Jul 2 09:02:43 UTC 2017


This is an automated email from the git hooks/post-receive script.

kfj-guest pushed a commit to branch master
in repository vspline.

commit 669e7680d2158da0262e1f3fd0e4e5fa67551706
Author: Kay F. Jahnke <kfjahnke at gmail.com>
Date:   Mon May 15 09:30:54 2017 +0200

    changed gather/scatter with extrusion code and it's use
    Given a simdized type ST, I noticed recently that, apart from using
     ST::IndexType (which uses int) for gather/scatter operations, I can
     use any type offering operator[] and holding the required number of
     indexes. Initially I filed this under nice-to-have, but since the code
     I'm working on produces indexes initially as ptrdiff_t
     (which is long on my system), I experimentally coded passing
     vigra::TinyVector<ptrdiff_t,N> as index type, vigra::TinyVector
     being a fixed size container type. To my surprise, the resulting
     code was faster than the code using ST::IndexType, so it looks like
     a win-win situation: I don't need to safeguard against index overflow
     (as unlikely as that may be) and I get faster code.
---
 common.h |  10 ++
 filter.h | 387 ++++++++++++++++++++++++++++++++++++++++++---------------------
 2 files changed, 269 insertions(+), 128 deletions(-)

diff --git a/common.h b/common.h
index f086123..6d65172 100644
--- a/common.h
+++ b/common.h
@@ -158,6 +158,16 @@ struct out_of_bounds
 {
 } ;
 
+/// exception which is thrown when an assiging an rvalue which is larger than
+/// what the lvalue can hold
+
+struct numeric_overflow
+: std::invalid_argument
+{
+  numeric_overflow  ( const char * msg )
+  : std::invalid_argument ( msg ) { }  ;
+} ;
+
 /// This enumeration is used for codes connected to boundary conditions. There are
 /// two aspects to boundary conditions: During prefiltering, if the implicit scheme is used,
 /// the initial causal and anticausal coefficients have to be calculated in a way specific to
diff --git a/filter.h b/filter.h
index 4dbd05d..bd467e0 100644
--- a/filter.h
+++ b/filter.h
@@ -58,18 +58,10 @@
     
     While the initial purpose for the code in this file was, of course, b-spline prefiltering,
     the generalized version I present here can be used for arbitrary filters. There is probably
-    one filter which is most useful in the context of vspline: passing a single positive pole
-    in the range of ] 0 , 1 [ smoothes the signal very efficiently.
+    one other filter which is most useful in the context of vspline: passing a single positive
+    pole in the range of ] 0 , 1 [ smoothes the signal very efficiently.
 */
 
-// TODO: with buffering used throughout, we might make the code more general:
-// input could be of any type convertible to a type the 1D filter can handle,
-// and would be cast to the type the buffer contains when the buffer is filled.
-// after completion of the filter, the data might be submitted to yet another cast
-// on copying them out to the target. With such a scheme, operation on, for example,
-// integral data, would become possible without need for converting the data first to
-// an nD array of real values.
-
 // include common.h for the border condition codes
 
 #include <vector>
@@ -121,7 +113,7 @@ double overall_gain ( const int & nbpoles , const double * const pole )
 /// type of the iterator's value_type. When the value_types are vigra aggregates
 /// (TinyVectors etc.) vigra's ExpandElementResult mechanism will provide, but at times
 /// we may wish to be explicit here, e.g. when iterating over simdized types.
-///   
+
 template < typename in_iter ,   // iterator over the knot point values
            typename out_iter ,  // iterator over the coefficient array
            typename real_type > // type for single real value for calculations
@@ -777,12 +769,16 @@ filter ( int _M ,               ///< number of input/output elements (DataLength
 /// starting at source and deposting compactly at target. scatter performs the reverse
 /// operation. source_type and target_type can be different; on assignment source_type is
 /// simply cast to target_type.
+/// index_type is passed in as a template argument, allowing for wider types than int,
+/// so these routines can also operate on very large areas of memory.
 
-template < typename source_type , typename target_type = source_type >
+template < typename source_type ,
+           typename target_type = source_type ,
+           typename index_type = int >
 void gather ( const source_type* source ,
               target_type* target ,
-              const int & stride ,
-              int count
+              const index_type & stride ,
+              index_type count
             )
 {
   while ( count-- )
@@ -793,11 +789,13 @@ void gather ( const source_type* source ,
   }
 }
 
-template < typename source_type , typename target_type = source_type >
+template < typename source_type ,
+           typename target_type = source_type ,
+           typename index_type = int >
 void scatter ( const source_type* source ,
                target_type* target ,
-               const int & stride ,
-               int count
+               const index_type & stride ,
+               index_type count
              )
 {
   while ( count-- )
@@ -815,6 +813,8 @@ void scatter ( const source_type* source ,
 /// especially with higher-order filters. On my system, I found I broke even even with only
 /// one pole, so there is no special treatment here for low-order filtering (TODO confirm)
 /// note the use of range_type<T>, which is from multithread.h
+/// we derive the index type for the call to the monadic gather/scatter routines
+/// automatically, so here it comes out as vigra's difference_type_1
 
 template < class source_view_type ,
            class target_view_type ,
@@ -841,7 +841,7 @@ void nonaggregating_filter ( range_type < typename source_view_type::difference_
   const auto source = p_original_source->subarray ( range[0] , range[1] ) ;
   auto target = p_original_target->subarray ( range[0] , range[1] ) ;
   
-  int count = source.shape ( axis ) ; 
+  auto count = source.shape ( axis ) ; 
 
   /// we use a buffer of count value_types
 
@@ -854,13 +854,9 @@ void nonaggregating_filter ( range_type < typename source_view_type::difference_
   typedef filter < iter_type , iter_type , math_type > filter_type ;
   filter_type solver ( count , gain , bc , nbpoles , pole , tolerance ) ;
 
-  // offset into the source slice which will be used for gather/scatter
-
-  int source_index ;
-  
   // next slice is this far away:
 
-  int source_stride = source.stride ( axis ) ;
+  auto source_stride = source.stride ( axis ) ;
 
   auto source_base_adress = source.data() ;
   auto buffer_base_adress = buffer.data() ;
@@ -896,7 +892,7 @@ void nonaggregating_filter ( range_type < typename source_view_type::difference_
       // copy from the array to the buffer with a monadic gather, casting to
       // math_type in the process
       
-      source_index = &(*source_sliter) - source_base_adress ;
+      auto source_index = &(*source_sliter) - source_base_adress ;
       
       gather < source_type , math_type > ( source_base_adress + source_index ,
                                            buffer_base_adress ,
@@ -931,14 +927,13 @@ void nonaggregating_filter ( range_type < typename source_view_type::difference_
     auto source_sliter_end = source_slice.end() ;
 
     auto target_slice = target.bindAt ( axis , 0 ) ;
-    int target_stride = target.stride ( axis ) ;
+    auto target_stride = target.stride ( axis ) ;
     auto target_sliter = target_slice.begin() ;
-    int target_index ;
 
     while ( source_sliter < source_sliter_end )
     {
-      source_index = &(*source_sliter) - source_base_adress ;
-      target_index = &(*target_sliter) - target_base_adress ;
+      auto source_index = &(*source_sliter) - source_base_adress ;
+      auto target_index = &(*target_sliter) - target_base_adress ;
       
       gather < source_type , math_type > ( source_base_adress + source_index ,
                                            buffer_base_adress ,
@@ -971,8 +966,6 @@ void nonaggregating_filter ( range_type < typename source_view_type::difference_
 /// and deposits in target, which is compact.
 /// The scatter routine scatters from source, which points to compact memory,
 /// and deposits in target, which points to strided memory.
-/// The routine is templated by the vector type used for the single gather/scatter
-/// operations, it can be a Vc::Vector or a Vc::SimdArray.
 /// Initially I coded using load/store operations to access the 'non-compact'
 /// memory as well, if the indexes were contiguous, but surprisingly, this was
 /// slower. I like the concise expression with this code - instead of having
@@ -980,28 +973,41 @@ void nonaggregating_filter ( range_type < typename source_view_type::difference_
 /// the modus operandi is determined by the indices and mask passed, which is
 /// relatively cheap as it occurs only once, while the inner loop can just
 /// rip away.
-
-template < typename source_type , typename target_type = source_type >
+/// per default, the type used for gather/scatter indices (gs_indexes_type)
+/// will be what Vc deems appropriate. This comes out as an SIMD type composed
+/// of int, and ought to result in the fastest code on the machine level.
+/// But since the only *requirement* on gather/scatter indices is that they
+/// offer a subscript operator (and hold enough indices), other types can be
+/// used as gs_indexes_type as well. Below I make the disticzion and pass in
+/// a TinyVector of ptrdiff_t if int isn't sufficiently large to hold the
+/// intended indices. On my system, this is actually faster.
+
+template < typename source_type ,     // (singular) source type
+           typename target_type ,     // (simdized) target type
+           typename index_type ,      // (singular) index type for stride, count
+           typename gs_indexes_type > // type for gather/scatter indices
 void
-gather ( const typename source_type::value_type * source ,
-         typename target_type::value_type * target ,
-         const typename source_type::IndexType & indexes ,
-         const typename source_type::Mask & mask ,
-         const int & stride ,
-         int count
+gather ( const source_type * source ,
+         target_type * target ,
+         const gs_indexes_type & indexes ,
+         const typename target_type::Mask & mask ,
+         const index_type & stride ,
+         index_type count
        )
 {
-  register source_type x ;
+  // fix the type into which to gather source data
+  enum { vsize = target_type::Size } ;
+  typedef typename Vc::SimdArray < source_type , vsize > simdized_source_type ;
+
   // if the mask is all-true, load the data with an unmasked gather operation
   if ( mask.isFull() )
   {
     while ( count-- )
     {
-      x.gather ( source , indexes ) ;
-      auto tx = target_type ( x ) ;
-      tx.store ( target , Vc::Aligned ) ;
+      simdized_source_type x ( source , indexes ) ;
+      * target = target_type ( x ) ;
       source += stride ;
-      target += target_type::Size ;
+      ++ target ;
     }
   }
   else
@@ -1009,35 +1015,39 @@ gather ( const typename source_type::value_type * source ,
     // if there is a partially filled mask, perform a masked gather operation
     while ( count-- )
     {
-      x.gather ( source , indexes , mask ) ;
+      simdized_source_type x ( source , indexes , mask ) ;
+      * target = target_type ( x ) ;
       source += stride ;
-      auto tx = target_type ( x ) ;
-      tx.store ( target , Vc::Aligned ) ;
-      target += target_type::Size ;
+      ++ target ;
     }
   }
 }
 
-template < typename source_type , typename target_type = source_type >
+template < typename source_type ,     // (simdized) source type
+           typename target_type ,     // (singular) target type
+           typename index_type ,      // (singular) index type for stride, count
+           typename gs_indexes_type > // type for gather/scatter indices
 void
-scatter ( const typename source_type::value_type * source ,
-          typename target_type::value_type * target ,
-          const typename target_type::IndexType & indexes ,
-          const typename target_type::Mask & mask ,
-          const int & stride ,
-          int count
+scatter ( const source_type * source ,
+          target_type * target ,
+          const gs_indexes_type & indexes ,
+          const typename source_type::Mask & mask ,
+          const index_type & stride ,
+          index_type count
         )
 {
-  register source_type x ;
-  // if the mask is all-true, store the data with an unmasked scatter operation
+  // fix the type from which to scatter target data
+  enum { vsize = source_type::Size } ;
+  typedef typename Vc::SimdArray < target_type , vsize > simdized_target_type ;
+
+  // if the mask is full, deposit with an unmasked scatter
   if ( mask.isFull() )
   {
     while ( count-- )
     {
-      x.load ( source , Vc::Aligned ) ;
-      auto tx = target_type ( x ) ;
-      tx.scatter ( target , indexes ) ;
-      source += source_type::Size ;
+      simdized_target_type x ( *source ) ;
+      x.scatter ( target , indexes ) ;
+      ++ source ;
       target += stride ;
     }
   }
@@ -1046,10 +1056,9 @@ scatter ( const typename source_type::value_type * source ,
     // if there is a partially filled mask, perform a masked scatter operation
     while ( count-- )
     {
-      x.load ( source , Vc::Aligned ) ;
-      source += source_type::Size ;
-      auto tx = target_type ( x ) ;
-      tx.scatter ( target , indexes , mask ) ;
+      simdized_target_type x ( *source ) ;
+      x.scatter ( target , indexes , mask ) ;
+      ++ source ;
       target += stride ;
     }
   }
@@ -1065,8 +1074,20 @@ scatter ( const typename source_type::value_type * source ,
 /// comparison. Depending on data type, array size and spline degree, sometimes the
 /// nonvectorized code is faster. But as both grow, bufering soon comes out on top.
 /// ele_aggregating_filter is a subroutine processing arrays of elementary value_type.
-/// It's used by aggregating_filter, either directly if the arrays' value_type is already
-/// an elementary type, or after element-expanding the array(s).
+/// It's used by aggregating_filter, after element-expanding the array(s).
+/// With this vectorized routine and the size of gather/scatter indices used by Vc
+/// numeric overflow could occur: the index type is only int, while it's assigned a
+/// ptrdiff_t, which it may not be able to represent. The overflow can happen when
+/// a gather/scatter spans a too-large memory area. The gather/scatter indices will
+/// be set up so that the first index is always 0 (by using the adress of the first
+/// storee, not the array base adress), but even though this makes it less likely for
+/// the overflow to occur, it still can happen. In this case the code falls back
+/// to using a vigra::TinyVector < ptrdiff_t > as gather/scatter index type, which
+/// may cause Vc to use less performant code for the gather/scatter operations but
+/// is safe.
+/// TODO: On my system, surprisingly, the variant passing TinyVector<ptrdiff_t> to
+/// the gather/scatter routine performs slightly faster than the presumedly optimal
+/// variant.
 
 template < typename source_view_type ,
            typename target_view_type ,
@@ -1085,7 +1106,8 @@ void ele_aggregating_filter ( source_view_type &source ,
   // which are used as simdized type in evaluation
 
   typedef typename Vc::Vector < math_type > simdized_math_type ;
-  const int vsize ( simdized_math_type::Size ) ;              // number of math_type in a simdized_type
+       // number of math_type in a simdized_math_type
+  const int vsize ( simdized_math_type::Size ) ;
   
   typedef typename source_view_type::value_type source_type ;
   typedef typename Vc::SimdArray < source_type , vsize > simdized_source_type ;
@@ -1093,18 +1115,24 @@ void ele_aggregating_filter ( source_view_type &source ,
   typedef typename target_view_type::value_type target_type ;
   typedef typename Vc::SimdArray < target_type , vsize > simdized_target_type ;
   
-  typedef typename simdized_math_type::IndexType index_type ; // indexes for gather/scatter
-  typedef typename simdized_math_type::MaskType mask_type ;   // and mask for masked operation
+  // indexes for gather/scatter. first the 'optimal' type, which Vc produces as
+  // the IndexType for simdized_math_type. Next a wider type composed of std::ptrdiff_t,
+  // to be used initially when calculating the indices, and optionally later for the
+  // actual gather/scatter operations if gs_indexes_type isn't wide enough.
+
+  typedef typename simdized_math_type::IndexType gs_indexes_type ;
+  typedef vigra::TinyVector < std::ptrdiff_t , vsize > comb_type ;
+  
+  // mask type for masked operation
+  typedef typename simdized_math_type::MaskType mask_type ;
   
-  int count = source.shape ( axis ) ;                    // number of vectors we'll process
+  auto count = source.shape ( axis ) ; // number of vectors we'll process
 
   // I initially tried to use Vc::Memory, but the semantics of the iterator obtained
   // by buffer.begin() didn't work for me.
   // anyway, a MultiArray with the proper allocator works just fine, and the dereferencing
   // of the iterator needed in the solver works without further ado. 
   
-  // Vc::Memory < simdized_type > buffer ( count * vsize ) ; // doesn't work for me
-
   vigra::MultiArray < 1 , simdized_math_type , Vc::Allocator<simdized_math_type> >
     buffer ( count ) ;
 
@@ -1115,7 +1143,7 @@ void ele_aggregating_filter ( source_view_type &source ,
 
   // set of offsets into the source slice which will be used for gather/scatter
 
-  index_type source_indexes ;
+  comb_type source_indexes ;
   
   // while we don't hit the last odd few 1D subarrays the mask is all-true
 
@@ -1123,7 +1151,7 @@ void ele_aggregating_filter ( source_view_type &source ,
   
   // next slice is this far away:
 
-  int source_stride = source.stride ( axis ) ;
+  auto source_stride = source.stride ( axis ) ;
 
   // we want to use the extended gather/scatter (with 'extrusion'), so we need the
   // source and target pointers. Casting buffer's data pointer to math_type is safe,
@@ -1131,9 +1159,12 @@ void ele_aggregating_filter ( source_view_type &source ,
   // in disguise.
 
   auto source_base_adress = source.data() ;
-  auto buffer_base_adress = (math_type*) (buffer.data()) ;
+  auto buffer_base_adress = buffer.data() ;
   auto target_base_adress = target.data() ;
 
+  gs_indexes_type source_gs_indexes ;
+  gs_indexes_type target_gs_indexes ;      
+
   // we create a solver object capable of handling the iterator producing the successive
   // simdized_types from the buffer. While the unvectorized code can omit passing the third
   // template argument (the elementary type used inside the solver) we pass it here, as we
@@ -1166,16 +1197,23 @@ void ele_aggregating_filter ( source_view_type &source ,
 
     auto source_sliter = permuted_slice.begin() ;
     auto source_sliter_end = permuted_slice.end() ;
-
+    
     while ( source_sliter < source_sliter_end )
     {
-      // try loading vsize successive offsets into an index_type
+      // try loading vsize successive offsets into an comb_type
       
       int e ;
       
+      // we base the operation so that the first entry in source_indexes
+      // will come out 0.
+  
+      auto first_source_adress = &(*source_sliter) ;
+      auto offset = first_source_adress - source_base_adress ;
+      auto first_target_adress = target_base_adress + offset ;
+      
       for ( e = 0 ; e < vsize && source_sliter < source_sliter_end ; ++e , ++source_sliter )
         
-        source_indexes[e] = &(*source_sliter) - source_base_adress ;
+        source_indexes[e] = &(*source_sliter) - first_source_adress ;
       
       if ( e < vsize )
         
@@ -1183,39 +1221,91 @@ void ele_aggregating_filter ( source_view_type &source ,
         // mask was all-true before, so now we limit it to the first e fields:
         
         mask = ( simdized_math_type::IndexesFromZero() < e ) ;
+
+// The next bit of code, which is commented out, tried to use an 'optimal' type
+// for gather/scatter indices by picking simdized_math_type::IndexType. But it
+// turned out the resulting code was slower on my system and requires testing for
+// overflow. So I leave the code in but I'm not using it:
       
-      // perform extended gather with extrusion parameters to transport the unfiltered data
-      // to the buffer
-      
-      gather < simdized_source_type , simdized_math_type >
-        ( source_base_adress ,
-          buffer_base_adress ,
-          source_indexes ,
-          mask ,
-          source_stride ,
-          count ) ;
-                              
-      // finally (puh): apply the prefilter, using the solver in-place, iterating over
-      // the vectors in buffer with maximum efficiency.
-                              
-      solver.solve ( buffer.begin() ) ;
-      
-      // and perform extended scatter with extrusion parameters to write the filtered data
-      // to the destination
-
-      scatter < simdized_math_type , simdized_target_type >
-        ( buffer_base_adress ,
-          target_base_adress ,
-          source_indexes ,
-          mask ,
-          source_stride ,
-          count ) ;
+//       // next we assign the indices (which are ptrdiff_t) to the intended type
+//       // for gather/scatter indices - which is what Vc deems appropriate. This should
+//       // be the optimal choice in terms of performance. Yet we can't be certain that
+//       // the ptrdiff_t values actually fit into this type, which is usually composed of
+//       // int only. So we test if the assigned value compares equal to the assignee.
+//       // If the test fails for any of the indices, we switch to code using a
+//       // vigra::TinyVector < ptrdiff_t > for the indices, which is permissible, since
+//       // TinyVector offers operator[], but may be less efficient. On my system I could
+//       // not detect a run-time penalty, but I leave the test and code differentiation
+//       // in nevertheless.
+//       
+//       bool fits = true ;
+//       for ( e = 0 ; e < vsize ; e++ )
+//       {
+//         source_gs_indexes[e] = source_indexes[e] ;
+//         if ( source_gs_indexes[e] != source_indexes[e] )
+//           fits = false ;
+//       }
+//       
+//       if ( fits )
+//       {
+//         // perform extended gather with extrusion parameters to transport the unfiltered data
+//         // to the buffer, passing in source_gs_indexes for best performance.
+//         
+//         gather
+//           ( first_source_adress ,
+//             buffer_base_adress ,
+//             source_gs_indexes ,
+//             mask ,
+//             source_stride ,
+//             count ) ;
+//                                 
+//         // finally (puh): apply the prefilter, using the solver in-place, iterating over
+//         // the vectors in buffer with maximum efficiency.
+//                                 
+//         solver.solve ( buffer.begin() ) ;
+//         
+//         // and perform extended scatter with extrusion parameters to write the filtered data
+//         // to the destination
+// 
+//         scatter
+//           ( buffer_base_adress ,
+//             first_target_adress ,
+//             source_gs_indexes ,
+//             mask ,
+//             source_stride ,
+//             count ) ;
+//       }
+//       else
+      {
+        // Since the indices did not fit into the optimal type for gather/scatter
+        // indices, we pass in a wider type, which may reduce performance, but is
+        // necessary under the circumstances. But this should rarely happen:
+        // it would mean a gather/scatter spanning several GB.
+        
+        gather
+          ( first_source_adress ,
+            buffer_base_adress ,
+            source_indexes ,
+            mask ,
+            source_stride ,
+            count ) ;
+                                
+        solver.solve ( buffer.begin() ) ;
+        
+        scatter
+          ( buffer_base_adress ,
+            first_target_adress ,
+            source_indexes ,
+            mask ,
+            source_stride ,
+            count ) ;
+      }
     }
   }
   else
   {
-    // pretty much the same as the previouse operation, with the distinction that
-    // copying the filtered data from the buffer to the target now needs it's own set of
+    // pretty much the same as the if(...) case, with the distinction that copying
+    // the filtered data from the buffer to the target now needs it's own set of
     // indexes etc., since all these may be different.
 
     // TODO we might permute source_slice's strides to ascending and apply the same
@@ -1226,35 +1316,77 @@ void ele_aggregating_filter ( source_view_type &source ,
     auto source_sliter_end = source_slice.end() ;
 
     auto target_slice = target.bindAt ( axis , 0 ) ;
-    int target_stride = target.stride ( axis ) ;
+    auto target_stride = target.stride ( axis ) ;
     auto target_sliter = target_slice.begin() ;
-    index_type target_indexes ;
+    comb_type target_indexes ;
 
     while ( source_sliter < source_sliter_end )
     {
       int e ;
-      for ( e = 0 ; e < vsize && source_sliter < source_sliter_end ; ++e , ++source_sliter , ++target_sliter )
+      auto first_source_adress = &(*source_sliter) ;
+      auto first_target_adress = &(*target_sliter) ;
+      
+      for ( e = 0 ;
+           e < vsize && source_sliter < source_sliter_end ;
+           ++e , ++source_sliter , ++target_sliter )
       {
-        source_indexes[e] = &(*source_sliter) - source_base_adress ;
-        target_indexes[e] = &(*target_sliter) - target_base_adress ;
+        source_indexes[e] = &(*source_sliter) - first_source_adress ;
+        target_indexes[e] = &(*target_sliter) - first_target_adress ;
       }
       if ( e < vsize )
         mask = ( simdized_math_type::IndexesFromZero() < e ) ;
-      gather < simdized_source_type , simdized_math_type >
-        ( source_base_adress ,
-          buffer_base_adress ,
-          source_indexes ,
-          mask ,
-          source_stride ,
-          count ) ;
-      solver.solve ( buffer.begin() ) ;
-      scatter < simdized_math_type , simdized_target_type >
-        ( buffer_base_adress ,
-          target_base_adress ,
-          target_indexes ,
-          mask ,
-          target_stride ,
-          count ) ;
+      
+// The next bit of code, which is commented out, tried to use an 'optimal' type
+// for gather/scatter indices by picking simdized_math_type::IndexType. But it
+// turned out the resulting code was slower on my system and requires testing for
+// overflow. So I leave the code in but I'm not using it:
+      
+//       bool fits = true ;
+//       for ( e = 0 ; e < vsize ; e++ )
+//       {
+//         source_gs_indexes[e] = source_indexes[e] ;
+//         target_gs_indexes[e] = target_indexes[e] ;
+//         if (    source_gs_indexes[e] != source_indexes[e]
+//              || target_gs_indexes[e] != target_indexes[e] )
+//           fits = false ;
+//       }
+// 
+//       if ( fits )
+//       {
+//         gather
+//           ( first_source_adress ,
+//             buffer_base_adress ,
+//             source_gs_indexes ,
+//             mask ,
+//             source_stride ,
+//             count ) ;
+//         solver.solve ( buffer.begin() ) ;
+//         scatter
+//           ( buffer_base_adress ,
+//             first_target_adress ,
+//             target_gs_indexes ,
+//             mask ,
+//             target_stride ,
+//             count ) ;
+//       }
+//       else
+      {
+        gather
+          ( first_source_adress ,
+            buffer_base_adress ,
+            source_indexes ,
+            mask ,
+            source_stride ,
+            count ) ;
+        solver.solve ( buffer.begin() ) ;
+        scatter
+          ( buffer_base_adress ,
+            first_target_adress ,
+            target_indexes ,
+            mask ,
+            target_stride ,
+            count ) ;
+      }
     }
   }
 }
@@ -1369,8 +1501,7 @@ public:
   
   // obtain a partitioning of the data array into subranges. We do this 'manually' here
   // because we must instruct shape_splitter not to chop up the current processing axis
-  // (by passing the 'forbid' parameter, which is taken as -1 (no effect) in the default
-  // partitioning of nD arrays).
+  // (by passing axis as the 'forbid' parameter)
 
   auto partitioning = shape_splitter<dim>::part ( input.shape() , njobs , axis ) ;
   

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/vspline.git



More information about the debian-science-commits mailing list