[clfft] 64/74: bringing doc changes by Pradeep back in

Thu Jan 14 19:52:19 UTC 2016

This is an automated email from the git hooks/post-receive script.

ghisvail-guest pushed a commit to branch debian/sid
in repository clfft.

commit e51836ea313edf9a93bc7b416cefc9fa1b1ee701
Author: bragadeesh <bragadeesh.natarajan at amd>
Date:   Mon Jan 11 15:39:02 2016 -0800

    bringing doc changes by Pradeep back in
---
 src/library/mainpage.h | 347 +++++++++++++++----------------------------------
 1 file changed, 107 insertions(+), 240 deletions(-)

diff --git a/src/library/mainpage.h b/src/library/mainpage.h
index d74638e..bf55b90 100644
--- a/src/library/mainpage.h
+++ b/src/library/mainpage.h
@@ -16,9 +16,8 @@
 
 /*! @file mainpage.h
 
-This file contains all documentation, no code, in the form of comment text.  It's purpose is to provide
-chapter 1 of the documentation we produce with doxygen.  This included the title page, installation instructions
-and prose on the nature of FFT's and their use in our library.
+This file contains all documentation, no code, in the form of comment text.  It provides
+chapter 1 of the documentation that is produced with doxygen. Chapter 1 includes the title page, installation instructions, and prose on the nature of FFT and their use in our library.
 
 @mainpage OpenCL Fast Fourier Transforms (FFTs)
 
@@ -27,63 +26,44 @@ The clFFT library is an OpenCL library implementation of discrete Fast Fourier T
 @li works on CPU or GPU backends.
 @li supports in-place or out-of-place transforms.
 @li supports 1D, 2D, and 3D transforms with a batch size that can be greater than or equal to 1.
- at li supports planar (real and complex components are stored in separate arrays) and interleaved (real and complex
-components are stored as a pair in the same array) formats.
+ at li supports planar (real and complex components are stored in separate arrays) and interleaved (real and complex components are stored as a pair in the same array) formats.
 @li supports lengths that are any combination of powers of 2, 3, 5, and 7.
 @li supports single and double precision floating point formats.
 
 
 @section IntroFFT Introduction to clFFT
 
-The FFT is an implementation of the Discrete Fourier Transform (DFT) that makes use of symmetries in the FFT
-definition to reduce the mathematical intensity required from O(\f$N^2\f$) to O(\f$ N \log N\f$) when the
-sequence length, *N*, is the product of small prime factors.  Currently, there is no standard API for FFT
-routines. Hardware vendors usually provide a set of high-performance FFTs optimized for their systems:
-no two vendors employ the same interfaces for their FFT routines. clFFT provides a set of FFT routines that
-are optimized for AMD graphics processors, and that are also functional across CPU and other compute devices.
+The FFT is an implementation of the Discrete Fourier Transform (DFT) that makes use of symmetries in the FFT definition to reduce the mathematical intensity required from O(\f$N^2\f$) to O(\f$ N \log N\f$) when the sequence length, *N*, is the product of small prime factors.  Currently, there is no standard API for FFT routines. Hardware vendors usually provide a set of high-performance FFTs optimized for their systems: no two vendors employ the same interfaces for their FFT routines. cl [...]
 
 @subsection SupportRadix Supported radices
-clFFT supports transform sizes that are powers of 2, 3, 5, and 7. This means that the vector lengths that can be
-configured can be of any length that is a combination of powers of two, three, five, and seven; examples include \f$2^7, 2^1*3^1, 3^2*5^4, 2^2*3^3*5^5\f$,
-up to the limit that the device can support.
+clFFT supports transform sizes that are powers of 2, 3, 5, and 7. This means that the vector lengths can be  a combination of powers of two, three, five, and seven; examples include \f$2^7, 2^1*3^1, 3^2*5^4, 2^2*3^3*5^5\f$, up to the limit that the device can support.
 
 @subsection SizeLimit Transform size limits
-Currently, there is an upper bound on the transform size that the library can support for certain transforms. This
-limit is \f$2^{24}\f$ for real 1D single precision and \f$2^{22}\f$ for real 1D double precision.
+Currently, there is an upper bound on the transform size that the library can support for certain transforms. This limit is \f$2^{24}\f$ for real 1D single precision and \f$2^{22}\f$ for real 1D double precision.
 
 @subsection EnumDim Dimensionality
 clFFT currently supports FFTs of (up to) three dimensions, given by the enum @ref clfftDim. This enum
-is a required parameter of @ref clfftCreateDefaultPlan() to create an initial plan, where a plan is the collection of
-(almost) all the parameters needed to specify an FFT computation. For more information about clFFT plans, see the section \ref clFFTPlans.
+is a required parameter of @ref clfftCreateDefaultPlan() to create an initial plan, where a plan is the collection of (almost) all the parameters needed to specify an FFT computation. For more information about clFFT plans, see the section \ref clFFTPlans.
 Depending on the dimensionality that the client requests, clFFT uses the following formulations to compute the DFT:
 
 @li For a 1D complex DFT
 \f[
 {\tilde{x}}_j = {{1}\over{scale}}\sum_{k=0}^{n-1}x_k\exp\left({\pm i}{{2\pi jk}\over{n}}\right)\hbox{ for } j=0,1,\ldots,n-1
 \f]
-where, \f$x_k\f$ are the complex data to be transformed, \f$\tilde{x}_j\f$ are the transformed data, and the sign \f$\pm\f$
-determines the direction of the transform: \f$-\f$ for forward and \f$+\f$ for backward. Note that
-you must provide the scaling factor.  By default, the scale is set to 1 for forward transforms, and
-\f${{1}\over{N}}\f$ for backward transforms, where *N* is the size of transform.
+where, \f$x_k\f$ are the complex data to be transformed, \f$\tilde{x}_j\f$ are the transformed data, and the sign \f$\pm\f$ determines the direction of the transform: \f$-\f$ for forward and \f$+\f$ for backward. Note that you must provide the scaling factor.  By default, the scale is set to 1 for forward transforms, and \f${{1}\over{N}}\f$ for backward transforms, where *N* is the size of transform.
 
 @li For a 2D complex DFT
 \f[
 {\tilde{x}}_{jk} = {{1}\over{scale}}\sum_{q=0}^{m-1}\sum_{r=0}^{n-1}x_{rq}\exp\left({\pm i} {{2\pi jr}\over{n}}\right)\exp\left({\pm i}{{2\pi kq}\over{m}}\right)
 \f]
-for \f$j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\f$, where, \f$x_{rq}\f$ are the complex data to be transformed,
-\f$\tilde{x}_{jk}\f$ are the transformed data, and the sign \f$\pm\f$ determines the direction of the
-transform.  By default, the scale is set to 1 for forwards transforms and \f${{1}\over{MN}}\f$ for backwards transforms,
-where *M* and *N* are the 2D size of the transform.
+for \f$j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\f$, where, \f$x_{rq}\f$ are the complex data to be transformed, \f$\tilde{x}_{jk}\f$ are the transformed data, and the sign \f$\pm\f$ determines the direction of the transform.  By default, the scale is set to 1 for forwards transforms and \f${{1}\over{MN}}\f$ for backwards transforms, where *M* and *N* are the 2D size of the transform.
 
 @li For a 3D complex DFT
 \f[
 \tilde{x}_{jkl} = {{1}\over{scale}}\sum_{s=0}^{p-1}\sum_{q=0}^{m-1}\sum_{r=0}^{n-1}
 x_{rqs}\exp\left({\pm i} {{2\pi jr}\over{n}}\right)\exp\left({\pm i}{{2\pi kq}\over{m}}\right)\exp\left({\pm i}{{2\pi ls}\over{p}}\right)
 \f]
-for \f$j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\hbox{ and } l=0,1,\ldots,p-1\f$, where \f$x_{rqs}\f$ are the complex data
-to be transformed, \f$\tilde{x}_{jkl}\f$ are the transformed data, and the sign \f$\pm\f$ determines the direction of the
-transform. By default, the scale is set to 1 for forwards transforms and \f${{1}\over{MNP}}\f$ for backwards transforms,
-where *M*, *N*, and *P* are the 3D size of the transform.
+for \f$j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\hbox{ and } l=0,1,\ldots,p-1\f$, where \f$x_{rqs}\f$ are the complex data to be transformed, \f$\tilde{x}_{jkl}\f$ are the transformed data, and the sign \f$\pm\f$ determines the direction of the transform. By default, the scale is set to 1 for forward transforms and \f${{1}\over{MNP}}\f$ for backward transforms, where *M*, *N*, and *P* are the 3D size of the transform.
 
 @subsection InitLibrary Setup and Teardown of clFFT
 clFFT is initialized by the API @ref clfftSetup(), which must be called before any other API of
@@ -91,14 +71,11 @@ clFFT. This allows the library to create resources needed to manage the plans th
 destroy. This API also takes a structure @ref clfftInitSetupData() that is initialized by the
 client to control the behavior of the library.
 
-After you use the library, the @ref clfftTeardown() method must be called. This function instructs clFFT to release all resources allocated
-internally, and resets acquired references to any OpenCL objects.
+After you use the library, the @ref clfftTeardown() method must be called. This function instructs clFFT to release all resources allocated internally, and resets acquired references to any OpenCL objects.
 
 @subsection ThreadSafety Thread safety
 The clFFT API is designed to be thread-safe. It is safe to create plans from multiple threads and to
-destroy those plans in separate threads. Multiple threads can call @ref clfftEnqueueTransform() to place work
-in a command queue at the same time. clFFT does not provide a single-threaded version of the library.
-The overhead of the synchronization mechanisms inside a clFFT thread safe is expected to be minor.
+destroy those plans in separate threads. Multiple threads can call @ref clfftEnqueueTransform() to place work in a command queue at the same time. clFFT does not provide a single-threaded version of the library. The overhead of the synchronization mechanisms inside a clFFT thread-safe is expected to be minor.
 
 Currently, you must manage the multi-device operation. You can create OpenCL contexts that are
 associated with multiple devices, but clFFT only uses a single device from that context to transform
@@ -108,38 +85,24 @@ across multiple devices and contexts.
 
 @subsection MajorFormat Row major formats
 clFFT expects all multi-dimensional input passed to it to be in row-major format. This is compatible
-with C-based languages. However, clFFT is very flexible in the organization of input and output data, and it
-accepts input data by letting you specify a stride for each dimension. This feature can be used to process
-data in column major arrays and other non-contiguous data formats. See @ref clfftSetPlanInStride() and
+with C-based languages. However, clFFT is very flexible in the organization of the input and output data, and it accepts input data by letting you specify a stride for each dimension. This feature can be used to process data in column major arrays and other non-contiguous data formats. See @ref clfftSetPlanInStride() and 
 @ref clfftSetPlanOutStride().
 
 @subsection Object OpenCL object creation
-Your application must allocate and manage OpenCL objects, such as contexts,  *cl_mem* buffers and command queues.
-All the clFFT interfaces that interact with OpenCL objects take those objects as references through the API.
-Specifically, the plan creation function @ref clfftCreateDefaultPlan() takes an OpenCL context as a parameter
-reference, increments the reference count on that object, and keeps the object alive until the corresponding plan
-is destroyed by a call to @ref clfftDestroyPlan().
+Your application must allocate and manage OpenCL objects, such as contexts,  *cl_mem* buffers and command queues. All the clFFT interfaces that interact with OpenCL objects take those objects as references through the API.
+Specifically, the plan creation function @ref clfftCreateDefaultPlan() takes an OpenCL context as a parameter reference, increments the reference count on that object, and keeps the object alive until the corresponding plan is destroyed by the call @ref clfftDestroyPlan().
 
 @subsection FlushQueue Flushing of command queues
-The clFFT API operates asynchronously; with the exception of thread safety locking with multiple
+The clFFT API operates asynchronously; with the exception of thread safety locking with multiple 
 threads, all APIs return immediately. Specifically, the @ref clfftEnqueueTransform() API does not
-explicitly flush the command queues that are passed by reference to it. It pushes the transform work onto the
-command queues and returns the modified queues to the client. The client is free to issue its own blocking
-logic using OpenCL synchronization mechanisms or push further work onto the queue to continue processing.
+explicitly flush the command queues that are passed by reference to it. It pushes the transform work onto the command queues and returns the modified queues to the client. The client is free to issue its own blocking logic by using OpenCL synchronization mechanisms or push further work onto the queue to continue processing.
 
 @subsection EnvVariables Environment variables
-The clFFT library looks for the definition of 2 environment varibles. One is CLFFT_CACHE_PATH. If this
-variable is defined, then the library caches OpenCL binaries. This will enable a subsequent application run
-of the same type of transforms to avoid going through the expensive compile step. Instead, the stored
-binaries are loaded and executed. The CLFFT_CACHE_PATH must point to a folder location where the
-library can store binaries. The other environment variable is CLFFT_REQUEST_LIB_NOMEMALLOC. This
-variable when defined asks the library to do all computations in-place and avoid allocating extra
-device memory whenever possible. This feature is experimental and currently works only for certain types
-of transforms, and where the input can be decomposed into square matrices by the library.
-
+The clFFT library looks for definition of two environment variables: CLFFT_CACHE_PATH and CLFFT_REQUEST_LIB_NOMEMALLOC. 
+If the variable CLFFT_CACHE_PATH is defined, the library caches OpenCL binaries. This enables a subsequent run of the application with the same type of transforms to avoid the expensive compilation step. Instead, the stored binaries are loaded and executed. The CLFFT_CACHE_PATH must point to a folder location where the library can store binaries. 
+The other variable CLFFT_REQUEST_LIB_NOMEMALLOC when defined, requests the library to do all computations in-place and avoid allocating extra device memory whenever possible. This feature is experimental and currently works only for certain types of transforms  when the library decomposes the input into square matrices or rectangular matrices with dimensions in the ratio 1:2. Currently, it works for 1D complex transforms of size of powers of 2 and even powers of 3, 5, and 7. 
 
 @section clFFTPlans clFFT plans
-
 A plan is the collection of (almost) all the parameters needed to specify an FFT computation.
 A clFFT plan includes the following parameters:
 <ul>
@@ -208,91 +171,31 @@ real data. See the section \ref RealFFT.
 </ol>
 
 @subsubsection DistanceStridesandPitches Strides and Distances
-For one-dimensional data, if clStrides[0] = strideX = 1, successive elements in the first dimension are stored contiguously
-in memory. If strideX is an integral value greater than 1, gaps in memory exist between each element of
-the vectors.
-
-For multi-dimensional data, if clStrides[1] = strideY = LenX for 2 dimensional data and clStrides[2] = strideZ
-= LenX*LenY for 3 dimensional data, no gaps exist in memory between each element, and all vectors are
-stored tightly packed in memory. Here, LenX, LenY, and LenZ denote the transform lengths clLengths[0],
-clLengths[1], and clLengths[2], respectively, which are used to set up the plan.
-
-By specifying non-default strides, it is possible to process either
-row-major or column-major arrays. Data can be extracted from arrays of structures. Almost any regular
-data storage pattern can be accommodated.
-
-Distance is the amount of memory that exists between corresponding elements
-in an FFT primitive in a batch. Distance is measured in units of the FFT
-primitive; complex data measures in complex units, and real data measures in
-real units. Stride between tightly packed elements is 1 in either case. Typically,
-one can measure the distance between any two elements in a batch primitive,
-be it 1D, 2D, or 3D data. For tightly packed data, the distance between FFT
-primitives is the size of the FFT primitive, such that dist=LenX for 1D data,
-dist=LenX*LenY for 2D data, and dist=LenX*LenY*LenZ for 3D data. It is
-possible to set the distance of a plan to be less than the size of the FFT vector;
-most often 1 for this case. When computing a batch of 1D FFT vectors, if
-distance == 1, and strideX == length(vector), a transposed output is produced
-for a batch of 1D vectors. You must verify that the distance and
-strides are valid (not intersecting); if not valid, undefined results may occur.
-
-A simple example would be to perform a 1D length 4096 on each row of an array of 1024 rows x 4096 columns of
-values stored in a column-major array, such as a FORTRAN program might provide. (This would be equivalent
-to a C or C++ program that has an array of 4096 rows x 1024 columns stored in a row-major manner, on which
-you want to perform a 1-D length 4096 transform on each column.) In this case, specify the strides
-[1024, 1].
-
-A more complex example would be to compute a 2D FFT for each 64 x 64 subtile of the grid that has an input
-buffer with a raster grid of 1024 x 1024 monochrome pixel values. Specifying strides
-allows you to treat each horizontal band of 1024 x 64 pixels as an array of 16 64 x 64 matrixes,
-and process an entire band with a single call @ref clfftEnqueueTransform(). (Specifying strides is not
-quite flexible enough to transform the entire grid of this example with a single kernel execution.)
-It is possible to create a Plan to compute arrays of 64 x 64 2D FFTs, then specify three strides:
-[1, 1024, 64]. The first stride, 1, indicates that the rows of each matrix are stored consecutively;
-the second stride, 1024, gives the distance between rows, and the third stride, 64, defines the
-distance between two matrices. Then call @ref clfftEnqueueTransform() 16 times – once for each
-horizontal band of pixels.
+For one-dimensional data, if clStrides[0] = strideX = 1, successive elements in the first dimension are stored contiguously in memory. If strideX is an integral value greater than 1, gaps in memory exist between each element of the vectors.
+For multi-dimensional data, if clStrides[1] = strideY = LenX for 2 dimensional data and clStrides[2] = strideZ = LenX*LenY for 3 dimensional data, no gaps exist in memory between each element, and all vectors are stored tightly packed in memory. Here, LenX, LenY, and LenZ denote the transform lengths clLengths[0], clLengths[1], and clLengths[2], respectively, which are used to set up the plan.
 
- at subsection EnumPrecision Supported precisions in clFFT
-Both *CLFFT_SINGLE* and *CLFFT_DOUBLE* precisions are supported by the library
-for all supported radices. For both these enums the math functions of the host computer are used to
-produce the sine and cosine tables that are used by the OpenCL kernel.
-Both *CLFFT_SINGLE_FAST* and *CLFFT_DOUBLE_FAST* generate faster kernels with reduced accuracy,
-but are disabled in the current build.
+By specifying non-default strides, it is possible to process either row-major or column-major arrays. Data can be extracted from arrays of structures. Almost any regular data storage pattern can be accommodated.
 
+Distance is the amount of memory that exists between corresponding elements in an FFT primitive in a batch. Distance is measured in units of the FFT primitive; complex data measures in complex units, and real data measures in real units. Stride between tightly packed elements is 1 in either case. Typically, one can measure the distance between any two elements in a batch primitive, be it 1D, 2D, or 3D data. For tightly packed data, the distance between FFT primitives is the size of the F [...]
+A simple example would be to perform a 1D length 4096 on each row of an array of 1024 rows x 4096 columns of values stored in a column-major array, such as a FORTRAN program might provide. (This would be equivalent to a C or C++ program that has an array of 4096 rows x 1024 columns stored in a row-major manner, on which you want to perform a 1-D length 4096 transform on each column.) In this case, specify the strides as [1024, 1].
+
+A more complex example would be to compute a 2D FFT for each 64 x 64 subtile of the grid that has an input buffer with a raster grid of 1024 x 1024 monochrome pixel values. Specifying strides allows you to treat each horizontal band of 1024 x 64 pixels as an array of 16 64 x 64 matrixes, and process an entire band with a single call @ref clfftEnqueueTransform(). (Specifying strides is not quite flexible enough to transform the entire grid of this example with a single kernel execution.)  [...]
+
+ at subsection EnumPrecision Supported precisions in clFFT
+Both *CLFFT_SINGLE* and *CLFFT_DOUBLE* precisions are supported by the library for all supported radices. For both these enums the math functions of the host computer are used to produce the sine and cosine tables that are used by the OpenCL kernel.
+Both *CLFFT_SINGLE_FAST* and *CLFFT_DOUBLE_FAST* generate faster kernels with reduced accuracy, but are disabled in the current build.
 See @ref clfftPrecision, @ref clfftSetPlanPrecision(), and @ref clfftGetPlanPrecision().
 
 @subsection FftDirection clfftDirection
-The direction of the transform is not baked into the plan for complex transforms; the same plan can be used to specify both forward
-and backward transforms. To specify the direction, @ref clfftDirection is passed as a parameter into @ref clfftEnqueueTransform().
-In the case of real transforms, the plan's input and output layouts determine the direction.
+For complex transforms, the direction of the transform is not baked into the plan; the same plan can be used to specify both forward and backward transforms. To specify the direction, @ref clfftDirection is passed as a parameter into @ref clfftEnqueueTransform(). 
+For real transforms, the  input and output layouts of the plan determine the direction.
 
 @subsection EnumResultLocation In-place and out-of-place transforms
-The clFFT API supports both in-place and out-of-place transforms. With in-place
-transforms, only the input buffers are provided to the @ref clfftEnqueueTransform() API,
-and the resulting data is written in the same buffer, overwriting the input data.
-With out-of-place transforms, distinct output buffers are provided to the
- at ref clfftEnqueueTransform() API, and the input data is preserved.
-In-place transforms require the *cl_mem* objects the client
-creates have both read and write permissions. This is given in the nature of the
-in-place algorithm. Out-of-place transforms require that the destination buffers to
-have read and write permissions, but input buffers can still be created with
-read-only permissions. This is a clFFT requirement because internally the
-algorithms may go back and forth between the destination buffers and internally
-allocated temp buffers. For out-of-place transforms, clFFT never writes back
-to input buffers.
+The clFFT API supports both in-place and out-of-place transforms. With in-place transforms, only the input buffers are provided to the @ref clfftEnqueueTransform() API, and the resulting data is written in the same buffer, overwriting the input data. With out-of-place transforms, distinct output buffers are provided to the @ref clfftEnqueueTransform() API, and the input data is preserved. 
+In-place transforms require that the *cl_mem* objects created by the client application, have both read and write permissions. This is given in the nature of the in-place algorithm. Out-of-place transforms require that the destination buffers have read and write permissions, but input buffers can still be created with read-only permissions. This is a clFFT requirement because internally the algorithms may go back and forth between the destination buffers and internally allocated temp buf [...]
 
 @subsection clFFTEff Batches
-The efficiency of clFFT is improved by utilizing transforms in batches. Sending
-as much data as possible in a single transform call leverages the parallel
-compute capabilities of OpenCL devices (and GPU devices in particular), and
-minimizes the penalty of transfer overhead. It is best to think of an OpenCL device
-as a high-throughput, high-latency device. Using a networking analogy as an
-example, this approach is similar to having a massively high-bandwidth pipe with very high
-ping response times. If the client is ready to send data to the device for compute,
-it should be sent in as few API calls as possible and this can be done by batching.
-clFFT plans have a parameter @ref clfftSetPlanBatchSize() to describe the number of transforms being
-batched, and another parameter @ref clfftSetPlanDistance() to describe how those batches are
-laid out and spaced in memory. 1D, 2D, or 3D transforms can be batched.
+The efficiency of clFFT is improved by utilizing transforms in batches. Sending as much data as possible in a single transform call leverages the parallel compute capabilities of OpenCL devices (and GPU devices in particular), and minimizes the penalty of transfer overhead. It is best to think of an OpenCL device as a high-throughput, high-latency device. Using a networking analogy as an example, this approach is similar to having a massively high-bandwidth pipe with very high ping respo [...]
 
 @section Outline  Using clFFT in a client application
 
@@ -301,86 +204,74 @@ To perform FFT calculations using clFFT, the client program must perform the fol
 	<li> Initialize the library by calling @ref clfftSetup(). </li>
 	<li> For each distinct type of FFT needed:
 	<ol>
-    <li> Create an FFT Plan object. To create an FFT Plan object, do either of the following. In both cases,
-	the function returns an opaque handle to the plan object.
+   		 <li> Create an FFT Plan object. 
+           			To create an FFT Plan object, do either of the following. 
 		<ul>
-			<li>Call the factory function @ref clfftCreateDefaultPlan() and specify the value of the
-		   		most fundamental parameters, such as plHandle, context, dim, and clLengths, while other parametes assume
-		   		default values.  The OpenCL context must be provided when the plan is created; it cannot be changed. </li>
+			<li>Call the factory function @ref clfftCreateDefaultPlan() and specify the value 
+			       of the most fundamental parameters, such as plHandle, context, dim, and 
+ 			       clLengths, while other parameters assume default values.  The 
+			       OpenCL context must be provided when the plan is created; it cannot be 
+			       changed. </li> <br /> Or
 		 	<li>Call @ref clfftCopyPlan(). </li>
 		</ul>
-    <li> Complete the specification of all the Plan parameters by calling various parameter-setting functions that
-		     have the prefix *clfftSet*. </li>
-	  <li> Optionally, "bake" or finalize the plan by calling @ref clfftBakePlan() function. This signals the library the end
-		     of the specification phase, and causes it to generate and compile the exact OpenCL kernels that perform the
-		     specified FFT on the provided OpenCL device.
-
-		     At this point, all performance-enhancing optimizations are applied, possibly including executing benchmark kernels
-		     on the OpenCL device context to maximize runtime performance.
-
-		     Although the last step is optional, it is recommended to use it so that you can have control on when to do this work.
-			   Usually, this time consuming step is done when the application is initialized. If you do not call
-			   @ref clfftBakePlan(), this work is done during the first call of @ref clfftEnqueueTransform().
+            			Note: In both the cases, the function returns an opaque handle to the plan 
+			object.
+    <li> Complete the specification of all the Plan parameters by calling various parameter-setting 
+            functions that have the prefix *clfftSet*. </li>
+	  <li> Optionally, "bake" or finalize the plan by calling @ref clfftBakePlan() function. This signals 
+	          the library the end of the specification phase, and causes it to generate and compile the 
+                         exact OpenCL kernels that perform the specified FFT on the provided OpenCL device.
+	          At this point, all performance-enhancing optimizations are applied, possibly including 
+                        executing benchmark kernels on the OpenCL device context to maximize the runtime 
+                        performance. <br /> <br />
+Although the previous step is optional, it is recommended to use it so that you can
+have control on when to do this work. Usually, this time consuming step is done when the application is initialized. If you do not call @ref clfftBakePlan(), this work is done during the first call of @ref clfftEnqueueTransform().
 		</li>
 	</ol>
 
 	<li> Execute the OpenCL FFT kernels as many times as needed. </li>
 	<ol>
-		<li>  Call @ref clfftEnqueueTransform(). At this point, specify whether you want to execute a forward or reverse
-		      transform; also, provide the OpenCL *cl_mem* handles for the input buffer(s), output buffer(s)--unless you want
-		      the transformed data to overwrite the input buffers, and (optionally) scratch buffer.
-
-		      @ref clfftEnqueueTransform() performs one or more calls to the OpenCL function clEnqueueNDRangeKernel.
-		      Like clEnqueueNDRangeKernel, @ref clfftEnqueueTransform() is a non-blocking call. The commands to
-		      execute the FFT compute kernel(s) are added to the OpenCL context queue to be executed asynchronously.
-		      An OpenCL event handle is returned to the caller. If multiple NDRangeKernel operations are queued,
-		      the final event handle is returned.
+		<li>  Call @ref clfftEnqueueTransform(). At this point, specify whether you want to 
+		         execute a forward or reverse transform; also, provide the OpenCL *cl_mem* 
+		         handles for the input buffer(s), output buffer(s)--unless you want the transformed 
+		         data to overwrite the input buffers, and (optionally) scratch buffer.
+		         @ref clfftEnqueueTransform() performs one or more calls to the OpenCL function 
+		          clEnqueueNDRangeKernel. Like clEnqueueNDRangeKernel, @ref 
+                                       clfftEnqueueTransform() is a non-blocking call. The commands to execute the FFT 
+                                       compute kernel(s) are added to the OpenCL context queue to be executed 
+		          asynchronously.
+		         An OpenCL event handle is returned to the caller. If multiple NDRangeKernel 
+		         operations are queued, the final event handle is returned.
 		</li>
-		<li>  Add any application OpenCL tasks to the OpenCL context queue. For example, if the
-			  next step in the application process is to apply a filter to the transformed data, the application calls
-		      clEnqueueNDRangeKernel, and specifies its output buffer(s) as the input to the filter kernel,
-		      and provides the transform's event handle to ensure proper synchronization. </li>
-		<li>  If the application accessed the transformed data directly, it calls one of the OpenCL functions
-		      for synchronizing the host computer execution with the OpenCL device (for example: clFinish()). </li>
+		<li>  Add any application OpenCL tasks to the OpenCL context queue. For example, if the    		        next step in the application process is to apply a filter to the transformed data, the 
+		        application calls clEnqueueNDRangeKernel, and specifies its output buffer(s) as the  
+		        input to the filter kernel, and provides the transform's event handle to ensure 
+		        proper synchronization. </li>
+		<li>  If the application accessed the transformed data directly, it calls one of the OpenCL 
+		        functions for synchronizing the host computer execution with the OpenCL device 
+		        (for example: clFinish()). </li>
 	</ol>
 	<li> Terminate the library by calling @ref clfftTeardown().
 </ul>
 
 @section RealFFT  FFTs of real data
 
-When real data is subject to DFT transformation, the resulting complex output data
-follows a special property. About half of the output is redundant because they are
-complex conjugates of the other half. This is called the Hermitian redundancy.
-So, for space and performance considerations, it is only necessary to store the
-non-redundant part of the data. Most FFT libraries use this property to offer
-specific storage layouts for FFTs involving real data. clFFT provides three
-enumerated types to deal with real data FFTs:
-
+When real data is subject to DFT transformation, the resulting complex output data follows a special property. About half of the output is redundant because they are complex conjugates of the other half. This is called the Hermitian redundancy. So, for space and performance considerations, it is only necessary to store the non-redundant part of the data. Most FFT libraries use this property to offer specific storage layouts for FFTs involving real data. clFFT provides three enumerated ty [...]
 <ul>
 	<li> *CLFFT_REAL*
 	<li> *CLFFT_HERMITIAN_INTERLEAVED*
 	<li> *CLFFT_HERMITIAN_PLANAR*
 </ul>
 
-The *CLFFT_REAL* enum specifies that the data is purely real. This can be used to feed
-real input or get back real output. The *CLFFT_HERMITIAN_INTERLEAVED* and
-*CLFFT_HERMITIAN_PLANAR* enums are similar to the corresponding full complex enums
-in the way they store real and imaginary components, but store only about half of the
-complex output. Client applications can do just a forward transform and analyze the output
-or they can process the output and do a backward transform to get back real data. This is illustrated
-in the following figure.
+The *CLFFT_REAL* enum specifies that the data is purely real. This can be used to feed real input or get back real output. The *CLFFT_HERMITIAN_INTERLEAVED* and *CLFFT_HERMITIAN_PLANAR* enums are similar to the corresponding full complex enums in the way they store real and imaginary components, but store only about half of the complex output. Client applications can do just a forward transform and analyze the output or they can process the output and do a backward transform to get back  [...]
 
 @image html realfft_fwdinv.jpg "Forward and Backward Transform Processes"
 
-Let us consider a 1D real FFT of length N. The full output looks as shown in
-following figure.
+Let us consider a 1D real FFT of length N. The full output looks as shown in following figure.
 
 @image html realfft_1dlen.jpg "1D Real FFT of Length N"
 
-Here, C* denotes the complex conjugate. Since the values at indices greater
-than N/2 can be deduced from the first half of the array, clFFT stores data
-only up to the index N/2. This means that the output contains only 1 + N/2
-complex elements, where the division N/2 is rounded down. Examples for even
+Here, C* denotes the complex conjugate. Since the values at indices greater than N/2 can be deduced from the first half of the array, clFFT stores data only up to the index N/2. This means that the output contains only 1 + N/2 complex elements, where the division N/2 is rounded down. Examples for even
 and odd lengths are given below.
 
 Example for N = 8 is shown in following figure.
@@ -392,19 +283,8 @@ Example for N = 7 is shown in following figure.
 @image html realfft_ex_n7.jpg "Example for N = 7"
 
 
-For length 8, only (1 + 8/2) = 5 of the output complex numbers are stored, with
-the index ranging from 0 through 4. Similarly for length 7, only (1 + 7/2) = 4 of
-the output complex numbers are stored, with the index ranging from 0 through 3.
-
-For 2D and 3D FFTs, the FFT length along the least dimension is used to
-compute the (1 + N/2) value. This is because the FFT along the least dimension
-is what is computed first and is logically a real-to-hermitian transform. The FFTs
-along other dimensions are computed afterwards; they are simply 'complex-to-complex'
-transforms. For example, assuming clLengths[2] is used to set up a 2D
-real FFT, let N1 = clLengths[1], and N0 = clLengths[0]. The output FFT has
-N1*(1 + N0/2) complex elements. Similarly, for a 3D FFT with clLengths[3] and
-N2 = clLengths[2], N1 = clLengths[1], and N0 = clLengths[0], the output has
-N2*N1*(1 + N0/2) complex elements.
+For length 8, only (1 + 8/2) = 5 of the output complex numbers are stored, with the index ranging from 0 through 4. Similarly for length 7, only (1 + 7/2) = 4 of the output complex numbers are stored, with the index ranging from 0 through 3.
+For 2D and 3D FFTs, the FFT length along the least dimension is used to compute the (1 + N/2) value. This is because the FFT along the least dimension is computed first and is logically a real-to-hermitian transform. The FFTs along other dimensions are computed afterwards; they are simply 'complex-to-complex' transforms. For example, assuming clLengths[2] is used to set up a 2D real FFT, let N1 = clLengths[1], and N0 = clLengths[0]. The output FFT has N1*(1 + N0/2) complex elements. Simi [...]
 
 @subsection RealModes Supported Modes
 
@@ -426,15 +306,11 @@ In-place transforms:
 
 @subsection ExplicitStrides Setting strides
 
-The library currently <b> requires you to explicitly set input and output strides for real transforms.</b> See
-the following examples to understand what values to use for input and output strides under different scenarios. These
-examples show typical usages, but you can allocate the buffers and layout data according
-to your need.
+The library currently <b> requires you to explicitly set input and output strides for real transforms.</b> See the following examples to understand what values to use for input and output strides under different scenarios. These examples show typical usages, but you can allocate the buffers and layout data according to your need.
 
 @subsection RealExamples Examples
 
-The following pages provide figures and examples to explain in detail the real
-FFT features of this library.
+The following pages provide figures and examples to explain in detail the real FFT features of this library.
 
 @image html realfft_expl_01.jpg "1D FFT - Real to Hermitian"
 @image html realfft_expl_02.jpg "1D FFT - Real to Hermitian, Example 1"
@@ -447,22 +323,15 @@ FFT features of this library.
 
 @section Callbacks  clFFT Callbacks
 
-The callback feature of clFFT has the ability to invoke user provided OpenCL™ inline functions 
-to pre-process or post-process data, from within the FFT kernel. The inline OpenCL callback function 
-is passed as a string to the library. It is then incorporated into the generated FFT kernel. This 
-eliminates the need for an additional kernel launch to carry out the pre/post processing tasks, thus 
-improving overall performance.
-
-There are 2 types of callback; Pre-callback and Post-callback. Pre-callback invokes user callback function to
-perform custom  pre-processing of the input data, before FFT is executed. Post-callback invokes user callback function to
-perform custom post-processing of the output data, after FFT is executed.
+The callback feature of clFFT has the ability to invoke user provided OpenCL™ inline functions to pre-process or post-process data from within the FFT kernel. The inline OpenCL callback function is passed as a string to the library. It is then incorporated into the generated FFT kernel. This eliminates the need for an additional kernel launch to carry out the pre/post processing tasks, thus improving overall performance.
+There are 2 types of callback; Pre-callback and Post-callback. Pre-callback invokes user callback function to perform custom  pre-processing of the input data before FFT is executed. Post-callback invokes user callback function to perform custom post-processing of the output data after FFT is executed.
 
 @subsection CallbackWorkflow Callback Workflow
 
-The workflow of FFT execution using callback feature of clFFT is as follows
+The workflow of FFT execution using callback feature of clFFT is as follows:
 
 <ol>
-	<li> Create clFFT Plan and initialize standard clFFT parameters.
+	<li> Create the clFFT Plan and initialize the standard clFFT parameters.
 	<li> Use @ref clfftSetPlanCallback() API to register the callback function with library
 		@code
 		clfftStatus clfftSetPlanCallback(clfftPlanHandle plHandle,
@@ -473,31 +342,28 @@ The workflow of FFT execution using callback feature of clFFT is as follows
 											void *userdata,
 											int numUserdataBuffers)
 		@endcode
-		The library uses the arguments passed to this API, including callback function string, to stitch the callback code
-		into the generated FFT kernel. The arguments for clfftSetPlanCallback are
+		The library uses the arguments passed to this API, including callback function string, to stitch the callback code 	into the generated FFT kernel. The arguments for clfftSetPlanCallback are
 		<ul>
 			<li> clFFT plan handle
 			<li> Name of the callback function
-			<li> Callback function as character array. The character array can also include any custom datatype declaration used by callback function
+			<li> Callback function as character array; the character array can include custom datatype declaration if any custom type is used by callback function
 			<li> Size of local memory requested by callback, if any, in bytes
-			<li> Type of callback; ‘PRECALLBACK’ or ‘POSTCALLBACK’. This is an enumerator
+			<li> Type of callback: ‘PRECALLBACK’ or ‘POSTCALLBACK’; this is an enumerator
 			<li> Supplementary user data, if any, used by callback function
-			<li> Number of user data buffers. The library currently supports only 1 user data buffer per callback registration
+			<li> Number of user data buffers; the library currently supports only one user data buffer per callback registration
 		</ul>
-		Multiple callback registration calls to the same type of callback will result in overwriting the previously registered callback function
+		Multiple callback registration calls to the same type of callback result in overwriting the previously registered callback function
 	<li> Invoke Bake Plan step
-	<li> Library inserts the callback code into the main FFT kernel during bake plan and compiles it. If there are any
-	compilation errors caused by syntax or incompatible callback function prototype, the library reports failure.
+	Library inserts the callback code into the main FFT kernel during bake plan and compiles it. If there are any compilation errors caused by syntax or incompatible callback function prototype, the library reports failure.
 	<li> Enqueue clFFT transform
 </ol>
 
 The caller is responsible to provide a callback function that matches the function prototype based on the type of
-callback(pre/post), type of transform(real/complex) and whether LDS is used. The bake plan step does the function prototype checking.
+callback(pre/post), type of transform(real/complex) and whether LDS is used. The bake plan step checks the function prototype.
 
 @subsection CallbackFunctionPrototype Callback Function Prototypes
 
-clFFT expects the callback function to be of a specific prototype depending on the type of callback(pre/post), type of transform(real/complex)
-and whether LDS is used. These are as follows:
+clFFT expects the callback function to be of a specific prototype depending on the type of callback(pre/post), type of transform(real/complex) and whether LDS is used. These are as follows:
 
 @subsubsection PrecallbackProtyotype Pre-callback Prototypes
 
@@ -522,11 +388,11 @@ Parameters
 	filter data or any scalar value. The userdata can be of any custom data type/structure, in which case,
 	you have to declare the custom data type and include it along with the callback function string. </li>
 	<li> \c localmem : Pointer to local memory. This memory is allocated by library based on the size you specify
-	and is subject to local memory availability. </li>
+	and is subjected to local memory availability. </li>
 </ul>
 
-For Planar C2C, the return type of callback is a vector (float2/double2) whose elements contain the result for Real
-and Imaginary as computed in the callback
+For Planar C2C, the return type of callback is a vector (float2/double2) that contains the   Real
+and Imaginary elements as computed in the callback.
 
 @subsubsection PostcallbackProtyotype Post-callback Prototypes
 
@@ -550,6 +416,7 @@ Parameters
 	for passing any supplementary data to the callback function. For example, buffer having convolution
 	filter data or any scalar value. The userdata can be of any custom data type/structure, in which case,
 	you have to declare the custom data type and include it along with the callback function string. </li>
+	<li> \c fftoutput : The result computed by clFFT for the element corresponding to outoffset argument
 	<li> \c localmem : Pointer to local memory. This memory is allocated by library based on the size you specify
 	and is subject to local memory availability. </li>
 </ul>
@@ -564,11 +431,11 @@ const char* precallbackstr = "float2 pre_mulval(__global void* input, \n
                                   uint inoffset,                      \n
                                   __global void* userdata,            \n
                                   __local void* localmem)             \n
-				{                                                             \n
+				{                                                            \n
 				float scalar = *((__global float*)userdata + inoffset);      \n
 				float2 ret = *((__global float2*)input + inoffset) * scalar; \n
-				return ret;                                                   \n
-				}                                                             \n";
+				return ret;                                                  \n
+				}                                                            \n";
 
 const char* postcallbackstr = "void post_mulval(__global void* output, \n
                                   uint outoffset,                      \n
@@ -579,9 +446,9 @@ const char* postcallbackstr = "void post_mulval(__global void* output, \n
 				float scalar = *((__global float*)userdata + outoffset);      \n
 				*((__global float2*)output + outoffset) = fftoutput * scalar; \n
 				}                                                             \n";
-
+				
 //**************************************************************************
-//* Step 2 : Initialize arguments if any required by the callback.
+//* Step 2 : Initialize arguments, if any, required by the callback.
 //**************************************************************************
 int h_preuserdata[N] = {  };
 cl_mem preuserdata = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float) * N,  (void*)h_preuserdata, NULL);
@@ -609,11 +476,11 @@ status = clfftEnqueueTransform( plan_handle, dir, 1, &queue, 0, NULL, &outEvent,
 @subsection CallbackNotes Important Notes on Callback
 
 <ol>
-	<li> The caller is responsible to provide a callback function in string form that matches the function prototype based on the type of callback, type of transform(real/complex) and whether LDS is used
-	<li> clFFT considers the value returned by pre-callback function as the new value of the input at the index corresponding to the *inoffset* argument
-	<li> Callback function can request for local memory for its own use. If the requested amount of local memory is available on the device, clFFT passes a pointer to the local memory when it invokes the callback function
-	<li> clFFT may invoke FFT kernels several times depending on the input parameters. However the pre-callback function provided by caller is invoked only once for each point in the input. Similarly it calls the post-callback function, for each point in the output, only once.
-	<li> If clFFT is implementing a given FFT in multiple phases, it calls the pre-callback function only from the first phase kernel. Similarly it calls the post-callback function only from the last phase kernel
+	<li> The caller is responsible to provide a callback function in string form that matches the function prototype based on the type of callback, type of transform(real/complex), and whether LDS is used.
+	<li> clFFT considers the value returned by the pre-callback function as the new value of the input at the index corresponding to the *inoffset* argument.
+	<li> Callback function can request for local memory for its own use. If the requested amount of local memory is available on the device, clFFT passes a pointer to the local memory when it invokes the callback function.
+	<li> clFFT may invoke the FFT kernels several times depending on the input parameters. However, the pre-callback function provided by the caller is invoked only once for each point in the input. Similarly, it calls the post-callback function only once for each point in the output.
+	<li> If clFFT is implementing a given FFT in multiple phases, it calls the pre-callback function only from the first phase kernel. Similarly, it calls the post-callback function only from the last phase kernel.
 </ol>
 
  */

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/clfft.git