[SCM] x265/upstream: New upstream version 2.2
sramacher at users.alioth.debian.org
sramacher at users.alioth.debian.org
Thu Dec 29 13:15:50 UTC 2016
The following commit has been merged in the upstream branch:
commit bc8a685b9a7adae4d491d19cdb3667d2c25f0b6a
Author: Sebastian Ramacher <sramacher at debian.org>
Date: Thu Dec 29 13:59:53 2016 +0100
New upstream version 2.2
diff --git a/.hg_archival.txt b/.hg_archival.txt
index 156633c..e6f2f0f 100644
--- a/.hg_archival.txt
+++ b/.hg_archival.txt
@@ -1,6 +1,4 @@
repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf
-node: 3e8ce3b26319dbd53ab6369e4c4e986bf30f1315
+node: be14a7e9755e54f0fd34911c72bdfa66981220bc
branch: stable
-latesttag: 2.1
-latesttagdistance: 1
-changessincelatesttag: 1
+tag: 2.2
diff --git a/doc/reST/cli.rst b/doc/reST/cli.rst
index f14b8a8..7f93623 100644
--- a/doc/reST/cli.rst
+++ b/doc/reST/cli.rst
@@ -662,7 +662,7 @@ the prediction quad-tree.
and less frame parallelism as well. Because of this the faster
presets use a CU size of 32. Default: 64
-.. option:: --min-cu-size <64|32|16|8>
+.. option:: --min-cu-size <32|16|8>
Minimum CU size (width and height). By using 16 or 32 the encoder
will not analyze the cost of CUs below that minimum threshold,
@@ -869,6 +869,24 @@ as the residual quad-tree (RQT).
partitions, in which case a TU split is implied and thus the
residual quad-tree begins one layer below the CU quad-tree.
+.. option:: --limit-tu <0..4>
+
+ Enables early exit from TU depth recursion, for inter coded blocks.
+ Level 1 - decides to recurse to next higher depth based on cost
+ comparison of full size TU and split TU.
+
+ Level 2 - based on first split subTU's depth, limits recursion of
+ other split subTUs.
+
+ Level 3 - based on the average depth of the co-located and the neighbor
+ CUs' TU depth, limits recursion of the current CU.
+
+ Level 4 - uses the depth of the neighbouring/ co-located CUs TU depth
+ to limit the 1st subTU depth. The 1st subTU depth is taken as the
+ limiting depth for the other subTUs.
+
+ Default: 0
+
.. option:: --nr-intra <integer>, --nr-inter <integer>
Noise reduction - an adaptive deadzone applied after DCT
@@ -949,13 +967,17 @@ Temporal / motion search options
encoder: a star-pattern search followed by an optional radix scan
followed by an optional star-search refinement. Full is an
exhaustive search; an order of magnitude slower than all other
- searches but not much better than umh or star.
+ searches but not much better than umh or star. SEA is similar to
+ FULL search; a three step motion search adopted from x264: DC
+ calculation followed by ADS calculation followed by SAD of the
+ passed motion vector candidates, hence faster than Full search.
0. dia
1. hex **(default)**
2. umh
3. star
- 4. full
+ 4. sea
+ 5. full
.. option:: --subme, -m <0..7>
@@ -1153,6 +1175,13 @@ Slice decision options
:option:`--scenecut` 0 or :option:`--no-scenecut` disables adaptive
I frame placement. Default 40
+.. option:: --scenecut-bias <0..100.0>
+
+ This value represents the percentage difference between the inter cost and
+ intra cost of a frame used in scenecut detection. For example, a value of 5 indicates,
+ if the inter cost of a frame is greater than or equal to 95 percent of the intra cost of the frame,
+ then detect this frame as scenecut. Values between 5 and 15 are recommended. Default 5.
+
.. option:: --intra-refresh
Enables Periodic Intra Refresh(PIR) instead of keyframe insertion.
@@ -1304,7 +1333,7 @@ Quality, rate control and rate distortion options
slices using param->rc.ipFactor and param->rc.pbFactor unless QP 0
is specified, in which case QP 0 is used for all slice types. Note
that QP 0 does not cause lossless encoding, it only disables
- quantization. Default disabled (CRF)
+ quantization. Default disabled.
**Range of values:** an integer from 0 to 51
@@ -1824,7 +1853,7 @@ Bitstream options
enhancement layer. A decoder may chose to drop the enhancement layer
and only decode and display the base layer slices.
- If used with a fixed GOP (:option:`b-adapt` 0) and :option:`bframes`
+ If used with a fixed GOP (:option:`--b-adapt` 0) and :option:`--bframes`
3 then the two layers evenly split the frame rate, with a cadence of
PbBbP. You probably also want :option:`--no-scenecut` and a keyframe
interval that is a multiple of 4.
@@ -1833,15 +1862,29 @@ Bitstream options
Maximum of the picture order count. Default 8
-.. option:: --discard-sei
+.. option:: --[no-]vui-timing-info
- Discard SEI messages generated from the final bitstream. HDR-related SEI
- messages are always dumped, immaterial of this option. Default disabled.
-
-.. option:: --discard-vui
+ Emit VUI timing info in bitstream. Default enabled.
+
+.. option:: --[no-]vui-hrd-info
+
+ Emit VUI HRD info in bitstream. Default enabled when
+ :option:`--hrd` is enabled.
+
+.. option:: --[no-]opt-qp-pps
+
+ Optimize QP in PPS (instead of default value of 26) based on the QP values
+ observed in last GOP. Default enabled.
+
+.. option:: --[no-]opt-ref-list-length-pps
+
+ Optimize L0 and L1 ref list length in PPS (instead of default value of 0)
+ based on the lengths observed in the last GOP. Default enabled.
+
+.. option:: --[no-]multi-pass-opt-rps
+
+ Enable storing commonly used RPS in SPS in multi pass mode. Default disabled.
- Discard optional VUI information (timing, HRD info) from the
- bitstream. Default disabled.
Debugging options
=================
diff --git a/doc/reST/index.rst b/doc/reST/index.rst
index 610f435..8cb1b00 100644
--- a/doc/reST/index.rst
+++ b/doc/reST/index.rst
@@ -9,3 +9,4 @@ x265 Documentation
threading
presets
lossless
+ releasenotes
diff --git a/doc/reST/releasenotes.rst b/doc/reST/releasenotes.rst
new file mode 100644
index 0000000..605cdba
--- /dev/null
+++ b/doc/reST/releasenotes.rst
@@ -0,0 +1,141 @@
+*************
+Release Notes
+*************
+
+Version 2.2
+===========
+
+Release date - 26th December, 2016.
+
+Encoder enhancements
+--------------------
+1. Enhancements to TU selection algorithm with early-outs for improved speed; use :option:`--limit-tu` to exercise.
+2. New motion search method SEA (Successive Elimination Algorithm) supported now as :option: `--me` 4
+3. Bit-stream optimizations to improve fields in PPS and SPS for bit-rate savings through :option:`--[no-]opt-qp-pps`, :option:`--[no-]opt-ref-list-length-pps`, and :option:`--[no-]multi-pass-opt-rps`.
+4. Enabled using VBV constraints when encoding without WPP.
+5. All param options dumped in SEI packet in bitstream when info selected.
+6. x265 now supports POWERPC-based systems. Several key functions also have optimized ALTIVEC kernels.
+
+API changes
+-----------
+1. Options to disable SEI and optional-VUI messages from bitstream made more descriptive.
+2. New option :option:`--scenecut-bias` to enable controlling bias to mark scene-cuts via cli.
+3. Support mono and mono16 color spaces for y4m input.
+4. :option:`--min-cu-size` of 64 no-longer supported for reasons of visual quality (was crashing earlier anyways.)
+5. API for CSV now expects version string for better integration of x265 into other applications.
+
+Bug fixes
+---------
+1. Several fixes to slice-based encoding.
+2. :option:`--log2-max-poc-lsb`'s range limited according to HEVC spec.
+3. Restrict MVs to within legal boundaries when encoding.
+
+Version 2.1
+===========
+
+Release date - 27th September, 2016
+
+Encoder enhancements
+--------------------
+1. Support for qg-size of 8
+2. Support for inserting non-IDR I-frames at scenecuts and when running with settings for fixed-GOP (min-keyint = max-keyint)
+3. Experimental support for slice-parallelism.
+
+API changes
+-----------
+1. Encode user-define SEI messages passed in through x265_picture object.
+2. Disable SEI and VUI messages from the bitstream
+3. Specify qpmin and qpmax
+4. Control number of bits to encode POC.
+
+Bug fixes
+---------
+1. QP fluctuation fix for first B-frame in mini-GOP for 2-pass encoding with tune-grain.
+2. Assembly fix for crashes in 32-bit from dct_sse4.
+3. Threadpool creation fix in windows platform.
+
+Version 2.0
+===========
+
+Release date - 13th July, 2016
+
+New Features
+------------
+
+1. uhd-bd: Enable Ultra-HD Bluray support
+2. rskip: Enables skipping recursion to analyze lower CU sizes using heuristics at different rd-levels. Provides good visual quality gains at the highest quality presets.
+3. rc-grain: Enables a new ratecontrol mode specifically for grainy content. Strictly prevents QP oscillations within and between frames to avoid grain fluctuations.
+4. tune grain: A fully refactored and improved option to encode film grain content including QP control as well as analysis options.
+5. asm: ARM assembly is now enabled by default, native or cross compiled builds supported on armv6 and later systems.
+
+API and Key Behaviour Changes
+-----------------------------
+
+1. x265_rc_stats added to x265_picture, containing all RC decision points for that frame
+2. PTL: high tier is now allowed by default, chosen only if necessary
+3. multi-pass: First pass now uses slow-firstpass by default, enabling better RC decisions in future passes
+4. pools: fix behaviour on multi-socketed Windows systems, provide more flexibility in determining thread and pool counts
+5. ABR: improve bits allocation in the first few frames, abr reset, vbv and cutree improved
+
+Misc
+----
+1. An SSIM calculation bug was corrected
+
+Version 1.9
+===========
+
+Release date - 29th January, 2016
+
+New Features
+------------
+
+1. Quant offsets: This feature allows block level quantization offsets to be specified for every frame. An API-only feature.
+2. --intra-refresh: Keyframes can be replaced by a moving column of intra blocks in non-keyframes.
+3. --limit-modes: Intelligently restricts mode analysis.
+4. --max-luma and --min-luma for luma clipping, optional for HDR use-cases
+5. Emergency denoising is now enabled by default in very low bitrate, VBV encodes
+
+API Changes
+-----------
+
+1. x265_frame_stats returns many additional fields: maxCLL, maxFALL, residual energy, scenecut and latency logging
+2. --qpfile now supports frametype 'K"
+3. x265 now allows CRF ratecontrol in pass N (N greater than or equal to 2)
+4. Chroma subsampling format YUV 4:0:0 is now fully supported and tested
+
+Presets and Performance
+-----------------------
+
+1. Recently added features lookahead-slices, limit-modes, limit-refs have been enabled by default for applicable presets.
+2. The default psy-rd strength has been increased to 2.0
+3. Multi-socket machines now use a single pool of threads that can work cross-socket.
+
+Version 1.8
+===========
+
+Release date - 10th August, 2015
+
+API Changes
+-----------
+1. Experimental support for Main12 is now enabled. Partial assembly support exists.
+2. Main12 and Intra/Still picture profiles are now supported. Still picture profile is detected based on x265_param::totalFrames.
+3. Three classes of encoding statistics are now available through the API.
+a) x265_stats - contains encoding statistics, available through x265_encoder_get_stats()
+b) x265_frame_stats and x265_cu_stats - contains frame encoding statistics, available through recon x265_picture
+4. --csv
+a) x265_encoder_log() is now deprecated
+b) x265_param::csvfn is also deprecated
+5. --log-level now controls only console logging, frame level console logging has been removed.
+6. Support added for new color transfer characteristic ARIB STD-B67
+
+New Features
+------------
+1. limit-refs: This feature limits the references analysed for individual CUS. Provides a nice tradeoff between efficiency and performance.
+2. aq-mode 3: A new aq-mode that provides additional biasing for low-light conditions.
+3. An improved scene cut detection logic that allows ratecontrol to manage visual quality at fade-ins and fade-outs better.
+
+Preset and Tune Options
+-----------------------
+
+1. tune grain: Increases psyRdoq strength to 10.0, and rdoq-level to 2.
+2. qg-size: Default value changed to 32.
diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
index b85c15f..ebd03b0 100644
--- a/source/CMakeLists.txt
+++ b/source/CMakeLists.txt
@@ -30,7 +30,7 @@ option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF)
mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
# X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 95)
+set(X265_BUILD 102)
configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
"${PROJECT_BINARY_DIR}/x265.def")
configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
@@ -60,6 +60,11 @@ elseif(POWERMATCH GREATER "-1")
message(STATUS "Detected POWER target processor")
set(POWER 1)
add_definitions(-DX265_ARCH_POWER=1)
+ if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8)
+ set(PPC64 1)
+ add_definitions(-DPPC64=1)
+ message(STATUS "Detected POWER PPC64 target processor")
+ endif()
elseif(ARMMATCH GREATER "-1")
if(CROSS_COMPILE_ARM)
message(STATUS "Cross compiling for ARM arch")
@@ -167,6 +172,19 @@ elseif(CLANG)
elseif(CMAKE_COMPILER_IS_GNUCXX)
set(GCC 1)
endif()
+
+if(CC STREQUAL "xlc")
+ message(STATUS "Use XLC compiler")
+ set(XLC 1)
+ set(GCC 0)
+ #set(CMAKE_C_COMPILER "/usr/bin/xlc")
+ #set(CMAKE_CXX_COMPILER "/usr/bin/xlc++")
+ add_definitions(-D__XLC__=1)
+ add_definitions(-O3 -qstrict -qhot -qaltivec)
+ add_definitions(-qinline=level=10 -qpath=IL:/data/video_files/latest.tpo/)
+endif()
+
+
if(GCC)
add_definitions(-Wall -Wextra -Wshadow)
add_definitions(-D__STDC_LIMIT_MACROS=1)
@@ -396,6 +414,22 @@ if(WIN32)
endif(WINXP_SUPPORT)
endif()
+if(POWER)
+ # IBM Power8
+ option(ENABLE_ALTIVEC "Enable ALTIVEC profiling instrumentation" ON)
+ if(ENABLE_ALTIVEC)
+ add_definitions(-DHAVE_ALTIVEC=1 -maltivec -mabi=altivec)
+ add_definitions(-flax-vector-conversions -fpermissive)
+ else()
+ add_definitions(-DHAVE_ALTIVEC=0)
+ endif()
+
+ option(CPU_POWER8 "Enable CPU POWER8 profiling instrumentation" ON)
+ if(CPU_POWER8)
+ add_definitions(-mcpu=power8 -DX265_ARCH_POWER8=1)
+ endif()
+endif()
+
include(version) # determine X265_VERSION and X265_LATEST_TAG
include_directories(. common encoder "${PROJECT_BINARY_DIR}")
diff --git a/source/common/CMakeLists.txt b/source/common/CMakeLists.txt
index 1cd1aac..102ef22 100644
--- a/source/common/CMakeLists.txt
+++ b/source/common/CMakeLists.txt
@@ -99,6 +99,19 @@ if(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM))
source_group(Assembly FILES ${ASM_PRIMITIVES})
endif(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM))
+if(POWER)
+ set_source_files_properties(version.cpp PROPERTIES COMPILE_FLAGS -DX265_VERSION=${X265_VERSION})
+ if(ENABLE_ALTIVEC)
+ set(ALTIVEC_SRCS pixel_altivec.cpp dct_altivec.cpp ipfilter_altivec.cpp intrapred_altivec.cpp)
+ foreach(SRC ${ALTIVEC_SRCS})
+ set(ALTIVEC_PRIMITIVES ${ALTIVEC_PRIMITIVES} ppc/${SRC})
+ endforeach()
+ source_group(Intrinsics_altivec FILES ${ALTIVEC_PRIMITIVES})
+ set_source_files_properties(${ALTIVEC_PRIMITIVES} PROPERTIES COMPILE_FLAGS "-Wno-unused -Wno-unknown-pragmas -Wno-maybe-uninitialized")
+ endif()
+endif()
+
+
# set_target_properties can't do list expansion
string(REPLACE ";" " " VERSION_FLAGS "${VFLAGS}")
set_source_files_properties(version.cpp PROPERTIES COMPILE_FLAGS ${VERSION_FLAGS})
@@ -116,7 +129,7 @@ if(WIN32)
endif(WIN32)
add_library(common OBJECT
- ${ASM_PRIMITIVES} ${VEC_PRIMITIVES} ${WINXP}
+ ${ASM_PRIMITIVES} ${VEC_PRIMITIVES} ${ALTIVEC_PRIMITIVES} ${WINXP}
primitives.cpp primitives.h
pixel.cpp dct.cpp ipfilter.cpp intrapred.cpp loopfilter.cpp
constants.cpp constants.h
diff --git a/source/common/bitstream.h b/source/common/bitstream.h
index 2714604..b82c8b9 100644
--- a/source/common/bitstream.h
+++ b/source/common/bitstream.h
@@ -71,6 +71,7 @@ public:
uint32_t getNumberOfWrittenBytes() const { return m_byteOccupancy; }
uint32_t getNumberOfWrittenBits() const { return m_byteOccupancy * 8 + m_partialByteBits; }
const uint8_t* getFIFO() const { return m_fifo; }
+ void copyBits(Bitstream* stream) { m_partialByteBits = stream->m_partialByteBits; m_byteOccupancy = stream->m_byteOccupancy; m_partialByte = stream->m_partialByte; }
void write(uint32_t val, uint32_t numBits);
void writeByte(uint32_t val);
diff --git a/source/common/common.h b/source/common/common.h
index 133a181..505f618 100644
--- a/source/common/common.h
+++ b/source/common/common.h
@@ -176,7 +176,7 @@ typedef int16_t coeff_t; // transform coefficient
#define X265_MIN(a, b) ((a) < (b) ? (a) : (b))
#define X265_MAX(a, b) ((a) > (b) ? (a) : (b))
-#define COPY1_IF_LT(x, y) if ((y) < (x)) (x) = (y);
+#define COPY1_IF_LT(x, y) {if ((y) < (x)) (x) = (y);}
#define COPY2_IF_LT(x, y, a, b) \
if ((y) < (x)) \
{ \
@@ -312,6 +312,7 @@ typedef int16_t coeff_t; // transform coefficient
#define MAX_NUM_REF_PICS 16 // max. number of pictures used for reference
#define MAX_NUM_REF 16 // max. number of entries in picture reference list
+#define MAX_NUM_SHORT_TERM_RPS 64 // max. number of short term reference picture set in SPS
#define REF_NOT_VALID -1
@@ -327,6 +328,8 @@ typedef int16_t coeff_t; // transform coefficient
#define PIXEL_MAX ((1 << X265_DEPTH) - 1)
+#define INTEGRAL_PLANE_NUM 12 // 12 integral planes for 32x32, 32x24, 32x8, 24x32, 16x16, 16x12, 16x4, 12x16, 8x32, 8x8, 4x16 and 4x4.
+
namespace X265_NS {
enum { SAO_NUM_OFFSET = 4 };
diff --git a/source/common/cpu.cpp b/source/common/cpu.cpp
index 0dafe48..5bd1e0f 100644
--- a/source/common/cpu.cpp
+++ b/source/common/cpu.cpp
@@ -99,6 +99,10 @@ const cpu_name_t cpu_names[] =
{ "ARMv6", X265_CPU_ARMV6 },
{ "NEON", X265_CPU_NEON },
{ "FastNeonMRC", X265_CPU_FAST_NEON_MRC },
+
+#elif X265_ARCH_POWER8
+ { "Altivec", X265_CPU_ALTIVEC },
+
#endif // if X265_ARCH_X86
{ "", 0 },
};
@@ -363,7 +367,18 @@ uint32_t cpu_detect(void)
return flags;
}
-#else // if X265_ARCH_X86
+#elif X265_ARCH_POWER8
+
+uint32_t cpu_detect(void)
+{
+#if HAVE_ALTIVEC
+ return X265_CPU_ALTIVEC;
+#else
+ return 0;
+#endif
+}
+
+#else // if X265_ARCH_POWER8
uint32_t cpu_detect(void)
{
diff --git a/source/common/cudata.cpp b/source/common/cudata.cpp
index 4a90f76..3c652fc 100644
--- a/source/common/cudata.cpp
+++ b/source/common/cudata.cpp
@@ -296,6 +296,9 @@ void CUData::initCTU(const Frame& frame, uint32_t cuAddr, int qp, uint32_t first
/* initialize the remaining CU data in one memset */
memset(m_cuDepth, 0, (frame.m_param->internalCsp == X265_CSP_I400 ? BytesPerPartition - 11 : BytesPerPartition - 7) * m_numPartitions);
+ for (int8_t i = 0; i < NUM_TU_DEPTH; i++)
+ m_refTuDepth[i] = -1;
+
uint32_t widthInCU = m_slice->m_sps->numCuInWidth;
m_cuLeft = (m_cuAddr % widthInCU) ? m_encData->getPicCTU(m_cuAddr - 1) : NULL;
m_cuAbove = (m_cuAddr >= widthInCU) && !m_bFirstRowInSlice ? m_encData->getPicCTU(m_cuAddr - widthInCU) : NULL;
diff --git a/source/common/cudata.h b/source/common/cudata.h
index 126624e..d31e38a 100644
--- a/source/common/cudata.h
+++ b/source/common/cudata.h
@@ -28,6 +28,8 @@
#include "slice.h"
#include "mv.h"
+#define NUM_TU_DEPTH 21
+
namespace X265_NS {
// private namespace
@@ -204,6 +206,7 @@ public:
enum { BytesPerPartition = 21 }; // combined sizeof() of all per-part data
coeff_t* m_trCoeff[3]; // transformed coefficient buffer per plane
+ int8_t m_refTuDepth[NUM_TU_DEPTH]; // TU depth of CU at depths 0, 1 and 2
MV* m_mv[2]; // array of motion vectors per list
MV* m_mvd[2]; // array of coded motion vector deltas per list
@@ -355,9 +358,8 @@ struct CUDataMemPool
CHECKED_MALLOC(trCoeffMemBlock, coeff_t, (sizeL + sizeC * 2) * numInstances);
}
CHECKED_MALLOC(charMemBlock, uint8_t, numPartition * numInstances * CUData::BytesPerPartition);
- CHECKED_MALLOC(mvMemBlock, MV, numPartition * 4 * numInstances);
+ CHECKED_MALLOC_ZERO(mvMemBlock, MV, numPartition * 4 * numInstances);
return true;
-
fail:
return false;
}
diff --git a/source/common/framedata.cpp b/source/common/framedata.cpp
index 7a077f5..ed0370d 100644
--- a/source/common/framedata.cpp
+++ b/source/common/framedata.cpp
@@ -37,6 +37,9 @@ bool FrameData::create(const x265_param& param, const SPS& sps, int csp)
m_slice = new Slice;
m_picCTU = new CUData[sps.numCUsInFrame];
m_picCsp = csp;
+ m_spsrpsIdx = -1;
+ if (param.rc.bStatWrite)
+ m_spsrps = const_cast<RPS*>(sps.spsrps);
m_cuMemPool.create(0, param.internalCsp, sps.numCUsInFrame);
for (uint32_t ctuAddr = 0; ctuAddr < sps.numCUsInFrame; ctuAddr++)
@@ -45,6 +48,12 @@ bool FrameData::create(const x265_param& param, const SPS& sps, int csp)
CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame);
CHECKED_MALLOC(m_rowStat, RCStatRow, sps.numCuInHeight);
reinit(sps);
+
+ for (int i = 0; i < INTEGRAL_PLANE_NUM; i++)
+ {
+ m_meBuffer[i] = NULL;
+ m_meIntegral[i] = NULL;
+ }
return true;
fail:
@@ -67,4 +76,16 @@ void FrameData::destroy()
X265_FREE(m_cuStat);
X265_FREE(m_rowStat);
+
+ if (m_meBuffer)
+ {
+ for (int i = 0; i < INTEGRAL_PLANE_NUM; i++)
+ {
+ if (m_meBuffer[i] != NULL)
+ {
+ X265_FREE(m_meBuffer[i]);
+ m_meBuffer[i] = NULL;
+ }
+ }
+ }
}
diff --git a/source/common/framedata.h b/source/common/framedata.h
index 3c0c524..66fd881 100644
--- a/source/common/framedata.h
+++ b/source/common/framedata.h
@@ -106,6 +106,9 @@ public:
CUDataMemPool m_cuMemPool;
CUData* m_picCTU;
+ RPS* m_spsrps;
+ int m_spsrpsIdx;
+
/* Rate control data used during encode and by references */
struct RCStatCU
{
@@ -123,10 +126,10 @@ public:
uint32_t encodedBits; /* sum of 'totalBits' of encoded CTUs */
uint32_t satdForVbv; /* sum of lowres (estimated) costs for entire row */
uint32_t intraSatdForVbv; /* sum of lowres (estimated) intra costs for entire row */
- uint32_t diagSatd;
- uint32_t diagIntraSatd;
- double diagQp;
- double diagQpScale;
+ uint32_t rowSatd;
+ uint32_t rowIntraSatd;
+ double rowQp;
+ double rowQpScale;
double sumQpRc;
double sumQpAq;
};
@@ -148,6 +151,9 @@ public:
double m_rateFactor; /* calculated based on the Frame QP */
int m_picCsp;
+ uint32_t* m_meIntegral[INTEGRAL_PLANE_NUM]; // 12 integral planes for 32x32, 32x24, 32x8, 24x32, 16x16, 16x12, 16x4, 12x16, 8x32, 8x8, 4x16 and 4x4.
+ uint32_t* m_meBuffer[INTEGRAL_PLANE_NUM];
+
FrameData();
bool create(const x265_param& param, const SPS& sps, int csp);
@@ -168,7 +174,6 @@ struct analysis_intra_data
/* Stores inter analysis data for a single frame */
struct analysis_inter_data
{
- MV* mv;
WeightParam* wt;
int32_t* ref;
uint8_t* depth;
diff --git a/source/common/param.cpp b/source/common/param.cpp
index d373f3a..3e96313 100644
--- a/source/common/param.cpp
+++ b/source/common/param.cpp
@@ -149,6 +149,7 @@ void x265_param_default(x265_param* param)
param->bBPyramid = 1;
param->scenecutThreshold = 40; /* Magic number pulled in from x264 */
param->lookaheadSlices = 8;
+ param->scenecutBias = 5.0;
/* Intra Coding Tools */
param->bEnableConstrainedIntra = 0;
@@ -176,6 +177,7 @@ void x265_param_default(x265_param* param)
param->maxNumReferences = 3;
param->bEnableTemporalMvp = 1;
param->bSourceReferenceEstimation = 0;
+ param->limitTU = 0;
/* Loop Filter */
param->bEnableLoopFilter = 1;
@@ -197,6 +199,7 @@ void x265_param_default(x265_param* param)
param->bCULossless = 0;
param->bEnableTemporalSubLayers = 0;
param->bEnableRdRefine = 0;
+ param->bMultiPassOptRPS = 0;
/* Rate control options */
param->rc.vbvMaxBitrate = 0;
@@ -229,8 +232,6 @@ void x265_param_default(x265_param* param)
param->rc.qpMin = 0;
param->rc.qpMax = QP_MAX_MAX;
- param->bDiscardOptionalVUI = 0;
-
/* Video Usability Information (VUI) */
param->vui.aspectRatioIdc = 0;
param->vui.sarWidth = 0;
@@ -256,8 +257,13 @@ void x265_param_default(x265_param* param)
param->minLuma = 0;
param->maxLuma = PIXEL_MAX;
param->log2MaxPocLsb = 8;
- param->bDiscardSEI = false;
param->maxSlices = 1;
+
+ param->bEmitVUITimingInfo = 1;
+ param->bEmitVUIHRDInfo = 1;
+ param->bOptQpPPS = 1;
+ param->bOptRefListLengthPPS = 1;
+
}
int x265_param_default_preset(x265_param* param, const char* preset, const char* tune)
@@ -901,21 +907,19 @@ int x265_param_parse(x265_param* p, const char* name, const char* value)
// solve "fatal error C1061: compiler limit : blocks nested too deeply"
if (bExtraParams)
{
- bExtraParams = false;
- if (0) ;
- OPT("slices") p->maxSlices = atoi(value);
- else
- bExtraParams = true;
- }
-
- if (bExtraParams)
- {
if (0) ;
OPT("qpmin") p->rc.qpMin = atoi(value);
OPT("analyze-src-pics") p->bSourceReferenceEstimation = atobool(value);
OPT("log2-max-poc-lsb") p->log2MaxPocLsb = atoi(value);
- OPT("discard-sei") p->bDiscardSEI = atobool(value);
- OPT("discard-vui") p->bDiscardOptionalVUI = atobool(value);
+ OPT("vui-timing-info") p->bEmitVUITimingInfo = atobool(value);
+ OPT("vui-hrd-info") p->bEmitVUIHRDInfo = atobool(value);
+ OPT("slices") p->maxSlices = atoi(value);
+ OPT("limit-tu") p->limitTU = atoi(value);
+ OPT("opt-qp-pps") p->bOptQpPPS = atobool(value);
+ OPT("opt-ref-list-length-pps") p->bOptRefListLengthPPS = atobool(value);
+ OPT("multi-pass-opt-rps") p->bMultiPassOptRPS = atobool(value);
+ OPT("scenecut-bias") p->scenecutBias = atof(value);
+
else
return X265_PARAM_BAD_NAME;
}
@@ -1078,8 +1082,8 @@ int x265_check_params(x265_param* param)
"Multiple-Slices mode must be enable Wavefront Parallel Processing (--wpp)");
CHECK(param->internalBitDepth != X265_DEPTH,
"internalBitDepth must match compiled bit depth");
- CHECK(param->minCUSize != 64 && param->minCUSize != 32 && param->minCUSize != 16 && param->minCUSize != 8,
- "minimim CU size must be 8, 16, 32, or 64");
+ CHECK(param->minCUSize != 32 && param->minCUSize != 16 && param->minCUSize != 8,
+ "minimim CU size must be 8, 16 or 32");
CHECK(param->minCUSize > param->maxCUSize,
"min CU size must be less than or equal to max CU size");
CHECK(param->rc.qp < -6 * (param->internalBitDepth - 8) || param->rc.qp > QP_MAX_SPEC,
@@ -1088,8 +1092,8 @@ int x265_check_params(x265_param* param)
"Frame rate numerator and denominator must be specified");
CHECK(param->interlaceMode < 0 || param->interlaceMode > 2,
"Interlace mode must be 0 (progressive) 1 (top-field first) or 2 (bottom field first)");
- CHECK(param->searchMethod<0 || param->searchMethod> X265_FULL_SEARCH,
- "Search method is not supported value (0:DIA 1:HEX 2:UMH 3:HM 5:FULL)");
+ CHECK(param->searchMethod < 0 || param->searchMethod > X265_FULL_SEARCH,
+ "Search method is not supported value (0:DIA 1:HEX 2:UMH 3:HM 4:SEA 5:FULL)");
CHECK(param->searchRange < 0,
"Search Range must be more than 0");
CHECK(param->searchRange >= 32768,
@@ -1122,6 +1126,7 @@ int x265_check_params(x265_param* param)
"QuadtreeTUMaxDepthInter must be less than or equal to the difference between log2(maxCUSize) and QuadtreeTULog2MinSize plus 1");
CHECK((param->maxTUSize != 32 && param->maxTUSize != 16 && param->maxTUSize != 8 && param->maxTUSize != 4),
"max TU size must be 4, 8, 16, or 32");
+ CHECK(param->limitTU > 4, "Invalid limit-tu option, limit-TU must be between 0 and 4");
CHECK(param->maxNumMergeCand < 1, "MaxNumMergeCand must be 1 or greater.");
CHECK(param->maxNumMergeCand > 5, "MaxNumMergeCand must be 5 or smaller.");
@@ -1217,6 +1222,8 @@ int x265_check_params(x265_param* param)
"Valid Logging level -1:none 0:error 1:warning 2:info 3:debug 4:full");
CHECK(param->scenecutThreshold < 0,
"scenecutThreshold must be greater than 0");
+ CHECK(param->scenecutBias < 0 || 100 < param->scenecutBias,
+ "scenecut-bias must be between 0 and 100");
CHECK(param->rdPenalty < 0 || param->rdPenalty > 2,
"Valid penalty for 32x32 intra TU in non-I slices. 0:disabled 1:RD-penalty 2:maximum");
CHECK(param->keyframeMax < -1,
@@ -1247,10 +1254,12 @@ int x265_check_params(x265_param* param)
"qpmax exceeds supported range (0 to 69)");
CHECK(param->rc.qpMin < QP_MIN || param->rc.qpMin > QP_MAX_MAX,
"qpmin exceeds supported range (0 to 69)");
- CHECK(param->log2MaxPocLsb < 4,
- "maximum of the picture order count can not be less than 4");
- CHECK(1 > param->maxSlices || param->maxSlices > ((param->sourceHeight + param->maxCUSize - 1) / param->maxCUSize),
- "The slices can not be more than number of rows");
+ CHECK(param->log2MaxPocLsb < 4 || param->log2MaxPocLsb > 16,
+ "Supported range for log2MaxPocLsb is 4 to 16");
+#if !X86_64
+ CHECK(param->searchMethod == X265_SEA && (param->sourceWidth > 840 || param->sourceHeight > 480),
+ "SEA motion search does not support resolutions greater than 480p in 32 bit build");
+#endif
return check_failed;
}
@@ -1338,9 +1347,8 @@ void x265_print_params(x265_param* param)
x265_log(param, X265_LOG_INFO, "ME / range / subpel / merge : %s / %d / %d / %d\n",
x265_motion_est_names[param->searchMethod], param->searchRange, param->subpelRefine, param->maxNumMergeCand);
-
if (param->keyframeMax != INT_MAX || param->scenecutThreshold)
- x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut : %d / %d / %d\n", param->keyframeMin, param->keyframeMax, param->scenecutThreshold);
+ x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut / bias: %d / %d / %d / %.2lf\n", param->keyframeMin, param->keyframeMax, param->scenecutThreshold, param->scenecutBias * 100);
else
x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut : disabled\n");
@@ -1395,6 +1403,7 @@ void x265_print_params(x265_param* param)
TOOLVAL(param->noiseReductionInter, "nr-inter=%d");
TOOLOPT(param->bEnableTSkipFast, "tskip-fast");
TOOLOPT(!param->bEnableTSkipFast && param->bEnableTransformSkip, "tskip");
+ TOOLVAL(param->limitTU , "limit-tu=%d");
TOOLOPT(param->bCULossless, "cu-lossless");
TOOLOPT(param->bEnableSignHiding, "signhide");
TOOLOPT(param->bEnableTemporalMvp, "tmvp");
@@ -1423,7 +1432,7 @@ void x265_print_params(x265_param* param)
fflush(stderr);
}
-char *x265_param2string(x265_param* p)
+char *x265_param2string(x265_param* p, int padx, int pady)
{
char *buf, *s;
@@ -1434,70 +1443,92 @@ char *x265_param2string(x265_param* p)
#define BOOL(param, cliopt) \
s += sprintf(s, " %s", (param) ? cliopt : "no-" cliopt);
- s += sprintf(s, "%dx%d", p->sourceWidth,p->sourceHeight);
- s += sprintf(s, " fps=%u/%u", p->fpsNum, p->fpsDenom);
- s += sprintf(s, " bitdepth=%d", p->internalBitDepth);
+ s += sprintf(s, "cpuid=%d", p->cpuid);
+ s += sprintf(s, " frame-threads=%d", p->frameNumThreads);
+ if (p->numaPools)
+ s += sprintf(s, " numa-pools=%s", p->numaPools);
BOOL(p->bEnableWavefront, "wpp");
+ BOOL(p->bDistributeModeAnalysis, "pmode");
+ BOOL(p->bDistributeMotionEstimation, "pme");
+ BOOL(p->bEnablePsnr, "psnr");
+ BOOL(p->bEnableSsim, "ssim");
+ s += sprintf(s, " log-level=%d", p->logLevel);
+ s += sprintf(s, " bitdepth=%d", p->internalBitDepth);
+ s += sprintf(s, " input-csp=%d", p->internalCsp);
+ s += sprintf(s, " fps=%u/%u", p->fpsNum, p->fpsDenom);
+ s += sprintf(s, " input-res=%dx%d", p->sourceWidth - padx, p->sourceHeight - pady);
+ s += sprintf(s, " interlace=%d", p->interlaceMode);
+ s += sprintf(s, " total-frames=%d", p->totalFrames);
+ s += sprintf(s, " level-idc=%d", p->levelIdc);
+ s += sprintf(s, " high-tier=%d", p->bHighTier);
+ s += sprintf(s, " uhd-bd=%d", p->uhdBluray);
+ s += sprintf(s, " ref=%d", p->maxNumReferences);
+ BOOL(p->bAllowNonConformance, "allow-non-conformance");
+ BOOL(p->bRepeatHeaders, "repeat-headers");
+ BOOL(p->bAnnexB, "annexb");
+ BOOL(p->bEnableAccessUnitDelimiters, "aud");
+ BOOL(p->bEmitHRDSEI, "hrd");
+ BOOL(p->bEmitInfoSEI, "info");
+ s += sprintf(s, " hash=%d", p->decodedPictureHashSEI);
+ BOOL(p->bEnableTemporalSubLayers, "temporal-layers");
+ BOOL(p->bOpenGOP, "open-gop");
+ s += sprintf(s, " min-keyint=%d", p->keyframeMin);
+ s += sprintf(s, " keyint=%d", p->keyframeMax);
+ s += sprintf(s, " bframes=%d", p->bframes);
+ s += sprintf(s, " b-adapt=%d", p->bFrameAdaptive);
+ BOOL(p->bBPyramid, "b-pyramid");
+ s += sprintf(s, " bframe-bias=%d", p->bFrameBias);
+ s += sprintf(s, " rc-lookahead=%d", p->lookaheadDepth);
+ s += sprintf(s, " lookahead-slices=%d", p->lookaheadSlices);
+ s += sprintf(s, " scenecut=%d", p->scenecutThreshold);
+ BOOL(p->bIntraRefresh, "intra-refresh");
s += sprintf(s, " ctu=%d", p->maxCUSize);
s += sprintf(s, " min-cu-size=%d", p->minCUSize);
- s += sprintf(s, " max-tu-size=%d", p->maxTUSize);
- s += sprintf(s, " tu-intra-depth=%d", p->tuQTMaxIntraDepth);
- s += sprintf(s, " tu-inter-depth=%d", p->tuQTMaxInterDepth);
- s += sprintf(s, " me=%d", p->searchMethod);
- s += sprintf(s, " subme=%d", p->subpelRefine);
- s += sprintf(s, " merange=%d", p->searchRange);
BOOL(p->bEnableRectInter, "rect");
BOOL(p->bEnableAMP, "amp");
- s += sprintf(s, " max-merge=%d", p->maxNumMergeCand);
- BOOL(p->bEnableTemporalMvp, "temporal-mvp");
- BOOL(p->bEnableEarlySkip, "early-skip");
- BOOL(p->bEnableRecursionSkip, "rskip");
- s += sprintf(s, " rdpenalty=%d", p->rdPenalty);
+ s += sprintf(s, " max-tu-size=%d", p->maxTUSize);
+ s += sprintf(s, " tu-inter-depth=%d", p->tuQTMaxInterDepth);
+ s += sprintf(s, " tu-intra-depth=%d", p->tuQTMaxIntraDepth);
+ s += sprintf(s, " limit-tu=%d", p->limitTU);
+ s += sprintf(s, " rdoq-level=%d", p->rdoqLevel);
+ BOOL(p->bEnableSignHiding, "signhide");
BOOL(p->bEnableTransformSkip, "tskip");
- BOOL(p->bEnableTSkipFast, "tskip-fast");
- BOOL(p->bEnableStrongIntraSmoothing, "strong-intra-smoothing");
- BOOL(p->bLossless, "lossless");
- BOOL(p->bCULossless, "cu-lossless");
+ s += sprintf(s, " nr-intra=%d", p->noiseReductionIntra);
+ s += sprintf(s, " nr-inter=%d", p->noiseReductionInter);
BOOL(p->bEnableConstrainedIntra, "constrained-intra");
- BOOL(p->bEnableFastIntra, "fast-intra");
- BOOL(p->bOpenGOP, "open-gop");
- BOOL(p->bEnableTemporalSubLayers, "temporal-layers");
- s += sprintf(s, " interlace=%d", p->interlaceMode);
- s += sprintf(s, " keyint=%d", p->keyframeMax);
- s += sprintf(s, " min-keyint=%d", p->keyframeMin);
- s += sprintf(s, " scenecut=%d", p->scenecutThreshold);
- s += sprintf(s, " rc-lookahead=%d", p->lookaheadDepth);
- s += sprintf(s, " lookahead-slices=%d", p->lookaheadSlices);
- s += sprintf(s, " bframes=%d", p->bframes);
- s += sprintf(s, " bframe-bias=%d", p->bFrameBias);
- s += sprintf(s, " b-adapt=%d", p->bFrameAdaptive);
- s += sprintf(s, " ref=%d", p->maxNumReferences);
+ BOOL(p->bEnableStrongIntraSmoothing, "strong-intra-smoothing");
+ s += sprintf(s, " max-merge=%d", p->maxNumMergeCand);
s += sprintf(s, " limit-refs=%d", p->limitReferences);
BOOL(p->limitModes, "limit-modes");
+ s += sprintf(s, " me=%d", p->searchMethod);
+ s += sprintf(s, " subme=%d", p->subpelRefine);
+ s += sprintf(s, " merange=%d", p->searchRange);
+ BOOL(p->bEnableTemporalMvp, "temporal-mvp");
BOOL(p->bEnableWeightedPred, "weightp");
BOOL(p->bEnableWeightedBiPred, "weightb");
- s += sprintf(s, " aq-mode=%d", p->rc.aqMode);
- s += sprintf(s, " qg-size=%d", p->rc.qgSize);
- s += sprintf(s, " aq-strength=%.2f", p->rc.aqStrength);
- s += sprintf(s, " cbqpoffs=%d", p->cbQpOffset);
- s += sprintf(s, " crqpoffs=%d", p->crQpOffset);
- s += sprintf(s, " rd=%d", p->rdLevel);
- s += sprintf(s, " psy-rd=%.2f", p->psyRd);
- s += sprintf(s, " rdoq-level=%d", p->rdoqLevel);
- s += sprintf(s, " psy-rdoq=%.2f", p->psyRdoq);
- s += sprintf(s, " log2-max-poc-lsb=%d", p->log2MaxPocLsb);
- BOOL(p->bEnableRdRefine, "rd-refine");
- BOOL(p->bEnableSignHiding, "signhide");
+ BOOL(p->bSourceReferenceEstimation, "analyze-src-pics");
BOOL(p->bEnableLoopFilter, "deblock");
if (p->bEnableLoopFilter)
s += sprintf(s, "=%d:%d", p->deblockingFilterTCOffset, p->deblockingFilterBetaOffset);
BOOL(p->bEnableSAO, "sao");
BOOL(p->bSaoNonDeblocked, "sao-non-deblock");
- BOOL(p->bBPyramid, "b-pyramid");
- BOOL(p->rc.cuTree, "cutree");
- BOOL(p->bIntraRefresh, "intra-refresh");
+ s += sprintf(s, " rd=%d", p->rdLevel);
+ BOOL(p->bEnableEarlySkip, "early-skip");
+ BOOL(p->bEnableRecursionSkip, "rskip");
+ BOOL(p->bEnableFastIntra, "fast-intra");
+ BOOL(p->bEnableTSkipFast, "tskip-fast");
+ BOOL(p->bCULossless, "cu-lossless");
+ BOOL(p->bIntraInBFrames, "b-intra");
+ s += sprintf(s, " rdpenalty=%d", p->rdPenalty);
+ s += sprintf(s, " psy-rd=%.2f", p->psyRd);
+ s += sprintf(s, " psy-rdoq=%.2f", p->psyRdoq);
+ BOOL(p->bEnableRdRefine, "rd-refine");
+ s += sprintf(s, " analysis-mode=%d", p->analysisMode);
+ BOOL(p->bLossless, "lossless");
+ s += sprintf(s, " cbqpoffs=%d", p->cbQpOffset);
+ s += sprintf(s, " crqpoffs=%d", p->crQpOffset);
s += sprintf(s, " rc=%s", p->rc.rateControlMode == X265_RC_ABR ? (
- p->rc.bStatRead ? "2 pass" : p->rc.bitrate == p->rc.vbvMaxBitrate ? "cbr" : "abr")
+ p->rc.bitrate == p->rc.vbvMaxBitrate ? "cbr" : "abr")
: p->rc.rateControlMode == X265_RC_CRF ? "crf" : "cqp");
if (p->rc.rateControlMode == X265_RC_ABR || p->rc.rateControlMode == X265_RC_CRF)
{
@@ -1505,17 +1536,20 @@ char *x265_param2string(x265_param* p)
s += sprintf(s, " crf=%.1f", p->rc.rfConstant);
else
s += sprintf(s, " bitrate=%d", p->rc.bitrate);
- s += sprintf(s, " qcomp=%.2f qpmin=%d qpmax=%d qpstep=%d",
- p->rc.qCompress, p->rc.qpMin, p->rc.qpMax, p->rc.qpStep);
+ s += sprintf(s, " qcomp=%.2f qpstep=%d", p->rc.qCompress, p->rc.qpStep);
+ s += sprintf(s, " stats-write=%d", p->rc.bStatWrite);
+ s += sprintf(s, " stats-read=%d", p->rc.bStatRead);
if (p->rc.bStatRead)
- s += sprintf( s, " cplxblur=%.1f qblur=%.1f",
- p->rc.complexityBlur, p->rc.qblur);
+ s += sprintf(s, " cplxblur=%.1f qblur=%.1f",
+ p->rc.complexityBlur, p->rc.qblur);
+ if (p->rc.bStatWrite && !p->rc.bStatRead)
+ BOOL(p->rc.bEnableSlowFirstPass, "slow-firstpass");
if (p->rc.vbvBufferSize)
{
- s += sprintf(s, " vbv-maxrate=%d vbv-bufsize=%d",
- p->rc.vbvMaxBitrate, p->rc.vbvBufferSize);
+ s += sprintf(s, " vbv-maxrate=%d vbv-bufsize=%d vbv-init=%.1f",
+ p->rc.vbvMaxBitrate, p->rc.vbvBufferSize, p->rc.vbvBufferInit);
if (p->rc.rateControlMode == X265_RC_CRF)
- s += sprintf(s, " crf-max=%.1f", p->rc.rfConstantMax);
+ s += sprintf(s, " crf-max=%.1f crf-min=%.1f", p->rc.rfConstantMax, p->rc.rfConstantMin);
}
}
else if (p->rc.rateControlMode == X265_RC_CQP)
@@ -1526,6 +1560,59 @@ char *x265_param2string(x265_param* p)
if (p->bframes)
s += sprintf(s, " pbratio=%.2f", p->rc.pbFactor);
}
+ s += sprintf(s, " aq-mode=%d", p->rc.aqMode);
+ s += sprintf(s, " aq-strength=%.2f", p->rc.aqStrength);
+ BOOL(p->rc.cuTree, "cutree");
+ s += sprintf(s, " zone-count=%d", p->rc.zoneCount);
+ if (p->rc.zoneCount)
+ {
+ for (int i = 0; i < p->rc.zoneCount; ++i)
+ {
+ s += sprintf(s, " zones: start-frame=%d end-frame=%d",
+ p->rc.zones[i].startFrame, p->rc.zones[i].endFrame);
+ if (p->rc.zones[i].bForceQp)
+ s += sprintf(s, " qp=%d", p->rc.zones[i].qp);
+ else
+ s += sprintf(s, " bitrate-factor=%f", p->rc.zones[i].bitrateFactor);
+ }
+ }
+ BOOL(p->rc.bStrictCbr, "strict-cbr");
+ s += sprintf(s, " qg-size=%d", p->rc.qgSize);
+ BOOL(p->rc.bEnableGrain, "rc-grain");
+ s += sprintf(s, " qpmax=%d qpmin=%d", p->rc.qpMax, p->rc.qpMin);
+ s += sprintf(s, " sar=%d", p->vui.aspectRatioIdc);
+ if (p->vui.aspectRatioIdc == X265_EXTENDED_SAR)
+ s += sprintf(s, " sar-width : sar-height=%d:%d", p->vui.sarWidth, p->vui.sarHeight);
+ s += sprintf(s, " overscan=%d", p->vui.bEnableOverscanInfoPresentFlag);
+ if (p->vui.bEnableOverscanInfoPresentFlag)
+ s += sprintf(s, " overscan-crop=%d", p->vui.bEnableOverscanAppropriateFlag);
+ s += sprintf(s, " videoformat=%d", p->vui.videoFormat);
+ s += sprintf(s, " range=%d", p->vui.bEnableVideoFullRangeFlag);
+ s += sprintf(s, " colorprim=%d", p->vui.colorPrimaries);
+ s += sprintf(s, " transfer=%d", p->vui.transferCharacteristics);
+ s += sprintf(s, " colormatrix=%d", p->vui.matrixCoeffs);
+ s += sprintf(s, " chromaloc=%d", p->vui.bEnableChromaLocInfoPresentFlag);
+ if (p->vui.bEnableChromaLocInfoPresentFlag)
+ s += sprintf(s, " chromaloc-top=%d chromaloc-bottom=%d",
+ p->vui.chromaSampleLocTypeTopField, p->vui.chromaSampleLocTypeBottomField);
+ s += sprintf(s, " display-window=%d", p->vui.bEnableDefaultDisplayWindowFlag);
+ if (p->vui.bEnableDefaultDisplayWindowFlag)
+ s += sprintf(s, " left=%d top=%d right=%d bottom=%d",
+ p->vui.defDispWinLeftOffset, p->vui.defDispWinTopOffset,
+ p->vui.defDispWinRightOffset, p->vui.defDispWinBottomOffset);
+ if (p->masteringDisplayColorVolume)
+ s += sprintf(s, " master-display=%s", p->masteringDisplayColorVolume);
+ s += sprintf(s, " max-cll=%hu,%hu", p->maxCLL, p->maxFALL);
+ s += sprintf(s, " min-luma=%hu", p->minLuma);
+ s += sprintf(s, " max-luma=%hu", p->maxLuma);
+ s += sprintf(s, " log2-max-poc-lsb=%d", p->log2MaxPocLsb);
+ BOOL(p->bEmitVUITimingInfo, "vui-timing-info");
+ BOOL(p->bEmitVUIHRDInfo, "vui-hrd-info");
+ s += sprintf(s, " slices=%d", p->maxSlices);
+ BOOL(p->bOptQpPPS, "opt-qp-pps");
+ BOOL(p->bOptRefListLengthPPS, "opt-ref-list-length-pps");
+ BOOL(p->bMultiPassOptRPS, "multi-pass-opt-rps");
+ s += sprintf(s, " scenecut-bias=%.2f", p->scenecutBias);
#undef BOOL
return buf;
}
diff --git a/source/common/param.h b/source/common/param.h
index 4ffc7a8..74c05e1 100644
--- a/source/common/param.h
+++ b/source/common/param.h
@@ -31,7 +31,7 @@ int x265_check_params(x265_param *param);
int x265_set_globals(x265_param *param);
void x265_print_params(x265_param *param);
void x265_param_apply_fastfirstpass(x265_param *p);
-char* x265_param2string(x265_param *param);
+char* x265_param2string(x265_param *param, int padx, int pady);
int x265_atoi(const char *str, bool& bError);
double x265_atof(const char *str, bool& bError);
int parseCpuName(const char *value, bool& bError);
diff --git a/source/common/pixel.cpp b/source/common/pixel.cpp
index 1d98bfd..af8df75 100644
--- a/source/common/pixel.cpp
+++ b/source/common/pixel.cpp
@@ -117,6 +117,52 @@ void sad_x4(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel
}
}
+template<int lx, int ly>
+int ads_x4(int encDC[4], uint32_t *sums, int delta, uint16_t *costMvX, int16_t *mvs, int width, int thresh)
+{
+ int nmv = 0;
+ for (int16_t i = 0; i < width; i++, sums++)
+ {
+ int ads = abs(encDC[0] - long(sums[0]))
+ + abs(encDC[1] - long(sums[lx >> 1]))
+ + abs(encDC[2] - long(sums[delta]))
+ + abs(encDC[3] - long(sums[delta + (lx >> 1)]))
+ + costMvX[i];
+ if (ads < thresh)
+ mvs[nmv++] = i;
+ }
+ return nmv;
+}
+
+template<int lx, int ly>
+int ads_x2(int encDC[2], uint32_t *sums, int delta, uint16_t *costMvX, int16_t *mvs, int width, int thresh)
+{
+ int nmv = 0;
+ for (int16_t i = 0; i < width; i++, sums++)
+ {
+ int ads = abs(encDC[0] - long(sums[0]))
+ + abs(encDC[1] - long(sums[delta]))
+ + costMvX[i];
+ if (ads < thresh)
+ mvs[nmv++] = i;
+ }
+ return nmv;
+}
+
+template<int lx, int ly>
+int ads_x1(int encDC[1], uint32_t *sums, int, uint16_t *costMvX, int16_t *mvs, int width, int thresh)
+{
+ int nmv = 0;
+ for (int16_t i = 0; i < width; i++, sums++)
+ {
+ int ads = abs(encDC[0] - long(sums[0]))
+ + costMvX[i];
+ if (ads < thresh)
+ mvs[nmv++] = i;
+ }
+ return nmv;
+}
+
template<int lx, int ly, class T1, class T2>
sse_t sse(const T1* pix1, intptr_t stride_pix1, const T2* pix2, intptr_t stride_pix2)
{
@@ -991,6 +1037,32 @@ void setupPixelPrimitives_c(EncoderPrimitives &p)
LUMA_PU(64, 16);
LUMA_PU(16, 64);
+ p.pu[LUMA_4x4].ads = ads_x1<4, 4>;
+ p.pu[LUMA_8x8].ads = ads_x1<8, 8>;
+ p.pu[LUMA_8x4].ads = ads_x2<8, 4>;
+ p.pu[LUMA_4x8].ads = ads_x2<4, 8>;
+ p.pu[LUMA_16x16].ads = ads_x4<16, 16>;
+ p.pu[LUMA_16x8].ads = ads_x2<16, 8>;
+ p.pu[LUMA_8x16].ads = ads_x2<8, 16>;
+ p.pu[LUMA_16x12].ads = ads_x1<16, 12>;
+ p.pu[LUMA_12x16].ads = ads_x1<12, 16>;
+ p.pu[LUMA_16x4].ads = ads_x1<16, 4>;
+ p.pu[LUMA_4x16].ads = ads_x1<4, 16>;
+ p.pu[LUMA_32x32].ads = ads_x4<32, 32>;
+ p.pu[LUMA_32x16].ads = ads_x2<32, 16>;
+ p.pu[LUMA_16x32].ads = ads_x2<16, 32>;
+ p.pu[LUMA_32x24].ads = ads_x4<32, 24>;
+ p.pu[LUMA_24x32].ads = ads_x4<24, 32>;
+ p.pu[LUMA_32x8].ads = ads_x4<32, 8>;
+ p.pu[LUMA_8x32].ads = ads_x4<8, 32>;
+ p.pu[LUMA_64x64].ads = ads_x4<64, 64>;
+ p.pu[LUMA_64x32].ads = ads_x2<64, 32>;
+ p.pu[LUMA_32x64].ads = ads_x2<32, 64>;
+ p.pu[LUMA_64x48].ads = ads_x4<64, 48>;
+ p.pu[LUMA_48x64].ads = ads_x4<48, 64>;
+ p.pu[LUMA_64x16].ads = ads_x4<64, 16>;
+ p.pu[LUMA_16x64].ads = ads_x4<16, 64>;
+
p.pu[LUMA_4x4].satd = satd_4x4;
p.pu[LUMA_8x8].satd = satd8<8, 8>;
p.pu[LUMA_8x4].satd = satd_8x4;
diff --git a/source/common/ppc/dct_altivec.cpp b/source/common/ppc/dct_altivec.cpp
new file mode 100644
index 0000000..7542a8e
--- /dev/null
+++ b/source/common/ppc/dct_altivec.cpp
@@ -0,0 +1,819 @@
+/*****************************************************************************
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Roger Moussalli <rmoussal at us.ibm.com>
+ * Min Chen <min.chen at multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "common.h"
+#include "primitives.h"
+#include "contexts.h" // costCoeffNxN_c
+#include "threading.h" // CLZ
+#include "ppccommon.h"
+
+using namespace X265_NS;
+
+static uint32_t quant_altivec(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff)
+{
+
+ X265_CHECK(qBits >= 8, "qBits less than 8\n");
+
+ X265_CHECK((numCoeff % 16) == 0, "numCoeff must be multiple of 16\n");
+
+ int qBits8 = qBits - 8;
+ uint32_t numSig = 0;
+
+
+ int level[8] ;
+ int sign[8] ;
+ int tmplevel[8] ;
+
+ const vector signed short v_zeros = {0, 0, 0, 0, 0, 0, 0, 0} ;
+ const vector signed short v_neg1 = {-1, -1, -1, -1, -1, -1, -1, -1} ;
+ const vector signed short v_pos1_ss = {1, 1, 1, 1, 1, 1, 1, 1} ;
+ const vector signed int v_pos1_sw = {1, 1, 1, 1} ;
+
+ const vector signed int v_clip_high = {32767, 32767, 32767, 32767} ;
+ const vector signed int v_clip_low = {-32768, -32768, -32768, -32768} ;
+
+
+ vector signed short v_level_ss ;
+ vector signed int v_level_0, v_level_1 ;
+ vector signed int v_tmplevel_0, v_tmplevel_1 ;
+ vector signed short v_sign_ss ;
+ vector signed int v_sign_0, v_sign_1 ;
+ vector signed int v_quantCoeff_0, v_quantCoeff_1 ;
+
+ vector signed int v_numSig = {0, 0, 0, 0} ;
+
+ vector signed int v_add ;
+ v_add[0] = add ;
+ v_add = vec_splat(v_add, 0) ;
+
+ vector unsigned int v_qBits ;
+ v_qBits[0] = qBits ;
+ v_qBits = vec_splat(v_qBits, 0) ;
+
+ vector unsigned int v_qBits8 ;
+ v_qBits8[0] = qBits8 ;
+ v_qBits8 = vec_splat(v_qBits8, 0) ;
+
+
+ for (int blockpos_outer = 0; blockpos_outer < numCoeff; blockpos_outer+=16)
+ {
+ int blockpos = blockpos_outer ;
+
+ // for(int ii=0; ii<8; ii++) { level[ii] = coef[blockpos+ii] ;}
+ v_level_ss = vec_xl(0, &coef[blockpos]) ;
+ v_level_0 = vec_unpackh(v_level_ss) ;
+ v_level_1 = vec_unpackl(v_level_ss) ;
+
+
+ // for(int ii=0; ii<8; ii++) { sign[ii] = (level[ii] < 0 ? -1 : 1) ;}
+ vector bool short v_level_cmplt0 ;
+ v_level_cmplt0 = vec_cmplt(v_level_ss, v_zeros) ;
+ v_sign_ss = vec_sel(v_pos1_ss, v_neg1, v_level_cmplt0) ;
+ v_sign_0 = vec_unpackh(v_sign_ss) ;
+ v_sign_1 = vec_unpackl(v_sign_ss) ;
+
+
+
+ // for(int ii=0; ii<8; ii++) { tmplevel[ii] = abs(level[ii]) * quantCoeff[blockpos+ii] ;}
+ v_level_0 = vec_abs(v_level_0) ;
+ v_level_1 = vec_abs(v_level_1) ;
+ v_quantCoeff_0 = vec_xl(0, &quantCoeff[blockpos]) ;
+ v_quantCoeff_1 = vec_xl(16, &quantCoeff[blockpos]) ;
+
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_tmplevel_0)
+ : "v" (v_level_0) , "v" (v_quantCoeff_0)
+ ) ;
+
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_tmplevel_1)
+ : "v" (v_level_1) , "v" (v_quantCoeff_1)
+ ) ;
+
+
+
+ // for(int ii=0; ii<8; ii++) { level[ii] = ((tmplevel[ii] + add) >> qBits) ;}
+ v_level_0 = vec_sra(vec_add(v_tmplevel_0, v_add), v_qBits) ;
+ v_level_1 = vec_sra(vec_add(v_tmplevel_1, v_add), v_qBits) ;
+
+ // for(int ii=0; ii<8; ii++) { deltaU[blockpos+ii] = ((tmplevel[ii] - (level[ii] << qBits)) >> qBits8) ;}
+ vector signed int v_temp_0_sw, v_temp_1_sw ;
+ v_temp_0_sw = vec_sl(v_level_0, v_qBits) ;
+ v_temp_1_sw = vec_sl(v_level_1, v_qBits) ;
+
+ v_temp_0_sw = vec_sub(v_tmplevel_0, v_temp_0_sw) ;
+ v_temp_1_sw = vec_sub(v_tmplevel_1, v_temp_1_sw) ;
+
+ v_temp_0_sw = vec_sra(v_temp_0_sw, v_qBits8) ;
+ v_temp_1_sw = vec_sra(v_temp_1_sw, v_qBits8) ;
+
+ vec_xst(v_temp_0_sw, 0, &deltaU[blockpos]) ;
+ vec_xst(v_temp_1_sw, 16, &deltaU[blockpos]) ;
+
+
+ // for(int ii=0; ii<8; ii++) { if(level[ii]) ++numSig ; }
+ vector bool int v_level_cmpeq0 ;
+ vector signed int v_level_inc ;
+ v_level_cmpeq0 = vec_cmpeq(v_level_0, (vector signed int)v_zeros) ;
+ v_level_inc = vec_sel(v_pos1_sw, (vector signed int)v_zeros, v_level_cmpeq0) ;
+ v_numSig = vec_add(v_numSig, v_level_inc) ;
+
+ v_level_cmpeq0 = vec_cmpeq(v_level_1, (vector signed int)v_zeros) ;
+ v_level_inc = vec_sel(v_pos1_sw, (vector signed int)v_zeros, v_level_cmpeq0) ;
+ v_numSig = vec_add(v_numSig, v_level_inc) ;
+
+
+ // for(int ii=0; ii<8; ii++) { level[ii] *= sign[ii]; }
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_level_0)
+ : "v" (v_level_0) , "v" (v_sign_0)
+ ) ;
+
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_level_1)
+ : "v" (v_level_1) , "v" (v_sign_1)
+ ) ;
+
+
+
+ // for(int ii=0; ii<8; ii++) {qCoef[blockpos+ii] = (int16_t)x265_clip3(-32768, 32767, level[ii]);}
+ vector bool int v_level_cmp_clip_high, v_level_cmp_clip_low ;
+
+ v_level_cmp_clip_high = vec_cmpgt(v_level_0, v_clip_high) ;
+ v_level_0 = vec_sel(v_level_0, v_clip_high, v_level_cmp_clip_high) ;
+ v_level_cmp_clip_low = vec_cmplt(v_level_0, v_clip_low) ;
+ v_level_0 = vec_sel(v_level_0, v_clip_low, v_level_cmp_clip_low) ;
+
+
+ v_level_cmp_clip_high = vec_cmpgt(v_level_1, v_clip_high) ;
+ v_level_1 = vec_sel(v_level_1, v_clip_high, v_level_cmp_clip_high) ;
+ v_level_cmp_clip_low = vec_cmplt(v_level_1, v_clip_low) ;
+ v_level_1 = vec_sel(v_level_1, v_clip_low, v_level_cmp_clip_low) ;
+
+ v_level_ss = vec_pack(v_level_0, v_level_1) ;
+
+ vec_xst(v_level_ss, 0, &qCoef[blockpos]) ;
+
+
+
+
+ // UNROLL ONCE MORE (which is ok since loops for multiple of 16 times, though that is NOT obvious to the compiler)
+ blockpos += 8 ;
+
+ // for(int ii=0; ii<8; ii++) { level[ii] = coef[blockpos+ii] ;}
+ v_level_ss = vec_xl(0, &coef[blockpos]) ;
+ v_level_0 = vec_unpackh(v_level_ss) ;
+ v_level_1 = vec_unpackl(v_level_ss) ;
+
+
+ // for(int ii=0; ii<8; ii++) { sign[ii] = (level[ii] < 0 ? -1 : 1) ;}
+ v_level_cmplt0 = vec_cmplt(v_level_ss, v_zeros) ;
+ v_sign_ss = vec_sel(v_pos1_ss, v_neg1, v_level_cmplt0) ;
+ v_sign_0 = vec_unpackh(v_sign_ss) ;
+ v_sign_1 = vec_unpackl(v_sign_ss) ;
+
+
+
+ // for(int ii=0; ii<8; ii++) { tmplevel[ii] = abs(level[ii]) * quantCoeff[blockpos+ii] ;}
+ v_level_0 = vec_abs(v_level_0) ;
+ v_level_1 = vec_abs(v_level_1) ;
+ v_quantCoeff_0 = vec_xl(0, &quantCoeff[blockpos]) ;
+ v_quantCoeff_1 = vec_xl(16, &quantCoeff[blockpos]) ;
+
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_tmplevel_0)
+ : "v" (v_level_0) , "v" (v_quantCoeff_0)
+ ) ;
+
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_tmplevel_1)
+ : "v" (v_level_1) , "v" (v_quantCoeff_1)
+ ) ;
+
+
+
+ // for(int ii=0; ii<8; ii++) { level[ii] = ((tmplevel[ii] + add) >> qBits) ;}
+ v_level_0 = vec_sra(vec_add(v_tmplevel_0, v_add), v_qBits) ;
+ v_level_1 = vec_sra(vec_add(v_tmplevel_1, v_add), v_qBits) ;
+
+ // for(int ii=0; ii<8; ii++) { deltaU[blockpos+ii] = ((tmplevel[ii] - (level[ii] << qBits)) >> qBits8) ;}
+ v_temp_0_sw = vec_sl(v_level_0, v_qBits) ;
+ v_temp_1_sw = vec_sl(v_level_1, v_qBits) ;
+
+ v_temp_0_sw = vec_sub(v_tmplevel_0, v_temp_0_sw) ;
+ v_temp_1_sw = vec_sub(v_tmplevel_1, v_temp_1_sw) ;
+
+ v_temp_0_sw = vec_sra(v_temp_0_sw, v_qBits8) ;
+ v_temp_1_sw = vec_sra(v_temp_1_sw, v_qBits8) ;
+
+ vec_xst(v_temp_0_sw, 0, &deltaU[blockpos]) ;
+ vec_xst(v_temp_1_sw, 16, &deltaU[blockpos]) ;
+
+
+ // for(int ii=0; ii<8; ii++) { if(level[ii]) ++numSig ; }
+ v_level_cmpeq0 = vec_cmpeq(v_level_0, (vector signed int)v_zeros) ;
+ v_level_inc = vec_sel(v_pos1_sw, (vector signed int)v_zeros, v_level_cmpeq0) ;
+ v_numSig = vec_add(v_numSig, v_level_inc) ;
+
+ v_level_cmpeq0 = vec_cmpeq(v_level_1, (vector signed int)v_zeros) ;
+ v_level_inc = vec_sel(v_pos1_sw, (vector signed int)v_zeros, v_level_cmpeq0) ;
+ v_numSig = vec_add(v_numSig, v_level_inc) ;
+
+
+ // for(int ii=0; ii<8; ii++) { level[ii] *= sign[ii]; }
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_level_0)
+ : "v" (v_level_0) , "v" (v_sign_0)
+ ) ;
+
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_level_1)
+ : "v" (v_level_1) , "v" (v_sign_1)
+ ) ;
+
+
+
+ // for(int ii=0; ii<8; ii++) {qCoef[blockpos+ii] = (int16_t)x265_clip3(-32768, 32767, level[ii]);}
+ v_level_cmp_clip_high = vec_cmpgt(v_level_0, v_clip_high) ;
+ v_level_0 = vec_sel(v_level_0, v_clip_high, v_level_cmp_clip_high) ;
+ v_level_cmp_clip_low = vec_cmplt(v_level_0, v_clip_low) ;
+ v_level_0 = vec_sel(v_level_0, v_clip_low, v_level_cmp_clip_low) ;
+
+
+ v_level_cmp_clip_high = vec_cmpgt(v_level_1, v_clip_high) ;
+ v_level_1 = vec_sel(v_level_1, v_clip_high, v_level_cmp_clip_high) ;
+ v_level_cmp_clip_low = vec_cmplt(v_level_1, v_clip_low) ;
+ v_level_1 = vec_sel(v_level_1, v_clip_low, v_level_cmp_clip_low) ;
+
+ v_level_ss = vec_pack(v_level_0, v_level_1) ;
+
+ vec_xst(v_level_ss, 0, &qCoef[blockpos]) ;
+
+
+ }
+
+ v_numSig = vec_sums(v_numSig, (vector signed int)v_zeros) ;
+
+ // return numSig;
+ return v_numSig[3] ;
+} // end quant_altivec()
+
+
+inline void denoiseDct_unroll8_altivec(int16_t* dctCoef, uint32_t* resSum, const uint16_t* offset, int numCoeff, int index_offset)
+{
+ vector short v_level_ss, v_sign_ss ;
+ vector int v_level_h_sw, v_level_l_sw ;
+ vector int v_level_h_processed_sw, v_level_l_processed_sw ;
+ vector int v_sign_h_sw, v_sign_l_sw ;
+ vector unsigned int v_resSum_h_uw, v_resSum_l_uw ;
+ vector unsigned short v_offset_us ;
+ vector unsigned int v_offset_h_uw, v_offset_l_uw ;
+ const vector unsigned short v_shamt_us = {15,15,15,15,15,15,15,15} ;
+ const vector unsigned int v_unpack_mask = {0x0FFFF, 0x0FFFF, 0x0FFFF, 0x0FFFF} ;
+ vector bool int vec_less_than_zero_h_bw, vec_less_than_zero_l_bw ;
+ LOAD_ZERO;
+
+ // for(int jj=0; jj<8; jj++) v_level[jj]=dctCoef[ii*8 + jj] ;
+ v_level_ss = vec_xl(0, &dctCoef[index_offset]) ;
+ v_level_h_sw = vec_unpackh(v_level_ss) ;
+ v_level_l_sw = vec_unpackl(v_level_ss) ;
+
+ // for(int jj=0; jj<8; jj++) v_sign[jj] = v_level[jj] >> 31 ;
+ v_sign_ss = vec_sra(v_level_ss, v_shamt_us) ;
+ v_sign_h_sw = vec_unpackh(v_sign_ss) ;
+ v_sign_l_sw = vec_unpackl(v_sign_ss) ;
+
+
+
+ // for(int jj=0; jj<8; jj++) v_level[jj] = (v_level[jj] + v_sign[jj]) ^ v_sign[jj] ;
+ v_level_h_sw = vec_add(v_level_h_sw, v_sign_h_sw) ;
+ v_level_l_sw = vec_add(v_level_l_sw, v_sign_l_sw) ;
+
+ v_level_h_sw = vec_xor(v_level_h_sw, v_sign_h_sw) ;
+ v_level_l_sw = vec_xor(v_level_l_sw, v_sign_l_sw) ;
+
+
+
+ // for(int jj=0; jj<8; jj++) resSum[ii*8 + jj] += v_level[jj] ;
+ v_resSum_h_uw = vec_xl(0, &resSum[index_offset]) ;
+ v_resSum_l_uw = vec_xl(0, &resSum[index_offset + 4]) ;
+
+ v_resSum_h_uw = vec_add(v_resSum_h_uw, (vector unsigned int)v_level_h_sw) ;
+ v_resSum_l_uw = vec_add(v_resSum_l_uw, (vector unsigned int)v_level_l_sw) ;
+
+ vec_xst(v_resSum_h_uw, 0, &resSum[index_offset]) ;
+ vec_xst(v_resSum_l_uw, 0, &resSum[index_offset + 4]) ;
+
+
+ // for(int jj=0; jj<8; jj++) v_level[jj] -= offset[ii*8 + jj] ;
+ v_offset_us = vec_xl(0, &offset[index_offset]) ;
+ v_offset_h_uw = (vector unsigned int)vec_unpackh((vector signed short)v_offset_us) ;
+ v_offset_l_uw = (vector unsigned int)vec_unpackl((vector signed short)v_offset_us) ;
+ v_offset_h_uw = vec_and(v_offset_h_uw, v_unpack_mask) ;
+ v_offset_l_uw = vec_and(v_offset_l_uw, v_unpack_mask) ;
+ v_level_h_sw = vec_sub(v_level_h_sw, (vector signed int) v_offset_h_uw) ;
+ v_level_l_sw = vec_sub(v_level_l_sw, (vector signed int) v_offset_l_uw) ;
+
+
+ // for (int jj = 0; jj < 8; jj++) dctCoef[ii*8 + jj] = (int16_t)(v_level[jj] < 0 ? 0 : (v_level[jj] ^ v_sign[jj]) - v_sign[jj]);
+ // (level ^ sign) - sign
+ v_level_h_processed_sw = vec_xor(v_level_h_sw, v_sign_h_sw) ;
+ v_level_l_processed_sw = vec_xor(v_level_l_sw, v_sign_l_sw) ;
+ v_level_h_processed_sw = vec_sub(v_level_h_processed_sw, v_sign_h_sw) ;
+ v_level_l_processed_sw = vec_sub(v_level_l_processed_sw, v_sign_l_sw) ;
+
+ //vec_less_than_zero_h_bw = vec_cmplt(v_level_h_sw, (vector signed int){0, 0, 0, 0}) ;
+ //vec_less_than_zero_l_bw = vec_cmplt(v_level_l_sw, (vector signed int){0, 0, 0, 0}) ;
+ vec_less_than_zero_h_bw = vec_cmplt(v_level_h_sw, zero_s32v) ;
+ vec_less_than_zero_l_bw = vec_cmplt(v_level_l_sw, zero_s32v) ;
+
+ v_level_h_sw = vec_sel(v_level_h_processed_sw, (vector signed int){0, 0, 0, 0}, vec_less_than_zero_h_bw) ;
+ v_level_l_sw = vec_sel(v_level_l_processed_sw, (vector signed int){0, 0, 0, 0}, vec_less_than_zero_l_bw) ;
+
+ v_level_ss = vec_pack(v_level_h_sw, v_level_l_sw) ;
+
+ vec_xst(v_level_ss, 0, &dctCoef[index_offset]) ;
+}
+
+
+void denoiseDct_altivec(int16_t* dctCoef, uint32_t* resSum, const uint16_t* offset, int numCoeff)
+{
+ int ii_offset ;
+
+ // For each set of 256
+ for(int ii=0; ii<(numCoeff/256); ii++)
+ {
+ #pragma unroll
+ for(int jj=0; jj<32; jj++)
+ {
+ denoiseDct_unroll8_altivec(dctCoef, resSum, offset, numCoeff, ii*256 + jj*8) ;
+ }
+ }
+
+ ii_offset = ((numCoeff >> 8) << 8) ;
+
+ // For each set of 64
+ for(int ii=0; ii<((numCoeff%256) /64); ii++)
+ {
+ #pragma unroll
+ for(int jj=0; jj<8; jj++)
+ {
+ denoiseDct_unroll8_altivec(dctCoef, resSum, offset, numCoeff, ii_offset + ii*64 + jj*8) ;
+ }
+ }
+
+
+ ii_offset = ((numCoeff >> 6) << 6) ;
+
+ // For each set of 8
+ for(int ii=0; ii < ((numCoeff%64) /8); ii++)
+ {
+ denoiseDct_unroll8_altivec(dctCoef, resSum, offset, numCoeff, ii_offset + (ii*8)) ;
+ }
+
+
+ ii_offset = ((numCoeff >> 3) << 3) ;
+
+ for (int ii = 0; ii < (numCoeff % 8); ii++)
+ {
+ int level = dctCoef[ii + ii_offset];
+ int sign = level >> 31;
+ level = (level + sign) ^ sign;
+ resSum[ii+ii_offset] += level;
+ level -= offset[ii+ii_offset] ;
+ dctCoef[ii+ii_offset] = (int16_t)(level < 0 ? 0 : (level ^ sign) - sign);
+ }
+
+} // end denoiseDct_altivec()
+
+
+
+
+inline void transpose_matrix_8_altivec(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+{
+ vector signed short v_src_0 ;
+ vector signed short v_src_1 ;
+ vector signed short v_src_2 ;
+ vector signed short v_src_3 ;
+ vector signed short v_src_4 ;
+ vector signed short v_src_5 ;
+ vector signed short v_src_6 ;
+ vector signed short v_src_7 ;
+
+ vector signed short v_dst_32s_0 ;
+ vector signed short v_dst_32s_1 ;
+ vector signed short v_dst_32s_2 ;
+ vector signed short v_dst_32s_3 ;
+ vector signed short v_dst_32s_4 ;
+ vector signed short v_dst_32s_5 ;
+ vector signed short v_dst_32s_6 ;
+ vector signed short v_dst_32s_7 ;
+
+ vector signed short v_dst_64s_0 ;
+ vector signed short v_dst_64s_1 ;
+ vector signed short v_dst_64s_2 ;
+ vector signed short v_dst_64s_3 ;
+ vector signed short v_dst_64s_4 ;
+ vector signed short v_dst_64s_5 ;
+ vector signed short v_dst_64s_6 ;
+ vector signed short v_dst_64s_7 ;
+
+ vector signed short v_dst_128s_0 ;
+ vector signed short v_dst_128s_1 ;
+ vector signed short v_dst_128s_2 ;
+ vector signed short v_dst_128s_3 ;
+ vector signed short v_dst_128s_4 ;
+ vector signed short v_dst_128s_5 ;
+ vector signed short v_dst_128s_6 ;
+ vector signed short v_dst_128s_7 ;
+
+ v_src_0 = vec_xl(0, src) ;
+ v_src_1 = vec_xl( (srcStride*2) , src) ;
+ v_src_2 = vec_xl( (srcStride*2) * 2, src) ;
+ v_src_3 = vec_xl( (srcStride*2) * 3, src) ;
+ v_src_4 = vec_xl( (srcStride*2) * 4, src) ;
+ v_src_5 = vec_xl( (srcStride*2) * 5, src) ;
+ v_src_6 = vec_xl( (srcStride*2) * 6, src) ;
+ v_src_7 = vec_xl( (srcStride*2) * 7, src) ;
+
+ vector unsigned char v_permute_32s_high = {0x00, 0x01, 0x10, 0x11, 0x02, 0x03, 0x12, 0x13, 0x04, 0x05, 0x14, 0x15, 0x06, 0x07, 0x16, 0x17} ;
+ vector unsigned char v_permute_32s_low = {0x08, 0x09, 0x18, 0x19, 0x0A, 0x0B, 0x1A, 0x1B, 0x0C, 0x0D, 0x1C, 0x1D, 0x0E, 0x0F, 0x1E, 0x1F} ;
+ vector unsigned char v_permute_64s_high = {0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x04, 0x05, 0x06, 0x07, 0x14, 0x015, 0x16, 0x17} ;
+ vector unsigned char v_permute_64s_low = {0x08, 0x09, 0x0A, 0x0B, 0x18, 0x19, 0x1A, 0x1B, 0x0C, 0x0D, 0x0E, 0x0F, 0x1C, 0x1D, 0x1E, 0x1F} ;
+ vector unsigned char v_permute_128s_high = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x10, 0x11, 0x12, 0x13, 0x14, 0x015, 0x16, 0x17} ;
+ vector unsigned char v_permute_128s_low = {0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F} ;
+
+ v_dst_32s_0 = vec_perm(v_src_0, v_src_1, v_permute_32s_high) ;
+ v_dst_32s_1 = vec_perm(v_src_2, v_src_3, v_permute_32s_high) ;
+ v_dst_32s_2 = vec_perm(v_src_4, v_src_5, v_permute_32s_high) ;
+ v_dst_32s_3 = vec_perm(v_src_6, v_src_7, v_permute_32s_high) ;
+ v_dst_32s_4 = vec_perm(v_src_0, v_src_1, v_permute_32s_low) ;
+ v_dst_32s_5 = vec_perm(v_src_2, v_src_3, v_permute_32s_low) ;
+ v_dst_32s_6 = vec_perm(v_src_4, v_src_5, v_permute_32s_low) ;
+ v_dst_32s_7 = vec_perm(v_src_6, v_src_7, v_permute_32s_low) ;
+
+ v_dst_64s_0 = vec_perm(v_dst_32s_0, v_dst_32s_1, v_permute_64s_high) ;
+ v_dst_64s_1 = vec_perm(v_dst_32s_2, v_dst_32s_3, v_permute_64s_high) ;
+ v_dst_64s_2 = vec_perm(v_dst_32s_0, v_dst_32s_1, v_permute_64s_low) ;
+ v_dst_64s_3 = vec_perm(v_dst_32s_2, v_dst_32s_3, v_permute_64s_low) ;
+ v_dst_64s_4 = vec_perm(v_dst_32s_4, v_dst_32s_5, v_permute_64s_high) ;
+ v_dst_64s_5 = vec_perm(v_dst_32s_6, v_dst_32s_7, v_permute_64s_high) ;
+ v_dst_64s_6 = vec_perm(v_dst_32s_4, v_dst_32s_5, v_permute_64s_low) ;
+ v_dst_64s_7 = vec_perm(v_dst_32s_6, v_dst_32s_7, v_permute_64s_low) ;
+
+ v_dst_128s_0 = vec_perm(v_dst_64s_0, v_dst_64s_1, v_permute_128s_high) ;
+ v_dst_128s_1 = vec_perm(v_dst_64s_0, v_dst_64s_1, v_permute_128s_low) ;
+ v_dst_128s_2 = vec_perm(v_dst_64s_2, v_dst_64s_3, v_permute_128s_high) ;
+ v_dst_128s_3 = vec_perm(v_dst_64s_2, v_dst_64s_3, v_permute_128s_low) ;
+ v_dst_128s_4 = vec_perm(v_dst_64s_4, v_dst_64s_5, v_permute_128s_high) ;
+ v_dst_128s_5 = vec_perm(v_dst_64s_4, v_dst_64s_5, v_permute_128s_low) ;
+ v_dst_128s_6 = vec_perm(v_dst_64s_6, v_dst_64s_7, v_permute_128s_high) ;
+ v_dst_128s_7 = vec_perm(v_dst_64s_6, v_dst_64s_7, v_permute_128s_low) ;
+
+
+ vec_xst(v_dst_128s_0, 0, dst) ;
+ vec_xst(v_dst_128s_1, (dstStride*2) , dst) ;
+ vec_xst(v_dst_128s_2, (dstStride*2) * 2, dst) ;
+ vec_xst(v_dst_128s_3, (dstStride*2) * 3, dst) ;
+ vec_xst(v_dst_128s_4, (dstStride*2) * 4, dst) ;
+ vec_xst(v_dst_128s_5, (dstStride*2) * 5, dst) ;
+ vec_xst(v_dst_128s_6, (dstStride*2) * 6, dst) ;
+ vec_xst(v_dst_128s_7, (dstStride*2) * 7, dst) ;
+
+} // end transpose_matrix_8_altivec()
+
+
+inline void transpose_matrix_16_altivec(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+{
+ transpose_matrix_8_altivec((int16_t *)src, srcStride, (int16_t *)dst, dstStride) ;
+ transpose_matrix_8_altivec((int16_t *)&src[8] , srcStride, (int16_t *)&dst[dstStride*8], dstStride) ;
+ transpose_matrix_8_altivec((int16_t *)&src[srcStride*8], srcStride, (int16_t *)&dst[8], dstStride) ;
+ transpose_matrix_8_altivec((int16_t *)&src[srcStride*8 + 8], srcStride, (int16_t *)&dst[dstStride*8 + 8], dstStride) ;
+} // end transpose_matrix_16_altivec()
+
+
+inline void transpose_matrix_32_altivec(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+{
+ transpose_matrix_16_altivec((int16_t *)src, srcStride, (int16_t *)dst, dstStride) ;
+ transpose_matrix_16_altivec((int16_t *)&src[16] , srcStride, (int16_t *)&dst[dstStride*16], dstStride) ;
+ transpose_matrix_16_altivec((int16_t *)&src[srcStride*16], srcStride, (int16_t *)&dst[16], dstStride) ;
+ transpose_matrix_16_altivec((int16_t *)&src[srcStride*16 + 16], srcStride, (int16_t *)&dst[dstStride*16 + 16], dstStride) ;
+} // end transpose_matrix_32_altivec()
+
+
+inline static void partialButterfly32_transposedSrc_altivec(const int16_t* __restrict__ src, int16_t* __restrict__ dst, int shift)
+{
+ const int line = 32 ;
+
+ int j, k;
+ int E[16][8], O[16][8];
+ int EE[8][8], EO[8][8];
+ int EEE[4][8], EEO[4][8];
+ int EEEE[2][8], EEEO[2][8];
+ int add = 1 << (shift - 1);
+
+ for (j = 0; j < line/8; j++)
+ {
+ /* E and O*/
+ for(int ii=0; ii<8; ii++) { E[0][ii] = src[(0*line) + ii] + src[((31 - 0)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[0][ii] = src[(0*line) + ii] - src[((31 - 0)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[1][ii] = src[(1*line) + ii] + src[((31 - 1)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[1][ii] = src[(1*line) + ii] - src[((31 - 1)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[2][ii] = src[(2*line) + ii] + src[((31 - 2)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[2][ii] = src[(2*line) + ii] - src[((31 - 2)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[3][ii] = src[(3*line) + ii] + src[((31 - 3)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[3][ii] = src[(3*line) + ii] - src[((31 - 3)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[4][ii] = src[(4*line) + ii] + src[((31 - 4)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[4][ii] = src[(4*line) + ii] - src[((31 - 4)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[5][ii] = src[(5*line) + ii] + src[((31 - 5)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[5][ii] = src[(5*line) + ii] - src[((31 - 5)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[6][ii] = src[(6*line) + ii] + src[((31 - 6)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[6][ii] = src[(6*line) + ii] - src[((31 - 6)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[7][ii] = src[(7*line) + ii] + src[((31 - 7)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[7][ii] = src[(7*line) + ii] - src[((31 - 7)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[8][ii] = src[(8*line) + ii] + src[((31 - 8)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[8][ii] = src[(8*line) + ii] - src[((31 - 8)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[9][ii] = src[(9*line) + ii] + src[((31 - 9)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[9][ii] = src[(9*line) + ii] - src[((31 - 9)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[10][ii] = src[(10*line) + ii] + src[((31 - 10)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[10][ii] = src[(10*line) + ii] - src[((31 - 10)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[11][ii] = src[(11*line) + ii] + src[((31 - 11)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[11][ii] = src[(11*line) + ii] - src[((31 - 11)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[12][ii] = src[(12*line) + ii] + src[((31 - 12)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[12][ii] = src[(12*line) + ii] - src[((31 - 12)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[13][ii] = src[(13*line) + ii] + src[((31 - 13)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[13][ii] = src[(13*line) + ii] - src[((31 - 13)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[14][ii] = src[(14*line) + ii] + src[((31 - 14)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[14][ii] = src[(14*line) + ii] - src[((31 - 14)*line) + ii] ; }
+
+ for(int ii=0; ii<8; ii++) { E[15][ii] = src[(15*line) + ii] + src[((31 - 15)*line) + ii] ; }
+ for(int ii=0; ii<8; ii++) { O[15][ii] = src[(15*line) + ii] - src[((31 - 15)*line) + ii] ; }
+
+
+ /* EE and EO */
+ for(int ii=0; ii<8; ii++) {EE[0][ii] = E[0][ii] + E[15 - 0][ii];}
+ for(int ii=0; ii<8; ii++) {EO[0][ii] = E[0][ii] - E[15 - 0][ii];}
+
+ for(int ii=0; ii<8; ii++) {EE[1][ii] = E[1][ii] + E[15 - 1][ii];}
+ for(int ii=0; ii<8; ii++) {EO[1][ii] = E[1][ii] - E[15 - 1][ii];}
+
+ for(int ii=0; ii<8; ii++) {EE[2][ii] = E[2][ii] + E[15 - 2][ii];}
+ for(int ii=0; ii<8; ii++) {EO[2][ii] = E[2][ii] - E[15 - 2][ii];}
+
+ for(int ii=0; ii<8; ii++) {EE[3][ii] = E[3][ii] + E[15 - 3][ii];}
+ for(int ii=0; ii<8; ii++) {EO[3][ii] = E[3][ii] - E[15 - 3][ii];}
+
+ for(int ii=0; ii<8; ii++) {EE[4][ii] = E[4][ii] + E[15 - 4][ii];}
+ for(int ii=0; ii<8; ii++) {EO[4][ii] = E[4][ii] - E[15 - 4][ii];}
+
+ for(int ii=0; ii<8; ii++) {EE[5][ii] = E[5][ii] + E[15 - 5][ii];}
+ for(int ii=0; ii<8; ii++) {EO[5][ii] = E[5][ii] - E[15 - 5][ii];}
+
+ for(int ii=0; ii<8; ii++) {EE[6][ii] = E[6][ii] + E[15 - 6][ii];}
+ for(int ii=0; ii<8; ii++) {EO[6][ii] = E[6][ii] - E[15 - 6][ii];}
+
+ for(int ii=0; ii<8; ii++) {EE[7][ii] = E[7][ii] + E[15 - 7][ii];}
+ for(int ii=0; ii<8; ii++) {EO[7][ii] = E[7][ii] - E[15 - 7][ii];}
+
+
+ /* EEE and EEO */
+ for(int ii=0; ii<8; ii++) {EEE[0][ii] = EE[0][ii] + EE[7 - 0][ii];}
+ for(int ii=0; ii<8; ii++) {EEO[0][ii] = EE[0][ii] - EE[7 - 0][ii];}
+
+ for(int ii=0; ii<8; ii++) {EEE[1][ii] = EE[1][ii] + EE[7 - 1][ii];}
+ for(int ii=0; ii<8; ii++) {EEO[1][ii] = EE[1][ii] - EE[7 - 1][ii];}
+
+ for(int ii=0; ii<8; ii++) {EEE[2][ii] = EE[2][ii] + EE[7 - 2][ii];}
+ for(int ii=0; ii<8; ii++) {EEO[2][ii] = EE[2][ii] - EE[7 - 2][ii];}
+
+ for(int ii=0; ii<8; ii++) {EEE[3][ii] = EE[3][ii] + EE[7 - 3][ii];}
+ for(int ii=0; ii<8; ii++) {EEO[3][ii] = EE[3][ii] - EE[7 - 3][ii];}
+
+
+
+ /* EEEE and EEEO */
+ for(int ii=0; ii<8; ii++) {EEEE[0][ii] = EEE[0][ii] + EEE[3][ii];}
+ for(int ii=0; ii<8; ii++) {EEEO[0][ii] = EEE[0][ii] - EEE[3][ii];}
+
+ for(int ii=0; ii<8; ii++) {EEEE[1][ii] = EEE[1][ii] + EEE[2][ii];}
+ for(int ii=0; ii<8; ii++) {EEEO[1][ii] = EEE[1][ii] - EEE[2][ii];}
+
+
+ /* writing to dst */
+ for(int ii=0; ii<8; ii++) {dst[0 + ii] = (int16_t)((g_t32[0][0] * EEEE[0][ii] + g_t32[0][1] * EEEE[1][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(16 * line) + ii] = (int16_t)((g_t32[16][0] * EEEE[0][ii] + g_t32[16][1] * EEEE[1][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(8 * line ) + ii] = (int16_t)((g_t32[8][0] * EEEO[0][ii] + g_t32[8][1] * EEEO[1][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(24 * line) + ii] = (int16_t)((g_t32[24][0] * EEEO[0][ii] + g_t32[24][1] * EEEO[1][ii] + add) >> shift);}
+
+ for(int ii=0; ii<8; ii++) {dst[(4 * line) + ii] = (int16_t)((g_t32[4][0] * EEO[0][ii] + g_t32[4][1] * EEO[1][ii] + g_t32[4][2] * EEO[2][ii] + g_t32[4][3] * EEO[3][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(12 * line) + ii] = (int16_t)((g_t32[12][0] * EEO[0][ii] + g_t32[12][1] * EEO[1][ii] + g_t32[12][2] * EEO[2][ii] + g_t32[12][3] * EEO[3][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(20 * line) + ii] = (int16_t)((g_t32[20][0] * EEO[0][ii] + g_t32[20][1] * EEO[1][ii] + g_t32[20][2] * EEO[2][ii] + g_t32[20][3] * EEO[3][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(28 * line) + ii] = (int16_t)((g_t32[28][0] * EEO[0][ii] + g_t32[28][1] * EEO[1][ii] + g_t32[28][2] * EEO[2][ii] + g_t32[28][3] * EEO[3][ii] + add) >> shift);}
+
+ for(int ii=0; ii<8; ii++) {dst[(2 * line) + ii] = (int16_t)((g_t32[2][0] * EO[0][ii] + g_t32[2][1] * EO[1][ii] + g_t32[2][2] * EO[2][ii] + g_t32[2][3] * EO[3][ii] + g_t32[2][4] * EO[4][ii] + g_t32[2][5] * EO[5][ii] + g_t32[2][6] * EO[6][ii] + g_t32[2][7] * EO[7][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(6 * line) + ii] = (int16_t)((g_t32[6][0] * EO[0][ii] + g_t32[6][1] * EO[1][ii] + g_t32[6][2] * EO[2][ii] + g_t32[6][3] * EO[3][ii] + g_t32[6][4] * EO[4][ii] + g_t32[6][5] * EO[5][ii] + g_t32[6][6] * EO[6][ii] + g_t32[6][7] * EO[7][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(10 * line) + ii] = (int16_t)((g_t32[10][0] * EO[0][ii] + g_t32[10][1] * EO[1][ii] + g_t32[10][2] * EO[2][ii] + g_t32[10][3] * EO[3][ii] + g_t32[10][4] * EO[4][ii] + g_t32[10][5] * EO[5][ii] + g_t32[10][6] * EO[6][ii] + g_t32[10][7] * EO[7][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(14 * line) + ii] = (int16_t)((g_t32[14][0] * EO[0][ii] + g_t32[14][1] * EO[1][ii] + g_t32[14][2] * EO[2][ii] + g_t32[14][3] * EO[3][ii] + g_t32[14][4] * EO[4][ii] + g_t32[14][5] * EO[5][ii] + g_t32[14][6] * EO[6][ii] + g_t32[14][7] * EO[7][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(18 * line) + ii] = (int16_t)((g_t32[18][0] * EO[0][ii] + g_t32[18][1] * EO[1][ii] + g_t32[18][2] * EO[2][ii] + g_t32[18][3] * EO[3][ii] + g_t32[18][4] * EO[4][ii] + g_t32[18][5] * EO[5][ii] + g_t32[18][6] * EO[6][ii] + g_t32[18][7] * EO[7][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(22 * line) + ii] = (int16_t)((g_t32[22][0] * EO[0][ii] + g_t32[22][1] * EO[1][ii] + g_t32[22][2] * EO[2][ii] + g_t32[22][3] * EO[3][ii] + g_t32[22][4] * EO[4][ii] + g_t32[22][5] * EO[5][ii] + g_t32[22][6] * EO[6][ii] + g_t32[22][7] * EO[7][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(26 * line) + ii] = (int16_t)((g_t32[26][0] * EO[0][ii] + g_t32[26][1] * EO[1][ii] + g_t32[26][2] * EO[2][ii] + g_t32[26][3] * EO[3][ii] + g_t32[26][4] * EO[4][ii] + g_t32[26][5] * EO[5][ii] + g_t32[26][6] * EO[6][ii] + g_t32[26][7] * EO[7][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) {dst[(30 * line) + ii] = (int16_t)((g_t32[30][0] * EO[0][ii] + g_t32[30][1] * EO[1][ii] + g_t32[30][2] * EO[2][ii] + g_t32[30][3] * EO[3][ii] + g_t32[30][4] * EO[4][ii] + g_t32[30][5] * EO[5][ii] + g_t32[30][6] * EO[6][ii] + g_t32[30][7] * EO[7][ii] + add) >> shift);}
+
+
+ for(int ii=0; ii<8; ii++) { dst[(1 * line) + ii] = (int16_t)((g_t32[1][0] * O[0][ii] + g_t32[1][1] * O[1][ii] + g_t32[1][2] * O[2][ii] + g_t32[1][3] * O[3][ii] + g_t32[1][4] * O[4][ii] + g_t32[1][5] * O[5][ii] + g_t32[1][6] * O[6][ii] + g_t32[1][7] * O[7][ii] + g_t32[1][8] * O[8][ii] + g_t32[1][9] * O[9][ii] + g_t32[1][10] * O[10][ii] + g_t32[1][11] * O[11][ii] + g_t32[1][12] * O[12][ii] + g_t32[1][13] * O[13][ii] + g_t32[1][14] * O[14][ii] + g_t32[1][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(3 * line) + ii] = (int16_t)((g_t32[3][0] * O[0][ii] + g_t32[3][1] * O[1][ii] + g_t32[3][2] * O[2][ii] + g_t32[3][3] * O[3][ii] + g_t32[3][4] * O[4][ii] + g_t32[3][5] * O[5][ii] + g_t32[3][6] * O[6][ii] + g_t32[3][7] * O[7][ii] + g_t32[3][8] * O[8][ii] + g_t32[3][9] * O[9][ii] + g_t32[3][10] * O[10][ii] + g_t32[3][11] * O[11][ii] + g_t32[3][12] * O[12][ii] + g_t32[3][13] * O[13][ii] + g_t32[3][14] * O[14][ii] + g_t32[3][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(5 * line) + ii] = (int16_t)((g_t32[5][0] * O[0][ii] + g_t32[5][1] * O[1][ii] + g_t32[5][2] * O[2][ii] + g_t32[5][3] * O[3][ii] + g_t32[5][4] * O[4][ii] + g_t32[5][5] * O[5][ii] + g_t32[5][6] * O[6][ii] + g_t32[5][7] * O[7][ii] + g_t32[5][8] * O[8][ii] + g_t32[5][9] * O[9][ii] + g_t32[5][10] * O[10][ii] + g_t32[5][11] * O[11][ii] + g_t32[5][12] * O[12][ii] + g_t32[5][13] * O[13][ii] + g_t32[5][14] * O[14][ii] + g_t32[5][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(7 * line) + ii] = (int16_t)((g_t32[7][0] * O[0][ii] + g_t32[7][1] * O[1][ii] + g_t32[7][2] * O[2][ii] + g_t32[7][3] * O[3][ii] + g_t32[7][4] * O[4][ii] + g_t32[7][5] * O[5][ii] + g_t32[7][6] * O[6][ii] + g_t32[7][7] * O[7][ii] + g_t32[7][8] * O[8][ii] + g_t32[7][9] * O[9][ii] + g_t32[7][10] * O[10][ii] + g_t32[7][11] * O[11][ii] + g_t32[7][12] * O[12][ii] + g_t32[7][13] * O[13][ii] + g_t32[7][14] * O[14][ii] + g_t32[7][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(9 * line) + ii] = (int16_t)((g_t32[9][0] * O[0][ii] + g_t32[9][1] * O[1][ii] + g_t32[9][2] * O[2][ii] + g_t32[9][3] * O[3][ii] + g_t32[9][4] * O[4][ii] + g_t32[9][5] * O[5][ii] + g_t32[9][6] * O[6][ii] + g_t32[9][7] * O[7][ii] + g_t32[9][8] * O[8][ii] + g_t32[9][9] * O[9][ii] + g_t32[9][10] * O[10][ii] + g_t32[9][11] * O[11][ii] + g_t32[9][12] * O[12][ii] + g_t32[9][13] * O[13][ii] + g_t32[9][14] * O[14][ii] + g_t32[9][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(11 * line) + ii] = (int16_t)((g_t32[11][0] * O[0][ii] + g_t32[11][1] * O[1][ii] + g_t32[11][2] * O[2][ii] + g_t32[11][3] * O[3][ii] + g_t32[11][4] * O[4][ii] + g_t32[11][5] * O[5][ii] + g_t32[11][6] * O[6][ii] + g_t32[11][7] * O[7][ii] + g_t32[11][8] * O[8][ii] + g_t32[11][9] * O[9][ii] + g_t32[11][10] * O[10][ii] + g_t32[11][11] * O[11][ii] + g_t32[11][12] * O[12][ii] + g_t32[11][13] * O[13][ii] + g_t32[11][14] * O[14][ii] + g_t32[11][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(13 * line) + ii] = (int16_t)((g_t32[13][0] * O[0][ii] + g_t32[13][1] * O[1][ii] + g_t32[13][2] * O[2][ii] + g_t32[13][3] * O[3][ii] + g_t32[13][4] * O[4][ii] + g_t32[13][5] * O[5][ii] + g_t32[13][6] * O[6][ii] + g_t32[13][7] * O[7][ii] + g_t32[13][8] * O[8][ii] + g_t32[13][9] * O[9][ii] + g_t32[13][10] * O[10][ii] + g_t32[13][11] * O[11][ii] + g_t32[13][12] * O[12][ii] + g_t32[13][13] * O[13][ii] + g_t32[13][14] * O[14][ii] + g_t32[13][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(15 * line) + ii] = (int16_t)((g_t32[15][0] * O[0][ii] + g_t32[15][1] * O[1][ii] + g_t32[15][2] * O[2][ii] + g_t32[15][3] * O[3][ii] + g_t32[15][4] * O[4][ii] + g_t32[15][5] * O[5][ii] + g_t32[15][6] * O[6][ii] + g_t32[15][7] * O[7][ii] + g_t32[15][8] * O[8][ii] + g_t32[15][9] * O[9][ii] + g_t32[15][10] * O[10][ii] + g_t32[15][11] * O[11][ii] + g_t32[15][12] * O[12][ii] + g_t32[15][13] * O[13][ii] + g_t32[15][14] * O[14][ii] + g_t32[15][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(17 * line) + ii] = (int16_t)((g_t32[17][0] * O[0][ii] + g_t32[17][1] * O[1][ii] + g_t32[17][2] * O[2][ii] + g_t32[17][3] * O[3][ii] + g_t32[17][4] * O[4][ii] + g_t32[17][5] * O[5][ii] + g_t32[17][6] * O[6][ii] + g_t32[17][7] * O[7][ii] + g_t32[17][8] * O[8][ii] + g_t32[17][9] * O[9][ii] + g_t32[17][10] * O[10][ii] + g_t32[17][11] * O[11][ii] + g_t32[17][12] * O[12][ii] + g_t32[17][13] * O[13][ii] + g_t32[17][14] * O[14][ii] + g_t32[17][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(19 * line) + ii] = (int16_t)((g_t32[19][0] * O[0][ii] + g_t32[19][1] * O[1][ii] + g_t32[19][2] * O[2][ii] + g_t32[19][3] * O[3][ii] + g_t32[19][4] * O[4][ii] + g_t32[19][5] * O[5][ii] + g_t32[19][6] * O[6][ii] + g_t32[19][7] * O[7][ii] + g_t32[19][8] * O[8][ii] + g_t32[19][9] * O[9][ii] + g_t32[19][10] * O[10][ii] + g_t32[19][11] * O[11][ii] + g_t32[19][12] * O[12][ii] + g_t32[19][13] * O[13][ii] + g_t32[19][14] * O[14][ii] + g_t32[19][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(21 * line) + ii] = (int16_t)((g_t32[21][0] * O[0][ii] + g_t32[21][1] * O[1][ii] + g_t32[21][2] * O[2][ii] + g_t32[21][3] * O[3][ii] + g_t32[21][4] * O[4][ii] + g_t32[21][5] * O[5][ii] + g_t32[21][6] * O[6][ii] + g_t32[21][7] * O[7][ii] + g_t32[21][8] * O[8][ii] + g_t32[21][9] * O[9][ii] + g_t32[21][10] * O[10][ii] + g_t32[21][11] * O[11][ii] + g_t32[21][12] * O[12][ii] + g_t32[21][13] * O[13][ii] + g_t32[21][14] * O[14][ii] + g_t32[21][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(23 * line) + ii] = (int16_t)((g_t32[23][0] * O[0][ii] + g_t32[23][1] * O[1][ii] + g_t32[23][2] * O[2][ii] + g_t32[23][3] * O[3][ii] + g_t32[23][4] * O[4][ii] + g_t32[23][5] * O[5][ii] + g_t32[23][6] * O[6][ii] + g_t32[23][7] * O[7][ii] + g_t32[23][8] * O[8][ii] + g_t32[23][9] * O[9][ii] + g_t32[23][10] * O[10][ii] + g_t32[23][11] * O[11][ii] + g_t32[23][12] * O[12][ii] + g_t32[23][13] * O[13][ii] + g_t32[23][14] * O[14][ii] + g_t32[23][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(25 * line) + ii] = (int16_t)((g_t32[25][0] * O[0][ii] + g_t32[25][1] * O[1][ii] + g_t32[25][2] * O[2][ii] + g_t32[25][3] * O[3][ii] + g_t32[25][4] * O[4][ii] + g_t32[25][5] * O[5][ii] + g_t32[25][6] * O[6][ii] + g_t32[25][7] * O[7][ii] + g_t32[25][8] * O[8][ii] + g_t32[25][9] * O[9][ii] + g_t32[25][10] * O[10][ii] + g_t32[25][11] * O[11][ii] + g_t32[25][12] * O[12][ii] + g_t32[25][13] * O[13][ii] + g_t32[25][14] * O[14][ii] + g_t32[25][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(27 * line) + ii] = (int16_t)((g_t32[27][0] * O[0][ii] + g_t32[27][1] * O[1][ii] + g_t32[27][2] * O[2][ii] + g_t32[27][3] * O[3][ii] + g_t32[27][4] * O[4][ii] + g_t32[27][5] * O[5][ii] + g_t32[27][6] * O[6][ii] + g_t32[27][7] * O[7][ii] + g_t32[27][8] * O[8][ii] + g_t32[27][9] * O[9][ii] + g_t32[27][10] * O[10][ii] + g_t32[27][11] * O[11][ii] + g_t32[27][12] * O[12][ii] + g_t32[27][13] * O[13][ii] + g_t32[27][14] * O[14][ii] + g_t32[27][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(29 * line) + ii] = (int16_t)((g_t32[29][0] * O[0][ii] + g_t32[29][1] * O[1][ii] + g_t32[29][2] * O[2][ii] + g_t32[29][3] * O[3][ii] + g_t32[29][4] * O[4][ii] + g_t32[29][5] * O[5][ii] + g_t32[29][6] * O[6][ii] + g_t32[29][7] * O[7][ii] + g_t32[29][8] * O[8][ii] + g_t32[29][9] * O[9][ii] + g_t32[29][10] * O[10][ii] + g_t32[29][11] * O[11][ii] + g_t32[29][12] * O[12][ii] + g_t32[29][13] * O[13][ii] + g_t32[29][14] * O[14][ii] + g_t32[29][15] * O[15][ii] + add) >> shift);}
+ for(int ii=0; ii<8; ii++) { dst[(31 * line) + ii] = (int16_t)((g_t32[31][0] * O[0][ii] + g_t32[31][1] * O[1][ii] + g_t32[31][2] * O[2][ii] + g_t32[31][3] * O[3][ii] + g_t32[31][4] * O[4][ii] + g_t32[31][5] * O[5][ii] + g_t32[31][6] * O[6][ii] + g_t32[31][7] * O[7][ii] + g_t32[31][8] * O[8][ii] + g_t32[31][9] * O[9][ii] + g_t32[31][10] * O[10][ii] + g_t32[31][11] * O[11][ii] + g_t32[31][12] * O[12][ii] + g_t32[31][13] * O[13][ii] + g_t32[31][14] * O[14][ii] + g_t32[31][15] * O[15][ii] + add) >> shift);}
+
+ src += 8 ;
+ dst += 8 ;
+ }
+} // end partialButterfly32_transposedSrc_altivec()
+
+
+inline static void partialButterfly16_transposedSrc_altivec(const int16_t* __restrict__ src, int16_t* __restrict__ dst, int shift)
+{
+ const int line = 16 ;
+
+ int j, k;
+ int add = 1 << (shift - 1);
+
+ int E[8][8], O[8][8] ;
+ int EE[4][8], EO[4][8] ;
+ int EEE[2][8], EEO[2][8] ;
+
+
+ for (j = 0; j < line/8; j++)
+ {
+ /* E and O */
+ for(int ii=0; ii<8; ii++) { E[0][ii] = src[(0*line) + ii] + src[ ((15 - 0) * line) + ii] ;}
+ for(int ii=0; ii<8; ii++) { O[0][ii] = src[(0*line) + ii] - src[ ((15 - 0) * line) + ii] ;}
+
+ for(int ii=0; ii<8; ii++) { E[1][ii] = src[(1*line) + ii] + src[ ((15 - 1) * line) + ii] ;}
+ for(int ii=0; ii<8; ii++) { O[1][ii] = src[(1*line) + ii] - src[ ((15 - 1) * line) + ii] ;}
+
+ for(int ii=0; ii<8; ii++) { E[2][ii] = src[(2*line) + ii] + src[ ((15 - 2) * line) + ii] ;}
+ for(int ii=0; ii<8; ii++) { O[2][ii] = src[(2*line) + ii] - src[ ((15 - 2) * line) + ii] ;}
+
+ for(int ii=0; ii<8; ii++) { E[3][ii] = src[(3*line) + ii] + src[ ((15 - 3) * line) + ii] ;}
+ for(int ii=0; ii<8; ii++) { O[3][ii] = src[(3*line) + ii] - src[ ((15 - 3) * line) + ii] ;}
+
+ for(int ii=0; ii<8; ii++) { E[4][ii] = src[(4*line) + ii] + src[ ((15 - 4) * line) + ii] ;}
+ for(int ii=0; ii<8; ii++) { O[4][ii] = src[(4*line) + ii] - src[ ((15 - 4) * line) + ii] ;}
+
+ for(int ii=0; ii<8; ii++) { E[5][ii] = src[(5*line) + ii] + src[ ((15 - 5) * line) + ii] ;}
+ for(int ii=0; ii<8; ii++) { O[5][ii] = src[(5*line) + ii] - src[ ((15 - 5) * line) + ii] ;}
+
+ for(int ii=0; ii<8; ii++) { E[6][ii] = src[(6*line) + ii] + src[ ((15 - 6) * line) + ii] ;}
+ for(int ii=0; ii<8; ii++) { O[6][ii] = src[(6*line) + ii] - src[ ((15 - 6) * line) + ii] ;}
+
+ for(int ii=0; ii<8; ii++) { E[7][ii] = src[(7*line) + ii] + src[ ((15 - 7) * line) + ii] ;}
+ for(int ii=0; ii<8; ii++) { O[7][ii] = src[(7*line) + ii] - src[ ((15 - 7) * line) + ii] ;}
+
+
+ /* EE and EO */
+ for(int ii=0; ii<8; ii++) { EE[0][ii] = E[0][ii] + E[7-0][ii] ;}
+ for(int ii=0; ii<8; ii++) { EO[0][ii] = E[0][ii] - E[7-0][ii] ;}
+
+ for(int ii=0; ii<8; ii++) { EE[1][ii] = E[1][ii] + E[7-1][ii] ;}
+ for(int ii=0; ii<8; ii++) { EO[1][ii] = E[1][ii] - E[7-1][ii] ;}
+
+ for(int ii=0; ii<8; ii++) { EE[2][ii] = E[2][ii] + E[7-2][ii] ;}
+ for(int ii=0; ii<8; ii++) { EO[2][ii] = E[2][ii] - E[7-2][ii] ;}
+
+ for(int ii=0; ii<8; ii++) { EE[3][ii] = E[3][ii] + E[7-3][ii] ;}
+ for(int ii=0; ii<8; ii++) { EO[3][ii] = E[3][ii] - E[7-3][ii] ;}
+
+
+ /* EEE and EEO */
+ for(int ii=0; ii<8; ii++) { EEE[0][ii] = EE[0][ii] + EE[3][ii] ;}
+ for(int ii=0; ii<8; ii++) { EEO[0][ii] = EE[0][ii] - EE[3][ii] ;}
+
+ for(int ii=0; ii<8; ii++) { EEE[1][ii] = EE[1][ii] + EE[2][ii] ;}
+ for(int ii=0; ii<8; ii++) { EEO[1][ii] = EE[1][ii] - EE[2][ii] ;}
+
+
+ /* Writing to dst */
+ for(int ii=0; ii<8; ii++) { dst[ 0 + ii] = (int16_t)((g_t16[0][0] * EEE[0][ii] + g_t16[0][1] * EEE[1][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(8 * line) + ii] = (int16_t)((g_t16[8][0] * EEE[0][ii] + g_t16[8][1] * EEE[1][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(4 * line) + ii] = (int16_t)((g_t16[4][0] * EEO[0][ii] + g_t16[4][1] * EEO[1][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(12 * line) + ii] = (int16_t)((g_t16[12][0] * EEO[0][ii] + g_t16[12][1] * EEO[1][ii] + add) >> shift) ; }
+
+ for(int ii=0; ii<8; ii++) { dst[(2 * line) + ii] = (int16_t)((g_t16[2][0] * EO[0][ii] + g_t16[2][1] * EO[1][ii] + g_t16[2][2] * EO[2][ii] + g_t16[2][3] * EO[3][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(6 * line) + ii] = (int16_t)((g_t16[6][0] * EO[0][ii] + g_t16[6][1] * EO[1][ii] + g_t16[6][2] * EO[2][ii] + g_t16[6][3] * EO[3][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(10 * line) + ii] = (int16_t)((g_t16[10][0] * EO[0][ii] + g_t16[10][1] * EO[1][ii] + g_t16[10][2] * EO[2][ii] + g_t16[10][3] * EO[3][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(14 * line) + ii] = (int16_t)((g_t16[14][0] * EO[0][ii] + g_t16[14][1] * EO[1][ii] + g_t16[14][2] * EO[2][ii] + g_t16[14][3] * EO[3][ii] + add) >> shift) ;}
+
+ for(int ii=0; ii<8; ii++) { dst[(1 * line) + ii] = (int16_t)((g_t16[1][0] * O[0][ii] + g_t16[1][1] * O[1][ii] + g_t16[1][2] * O[2][ii] + g_t16[1][3] * O[3][ii] + g_t16[1][4] * O[4][ii] + g_t16[1][5] * O[5][ii] + g_t16[1][6] * O[6][ii] + g_t16[1][7] * O[7][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(3 * line) + ii] = (int16_t)((g_t16[3][0] * O[0][ii] + g_t16[3][1] * O[1][ii] + g_t16[3][2] * O[2][ii] + g_t16[3][3] * O[3][ii] + g_t16[3][4] * O[4][ii] + g_t16[3][5] * O[5][ii] + g_t16[3][6] * O[6][ii] + g_t16[3][7] * O[7][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(5 * line) + ii] = (int16_t)((g_t16[5][0] * O[0][ii] + g_t16[5][1] * O[1][ii] + g_t16[5][2] * O[2][ii] + g_t16[5][3] * O[3][ii] + g_t16[5][4] * O[4][ii] + g_t16[5][5] * O[5][ii] + g_t16[5][6] * O[6][ii] + g_t16[5][7] * O[7][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(7 * line) + ii] = (int16_t)((g_t16[7][0] * O[0][ii] + g_t16[7][1] * O[1][ii] + g_t16[7][2] * O[2][ii] + g_t16[7][3] * O[3][ii] + g_t16[7][4] * O[4][ii] + g_t16[7][5] * O[5][ii] + g_t16[7][6] * O[6][ii] + g_t16[7][7] * O[7][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(9 * line) + ii] = (int16_t)((g_t16[9][0] * O[0][ii] + g_t16[9][1] * O[1][ii] + g_t16[9][2] * O[2][ii] + g_t16[9][3] * O[3][ii] + g_t16[9][4] * O[4][ii] + g_t16[9][5] * O[5][ii] + g_t16[9][6] * O[6][ii] + g_t16[9][7] * O[7][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(11 * line) + ii] = (int16_t)((g_t16[11][0] * O[0][ii] + g_t16[11][1] * O[1][ii] + g_t16[11][2] * O[2][ii] + g_t16[11][3] * O[3][ii] + g_t16[11][4] * O[4][ii] + g_t16[11][5] * O[5][ii] + g_t16[11][6] * O[6][ii] + g_t16[11][7] * O[7][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(13 * line) + ii] = (int16_t)((g_t16[13][0] * O[0][ii] + g_t16[13][1] * O[1][ii] + g_t16[13][2] * O[2][ii] + g_t16[13][3] * O[3][ii] + g_t16[13][4] * O[4][ii] + g_t16[13][5] * O[5][ii] + g_t16[13][6] * O[6][ii] + g_t16[13][7] * O[7][ii] + add) >> shift) ;}
+ for(int ii=0; ii<8; ii++) { dst[(15 * line) + ii] = (int16_t)((g_t16[15][0] * O[0][ii] + g_t16[15][1] * O[1][ii] + g_t16[15][2] * O[2][ii] + g_t16[15][3] * O[3][ii] + g_t16[15][4] * O[4][ii] + g_t16[15][5] * O[5][ii] + g_t16[15][6] * O[6][ii] + g_t16[15][7] * O[7][ii] + add) >> shift) ;}
+
+
+ src += 8;
+ dst += 8 ;
+
+ }
+} // end partialButterfly16_transposedSrc_altivec()
+
+
+static void dct16_altivec(const int16_t* src, int16_t* dst, intptr_t srcStride)
+{
+ const int shift_1st = 3 + X265_DEPTH - 8;
+ const int shift_2nd = 10;
+
+ ALIGN_VAR_32(int16_t, coef[16 * 16]);
+ ALIGN_VAR_32(int16_t, block_transposed[16 * 16]);
+ ALIGN_VAR_32(int16_t, coef_transposed[16 * 16]);
+
+ transpose_matrix_16_altivec((int16_t *)src, srcStride, (int16_t *)block_transposed, 16) ;
+ partialButterfly16_transposedSrc_altivec(block_transposed, coef, shift_1st) ;
+
+ transpose_matrix_16_altivec((int16_t *)coef, 16, (int16_t *)coef_transposed, 16) ;
+ partialButterfly16_transposedSrc_altivec(coef_transposed, dst, shift_2nd);
+} // end dct16_altivec()
+
+
+
+
+static void dct32_altivec(const int16_t* src, int16_t* dst, intptr_t srcStride)
+{
+ const int shift_1st = 4 + X265_DEPTH - 8;
+ const int shift_2nd = 11;
+
+ ALIGN_VAR_32(int16_t, coef[32 * 32]);
+ ALIGN_VAR_32(int16_t, block_transposed[32 * 32]);
+ ALIGN_VAR_32(int16_t, coef_transposed[32 * 32]);
+
+ transpose_matrix_32_altivec((int16_t *)src, srcStride, (int16_t *)block_transposed, 32) ;
+ partialButterfly32_transposedSrc_altivec(block_transposed, coef, shift_1st) ;
+
+ transpose_matrix_32_altivec((int16_t *)coef, 32, (int16_t *)coef_transposed, 32) ;
+ partialButterfly32_transposedSrc_altivec(coef_transposed, dst, shift_2nd);
+} // end dct32_altivec()
+
+
+namespace X265_NS {
+// x265 private namespace
+
+void setupDCTPrimitives_altivec(EncoderPrimitives& p)
+{
+ p.quant = quant_altivec ;
+
+ p.cu[BLOCK_16x16].dct = dct16_altivec ;
+ p.cu[BLOCK_32x32].dct = dct32_altivec ;
+
+ p.denoiseDct = denoiseDct_altivec ;
+}
+}
diff --git a/source/common/ppc/intrapred_altivec.cpp b/source/common/ppc/intrapred_altivec.cpp
new file mode 100644
index 0000000..6bd3005
--- /dev/null
+++ b/source/common/ppc/intrapred_altivec.cpp
@@ -0,0 +1,30809 @@
+/*****************************************************************************
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Roger Moussalli <rmoussal at us.ibm.com>
+ * Min Chen <min.chen at multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include <iostream>
+#include <vector>
+#include <assert.h>
+#include <math.h>
+#include <cmath>
+#include <linux/types.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/time.h>
+#include <string.h>
+
+#include "common.h"
+#include "primitives.h"
+#include "x265.h"
+#include "ppccommon.h"
+
+//using namespace std ;
+namespace X265_NS {
+
+/* INTRA Prediction - altivec implementation */
+template<int width, int dirMode>
+void intra_pred(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter){};
+
+template<>
+void intra_pred<4, 2>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ if(dstStride == 4) {
+ const vec_u8_t srcV = vec_xl(10, srcPix0); /* offset = width2+2 = width<<1 + 2*/
+ const vec_u8_t mask = {0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03,0x04, 0x02, 0x03,0x04,0x05, 0x03,0x04,0x05, 0x06};
+ vec_u8_t vout = vec_perm(srcV, srcV, mask);
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_u8_t v0 = vec_xl(10, srcPix0);
+ vec_ste((vec_u32_t)v0, 0, (unsigned int*)dst);
+ vec_u8_t v1 = vec_xl(11, srcPix0);
+ vec_ste((vec_u32_t)v1, 0, (unsigned int*)(dst+dstStride));
+ vec_u8_t v2 = vec_xl(12, srcPix0);
+ vec_ste((vec_u32_t)v2, 0, (unsigned int*)(dst+dstStride*2));
+ vec_u8_t v3 = vec_xl(13, srcPix0);
+ vec_ste((vec_u32_t)v3, 0, (unsigned int*)(dst+dstStride*3));
+ }
+ else{
+ const vec_u8_t srcV = vec_xl(10, srcPix0); /* offset = width2+2 = width<<1 + 2*/
+ const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_1 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_2 = {0x02, 0x03, 0x04, 0x05, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_3 = {0x03, 0x04, 0x05, 0x06, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(srcV, vec_xl(0, dst), mask_0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(srcV, vec_xl(dstStride, dst), mask_1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(srcV, vec_xl(dstStride*2, dst), mask_2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(srcV, vec_xl(dstStride*3, dst), mask_3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 2>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ if(dstStride == 8) {
+ const vec_u8_t srcV1 = vec_xl(18, srcPix0); /* offset = width2+2 = width<<1 + 2*/
+ const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03,0x04, 0x05, 0x06, 0x07, 0x08};
+ const vec_u8_t mask_1 = {0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ const vec_u8_t mask_2 = {0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c};
+ const vec_u8_t mask_3 = {0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e};
+ vec_u8_t v0 = vec_perm(srcV1, srcV1, mask_0);
+ vec_u8_t v1 = vec_perm(srcV1, srcV1, mask_1);
+ vec_u8_t v2 = vec_perm(srcV1, srcV1, mask_2);
+ vec_u8_t v3 = vec_perm(srcV1, srcV1, mask_3);
+ vec_xst(v0, 0, dst);
+ vec_xst(v1, 16, dst);
+ vec_xst(v2, 32, dst);
+ vec_xst(v3, 48, dst);
+ }
+ else{
+ //pixel *out = dst;
+ const vec_u8_t srcV1 = vec_xl(18, srcPix0); /* offset = width2+2 = width<<1 + 2*/
+ const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_1 = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_2 = {0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_3 = {0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_4 = {0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_5 = {0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_6 = {0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_7 = {0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(srcV1, vec_xl(0, dst), mask_0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(srcV1, vec_xl(dstStride, dst), mask_1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(srcV1, vec_xl(dstStride*2, dst), mask_2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(srcV1, vec_xl(dstStride*3, dst), mask_3);
+ vec_xst(v3, dstStride*3, dst);
+ vec_u8_t v4 = vec_perm(srcV1, vec_xl(dstStride*4, dst), mask_4);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(srcV1, vec_xl(dstStride*5, dst), mask_5);
+ vec_xst(v5, dstStride*5, dst);
+ vec_u8_t v6 = vec_perm(srcV1, vec_xl(dstStride*6, dst), mask_6);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(srcV1, vec_xl(dstStride*7, dst), mask_7);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 2>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ int i;
+ //int off = dstStride;
+ //const pixel *srcPix = srcPix0;
+ for(i=0; i<16; i++){
+ vec_xst( vec_xl(34+i, srcPix0), i*dstStride, dst); /* first offset = width2+2 = width<<1 + 2*/
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x <16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 2>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ int i;
+ int off = dstStride;
+ //const pixel *srcPix = srcPix0;
+ for(i=0; i<32; i++){
+ off = i*dstStride;
+ vec_xst( vec_xl(66+i, srcPix0), off, dst); /* first offset = width2+2 = width<<1 + 2*/
+ vec_xst( vec_xl(82+i, srcPix0), off+16, dst); /* first offset = width2+2 = width<<1 + 2*/
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x <32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+#define one_line(s0, s1, vf32, vf, vout) {\
+vmle0 = vec_mule(s0, vf32);\
+vmlo0 = vec_mulo(s0, vf32);\
+vmle1 = vec_mule(s1, vf);\
+vmlo1 = vec_mulo(s1, vf);\
+vsume = vec_add(vec_add(vmle0, vmle1), u16_16);\
+ve = vec_sra(vsume, u16_5);\
+vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);\
+vo = vec_sra(vsumo, u16_5);\
+vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));\
+}
+
+template<>
+void intra_pred<4, 3>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x1, 0x2, 0x3, 0x4, 0x2, 0x3, 0x4, 0x5, 0x3, 0x4, 0x5, 0x6};
+ vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x2, 0x3, 0x4, 0x5, 0x3, 0x4, 0x5, 0x6, 0x4, 0x5, 0x6, 0x7};
+
+ vec_u8_t vfrac4 = (vec_u8_t){26, 20, 14, 8, 26, 20, 14, 8, 26, 20, 14, 8, 26, 20, 14, 8};
+ vec_u8_t vfrac4_32 = (vec_u8_t){6, 12, 18, 24, 6, 12, 18, 24, 6, 12, 18, 24, 6, 12, 18, 24};
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 3>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7};
+ vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8};
+ vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9};
+ vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa};
+ vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb};
+ vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc};
+ vec_u8_t mask6={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd};
+ vec_u8_t mask7={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe};
+ //vec_u8_t mask8={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf};
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */
+
+ vec_u8_t vfrac8 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 26, 20, 14, 8, 2, 28, 22, 16};
+ vec_u8_t vfrac8_32 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 6, 12, 18, 24, 30, 4, 10, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32);
+ vmlo0 = vec_mulo(srv2, vfrac8_32);
+ vmle1 = vec_mule(srv3, vfrac8);
+ vmlo1 = vec_mulo(srv3, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv4, vfrac8_32);
+ vmlo0 = vec_mulo(srv4, vfrac8_32);
+ vmle1 = vec_mule(srv5, vfrac8);
+ vmlo1 = vec_mulo(srv5, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv6, vfrac8_32);
+ vmlo0 = vec_mulo(srv6, vfrac8_32);
+ vmle1 = vec_mule(srv7, vfrac8);
+ vmlo1 = vec_mulo(srv7, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 3>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd};
+vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe};
+vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf};
+vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10};
+vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11};
+vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12};
+vec_u8_t mask6={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13};
+vec_u8_t mask7={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14};
+vec_u8_t mask8={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+vec_u8_t mask9={0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+vec_u8_t mask10={0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+vec_u8_t mask11={0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+vec_u8_t mask12={0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+vec_u8_t mask13={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+vec_u8_t mask14={0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+vec_u8_t mask15={0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+ vec_u8_t srv00 = vec_perm(sv1, sv1, mask0);
+
+vec_u8_t vfrac16 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+vec_u8_t vfrac16_32 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srva, vfrac16_32, vfrac16, vout_9);
+ one_line(srva, srvb, vfrac16_32, vfrac16, vout_10);
+ one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11);
+ one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12);
+ one_line(srvd, srve, vfrac16_32, vfrac16, vout_13);
+ one_line(srve, srvf, vfrac16_32, vfrac16, vout_14);
+ one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 3>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+
+vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, };
+vec_u8_t mask16_0={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, };
+vec_u8_t mask16_1={0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask16_2={0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask16_3={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, };
+vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask16_4={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, };
+vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask16_5={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask6={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask16_6={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask7={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask16_7={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask8={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask16_8={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask9={0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask16_9={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask10={0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask16_10={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask11={0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask16_11={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask12={0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask16_12={0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask13={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask16_13={0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask14={0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask16_14={0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask15={0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask16_15={0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+
+/*vec_u8_t mask16={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, };
+vec_u8_t mask16_16={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask17={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, };
+vec_u8_t mask16_17={0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask18={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask16_18={0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask19={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask16_19={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, };
+vec_u8_t mask20={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask16_20={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, };
+vec_u8_t mask21={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask16_21={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask22={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask16_22={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask23={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask16_23={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask24={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask16_24={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask25={0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask16_25={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask26={0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask16_26={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask27={0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask16_27={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask28={0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask16_28={0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask29={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask16_29={0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask30={0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask16_30={0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask31={0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask16_31={0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };*/
+
+vec_u8_t maskadd1_31={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, };
+vec_u8_t maskadd1_16_31={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ //vec_u8_t sv4 =vec_xl(129, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+/*
+ printf("source:\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+65]);
+ }
+ printf("\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+97]);
+ }
+ printf("\n\n");
+*/
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0);
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srv10 = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srv11 = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srv12 = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srv13 = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srv14 = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srv15 = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1);
+ vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(sv1, sv2, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(sv1, sv2, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(sv1, sv2, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(sv1, sv2, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(sv1, sv2, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(sv1, sv2, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(sv1, sv2, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(sv1, sv2, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(sv1, sv2, mask16_11);
+ vec_u8_t srv16_12 = vec_perm(sv1, sv2, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(sv1, sv2, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */
+ vec_u8_t srv17 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv18 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv19 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv21 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv22 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv23 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv24 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv25 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srv26 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srv27 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srv28 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srv29 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srv31 = vec_perm(sv1, sv2, mask15);
+ vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31);
+
+
+ vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */
+ vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(sv2, sv3, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(sv2, sv3, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(sv2, sv3, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(sv2, sv3, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(sv2, sv3, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(sv2, sv3, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(sv2, sv3, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(sv2, sv3, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(sv2, sv3, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(sv2, sv3, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(sv2, sv3, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15);
+ vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31);
+
+
+vec_u8_t vfrac32_0 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+vec_u8_t vfrac32_1 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+vec_u8_t vfrac32_32_0 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 32};
+vec_u8_t vfrac32_32_1 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+
+ one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 4>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x1, 0x2, 0x2, 0x3, 0x2, 0x3, 0x3, 0x4, 0x3, 0x4, 0x4, 0x5};
+vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x2, 0x3, 0x3, 0x4, 0x3, 0x4, 0x4, 0x5, 0x4, 0x5, 0x5, 0x6};
+
+vec_u8_t vfrac4 = (vec_u8_t){21, 10, 31, 20, 21, 10, 31, 20, 21, 10, 31, 20, 21, 10, 31, 20};
+vec_u8_t vfrac4_32 = (vec_u8_t){11, 22, 1, 12, 11, 22, 1, 12, 11, 22, 1, 12, 11, 22, 1, 12};
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 4>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, };
+vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, };
+vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, };
+vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, };
+vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, };
+vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, };
+vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, };
+vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, };
+//vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */
+
+ //mode 4, mode32
+ //int offset_4[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+ //int fraction_4[32] = {21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, 5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0};
+
+vec_u8_t vfrac8 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 21, 10, 31, 20, 9, 30, 19, 8, };
+vec_u8_t vfrac8_32 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 11, 22, 1, 12, 23, 2, 13, 24, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32);
+ vmlo0 = vec_mulo(srv2, vfrac8_32);
+ vmle1 = vec_mule(srv3, vfrac8);
+ vmlo1 = vec_mulo(srv3, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv4, vfrac8_32);
+ vmlo0 = vec_mulo(srv4, vfrac8_32);
+ vmle1 = vec_mule(srv5, vfrac8);
+ vmlo1 = vec_mulo(srv5, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv6, vfrac8_32);
+ vmlo0 = vec_mulo(srv6, vfrac8_32);
+ vmle1 = vec_mule(srv7, vfrac8);
+ vmlo1 = vec_mulo(srv7, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 4>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, };
+vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, };
+vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, };
+vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, };
+vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, };
+vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, };
+vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, };
+vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, };
+vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, };
+vec_u8_t mask9={0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, };
+vec_u8_t mask10={0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, };
+vec_u8_t mask11={0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, };
+vec_u8_t mask12={0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, };
+vec_u8_t mask13={0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, };
+vec_u8_t mask14={0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x17, 0x18, };
+vec_u8_t mask15={0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x18, 0x19, };
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+ vec_u8_t srv00 = vec_perm(sv1, sv1, mask0);;
+
+vec_u8_t vfrac16 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, };
+vec_u8_t vfrac16_32 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 3, 14, 25, 4, 15, 26, 5, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srva, vfrac16_32, vfrac16, vout_9);
+ one_line(srva, srvb, vfrac16_32, vfrac16, vout_10);
+ one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11);
+ one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12);
+ one_line(srvd, srve, vfrac16_32, vfrac16, vout_13);
+ one_line(srve, srvf, vfrac16_32, vfrac16, vout_14);
+ one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 4>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-3;
+ dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-3;
+ dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-3;
+ dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+
+ ....
+ y=16; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5);
+ dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5);
+ dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5);
+
+ ....
+ y=31; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5);
+ dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5);
+ dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5);
+ ...
+ dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5);
+ }
+ */
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+
+vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, };
+vec_u8_t mask16_0={0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, };
+vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, };
+vec_u8_t mask16_1={0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, };
+vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, };
+vec_u8_t mask16_2={0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, };
+vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, };
+vec_u8_t mask16_3={0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, };
+vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, };
+vec_u8_t mask16_4={0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, };
+vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, };
+vec_u8_t mask16_5={0x0, 0x0, 0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, };
+vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, };
+vec_u8_t mask16_6={0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, };
+vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, };
+vec_u8_t mask16_7={0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, };
+vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, };
+vec_u8_t mask16_8={0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, };
+vec_u8_t mask9={0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, };
+vec_u8_t mask16_9={0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, };
+vec_u8_t mask10={0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, };
+vec_u8_t mask16_10={0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, };
+vec_u8_t mask11={0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, };
+vec_u8_t mask16_11={0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, };
+vec_u8_t mask12={0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, };
+vec_u8_t mask16_12={0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, };
+vec_u8_t mask13={0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, };
+vec_u8_t mask16_13={0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, };
+vec_u8_t mask14={0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x17, 0x18, };
+vec_u8_t mask16_14={0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, };
+vec_u8_t mask15={0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x18, 0x19, };
+vec_u8_t mask16_15={0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, };
+/*vec_u8_t mask16={0x0, 0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, };
+vec_u8_t mask16_16={0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, };
+vec_u8_t mask17={0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, };
+vec_u8_t mask16_17={0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, };
+vec_u8_t mask18={0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, };
+vec_u8_t mask16_18={0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, };
+vec_u8_t mask19={0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, };
+vec_u8_t mask16_19={0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, };
+vec_u8_t mask20={0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, };
+vec_u8_t mask16_20={0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, };
+vec_u8_t mask21={0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, };
+vec_u8_t mask16_21={0x0, 0x0, 0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, };
+vec_u8_t mask22={0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, };
+vec_u8_t mask16_22={0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, };
+vec_u8_t mask23={0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, };
+vec_u8_t mask16_23={0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, };
+vec_u8_t mask24={0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, };
+vec_u8_t mask16_24={0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, };
+vec_u8_t mask25={0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, };
+vec_u8_t mask16_25={0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, };
+vec_u8_t mask26={0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, };
+vec_u8_t mask16_26={0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, };
+vec_u8_t mask27={0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, };
+vec_u8_t mask16_27={0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, };
+vec_u8_t mask28={0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, };
+vec_u8_t mask16_28={0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, };
+vec_u8_t mask29={0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, };
+vec_u8_t mask16_29={0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, };
+vec_u8_t mask30={0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x17, 0x18, };
+vec_u8_t mask16_30={0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, };
+vec_u8_t mask31={0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x18, 0x19, };
+vec_u8_t mask16_31={0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, };*/
+vec_u8_t maskadd1_31={0x0, 0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, };
+vec_u8_t maskadd1_16_31={0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+/*
+ printf("source:\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+65]);
+ }
+ printf("\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+97]);
+ }
+ printf("\n\n");
+*/
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0);
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srv10 = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srv11 = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srv12 = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srv13 = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srv14 = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srv15 = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1);
+ vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(sv0, sv1, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(sv0, sv1, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(sv1, sv2, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(sv1, sv2, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(sv1, sv2, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(sv1, sv2, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(sv1, sv2, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(sv1, sv2, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(sv1, sv2, mask16_11);
+ vec_u8_t srv16_12 = vec_perm(sv1, sv2, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(sv1, sv2, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */
+ vec_u8_t srv17 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv18 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv19 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv21 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv22 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv23 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv24 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv25 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srv26 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srv27 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srv28 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srv29 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srv31 = vec_perm(sv1, sv2, mask15);
+ vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31);
+
+
+ vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */
+ vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(sv1, sv2, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(sv1, sv2, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(sv2, sv3, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(sv2, sv3, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(sv2, sv3, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(sv2, sv3, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(sv2, sv3, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(sv2, sv3, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(sv2, sv3, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(sv2, sv3, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(sv2, sv3, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15);
+ vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31);
+
+
+vec_u8_t vfrac32_0 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, };
+vec_u8_t vfrac32_1 = (vec_u8_t){5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 3, 14, 25, 4, 15, 26, 5, 16, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){27, 6, 17, 28, 7, 18, 29, 8, 19, 30, 9, 20, 31, 10, 21, 32, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+
+ one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 5>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x1, 0x2, 0x2, 0x3, 0x2, 0x3, 0x3, 0x4, 0x3, 0x4, 0x4, 0x5, };
+vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x2, 0x3, 0x3, 0x4, 0x3, 0x4, 0x4, 0x5, 0x4, 0x5, 0x5, 0x6, };
+
+vec_u8_t vfrac4 = (vec_u8_t){17, 2, 19, 4, 17, 2, 19, 4, 17, 2, 19, 4, 17, 2, 19, 4, };
+vec_u8_t vfrac4_32 = (vec_u8_t){15, 30, 13, 28, 15, 30, 13, 28, 15, 30, 13, 28, 15, 30, 13, 28, };
+
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 5>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, };
+vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, };
+vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, };
+vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, };
+vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, };
+vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, };
+vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, };
+vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, };
+//vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */
+
+vec_u8_t vfrac8 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 17, 2, 19, 4, 21, 6, 23, 8, };
+vec_u8_t vfrac8_32 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 15, 30, 13, 28, 11, 26, 9, 24, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32);
+ vmlo0 = vec_mulo(srv2, vfrac8_32);
+ vmle1 = vec_mule(srv3, vfrac8);
+ vmlo1 = vec_mulo(srv3, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv4, vfrac8_32);
+ vmlo0 = vec_mulo(srv4, vfrac8_32);
+ vmle1 = vec_mule(srv5, vfrac8);
+ vmlo1 = vec_mulo(srv5, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv6, vfrac8_32);
+ vmlo0 = vec_mulo(srv6, vfrac8_32);
+ vmle1 = vec_mule(srv7, vfrac8);
+ vmlo1 = vec_mulo(srv7, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 5>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, };
+vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, };
+vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, };
+vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, };
+vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, };
+vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, };
+vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, };
+vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, };
+vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, };
+vec_u8_t mask9={0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, };
+vec_u8_t mask10={0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, };
+vec_u8_t mask11={0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, };
+vec_u8_t mask12={0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, };
+vec_u8_t mask13={0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, };
+vec_u8_t mask14={0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, };
+vec_u8_t mask15={0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, 0x16, 0x17, };
+//vec_u8_t mask16={0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, 0x16, 0x17, 0x17, 0x18, };
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+ vec_u8_t srv00 = vec_perm(sv1, sv1, mask0);
+
+vec_u8_t vfrac16 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, };
+vec_u8_t vfrac16_32 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 7, 22, 5, 20, 3, 18, 1, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srva, vfrac16_32, vfrac16, vout_9);
+ one_line(srva, srvb, vfrac16_32, vfrac16, vout_10);
+ one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11);
+ one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12);
+ one_line(srvd, srve, vfrac16_32, vfrac16, vout_13);
+ one_line(srve, srvf, vfrac16_32, vfrac16, vout_14);
+ one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 5>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-3;
+ dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-3;
+ dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-3;
+ dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+
+ ....
+ y=16; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5);
+ dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5);
+ dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5);
+
+ ....
+ y=31; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5);
+ dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5);
+ dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5);
+ ...
+ dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5);
+ }
+ */
+vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, };
+vec_u8_t mask16_0={0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x11, };
+vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, };
+vec_u8_t mask16_1={0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x12, };
+vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, };
+vec_u8_t mask16_2={0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x13, };
+vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, };
+vec_u8_t mask16_3={0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x14, };
+vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, };
+vec_u8_t mask16_4={0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x15, };
+vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, };
+vec_u8_t mask16_5={0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x16, };
+vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, };
+vec_u8_t mask16_6={0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, 0x17, };
+vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, };
+vec_u8_t mask16_7={0x0, 0x0, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x8, };
+vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, };
+vec_u8_t mask16_8={0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x9, };
+vec_u8_t mask9={0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, };
+vec_u8_t mask16_9={0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0xa, };
+vec_u8_t mask10={0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, };
+vec_u8_t mask16_10={0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xb, };
+vec_u8_t mask11={0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, };
+vec_u8_t mask16_11={0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xc, };
+vec_u8_t mask12={0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, };
+vec_u8_t mask16_12={0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xd, };
+vec_u8_t mask13={0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, };
+vec_u8_t mask16_13={0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xe, };
+vec_u8_t mask14={0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, };
+vec_u8_t mask16_14={0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xf, };
+vec_u8_t mask15={0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, 0x16, 0x17, };
+vec_u8_t mask16_15={0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_31={0x0, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, };
+vec_u8_t maskadd1_16_31={0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x11, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ //vec_u8_t sv4 =vec_xl(129, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+/*
+ printf("source:\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+65]);
+ }
+ printf("\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+97]);
+ }
+ printf("\n\n");
+*/
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0);
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srv10 = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srv11 = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srv12 = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srv13 = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srv14 = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srv15 = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1);
+ vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(sv0, sv1, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(sv0, sv1, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(sv0, sv1, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(sv0, sv1, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(sv1, sv2, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(sv1, sv2, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(sv1, sv2, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(sv1, sv2, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(sv1, sv2, mask16_11);
+ vec_u8_t srv16_12 = vec_perm(sv1, sv2, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(sv1, sv2, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */
+ vec_u8_t srv17 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv18 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv19 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv21 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv22 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv23 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv24 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv25 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srv26 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srv27 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srv28 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srv29 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srv31 = vec_perm(sv1, sv2, mask15);
+ vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31);
+
+
+ vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */
+ vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(sv1, sv2, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(sv1, sv2, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(sv1, sv2, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(sv1, sv2, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(sv2, sv3, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(sv2, sv3, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(sv2, sv3, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(sv2, sv3, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(sv2, sv3, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(sv2, sv3, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(sv2, sv3, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15);
+ vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31);
+
+
+vec_u8_t vfrac32_0 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, };
+vec_u8_t vfrac32_1 = (vec_u8_t){1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 7, 22, 5, 20, 3, 18, 1, 16, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){31, 14, 29, 12, 27, 10, 25, 8, 23, 6, 21, 4, 19, 2, 17, 32, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+
+ one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<4, 6>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, };
+vec_u8_t mask1={0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, };
+
+
+vec_u8_t vfrac4 = (vec_u8_t){13, 26, 7, 20, 13, 26, 7, 20, 13, 26, 7, 20, 13, 26, 7, 20, };
+
+vec_u8_t vfrac4_32 = (vec_u8_t){19, 6, 25, 12, 19, 6, 25, 12, 19, 6, 25, 12, 19, 6, 25, 12, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 6>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x0, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, };
+vec_u8_t mask1={0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, };
+vec_u8_t mask2={0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, };
+vec_u8_t mask3={0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, };
+vec_u8_t mask4={0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, };
+vec_u8_t mask5={0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, };
+vec_u8_t mask6={0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, };
+vec_u8_t mask7={0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, };
+//vec_u8_t mask8={0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */
+
+vec_u8_t vfrac8 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 13, 26, 7, 20, 1, 14, 27, 8, };
+vec_u8_t vfrac8_32 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 19, 6, 25, 12, 31, 18, 5, 24, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32);
+ vmlo0 = vec_mulo(srv2, vfrac8_32);
+ vmle1 = vec_mule(srv3, vfrac8);
+ vmlo1 = vec_mulo(srv3, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv4, vfrac8_32);
+ vmlo0 = vec_mulo(srv4, vfrac8_32);
+ vmle1 = vec_mule(srv5, vfrac8);
+ vmlo1 = vec_mulo(srv5, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv6, vfrac8_32);
+ vmlo0 = vec_mulo(srv6, vfrac8_32);
+ vmle1 = vec_mule(srv7, vfrac8);
+ vmlo1 = vec_mulo(srv7, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 6>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+vec_u8_t mask0={0x0, 0x0, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, };
+vec_u8_t mask1={0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, };
+vec_u8_t mask2={0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, };
+vec_u8_t mask3={0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, };
+vec_u8_t mask4={0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, };
+vec_u8_t mask5={0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, };
+vec_u8_t mask6={0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, };
+vec_u8_t mask7={0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, };
+vec_u8_t mask8={0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, };
+vec_u8_t mask9={0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, };
+vec_u8_t mask10={0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, };
+vec_u8_t mask11={0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, };
+vec_u8_t mask12={0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, };
+vec_u8_t mask13={0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, };
+vec_u8_t mask14={0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, };
+vec_u8_t mask15={0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, };
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+ vec_u8_t srv00 = vec_perm(sv1, sv1, mask0);
+
+vec_u8_t vfrac16 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, };
+vec_u8_t vfrac16_32 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 11, 30, 17, 4, 23, 10, 29, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srva, vfrac16_32, vfrac16, vout_9);
+ one_line(srva, srvb, vfrac16_32, vfrac16, vout_10);
+ one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11);
+ one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12);
+ one_line(srvd, srve, vfrac16_32, vfrac16, vout_13);
+ one_line(srve, srvf, vfrac16_32, vfrac16, vout_14);
+ one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 6>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-3;
+ dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-3;
+ dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-3;
+ dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+
+ ....
+ y=16; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5);
+ dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5);
+ dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5);
+
+ ....
+ y=31; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5);
+ dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5);
+ dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5);
+ ...
+ dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5);
+ }
+ */
+vec_u8_t mask0={0x0, 0x0, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, };
+vec_u8_t mask16_0={0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, };
+vec_u8_t mask1={0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, };
+vec_u8_t mask16_1={0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, };
+vec_u8_t mask2={0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, };
+vec_u8_t mask16_2={0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, };
+vec_u8_t mask3={0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, };
+vec_u8_t mask16_3={0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, };
+vec_u8_t mask4={0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, };
+vec_u8_t mask16_4={0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, };
+vec_u8_t mask5={0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, };
+vec_u8_t mask16_5={0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, };
+vec_u8_t mask6={0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, };
+vec_u8_t mask16_6={0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, };
+vec_u8_t mask7={0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, };
+vec_u8_t mask16_7={0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, };
+vec_u8_t mask8={0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, };
+vec_u8_t mask16_8={0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, };
+vec_u8_t mask9={0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, };
+vec_u8_t mask16_9={0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, };
+vec_u8_t mask10={0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, };
+vec_u8_t mask16_10={0x0, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, };
+vec_u8_t mask11={0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, };
+vec_u8_t mask16_11={0x1, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, };
+vec_u8_t mask12={0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, };
+vec_u8_t mask16_12={0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, };
+vec_u8_t mask13={0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, };
+vec_u8_t mask16_13={0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, };
+vec_u8_t mask14={0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, };
+vec_u8_t mask16_14={0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, };
+vec_u8_t mask15={0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, };
+vec_u8_t mask16_15={0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, };
+vec_u8_t maskadd1_31={0x0, 0x0, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, };
+vec_u8_t maskadd1_16_31={0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, };
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ //vec_u8_t sv4 =vec_xl(129, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+/*
+ printf("source:\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+65]);
+ }
+ printf("\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+97]);
+ }
+ printf("\n\n");
+*/
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0);
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srv10 = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srv11 = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srv12 = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srv13 = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srv14 = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srv15 = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1);
+ vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(sv0, sv1, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(sv0, sv1, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(sv0, sv1, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(sv0, sv1, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(sv0, sv1, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(sv0, sv1, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(sv0, sv1, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(sv1, sv2, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(sv1, sv2, mask16_11);
+ vec_u8_t srv16_12 = vec_perm(sv1, sv2, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(sv1, sv2, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */
+ vec_u8_t srv17 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv18 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv19 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv21 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv22 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv23 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv24 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv25 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srv26 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srv27 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srv28 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srv29 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srv31 = vec_perm(sv1, sv2, mask15);
+ vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31);
+
+
+ vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */
+ vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(sv1, sv2, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(sv1, sv2, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(sv1, sv2, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(sv1, sv2, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(sv1, sv2, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(sv1, sv2, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(sv1, sv2, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(sv2, sv3, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(sv2, sv3, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(sv2, sv3, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(sv2, sv3, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15);
+ vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31);
+
+vec_u8_t vfrac32_0 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, };
+vec_u8_t vfrac32_1 = (vec_u8_t){29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 11, 30, 17, 4, 23, 10, 29, 16, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){3, 22, 9, 28, 15, 2, 21, 8, 27, 14, 1, 20, 7, 26, 13, 32, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+
+ one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 7>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, };
+
+vec_u8_t vfrac4 = (vec_u8_t){9, 18, 27, 4, 9, 18, 27, 4, 9, 18, 27, 4, 9, 18, 27, 4, };
+
+vec_u8_t vfrac4_32 = (vec_u8_t){23, 14, 5, 28, 23, 14, 5, 28, 23, 14, 5, 28, 23, 14, 5, 28, };
+
+
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 7>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, };
+//vec_u8_t mask8={0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */
+
+vec_u8_t vfrac8 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 9, 18, 27, 4, 13, 22, 31, 8, };
+vec_u8_t vfrac8_32 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 23, 14, 5, 28, 19, 10, 1, 24, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32);
+ vmlo0 = vec_mulo(srv2, vfrac8_32);
+ vmle1 = vec_mule(srv3, vfrac8);
+ vmlo1 = vec_mulo(srv3, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv4, vfrac8_32);
+ vmlo0 = vec_mulo(srv4, vfrac8_32);
+ vmle1 = vec_mule(srv5, vfrac8);
+ vmlo1 = vec_mulo(srv5, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv6, vfrac8_32);
+ vmlo0 = vec_mulo(srv6, vfrac8_32);
+ vmle1 = vec_mule(srv7, vfrac8);
+ vmlo1 = vec_mulo(srv7, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 7>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, };
+vec_u8_t mask8={0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, };
+vec_u8_t mask9={0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, };
+vec_u8_t mask10={0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, };
+vec_u8_t mask11={0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, };
+vec_u8_t mask12={0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, };
+vec_u8_t mask13={0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, };
+vec_u8_t mask14={0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, 0x12, 0x12, };
+vec_u8_t mask15={0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x12, 0x12, 0x13, 0x13, };
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+ vec_u8_t srv00 = vec_perm(sv1, sv1, mask0);
+
+vec_u8_t vfrac16 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, };
+vec_u8_t vfrac16_32 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 15, 6, 29, 20, 11, 2, 25, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srva, vfrac16_32, vfrac16, vout_9);
+ one_line(srva, srvb, vfrac16_32, vfrac16, vout_10);
+ one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11);
+ one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12);
+ one_line(srvd, srve, vfrac16_32, vfrac16, vout_13);
+ one_line(srve, srvf, vfrac16_32, vfrac16, vout_14);
+ one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 7>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-3;
+ dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-3;
+ dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-3;
+ dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+
+ ....
+ y=16; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5);
+ dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5);
+ dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5);
+
+ ....
+ y=31; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5);
+ dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5);
+ dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5);
+ ...
+ dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5);
+ }
+ */
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, };
+vec_u8_t mask16_0={0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, };
+vec_u8_t mask16_1={0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, };
+vec_u8_t mask16_2={0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, };
+vec_u8_t mask16_3={0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, };
+vec_u8_t mask16_4={0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, };
+vec_u8_t mask16_5={0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, };
+vec_u8_t mask16_6={0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, };
+vec_u8_t mask16_7={0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, };
+vec_u8_t mask8={0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, };
+vec_u8_t mask16_8={0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, };
+vec_u8_t mask9={0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, };
+vec_u8_t mask16_9={0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, };
+vec_u8_t mask10={0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, };
+vec_u8_t mask16_10={0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, 0x12, 0x12, 0x12, 0x13, };
+vec_u8_t mask11={0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, };
+vec_u8_t mask16_11={0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x12, 0x12, 0x13, 0x13, 0x13, 0x14, };
+vec_u8_t mask12={0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, };
+vec_u8_t mask16_12={0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, };
+vec_u8_t mask13={0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, };
+vec_u8_t mask16_13={0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, };
+vec_u8_t mask14={0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, 0x12, 0x12, };
+vec_u8_t mask16_14={0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, };
+vec_u8_t mask15={0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x12, 0x12, 0x13, 0x13, };
+vec_u8_t mask16_15={0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, };
+vec_u8_t maskadd1_31={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, };
+vec_u8_t maskadd1_16_31={0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ //vec_u8_t sv4 =vec_xl(129, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+/*
+ printf("source:\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+65]);
+ }
+ printf("\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+97]);
+ }
+ printf("\n\n");
+*/
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0);
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srv10 = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srv11 = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srv12 = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srv13 = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srv14 = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srv15 = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1);
+ vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(sv0, sv1, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(sv0, sv1, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(sv0, sv1, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(sv0, sv1, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(sv0, sv1, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(sv0, sv1, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(sv0, sv1, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(sv0, sv1, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(sv0, sv1, mask16_11);
+ vec_u8_t srv16_12 = vec_perm(sv1, sv2, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(sv1, sv2, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */
+ vec_u8_t srv17 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv18 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv19 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv21 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv22 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv23 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv24 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv25 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srv26 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srv27 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srv28 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srv29 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srv31 = vec_perm(sv1, sv2, mask15);
+ vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31);
+
+
+ vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */
+ vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(sv1, sv2, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(sv1, sv2, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(sv1, sv2, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(sv1, sv2, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(sv1, sv2, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(sv1, sv2, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(sv1, sv2, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(sv1, sv2, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(sv1, sv2, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(sv2, sv3, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(sv2, sv3, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15);
+ vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31);
+
+
+vec_u8_t vfrac32_0 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, };
+vec_u8_t vfrac32_1 = (vec_u8_t){25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 15, 6, 29, 20, 11, 2, 25, 16, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){7, 30, 21, 12, 3, 26, 17, 8, 31, 22, 13, 4, 27, 18, 9, 32, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+
+ one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 8>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, };
+
+vec_u8_t vfrac4 = (vec_u8_t){5, 10, 15, 20, 5, 10, 15, 20, 5, 10, 15, 20, 5, 10, 15, 20, };
+vec_u8_t vfrac4_32 = (vec_u8_t){27, 22, 17, 12, 27, 22, 17, 12, 27, 22, 17, 12, 27, 22, 17, 12, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 8>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, };
+//vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */
+
+vec_u8_t vfrac8 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 5, 10, 15, 20, 25, 30, 3, 8, };
+vec_u8_t vfrac8_32 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 27, 22, 17, 12, 7, 2, 29, 24, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32);
+ vmlo0 = vec_mulo(srv2, vfrac8_32);
+ vmle1 = vec_mule(srv3, vfrac8);
+ vmlo1 = vec_mulo(srv3, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv4, vfrac8_32);
+ vmlo0 = vec_mulo(srv4, vfrac8_32);
+ vmle1 = vec_mule(srv5, vfrac8);
+ vmlo1 = vec_mulo(srv5, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv6, vfrac8_32);
+ vmlo0 = vec_mulo(srv6, vfrac8_32);
+ vmle1 = vec_mule(srv7, vfrac8);
+ vmlo1 = vec_mulo(srv7, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 8>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, };
+vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, };
+vec_u8_t mask9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, };
+vec_u8_t mask10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, };
+vec_u8_t mask11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, };
+vec_u8_t mask12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, };
+vec_u8_t mask13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, };
+vec_u8_t mask14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, };
+vec_u8_t mask15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, };
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+ vec_u8_t srv00 = vec_perm(sv1, sv1, mask0);;
+
+vec_u8_t vfrac16 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, };
+vec_u8_t vfrac16_32 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 19, 14, 9, 4, 31, 26, 21, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srva, vfrac16_32, vfrac16, vout_9);
+ one_line(srva, srvb, vfrac16_32, vfrac16, vout_10);
+ one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11);
+ one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12);
+ one_line(srvd, srve, vfrac16_32, vfrac16, vout_13);
+ one_line(srve, srvf, vfrac16_32, vfrac16, vout_14);
+ one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 8>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-3;
+ dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-3;
+ dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-3;
+ dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+
+ ....
+ y=16; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5);
+ dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5);
+ dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5);
+
+ ....
+ y=31; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5);
+ dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5);
+ dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5);
+ ...
+ dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5);
+ }
+ */
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask16_0={0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask16_1={0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask16_2={0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask16_3={0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask16_4={0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask16_5={0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, };
+vec_u8_t mask16_6={0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, };
+vec_u8_t mask16_7={0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, };
+vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, };
+vec_u8_t mask16_8={0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, };
+vec_u8_t mask9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, };
+vec_u8_t mask16_9={0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, };
+vec_u8_t mask10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, };
+vec_u8_t mask16_10={0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, };
+vec_u8_t mask11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, };
+vec_u8_t mask16_11={0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, };
+vec_u8_t mask12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, };
+vec_u8_t mask16_12={0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x11, };
+vec_u8_t mask13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, };
+vec_u8_t mask16_13={0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x12, };
+vec_u8_t mask14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, };
+vec_u8_t mask16_14={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, };
+vec_u8_t mask15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, };
+vec_u8_t mask16_15={0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, };
+vec_u8_t maskadd1_31={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t maskadd1_16_31={0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ //vec_u8_t sv4 =vec_xl(129, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+/*
+ printf("source:\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+65]);
+ }
+ printf("\n");
+ for(int i=0; i<32; i++){
+ printf("%d ", srcPix0[i+97]);
+ }
+ printf("\n\n");
+*/
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0);
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srv10 = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srv11 = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srv12 = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srv13 = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srv14 = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srv15 = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1);
+ vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(sv0, sv1, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(sv0, sv1, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(sv0, sv1, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(sv0, sv1, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(sv0, sv1, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(sv0, sv1, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(sv0, sv1, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(sv0, sv1, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(sv0, sv1, mask16_11);
+ vec_u8_t srv16_12 = vec_perm(sv0, sv1, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(sv0, sv1, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */
+ vec_u8_t srv17 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv18 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv19 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv21 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv22 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv23 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv24 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv25 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srv26 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srv27 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srv28 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srv29 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srv31 = vec_perm(sv1, sv2, mask15);
+ vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31);
+
+
+ vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */
+ vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(sv1, sv2, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(sv1, sv2, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(sv1, sv2, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(sv1, sv2, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(sv1, sv2, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(sv1, sv2, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(sv1, sv2, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(sv1, sv2, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(sv1, sv2, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(sv1, sv2, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(sv1, sv2, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15);
+ vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31);
+
+vec_u8_t vfrac32_0 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, };
+vec_u8_t vfrac32_1 = (vec_u8_t){21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 19, 14, 9, 4, 31, 26, 21, 16, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){11, 6, 1, 28, 23, 18, 13, 8, 3, 30, 25, 20, 15, 10, 5, 32, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+
+ one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 9>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, };
+
+vec_u8_t vfrac4 = (vec_u8_t){2, 4, 6, 8, 2, 4, 6, 8, 2, 4, 6, 8, 2, 4, 6, 8, };
+
+vec_u8_t vfrac4_32 = (vec_u8_t){30, 28, 26, 24, 30, 28, 26, 24, 30, 28, 26, 24, 30, 28, 26, 24, };
+
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 9>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, };
+//vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */
+
+vec_u8_t vfrac8 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 2, 4, 6, 8, 10, 12, 14, 16, };
+vec_u8_t vfrac8_32 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 30, 28, 26, 24, 22, 20, 18, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32);
+ vmlo0 = vec_mulo(srv2, vfrac8_32);
+ vmle1 = vec_mule(srv3, vfrac8);
+ vmlo1 = vec_mulo(srv3, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv4, vfrac8_32);
+ vmlo0 = vec_mulo(srv4, vfrac8_32);
+ vmle1 = vec_mule(srv5, vfrac8);
+ vmlo1 = vec_mulo(srv5, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv6, vfrac8_32);
+ vmlo0 = vec_mulo(srv6, vfrac8_32);
+ vmle1 = vec_mule(srv7, vfrac8);
+ vmlo1 = vec_mulo(srv7, vfrac8);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 9>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, };
+vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, };
+vec_u8_t mask9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, };
+vec_u8_t mask10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, };
+vec_u8_t mask11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, };
+vec_u8_t mask12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, };
+vec_u8_t mask13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, };
+vec_u8_t mask14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, };
+vec_u8_t mask15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, };
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+ vec_u8_t srv00 = vec_perm(sv1, sv1, mask0);;
+
+vec_u8_t vfrac16 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 0, };
+vec_u8_t vfrac16_32 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 32, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srva, vfrac16_32, vfrac16, vout_9);
+ one_line(srva, srvb, vfrac16_32, vfrac16, vout_10);
+ one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11);
+ one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12);
+ one_line(srvd, srve, vfrac16_32, vfrac16, vout_13);
+ one_line(srve, srvf, vfrac16_32, vfrac16, vout_14);
+ one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 9>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-3;
+ dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-3;
+ dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-3;
+ dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+
+ ....
+ y=16; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5);
+ dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5);
+ dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5);
+ ...
+ dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5);
+ ...
+ dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5);
+
+ ....
+ y=31; off3 = offset[3]; x=0-3;
+ dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5);
+ dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5);
+ dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5);
+ ...
+ dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5);
+ }
+ */
+
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, };
+vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, };
+vec_u8_t mask9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, };
+vec_u8_t mask10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, };
+vec_u8_t mask11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, };
+vec_u8_t mask12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, };
+vec_u8_t mask13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, };
+vec_u8_t mask14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, };
+vec_u8_t mask15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0);
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srv10 = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srv11 = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srv12 = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srv13 = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srv14 = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srv15 = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */
+ vec_u8_t srv17 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv18 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv19 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv21 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv22 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv23 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv24 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv25 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srv26 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srv27 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srv28 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srv29 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srv31 = vec_perm(sv1, sv2, mask15);
+ vec_u8_t srv32 = vec_perm(sv2, sv3, mask0);
+ vec_u8_t srv33 = vec_perm(sv2, sv3, mask1);
+
+vec_u8_t vfrac32_0 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 0, };
+vec_u8_t vfrac32_1 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 32, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 32, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv1, srv2, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv2, srv3, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv3, srv4, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv4, srv5, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv5, srv6, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv6, srv7, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv7, srv8, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv8, srv9, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv9, srv10, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv10, srv11, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv11, srv12, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv12, srv13, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv13, srv14, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv14, srv15, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv15, srv16, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16, srv17, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv17, srv18, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv18, srv19, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv19, srv20, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv20, srv21, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv21, srv22, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv22, srv23, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv23, srv24, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv24, srv25, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv25, srv26, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv26, srv27, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv27, srv28, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv28, srv29, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv29, srv30, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv30, srv31, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv31, srv32, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv32, srv33, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+#ifdef WORDS_BIGENDIAN
+ vec_u8_t u8_to_s16_w4x4_mask1 = {0x00, 0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t u8_to_s16_w4x4_mask9 = {0x00, 0x19, 0x00, 0x1a, 0x00, 0x1b, 0x00, 0x1c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t u8_to_s16_w8x8_mask1 = {0x00, 0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x15, 0x00, 0x16, 0x00, 0x17, 0x00, 0x18};
+ vec_u8_t u8_to_s16_w8x8_maskh = {0x00, 0x10, 0x00, 0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x15, 0x00, 0x16, 0x00, 0x17};
+ vec_u8_t u8_to_s16_w8x8_maskl = {0x00, 0x18, 0x00, 0x19, 0x00, 0x1a, 0x00, 0x1b, 0x00, 0x1c, 0x00, 0x1d, 0x00, 0x1e, 0x00, 0x1f};
+ vec_u8_t u8_to_s16_b0_mask = {0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10};
+ vec_u8_t u8_to_s16_b1_mask = {0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11};
+ vec_u8_t u8_to_s16_b9_mask = {0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10};
+#else
+ vec_u8_t u8_to_s16_w4x4_mask1 = {0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t u8_to_s16_w4x4_mask9 = {0x19, 0x00, 0x1a, 0x00, 0x1b, 0x00, 0x1c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t u8_to_s16_w8x8_mask1 = {0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x15, 0x00, 0x16, 0x00, 0x17, 0x00, 0x18, 0x00};
+ vec_u8_t u8_to_s16_w8x8_maskh = {0x10, 0x00, 0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x15, 0x00, 0x16, 0x00, 0x17, 0x00};
+ vec_u8_t u8_to_s16_w8x8_maskl = {0x18, 0x00, 0x19, 0x00, 0x1a, 0x00, 0x1b, 0x00, 0x1c, 0x00, 0x1d, 0x00, 0x1e, 0x00, 0x1f, 0x00};
+ vec_u8_t u8_to_s16_b0_mask = {0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00};
+ vec_u8_t u8_to_s16_b1_mask = {0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00};
+ vec_u8_t u8_to_s16_b9_mask = {0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09};
+#endif
+vec_s16_t min_s16v = (vec_s16_t){255, 255, 255, 255, 255, 255, 255, 255};
+vec_u16_t one_u16v = (vec_u16_t)vec_splat_u16(1);
+
+template<>
+void intra_pred<4, 10>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(9, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+ vec_u8_t v_filter_u8, v_mask0, v_mask;
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(0, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w4x4_mask1));
+ vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v );
+ vec_s16_t v_sum = vec_add(c1_s16v, v1_s16);
+ vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum));
+ v_filter_u8 = vec_pack(v_filter_u16, zero_u16v);
+ v_mask0 = (vec_u8_t){0x10, 0x11, 0x12, 0x13, 0x01, 0x01, 0x01, 0x01, 0x02, 0x02, 0x02, 0x02, 0x03, 0x03, 0x03, 0x03};
+ v_mask = (vec_u8_t){0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ }
+ else{
+ v_mask0 = (vec_u8_t){0x00, 0x00, 0x00, 0x00, 0x01, 0x01, 0x01, 0x01, 0x02, 0x02, 0x02, 0x02, 0x03, 0x03, 0x03, 0x03};
+ v_mask = (vec_u8_t){0x00, 0x00, 0x00, 0x00, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ v_filter_u8 = srv;
+ }
+
+
+ if(dstStride == 4) {
+ vec_u8_t v0 = vec_perm(srv, v_filter_u8, v_mask0);
+ vec_xst(v0, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_u8_t v0 = vec_perm(srv, v_filter_u8, v_mask0);
+ vec_ste((vec_u32_t)v0, 0, (unsigned int*)dst);
+ vec_u8_t v1 = vec_sld(v0, v0, 12);
+ vec_ste((vec_u32_t)v1, 0, (unsigned int*)(dst+dstStride));
+ vec_u8_t v2 = vec_sld(v0, v0, 8);
+ vec_ste((vec_u32_t)v2, 0, (unsigned int*)(dst+dstStride*2));
+ vec_u8_t v3 = vec_sld(v0, v0, 4);
+ vec_ste((vec_u32_t)v3, 0, (unsigned int*)(dst+dstStride*3));
+ }
+ else{
+ vec_u8_t v_mask1 = {0x01, 0x01, 0x01, 0x01, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x02, 0x02, 0x02, 0x02, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x03, 0x03, 0x03, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(v_filter_u8, vec_xl(0, dst), v_mask);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(srv, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(srv, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(srv, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 10>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(17, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+
+ if(dstStride == 8) {
+ vec_u8_t v_mask0 = {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01};
+ vec_u8_t v_mask1 = {0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03};
+ vec_u8_t v_mask2 = {0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05};
+ vec_u8_t v_mask3 = {0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07};
+ vec_u8_t v0 = vec_perm(srv, srv, v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(srv, srv, v_mask1);
+ vec_xst(v1, 16, dst);
+ vec_u8_t v2 = vec_perm(srv, srv, v_mask2);
+ vec_xst(v2, 32, dst);
+ vec_u8_t v3 = vec_perm(srv, srv, v_mask3);
+ vec_xst(v3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask4 = {0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask5 = {0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask6 = {0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask7 = {0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(srv, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(srv, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(srv, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(srv, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ vec_u8_t v4 = vec_perm(srv, vec_xl(dstStride*4, dst), v_mask4);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(srv, vec_xl(dstStride*5, dst), v_mask5);
+ vec_xst(v5, dstStride*5, dst);
+ vec_u8_t v6 = vec_perm(srv, vec_xl(dstStride*6, dst), v_mask6);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(srv, vec_xl(dstStride*7, dst), v_mask7);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(0, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_mask1));
+ vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v );
+ vec_s16_t v_sum = vec_add(c1_s16v, v1_s16);
+ vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum));
+ vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v);
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_xst( vec_perm(v_filter_u8, vec_xl(0, dst), v_mask0), 0, dst );
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 10>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(33, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+
+ if(dstStride == 16) {
+ vec_xst(vec_splat(srv, 0), 0, dst);
+ vec_xst(vec_splat(srv, 1), 16, dst);
+ vec_xst(vec_splat(srv, 2), 32, dst);
+ vec_xst(vec_splat(srv, 3), 48, dst);
+ vec_xst(vec_splat(srv, 4), 64, dst);
+ vec_xst(vec_splat(srv, 5), 80, dst);
+ vec_xst(vec_splat(srv, 6), 96, dst);
+ vec_xst(vec_splat(srv, 7), 112, dst);
+ vec_xst(vec_splat(srv, 8), 128, dst);
+ vec_xst(vec_splat(srv, 9), 144, dst);
+ vec_xst(vec_splat(srv, 10), 160, dst);
+ vec_xst(vec_splat(srv, 11), 176, dst);
+ vec_xst(vec_splat(srv, 12), 192, dst);
+ vec_xst(vec_splat(srv, 13), 208, dst);
+ vec_xst(vec_splat(srv, 14), 224, dst);
+ vec_xst(vec_splat(srv, 15), 240, dst);
+ }
+ else{
+ vec_xst(vec_splat(srv, 0), 0, dst);
+ vec_xst(vec_splat(srv, 1), 1*dstStride, dst);
+ vec_xst(vec_splat(srv, 2), 2*dstStride, dst);
+ vec_xst(vec_splat(srv, 3), 3*dstStride, dst);
+ vec_xst(vec_splat(srv, 4), 4*dstStride, dst);
+ vec_xst(vec_splat(srv, 5), 5*dstStride, dst);
+ vec_xst(vec_splat(srv, 6), 6*dstStride, dst);
+ vec_xst(vec_splat(srv, 7), 7*dstStride, dst);
+ vec_xst(vec_splat(srv, 8), 8*dstStride, dst);
+ vec_xst(vec_splat(srv, 9), 9*dstStride, dst);
+ vec_xst(vec_splat(srv, 10), 10*dstStride, dst);
+ vec_xst(vec_splat(srv, 11), 11*dstStride, dst);
+ vec_xst(vec_splat(srv, 12), 12*dstStride, dst);
+ vec_xst(vec_splat(srv, 13), 13*dstStride, dst);
+ vec_xst(vec_splat(srv, 14), 14*dstStride, dst);
+ vec_xst(vec_splat(srv, 15), 15*dstStride, dst);
+ }
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(0, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_u8_t srcv1 = vec_xl(1, srcPix0);
+ vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh));
+ vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl));
+ vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v );
+ vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v );
+ vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16);
+ vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16);
+ vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum));
+ vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum));
+ vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16);
+ vec_xst( v_filter_u8, 0, dst );
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<32, 10>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(65, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+ vec_u8_t srv1 =vec_xl(81, srcPix0);
+ vec_u8_t vout;
+ int offset = 0;
+
+ #define v_pred32(vi, vo, i){\
+ vo = vec_splat(vi, i);\
+ vec_xst(vo, offset, dst);\
+ vec_xst(vo, 16+offset, dst);\
+ offset += dstStride;\
+ }
+
+ v_pred32(srv, vout, 0);
+ v_pred32(srv, vout, 1);
+ v_pred32(srv, vout, 2);
+ v_pred32(srv, vout, 3);
+ v_pred32(srv, vout, 4);
+ v_pred32(srv, vout, 5);
+ v_pred32(srv, vout, 6);
+ v_pred32(srv, vout, 7);
+ v_pred32(srv, vout, 8);
+ v_pred32(srv, vout, 9);
+ v_pred32(srv, vout, 10);
+ v_pred32(srv, vout, 11);
+ v_pred32(srv, vout, 12);
+ v_pred32(srv, vout, 13);
+ v_pred32(srv, vout, 14);
+ v_pred32(srv, vout, 15);
+
+ v_pred32(srv1, vout, 0);
+ v_pred32(srv1, vout, 1);
+ v_pred32(srv1, vout, 2);
+ v_pred32(srv1, vout, 3);
+ v_pred32(srv1, vout, 4);
+ v_pred32(srv1, vout, 5);
+ v_pred32(srv1, vout, 6);
+ v_pred32(srv1, vout, 7);
+ v_pred32(srv1, vout, 8);
+ v_pred32(srv1, vout, 9);
+ v_pred32(srv1, vout, 10);
+ v_pred32(srv1, vout, 11);
+ v_pred32(srv1, vout, 12);
+ v_pred32(srv1, vout, 13);
+ v_pred32(srv1, vout, 14);
+ v_pred32(srv1, vout, 15);
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(0, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_u8_t srcv1 = vec_xl(1, srcPix0);
+ vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh));
+ vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl));
+ vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v );
+ vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v );
+ vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16);
+ vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16);
+ vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum));
+ vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum));
+ vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16);
+ vec_xst( v_filter_u8, 0, dst );
+
+ vec_u8_t srcv2 = vec_xl(17, srcPix0);
+ vec_s16_t v2h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskh));
+ vec_s16_t v2l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskl));
+ vec_s16_t v3h_s16 = (vec_s16_t)vec_sra( vec_sub(v2h_s16, c0_s16v), one_u16v );
+ vec_s16_t v3l_s16 = (vec_s16_t)vec_sra( vec_sub(v2l_s16, c0_s16v), one_u16v );
+ vec_s16_t v2h_sum = vec_add(c1_s16v, v3h_s16);
+ vec_s16_t v2l_sum = vec_add(c1_s16v, v3l_s16);
+ vec_u16_t v2h_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2h_sum));
+ vec_u16_t v2l_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2l_sum));
+ vec_u8_t v2_filter_u8 = vec_pack(v2h_filter_u16, v2l_filter_u16);
+ vec_xst( v2_filter_u8, 16, dst );
+
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 11>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, };
+ vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, };
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){30, 28, 26, 24, 30, 28, 26, 24, 30, 28, 26, 24, 30, 28, 26, 24, };
+ vec_u8_t vfrac4_32 = (vec_u8_t){2, 4, 6, 8, 2, 4, 6, 8, 2, 4, 6, 8, 2, 4, 6, 8, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 11>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, };
+//vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, };
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+vec_u8_t vfrac8 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 30, 28, 26, 24, 22, 20, 18, 16, };
+vec_u8_t vfrac8_32 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 2, 4, 6, 8, 10, 12, 14, 16, };
+
+ one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0);
+ one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1);
+ one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2);
+ one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 11>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, };
+vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, };
+vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, };
+vec_u8_t mask9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, };
+vec_u8_t mask10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, };
+vec_u8_t mask11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, };
+vec_u8_t mask12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, };
+vec_u8_t mask13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, };
+vec_u8_t mask14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, };
+vec_u8_t mask15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, };
+vec_u8_t maskadd1_15={0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(48, srcPix0);
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 =vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = srv1;
+ vec_u8_t srv1_add1 = srv2;
+ vec_u8_t srv2_add1 = srv3;
+ vec_u8_t srv3_add1 = srv4;
+ vec_u8_t srv4_add1 = srv5;
+ vec_u8_t srv5_add1 = srv6;
+ vec_u8_t srv6_add1 = srv7;
+ vec_u8_t srv7_add1 = srv8;
+ vec_u8_t srv8_add1 = srv9;
+ vec_u8_t srv9_add1 = srv10;
+ vec_u8_t srv10_add1 = srv11;
+ vec_u8_t srv11_add1 = srv12;
+ vec_u8_t srv12_add1= srv13;
+ vec_u8_t srv13_add1 = srv14;
+ vec_u8_t srv14_add1 = srv15;
+ vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15);
+
+vec_u8_t vfrac16 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 0, };
+vec_u8_t vfrac16_32 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, };
+
+ one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 11>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, };
+vec_u8_t mask1={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask2={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask3={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask4={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask5={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask6={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask7={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, };
+vec_u8_t mask8={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, };
+vec_u8_t mask9={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, };
+vec_u8_t mask10={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, };
+vec_u8_t mask11={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, };
+vec_u8_t mask12={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, };
+vec_u8_t mask13={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, };
+vec_u8_t mask14={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, };
+vec_u8_t mask15={0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, };
+
+vec_u8_t mask16_0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, };
+/*vec_u8_t mask16_1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, };
+vec_u8_t mask16_2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask16_3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask16_4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask16_5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask16_6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask16_7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask16_8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, };
+vec_u8_t mask16_9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, };
+vec_u8_t mask16_10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, };
+vec_u8_t mask16_11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, };
+vec_u8_t mask16_12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, };
+vec_u8_t mask16_13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, };
+vec_u8_t mask16_14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, };
+vec_u8_t mask16_15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, };
+*/
+vec_u8_t maskadd1_31={0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, };
+vec_u8_t maskadd1_16_31={0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+
+vec_u8_t refmask_32_0={0x10, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
+vec_u8_t refmask_32_1={0x0, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(79, srcPix0);
+ vec_u8_t s2 = vec_xl(95, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s0, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s0, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s0, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s0, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s0, mask4);
+ vec_u8_t srv5 = vec_perm(s0, s0, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s0, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s0, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s0, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s0, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s0, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s0, mask11);
+ vec_u8_t srv12= vec_perm(s0, s0, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s0, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s0, mask14);
+ vec_u8_t srv15 = vec_perm(s1, s1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0);
+ vec_u8_t srv16_1 = srv0;
+ vec_u8_t srv16_2 = srv1;
+ vec_u8_t srv16_3 = srv2;
+ vec_u8_t srv16_4 = srv3;
+ vec_u8_t srv16_5 = srv4;
+ vec_u8_t srv16_6 = srv5;
+ vec_u8_t srv16_7 = srv6;
+ vec_u8_t srv16_8 = srv7;
+ vec_u8_t srv16_9 = srv8;
+ vec_u8_t srv16_10 = srv9;
+ vec_u8_t srv16_11 = srv10;
+ vec_u8_t srv16_12= srv11;
+ vec_u8_t srv16_13 = srv12;
+ vec_u8_t srv16_14 = srv13;
+ vec_u8_t srv16_15 =srv14;
+
+/*
+ vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(s0, s0, mask16_1);
+ vec_u8_t srv16_2 = vec_perm(s0, s0, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(s0, s0, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(s0, s0, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(s0, s0, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(s0, s0, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(s0, s0, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(s0, s0, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(s0, s0, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(s0, s0, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(s0, s0, mask16_11);
+ vec_u8_t srv16_12= vec_perm(s0, s0, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(s0, s0, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(s0, s0, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(s0, s0, mask16_15);
+*/
+ vec_u8_t srv16 = vec_perm(s1, s1, mask0);
+ vec_u8_t srv17 = vec_perm(s1, s1, mask1);
+ vec_u8_t srv18 = vec_perm(s1, s1, mask2);
+ vec_u8_t srv19 = vec_perm(s1, s1, mask3);
+ vec_u8_t srv20 = vec_perm(s1, s1, mask4);
+ vec_u8_t srv21 = vec_perm(s1, s1, mask5);
+ vec_u8_t srv22 = vec_perm(s1, s1, mask6);
+ vec_u8_t srv23 = vec_perm(s1, s1, mask7);
+ vec_u8_t srv24 = vec_perm(s1, s1, mask8);
+ vec_u8_t srv25 = vec_perm(s1, s1, mask9);
+ vec_u8_t srv26 = vec_perm(s1, s1, mask10);
+ vec_u8_t srv27 = vec_perm(s1, s1, mask11);
+ vec_u8_t srv28 = vec_perm(s1, s1, mask12);
+ vec_u8_t srv29 = vec_perm(s1, s1, mask13);
+ vec_u8_t srv30 = vec_perm(s1, s1, mask14);
+ vec_u8_t srv31 = vec_perm(s2, s2, mask15);
+
+/*
+ vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0);
+ vec_u8_t srv16_17 = vec_perm(s1, s1, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(s1, s1, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(s1, s1, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(s1, s1, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(s1, s1, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(s1, s1, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(s1, s1, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(s1, s1, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(s1, s1, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(s1, s1, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(s1, s1, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(s1, s1, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(s1, s1, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(s1, s1, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(s1, s1, mask16_15);
+*/
+ vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0);
+ vec_u8_t srv16_17 = srv16;
+ vec_u8_t srv16_18 = srv17;
+ vec_u8_t srv16_19 = srv18;
+ vec_u8_t srv16_20 = srv19;
+ vec_u8_t srv16_21 = srv20;
+ vec_u8_t srv16_22 = srv21;
+ vec_u8_t srv16_23 = srv22;
+ vec_u8_t srv16_24 = srv23;
+ vec_u8_t srv16_25 = srv24;
+ vec_u8_t srv16_26 = srv25;
+ vec_u8_t srv16_27 = srv26;
+ vec_u8_t srv16_28 = srv27;
+ vec_u8_t srv16_29 = srv28;
+ vec_u8_t srv16_30 = srv29;
+ vec_u8_t srv16_31 = srv30;
+
+ vec_u8_t srv0add1 = srv1;
+ vec_u8_t srv1add1 = srv2;
+ vec_u8_t srv2add1 = srv3;
+ vec_u8_t srv3add1 = srv4;
+ vec_u8_t srv4add1 = srv5;
+ vec_u8_t srv5add1 = srv6;
+ vec_u8_t srv6add1 = srv7;
+ vec_u8_t srv7add1 = srv8;
+ vec_u8_t srv8add1 = srv9;
+ vec_u8_t srv9add1 = srv10;
+ vec_u8_t srv10add1 = srv11;
+ vec_u8_t srv11add1 = srv12;
+ vec_u8_t srv12add1= srv13;
+ vec_u8_t srv13add1 = srv14;
+ vec_u8_t srv14add1 = srv15;
+ vec_u8_t srv15add1 = srv16;
+
+ vec_u8_t srv16add1_0 = srv16_1;
+ vec_u8_t srv16add1_1 = srv16_2;
+ vec_u8_t srv16add1_2 = srv16_3;
+ vec_u8_t srv16add1_3 = srv16_4;
+ vec_u8_t srv16add1_4 = srv16_5;
+ vec_u8_t srv16add1_5 = srv16_6;
+ vec_u8_t srv16add1_6 = srv16_7;
+ vec_u8_t srv16add1_7 = srv16_8;
+ vec_u8_t srv16add1_8 = srv16_9;
+ vec_u8_t srv16add1_9 = srv16_10;
+ vec_u8_t srv16add1_10 = srv16_11;
+ vec_u8_t srv16add1_11 = srv16_12;
+ vec_u8_t srv16add1_12= srv16_13;
+ vec_u8_t srv16add1_13 = srv16_14;
+ vec_u8_t srv16add1_14 = srv16_15;
+ vec_u8_t srv16add1_15 = srv16_16;
+
+ vec_u8_t srv16add1 = srv17;
+ vec_u8_t srv17add1 = srv18;
+ vec_u8_t srv18add1 = srv19;
+ vec_u8_t srv19add1 = srv20;
+ vec_u8_t srv20add1 = srv21;
+ vec_u8_t srv21add1 = srv22;
+ vec_u8_t srv22add1 = srv23;
+ vec_u8_t srv23add1 = srv24;
+ vec_u8_t srv24add1 = srv25;
+ vec_u8_t srv25add1 = srv26;
+ vec_u8_t srv26add1 = srv27;
+ vec_u8_t srv27add1 = srv28;
+ vec_u8_t srv28add1 = srv29;
+ vec_u8_t srv29add1 = srv30;
+ vec_u8_t srv30add1 = srv31;
+ vec_u8_t srv31add1 = vec_perm(s2, s2, maskadd1_31);
+
+ vec_u8_t srv16add1_16 = srv16_17;
+ vec_u8_t srv16add1_17 = srv16_18;
+ vec_u8_t srv16add1_18 = srv16_19;
+ vec_u8_t srv16add1_19 = srv16_20;
+ vec_u8_t srv16add1_20 = srv16_21;
+ vec_u8_t srv16add1_21 = srv16_22;
+ vec_u8_t srv16add1_22 = srv16_23;
+ vec_u8_t srv16add1_23 = srv16_24;
+ vec_u8_t srv16add1_24 = srv16_25;
+ vec_u8_t srv16add1_25 = srv16_26;
+ vec_u8_t srv16add1_26 = srv16_27;
+ vec_u8_t srv16add1_27 = srv16_28;
+ vec_u8_t srv16add1_28 = srv16_29;
+ vec_u8_t srv16add1_29 = srv16_30;
+ vec_u8_t srv16add1_30 = srv16_31;
+ vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31);
+
+vec_u8_t vfrac32_0 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 0, };
+vec_u8_t vfrac32_1 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, };
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<4, 12>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, };
+ vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, };
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){27, 22, 17, 12, 27, 22, 17, 12, 27, 22, 17, 12, 27, 22, 17, 12, };
+ vec_u8_t vfrac4_32 = (vec_u8_t){5, 10, 15, 20, 5, 10, 15, 20, 5, 10, 15, 20, 5, 10, 15, 20, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 12>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, };
+vec_u8_t mask1={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, };
+vec_u8_t mask2={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, };
+vec_u8_t mask3={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, };
+vec_u8_t mask4={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, };
+vec_u8_t mask5={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, };
+vec_u8_t mask6={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, };
+vec_u8_t mask7={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x6, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+vec_u8_t vfrac8 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 27, 22, 17, 12, 7, 2, 29, 24, };
+vec_u8_t vfrac8_32 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 5, 10, 15, 20, 25, 30, 3, 8, };
+
+ one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0);
+ one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1);
+ one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2);
+ one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 12>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, };
+vec_u8_t mask1={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, };
+vec_u8_t mask2={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask3={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask4={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask5={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask6={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask7={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask8={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, };
+vec_u8_t mask9={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, };
+vec_u8_t mask10={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, };
+vec_u8_t mask11={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, };
+vec_u8_t mask12={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, };
+vec_u8_t mask13={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, };
+vec_u8_t mask14={0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, };
+vec_u8_t mask15={0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, };
+
+vec_u8_t maskadd1_15={0x12, 0x12, 0x12, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(33, srcPix0);
+ vec_u8_t refmask_16={0xd, 0x6, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(46, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s0, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s0, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s0, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s0, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s0, mask4);
+ vec_u8_t srv5 =vec_perm(s0, s0, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s0, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s0, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s0, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s0, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s0, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s0, mask11);
+ vec_u8_t srv12= vec_perm(s0, s0, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s0, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = srv1;
+ vec_u8_t srv1_add1 = srv2;
+ vec_u8_t srv2_add1 = srv3;
+ vec_u8_t srv3_add1 = srv4;
+ vec_u8_t srv4_add1 = srv5;
+ vec_u8_t srv5_add1 = srv6;
+ vec_u8_t srv6_add1 = srv7;
+ vec_u8_t srv7_add1 = srv8;
+ vec_u8_t srv8_add1 = srv9;
+ vec_u8_t srv9_add1 = srv10;
+ vec_u8_t srv10_add1 = srv11;
+ vec_u8_t srv11_add1 = srv12;
+ vec_u8_t srv12_add1= srv13;
+ vec_u8_t srv13_add1 = srv14;
+ vec_u8_t srv14_add1 = srv15;
+ vec_u8_t srv15_add1 = vec_perm(s1, s1, maskadd1_15);
+
+vec_u8_t vfrac16 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 19, 14, 9, 4, 31, 26, 21, 16, };
+vec_u8_t vfrac16_32 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<32, 12>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask1={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask2={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask3={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask4={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask5={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask6={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, };
+vec_u8_t mask7={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, };
+vec_u8_t mask8={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, };
+vec_u8_t mask9={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, };
+vec_u8_t mask10={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, };
+vec_u8_t mask11={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, };
+vec_u8_t mask12={0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, };
+vec_u8_t mask13={0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, };
+vec_u8_t mask14={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, };
+vec_u8_t mask15={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, };
+
+vec_u8_t mask16_0={0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, };
+vec_u8_t mask16_1={0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, };
+vec_u8_t mask16_2={0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask16_3={0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask16_4={0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask16_5={0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask16_6={0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask16_7={0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask16_8={0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, };
+vec_u8_t mask16_9={0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, };
+vec_u8_t mask16_10={0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, };
+vec_u8_t mask16_11={0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, };
+vec_u8_t mask16_12={0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, };
+vec_u8_t mask16_13={0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, };
+vec_u8_t mask16_14={0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, };
+vec_u8_t mask16_15={0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, };
+
+vec_u8_t maskadd1_31={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t maskadd1_16_31={0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1a, 0x13, 0xd, 0x6, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
+ vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(76, srcPix0);
+ vec_u8_t s2 = vec_xl(92, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s0, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s0, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s0, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s0, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s0, mask4);
+ vec_u8_t srv5 = vec_perm(s0, s0, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s0, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s0, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s0, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s0, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s0, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s0, mask11);
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s1, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s1, s1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(s0, s0, mask16_1);
+ vec_u8_t srv16_2 = vec_perm(s0, s0, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(s0, s0, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(s0, s0, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(s0, s0, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(s0, s0, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(s0, s0, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(s0, s0, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(s0, s0, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(s0, s0, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(s0, s0, mask16_11);
+ vec_u8_t srv16_12= vec_perm(s0, s0, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(s0, s0, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(s1, s1, mask0);
+ vec_u8_t srv17 = vec_perm(s1, s1, mask1);
+ vec_u8_t srv18 = vec_perm(s1, s1, mask2);
+ vec_u8_t srv19 = vec_perm(s1, s1, mask3);
+ vec_u8_t srv20 = vec_perm(s1, s1, mask4);
+ vec_u8_t srv21 = vec_perm(s1, s1, mask5);
+ vec_u8_t srv22 = vec_perm(s1, s1, mask6);
+ vec_u8_t srv23 = vec_perm(s1, s1, mask7);
+ vec_u8_t srv24 = vec_perm(s1, s1, mask8);
+ vec_u8_t srv25 = vec_perm(s1, s1, mask9);
+ vec_u8_t srv26 = vec_perm(s1, s1, mask10);
+ vec_u8_t srv27 = vec_perm(s1, s1, mask11);
+ vec_u8_t srv28 = vec_perm(s1, s2, mask12);
+ vec_u8_t srv29 = vec_perm(s1, s2, mask13);
+ vec_u8_t srv30 = vec_perm(s2, s2, mask14);
+ vec_u8_t srv31 = vec_perm(s2, s2, mask15);
+
+ vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0);
+ vec_u8_t srv16_17 = vec_perm(s1, s1, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(s1, s1, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(s1, s1, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(s1, s1, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(s1, s1, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(s1, s1, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(s1, s1, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(s1, s1, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(s1, s1, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(s1, s1, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(s1, s1, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(s1, s1, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(s1, s1, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15);
+
+ vec_u8_t srv0add1 = srv1;
+ vec_u8_t srv1add1 = srv2;
+ vec_u8_t srv2add1 = srv3;
+ vec_u8_t srv3add1 = srv4;
+ vec_u8_t srv4add1 = srv5;
+ vec_u8_t srv5add1 = srv6;
+ vec_u8_t srv6add1 = srv7;
+ vec_u8_t srv7add1 = srv8;
+ vec_u8_t srv8add1 = srv9;
+ vec_u8_t srv9add1 = srv10;
+ vec_u8_t srv10add1 = srv11;
+ vec_u8_t srv11add1 = srv12;
+ vec_u8_t srv12add1= srv13;
+ vec_u8_t srv13add1 = srv14;
+ vec_u8_t srv14add1 = srv15;
+ vec_u8_t srv15add1 = srv16;
+
+ vec_u8_t srv16add1_0 = srv16_1;
+ vec_u8_t srv16add1_1 = srv16_2;
+ vec_u8_t srv16add1_2 = srv16_3;
+ vec_u8_t srv16add1_3 = srv16_4;
+ vec_u8_t srv16add1_4 = srv16_5;
+ vec_u8_t srv16add1_5 = srv16_6;
+ vec_u8_t srv16add1_6 = srv16_7;
+ vec_u8_t srv16add1_7 = srv16_8;
+ vec_u8_t srv16add1_8 = srv16_9;
+ vec_u8_t srv16add1_9 = srv16_10;
+ vec_u8_t srv16add1_10 = srv16_11;
+ vec_u8_t srv16add1_11 = srv16_12;
+ vec_u8_t srv16add1_12= srv16_13;
+ vec_u8_t srv16add1_13 = srv16_14;
+ vec_u8_t srv16add1_14 = srv16_15;
+ vec_u8_t srv16add1_15 = srv16_16;
+
+ vec_u8_t srv16add1 = srv17;
+ vec_u8_t srv17add1 = srv18;
+ vec_u8_t srv18add1 = srv19;
+ vec_u8_t srv19add1 = srv20;
+ vec_u8_t srv20add1 = srv21;
+ vec_u8_t srv21add1 = srv22;
+ vec_u8_t srv22add1 = srv23;
+ vec_u8_t srv23add1 = srv24;
+ vec_u8_t srv24add1 = srv25;
+ vec_u8_t srv25add1 = srv26;
+ vec_u8_t srv26add1 = srv27;
+ vec_u8_t srv27add1 = srv28;
+ vec_u8_t srv28add1 = srv29;
+ vec_u8_t srv29add1 = srv30;
+ vec_u8_t srv30add1 = srv31;
+ vec_u8_t srv31add1 = vec_perm(s2, s2, maskadd1_31);
+
+ vec_u8_t srv16add1_16 = srv16_17;
+ vec_u8_t srv16add1_17 = srv16_18;
+ vec_u8_t srv16add1_18 = srv16_19;
+ vec_u8_t srv16add1_19 = srv16_20;
+ vec_u8_t srv16add1_20 = srv16_21;
+ vec_u8_t srv16add1_21 = srv16_22;
+ vec_u8_t srv16add1_22 = srv16_23;
+ vec_u8_t srv16add1_23 = srv16_24;
+ vec_u8_t srv16add1_24 = srv16_25;
+ vec_u8_t srv16add1_25 = srv16_26;
+ vec_u8_t srv16add1_26 = srv16_27;
+ vec_u8_t srv16add1_27 = srv16_28;
+ vec_u8_t srv16add1_28 = srv16_29;
+ vec_u8_t srv16add1_29 = srv16_30;
+ vec_u8_t srv16add1_30 = srv16_31;
+ vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31);
+
+vec_u8_t vfrac32_0 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 19, 14, 9, 4, 31, 26, 21, 16, };
+vec_u8_t vfrac32_1 = (vec_u8_t){11, 6, 1, 28, 23, 18, 13, 8, 3, 30, 25, 20, 15, 10, 5, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 32, };
+
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<4, 13>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x1, 0x1, 0x1, 0x0, 0x2, 0x2, 0x2, 0x1, 0x3, 0x3, 0x3, 0x2, 0x4, 0x4, 0x4, 0x3, };
+ vec_u8_t mask1={0x2, 0x2, 0x2, 0x1, 0x3, 0x3, 0x3, 0x2, 0x4, 0x4, 0x4, 0x3, 0x5, 0x5, 0x5, 0x4, };
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+vec_u8_t vfrac4 = (vec_u8_t){23, 14, 5, 28, 23, 14, 5, 28, 23, 14, 5, 28, 23, 14, 5, 28, };
+vec_u8_t vfrac4_32 = (vec_u8_t){9, 18, 27, 4, 9, 18, 27, 4, 9, 18, 27, 4, 9, 18, 27, 4, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 13>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x0, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x1, };
+ vec_u8_t mask1={0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x1, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, };
+ vec_u8_t mask2={0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, };
+ vec_u8_t mask3={0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, };
+ vec_u8_t mask4={0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, };
+ vec_u8_t mask5={0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, };
+ vec_u8_t mask6={0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, };
+ vec_u8_t mask7={0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, };
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x7, 0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+ vec_u8_t vfrac8 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 23, 14, 5, 28, 19, 10, 1, 24, };
+ vec_u8_t vfrac8_32 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 9, 18, 27, 4, 13, 22, 31, 8, };
+
+ one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0);
+ one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1);
+ one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2);
+ one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 13>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, };
+vec_u8_t mask1={0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, };
+vec_u8_t mask2={0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, };
+vec_u8_t mask3={0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, };
+vec_u8_t mask4={0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, };
+vec_u8_t mask5={0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, };
+vec_u8_t mask6={0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, };
+vec_u8_t mask7={0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, };
+vec_u8_t mask8={0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, };
+vec_u8_t mask9={0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, };
+vec_u8_t mask10={0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, };
+vec_u8_t mask11={0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, };
+vec_u8_t mask12={0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, };
+vec_u8_t mask13={0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, };
+vec_u8_t mask14={0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, };
+vec_u8_t mask15={0x13, 0x13, 0x13, 0x12, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, };
+vec_u8_t maskadd1_15={0x14, 0x14, 0x14, 0x13, 0x13, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, };
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xe, 0xb, 0x7, 0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(44, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 =vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = srv1;
+ vec_u8_t srv1_add1 = srv2;
+ vec_u8_t srv2_add1 = srv3;
+ vec_u8_t srv3_add1 = srv4;
+ vec_u8_t srv4_add1 = srv5;
+ vec_u8_t srv5_add1 = srv6;
+ vec_u8_t srv6_add1 = srv7;
+ vec_u8_t srv7_add1 = srv8;
+ vec_u8_t srv8_add1 = srv9;
+ vec_u8_t srv9_add1 = srv10;
+ vec_u8_t srv10_add1 = srv11;
+ vec_u8_t srv11_add1 = srv12;
+ vec_u8_t srv12_add1= srv13;
+ vec_u8_t srv13_add1 = srv14;
+ vec_u8_t srv14_add1 = srv15;
+ vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15);
+
+vec_u8_t vfrac16 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 15, 6, 29, 20, 11, 2, 25, 16, };
+vec_u8_t vfrac16_32 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 13>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, };
+vec_u8_t mask1={0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, };
+vec_u8_t mask2={0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, };
+vec_u8_t mask3={0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, };
+vec_u8_t mask4={0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, };
+vec_u8_t mask5={0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, };
+vec_u8_t mask6={0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, };
+vec_u8_t mask7={0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, };
+vec_u8_t mask8={0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, };
+vec_u8_t mask9={0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, };
+vec_u8_t mask10={0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, };
+vec_u8_t mask11={0x13, 0x13, 0x13, 0x12, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, };
+vec_u8_t mask12={0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, };
+vec_u8_t mask13={0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, };
+vec_u8_t mask14={0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, };
+vec_u8_t mask15={0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, };
+
+vec_u8_t mask16_0={0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, };
+vec_u8_t mask16_1={0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, };
+vec_u8_t mask16_2={0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, };
+vec_u8_t mask16_3={0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, };
+vec_u8_t mask16_4={0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, };
+vec_u8_t mask16_5={0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, };
+vec_u8_t mask16_6={0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, };
+vec_u8_t mask16_7={0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, };
+vec_u8_t mask16_8={0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, };
+vec_u8_t mask16_9={0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, };
+vec_u8_t mask16_10={0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, };
+vec_u8_t mask16_11={0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, };
+vec_u8_t mask16_12={0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, };
+vec_u8_t mask16_13={0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, };
+vec_u8_t mask16_14={0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, };
+vec_u8_t mask16_15={0x13, 0x12, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, };
+
+vec_u8_t maskadd1_31={0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, };
+vec_u8_t maskadd1_16_31={0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1c, 0x19, 0x15, 0x12, 0xe, 0xb, 0x7, 0x4, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
+ vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(72, srcPix0);
+ vec_u8_t s2 = vec_xl(88, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s0, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s0, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s0, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s0, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s0, mask4);
+ vec_u8_t srv5 = vec_perm(s0, s0, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s0, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s0, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= vec_perm(s1, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s1, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s1, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s1, s1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(s0, s0, mask16_1);
+ vec_u8_t srv16_2 = vec_perm(s0, s0, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(s0, s0, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(s0, s0, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(s0, s0, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(s0, s0, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(s0, s0, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(s0, s0, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(s0, s0, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(s0, s0, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(s0, s0, mask16_11);
+ vec_u8_t srv16_12= vec_perm(s0, s1, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(s0, s1, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(s1, s1, mask0);
+ vec_u8_t srv17 = vec_perm(s1, s1, mask1);
+ vec_u8_t srv18 = vec_perm(s1, s1, mask2);
+ vec_u8_t srv19 = vec_perm(s1, s1, mask3);
+ vec_u8_t srv20 = vec_perm(s1, s1, mask4);
+ vec_u8_t srv21 = vec_perm(s1, s1, mask5);
+ vec_u8_t srv22 = vec_perm(s1, s1, mask6);
+ vec_u8_t srv23 = vec_perm(s1, s1, mask7);
+ vec_u8_t srv24 = vec_perm(s1, s2, mask8);
+ vec_u8_t srv25 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv26 = vec_perm(s1, s2, mask10);
+ vec_u8_t srv27 = vec_perm(s1, s2, mask11);
+ vec_u8_t srv28 = vec_perm(s2, s2, mask12);
+ vec_u8_t srv29 = vec_perm(s2, s2, mask13);
+ vec_u8_t srv30 = vec_perm(s2, s2, mask14);
+ vec_u8_t srv31 = vec_perm(s2, s2, mask15);
+
+ vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0);
+ vec_u8_t srv16_17 = vec_perm(s1, s1, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(s1, s1, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(s1, s1, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(s1, s1, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(s1, s1, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(s1, s1, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(s1, s1, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(s1, s1, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(s1, s1, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(s1, s1, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(s1, s1, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(s1, s2, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15);
+
+ vec_u8_t srv0add1 = srv1;
+ vec_u8_t srv1add1 = srv2;
+ vec_u8_t srv2add1 = srv3;
+ vec_u8_t srv3add1 = srv4;
+ vec_u8_t srv4add1 = srv5;
+ vec_u8_t srv5add1 = srv6;
+ vec_u8_t srv6add1 = srv7;
+ vec_u8_t srv7add1 = srv8;
+ vec_u8_t srv8add1 = srv9;
+ vec_u8_t srv9add1 = srv10;
+ vec_u8_t srv10add1 = srv11;
+ vec_u8_t srv11add1 = srv12;
+ vec_u8_t srv12add1= srv13;
+ vec_u8_t srv13add1 = srv14;
+ vec_u8_t srv14add1 = srv15;
+ vec_u8_t srv15add1 = srv16;
+
+ vec_u8_t srv16add1_0 = srv16_1;
+ vec_u8_t srv16add1_1 = srv16_2;
+ vec_u8_t srv16add1_2 = srv16_3;
+ vec_u8_t srv16add1_3 = srv16_4;
+ vec_u8_t srv16add1_4 = srv16_5;
+ vec_u8_t srv16add1_5 = srv16_6;
+ vec_u8_t srv16add1_6 = srv16_7;
+ vec_u8_t srv16add1_7 = srv16_8;
+ vec_u8_t srv16add1_8 = srv16_9;
+ vec_u8_t srv16add1_9 = srv16_10;
+ vec_u8_t srv16add1_10 = srv16_11;
+ vec_u8_t srv16add1_11 = srv16_12;
+ vec_u8_t srv16add1_12= srv16_13;
+ vec_u8_t srv16add1_13 = srv16_14;
+ vec_u8_t srv16add1_14 = srv16_15;
+ vec_u8_t srv16add1_15 = srv16_16;
+
+ vec_u8_t srv16add1 = srv17;
+ vec_u8_t srv17add1 = srv18;
+ vec_u8_t srv18add1 = srv19;
+ vec_u8_t srv19add1 = srv20;
+ vec_u8_t srv20add1 = srv21;
+ vec_u8_t srv21add1 = srv22;
+ vec_u8_t srv22add1 = srv23;
+ vec_u8_t srv23add1 = srv24;
+ vec_u8_t srv24add1 = srv25;
+ vec_u8_t srv25add1 = srv26;
+ vec_u8_t srv26add1 = srv27;
+ vec_u8_t srv27add1 = srv28;
+ vec_u8_t srv28add1 = srv29;
+ vec_u8_t srv29add1 = srv30;
+ vec_u8_t srv30add1 = srv31;
+ vec_u8_t srv31add1 = vec_perm(s2, s2, maskadd1_31);
+
+ vec_u8_t srv16add1_16 = srv16_17;
+ vec_u8_t srv16add1_17 = srv16_18;
+ vec_u8_t srv16add1_18 = srv16_19;
+ vec_u8_t srv16add1_19 = srv16_20;
+ vec_u8_t srv16add1_20 = srv16_21;
+ vec_u8_t srv16add1_21 = srv16_22;
+ vec_u8_t srv16add1_22 = srv16_23;
+ vec_u8_t srv16add1_23 = srv16_24;
+ vec_u8_t srv16add1_24 = srv16_25;
+ vec_u8_t srv16add1_25 = srv16_26;
+ vec_u8_t srv16add1_26 = srv16_27;
+ vec_u8_t srv16add1_27 = srv16_28;
+ vec_u8_t srv16add1_28 = srv16_29;
+ vec_u8_t srv16add1_29 = srv16_30;
+ vec_u8_t srv16add1_30 = srv16_31;
+ vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31);
+
+vec_u8_t vfrac32_0 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 15, 6, 29, 20, 11, 2, 25, 16, };
+vec_u8_t vfrac32_1 = (vec_u8_t){7, 30, 21, 12, 3, 26, 17, 8, 31, 22, 13, 4, 27, 18, 9, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 32, };
+
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+
+template<>
+void intra_pred<4, 14>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x1, 0x1, 0x0, 0x0, 0x2, 0x2, 0x1, 0x1, 0x3, 0x3, 0x2, 0x2, 0x4, 0x4, 0x3, 0x3, };
+ vec_u8_t mask1={0x2, 0x2, 0x1, 0x1, 0x3, 0x3, 0x2, 0x2, 0x4, 0x4, 0x3, 0x3, 0x5, 0x5, 0x4, 0x4, };
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){19, 6, 25, 12, 19, 6, 25, 12, 19, 6, 25, 12, 19, 6, 25, 12, };
+ vec_u8_t vfrac4_32 = (vec_u8_t){13, 26, 7, 20, 13, 26, 7, 20, 13, 26, 7, 20, 13, 26, 7, 20, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 14>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x1, 0x0, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, };
+vec_u8_t mask1={0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, };
+vec_u8_t mask2={0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, };
+vec_u8_t mask3={0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, };
+vec_u8_t mask4={0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, };
+vec_u8_t mask5={0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, };
+vec_u8_t mask6={0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, };
+vec_u8_t mask7={0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, };
+//vec_u8_t mask8={0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x7, 0x5, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+ vec_u8_t vfrac8 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 19, 6, 25, 12, 31, 18, 5, 24, };
+ vec_u8_t vfrac8_32 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 13, 26, 7, 20, 1, 14, 27, 8, };
+
+ one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0);
+ one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1);
+ one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2);
+ one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 14>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, };
+vec_u8_t mask1={0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, };
+vec_u8_t mask2={0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, };
+vec_u8_t mask3={0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, };
+vec_u8_t mask4={0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, };
+vec_u8_t mask5={0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, };
+vec_u8_t mask6={0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, };
+vec_u8_t mask7={0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, };
+vec_u8_t mask8={0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, };
+vec_u8_t mask9={0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, };
+vec_u8_t mask10={0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, };
+vec_u8_t mask11={0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, };
+vec_u8_t mask12={0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, };
+vec_u8_t mask13={0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, };
+vec_u8_t mask14={0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, };
+vec_u8_t mask15={0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, };
+vec_u8_t maskadd1_15={0x16, 0x16, 0x15, 0x15, 0x14, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ //vec_u8_t s1 = vec_xl(40, srcPix0);
+ vec_u8_t s1 = vec_xl(42, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 =vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = srv1;
+ vec_u8_t srv1_add1 = srv2;
+ vec_u8_t srv2_add1 = srv3;
+ vec_u8_t srv3_add1 = srv4;
+ vec_u8_t srv4_add1 = srv5;
+ vec_u8_t srv5_add1 = srv6;
+ vec_u8_t srv6_add1 = srv7;
+ vec_u8_t srv7_add1 = srv8;
+ vec_u8_t srv8_add1 = srv9;
+ vec_u8_t srv9_add1 = srv10;
+ vec_u8_t srv10_add1 = srv11;
+ vec_u8_t srv11_add1 = srv12;
+ vec_u8_t srv12_add1= srv13;
+ vec_u8_t srv13_add1 = srv14;
+ vec_u8_t srv14_add1 = srv15;
+ vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15);
+
+vec_u8_t vfrac16 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 11, 30, 17, 4, 23, 10, 29, 16, };
+vec_u8_t vfrac16_32 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 14>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, };
+vec_u8_t mask1={0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, };
+vec_u8_t mask2={0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, };
+vec_u8_t mask3={0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, };
+vec_u8_t mask4={0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, };
+vec_u8_t mask5={0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, };
+vec_u8_t mask6={0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, };
+vec_u8_t mask7={0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, };
+vec_u8_t mask8={0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, };
+vec_u8_t mask9={0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, };
+vec_u8_t mask10={0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, };
+vec_u8_t mask11={0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, };
+vec_u8_t mask12={0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, };
+vec_u8_t mask13={0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, };
+vec_u8_t mask14={0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, };
+vec_u8_t mask15={0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, };
+
+vec_u8_t mask16_0={0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, 0x0, };
+vec_u8_t mask16_1={0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x1, };
+vec_u8_t mask16_2={0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, };
+vec_u8_t mask16_3={0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, };
+vec_u8_t mask16_4={0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, };
+vec_u8_t mask16_5={0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, };
+vec_u8_t mask16_6={0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, };
+vec_u8_t mask16_7={0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, };
+vec_u8_t mask16_8={0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, };
+vec_u8_t mask16_9={0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, };
+vec_u8_t mask16_10={0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, };
+vec_u8_t mask16_11={0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, };
+vec_u8_t mask16_12={0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, };
+vec_u8_t mask16_13={0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, };
+vec_u8_t mask16_14={0x14, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, };
+vec_u8_t mask16_15={0x15, 0x14, 0x14, 0x13, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, };
+
+vec_u8_t maskadd1_31={0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, };
+vec_u8_t maskadd1_16_31={0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, 0x0, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1e, 0x1b, 0x19, 0x16, 0x14, 0x11, 0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x0, 0x0, 0x0};
+ vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x10, 0x11, 0x12};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(68, srcPix0);
+ vec_u8_t s2 = vec_xl(84, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s0, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s0, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s0, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s0, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 = vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s1, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s1, s1, mask11);
+ vec_u8_t srv12= vec_perm(s1, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s1, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s1, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s1, s1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(s0, s0, mask16_1);
+ vec_u8_t srv16_2 = vec_perm(s0, s0, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(s0, s0, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(s0, s0, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(s0, s0, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(s0, s0, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(s0, s0, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(s0, s0, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(s0, s0, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(s0, s1, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(s0, s1, mask16_11);
+ vec_u8_t srv16_12= vec_perm(s0, s1, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(s0, s1, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(s1, s1, mask0);
+ vec_u8_t srv17 = vec_perm(s1, s1, mask1);
+ vec_u8_t srv18 = vec_perm(s1, s1, mask2);
+ vec_u8_t srv19 = vec_perm(s1, s1, mask3);
+ vec_u8_t srv20 = vec_perm(s1, s2, mask4);
+ vec_u8_t srv21 = vec_perm(s1, s2, mask5);
+ vec_u8_t srv22 = vec_perm(s1, s2, mask6);
+ vec_u8_t srv23 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv24 = vec_perm(s1, s2, mask8);
+ vec_u8_t srv25 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv26 = vec_perm(s2, s2, mask10);
+ vec_u8_t srv27 = vec_perm(s2, s2, mask11);
+ vec_u8_t srv28 = vec_perm(s2, s2, mask12);
+ vec_u8_t srv29 = vec_perm(s2, s2, mask13);
+ vec_u8_t srv30 = vec_perm(s2, s2, mask14);
+ vec_u8_t srv31 = vec_perm(s2, s2, mask15);
+
+ vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0);
+ vec_u8_t srv16_17 = vec_perm(s1, s1, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(s1, s1, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(s1, s1, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(s1, s1, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(s1, s1, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(s1, s1, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(s1, s1, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(s1, s1, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(s1, s1, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(s1, s2, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(s1, s2, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(s1, s2, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15);
+
+ vec_u8_t srv0add1 = srv1;
+ vec_u8_t srv1add1 = srv2;
+ vec_u8_t srv2add1 = srv3;
+ vec_u8_t srv3add1 = srv4;
+ vec_u8_t srv4add1 = srv5;
+ vec_u8_t srv5add1 = srv6;
+ vec_u8_t srv6add1 = srv7;
+ vec_u8_t srv7add1 = srv8;
+ vec_u8_t srv8add1 = srv9;
+ vec_u8_t srv9add1 = srv10;
+ vec_u8_t srv10add1 = srv11;
+ vec_u8_t srv11add1 = srv12;
+ vec_u8_t srv12add1= srv13;
+ vec_u8_t srv13add1 = srv14;
+ vec_u8_t srv14add1 = srv15;
+ vec_u8_t srv15add1 = srv16;
+
+ vec_u8_t srv16add1_0 = srv16_1;
+ vec_u8_t srv16add1_1 = srv16_2;
+ vec_u8_t srv16add1_2 = srv16_3;
+ vec_u8_t srv16add1_3 = srv16_4;
+ vec_u8_t srv16add1_4 = srv16_5;
+ vec_u8_t srv16add1_5 = srv16_6;
+ vec_u8_t srv16add1_6 = srv16_7;
+ vec_u8_t srv16add1_7 = srv16_8;
+ vec_u8_t srv16add1_8 = srv16_9;
+ vec_u8_t srv16add1_9 = srv16_10;
+ vec_u8_t srv16add1_10 = srv16_11;
+ vec_u8_t srv16add1_11 = srv16_12;
+ vec_u8_t srv16add1_12= srv16_13;
+ vec_u8_t srv16add1_13 = srv16_14;
+ vec_u8_t srv16add1_14 = srv16_15;
+ vec_u8_t srv16add1_15 = srv16_16;
+
+ vec_u8_t srv16add1 = srv17;
+ vec_u8_t srv17add1 = srv18;
+ vec_u8_t srv18add1 = srv19;
+ vec_u8_t srv19add1 = srv20;
+ vec_u8_t srv20add1 = srv21;
+ vec_u8_t srv21add1 = srv22;
+ vec_u8_t srv22add1 = srv23;
+ vec_u8_t srv23add1 = srv24;
+ vec_u8_t srv24add1 = srv25;
+ vec_u8_t srv25add1 = srv26;
+ vec_u8_t srv26add1 = srv27;
+ vec_u8_t srv27add1 = srv28;
+ vec_u8_t srv28add1 = srv29;
+ vec_u8_t srv29add1 = srv30;
+ vec_u8_t srv30add1 = srv31;
+ vec_u8_t srv31add1 = vec_perm(s2, s2, maskadd1_31);
+
+ vec_u8_t srv16add1_16 = srv16_17;
+ vec_u8_t srv16add1_17 = srv16_18;
+ vec_u8_t srv16add1_18 = srv16_19;
+ vec_u8_t srv16add1_19 = srv16_20;
+ vec_u8_t srv16add1_20 = srv16_21;
+ vec_u8_t srv16add1_21 = srv16_22;
+ vec_u8_t srv16add1_22 = srv16_23;
+ vec_u8_t srv16add1_23 = srv16_24;
+ vec_u8_t srv16add1_24 = srv16_25;
+ vec_u8_t srv16add1_25 = srv16_26;
+ vec_u8_t srv16add1_26 = srv16_27;
+ vec_u8_t srv16add1_27 = srv16_28;
+ vec_u8_t srv16add1_28 = srv16_29;
+ vec_u8_t srv16add1_29 = srv16_30;
+ vec_u8_t srv16add1_30 = srv16_31;
+ vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31);
+
+vec_u8_t vfrac32_0 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 11, 30, 17, 4, 23, 10, 29, 16, };
+vec_u8_t vfrac32_1 = (vec_u8_t){3, 22, 9, 28, 15, 2, 21, 8, 27, 14, 1, 20, 7, 26, 13, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 32, };
+
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+
+template<>
+void intra_pred<4, 15>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x2, 0x1, 0x1, 0x0, 0x3, 0x2, 0x2, 0x1, 0x4, 0x3, 0x3, 0x2, 0x5, 0x4, 0x4, 0x3, };
+ vec_u8_t mask1={0x3, 0x2, 0x2, 0x1, 0x4, 0x3, 0x3, 0x2, 0x5, 0x4, 0x4, 0x3, 0x6, 0x5, 0x5, 0x4, };
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){15, 30, 13, 28, 15, 30, 13, 28, 15, 30, 13, 28, 15, 30, 13, 28, };
+ vec_u8_t vfrac4_32 = (vec_u8_t){17, 2, 19, 4, 17, 2, 19, 4, 17, 2, 19, 4, 17, 2, 19, 4, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 15>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x0, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, };
+vec_u8_t mask1={0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, };
+vec_u8_t mask2={0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, };
+vec_u8_t mask3={0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, };
+vec_u8_t mask4={0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, };
+vec_u8_t mask5={0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, };
+vec_u8_t mask6={0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, };
+vec_u8_t mask7={0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, };
+//vec_u8_t mask8={0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x8, 0x6, 0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+ vec_u8_t vfrac8 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 15, 30, 13, 28, 11, 26, 9, 24, };
+ vec_u8_t vfrac8_32 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 17, 2, 19, 4, 21, 6, 23, 8, };
+
+ one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0);
+ one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1);
+ one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2);
+ one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 15>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x0, };
+vec_u8_t mask1={0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, };
+vec_u8_t mask2={0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, };
+vec_u8_t mask3={0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, };
+vec_u8_t mask4={0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, };
+vec_u8_t mask5={0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, };
+vec_u8_t mask6={0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, };
+vec_u8_t mask7={0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, };
+vec_u8_t mask8={0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, };
+vec_u8_t mask9={0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, };
+vec_u8_t mask10={0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, };
+vec_u8_t mask11={0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, };
+vec_u8_t mask12={0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, };
+vec_u8_t mask13={0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, };
+vec_u8_t mask14={0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, };
+vec_u8_t mask15={0x17, 0x16, 0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, };
+vec_u8_t maskadd1_15={0x18, 0x17, 0x17, 0x16, 0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(40, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 =vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = srv1;
+ vec_u8_t srv1_add1 = srv2;
+ vec_u8_t srv2_add1 = srv3;
+ vec_u8_t srv3_add1 = srv4;
+ vec_u8_t srv4_add1 = srv5;
+ vec_u8_t srv5_add1 = srv6;
+ vec_u8_t srv6_add1 = srv7;
+ vec_u8_t srv7_add1 = srv8;
+ vec_u8_t srv8_add1 = srv9;
+ vec_u8_t srv9_add1 = srv10;
+ vec_u8_t srv10_add1 = srv11;
+ vec_u8_t srv11_add1 = srv12;
+ vec_u8_t srv12_add1= srv13;
+ vec_u8_t srv13_add1 = srv14;
+ vec_u8_t srv14_add1 = srv15;
+ vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15);
+
+vec_u8_t vfrac16 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 7, 22, 5, 20, 3, 18, 1, 16, };
+vec_u8_t vfrac16_32 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 15>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, };
+vec_u8_t mask1={0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, };
+vec_u8_t mask2={0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, };
+vec_u8_t mask3={0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, };
+vec_u8_t mask4={0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, };
+vec_u8_t mask5={0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, };
+vec_u8_t mask6={0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, };
+vec_u8_t mask7={0x17, 0x16, 0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, };
+vec_u8_t mask8={0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x0, };
+vec_u8_t mask9={0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, };
+vec_u8_t mask10={0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, };
+vec_u8_t mask11={0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, };
+vec_u8_t mask12={0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, };
+vec_u8_t mask13={0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, };
+vec_u8_t mask14={0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, };
+vec_u8_t mask15={0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, };
+
+vec_u8_t mask16_0={0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, };
+vec_u8_t mask16_1={0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, };
+vec_u8_t mask16_2={0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, };
+vec_u8_t mask16_3={0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, };
+vec_u8_t mask16_4={0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, };
+vec_u8_t mask16_5={0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, };
+vec_u8_t mask16_6={0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, };
+vec_u8_t mask16_7={0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, };
+vec_u8_t mask16_8={0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, };
+vec_u8_t mask16_9={0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, };
+vec_u8_t mask16_10={0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, };
+vec_u8_t mask16_11={0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, };
+vec_u8_t mask16_12={0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, };
+vec_u8_t mask16_13={0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, };
+vec_u8_t mask16_14={0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, };
+vec_u8_t mask16_15={0x16, 0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, };
+
+vec_u8_t maskadd1_31={0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, };
+vec_u8_t maskadd1_16_31={0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1e, 0x1c, 0x1a, 0x18, 0x17, 0x15, 0x13, 0x11, 0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2};
+ vec_u8_t refmask_32_1={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(80, srcPix0);
+ vec_u8_t s3 = vec_xl(96, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 = vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s1, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s1, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s1, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s1, s1, mask11);
+ vec_u8_t srv12= vec_perm(s1, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s1, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s1, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s1, s1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(s0, s0, mask16_1);
+ vec_u8_t srv16_2 = vec_perm(s0, s0, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(s0, s0, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(s0, s0, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(s0, s0, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(s0, s0, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(s0, s0, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(s0, s0, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(s0, s1, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(s0, s1, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(s0, s1, mask16_11);
+ vec_u8_t srv16_12= vec_perm(s0, s1, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(s0, s1, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv17 = vec_perm(s1, s2, mask1);
+ vec_u8_t srv18 = vec_perm(s1, s2, mask2);
+ vec_u8_t srv19 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv20 = vec_perm(s1, s2, mask4);
+ vec_u8_t srv21 = vec_perm(s1, s2, mask5);
+ vec_u8_t srv22 = vec_perm(s1, s2, mask6);
+ vec_u8_t srv23 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv24 = vec_perm(s2, s2, mask8);
+ vec_u8_t srv25 = vec_perm(s2, s2, mask9);
+ vec_u8_t srv26 = vec_perm(s2, s2, mask10);
+ vec_u8_t srv27 = vec_perm(s2, s2, mask11);
+ vec_u8_t srv28 = vec_perm(s2, s2, mask12);
+ vec_u8_t srv29 = vec_perm(s2, s2, mask13);
+ vec_u8_t srv30 = vec_perm(s2, s2, mask14);
+ vec_u8_t srv31 = vec_perm(s2, s2, mask15);
+
+ vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0);
+ vec_u8_t srv16_17 = vec_perm(s1, s1, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(s1, s1, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(s1, s1, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(s1, s1, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(s1, s1, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(s1, s1, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(s1, s1, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(s1, s1, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(s1, s2, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(s1, s2, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(s1, s2, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(s1, s2, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15);
+
+ vec_u8_t srv0add1 = srv1;
+ vec_u8_t srv1add1 = srv2;
+ vec_u8_t srv2add1 = srv3;
+ vec_u8_t srv3add1 = srv4;
+ vec_u8_t srv4add1 = srv5;
+ vec_u8_t srv5add1 = srv6;
+ vec_u8_t srv6add1 = srv7;
+ vec_u8_t srv7add1 = srv8;
+ vec_u8_t srv8add1 = srv9;
+ vec_u8_t srv9add1 = srv10;
+ vec_u8_t srv10add1 = srv11;
+ vec_u8_t srv11add1 = srv12;
+ vec_u8_t srv12add1= srv13;
+ vec_u8_t srv13add1 = srv14;
+ vec_u8_t srv14add1 = srv15;
+ vec_u8_t srv15add1 = srv16;
+
+ vec_u8_t srv16add1_0 = srv16_1;
+ vec_u8_t srv16add1_1 = srv16_2;
+ vec_u8_t srv16add1_2 = srv16_3;
+ vec_u8_t srv16add1_3 = srv16_4;
+ vec_u8_t srv16add1_4 = srv16_5;
+ vec_u8_t srv16add1_5 = srv16_6;
+ vec_u8_t srv16add1_6 = srv16_7;
+ vec_u8_t srv16add1_7 = srv16_8;
+ vec_u8_t srv16add1_8 = srv16_9;
+ vec_u8_t srv16add1_9 = srv16_10;
+ vec_u8_t srv16add1_10 = srv16_11;
+ vec_u8_t srv16add1_11 = srv16_12;
+ vec_u8_t srv16add1_12= srv16_13;
+ vec_u8_t srv16add1_13 = srv16_14;
+ vec_u8_t srv16add1_14 = srv16_15;
+ vec_u8_t srv16add1_15 = srv16_16;
+
+ vec_u8_t srv16add1 = srv17;
+ vec_u8_t srv17add1 = srv18;
+ vec_u8_t srv18add1 = srv19;
+ vec_u8_t srv19add1 = srv20;
+ vec_u8_t srv20add1 = srv21;
+ vec_u8_t srv21add1 = srv22;
+ vec_u8_t srv22add1 = srv23;
+ vec_u8_t srv23add1 = srv24;
+ vec_u8_t srv24add1 = srv25;
+ vec_u8_t srv25add1 = srv26;
+ vec_u8_t srv26add1 = srv27;
+ vec_u8_t srv27add1 = srv28;
+ vec_u8_t srv28add1 = srv29;
+ vec_u8_t srv29add1 = srv30;
+ vec_u8_t srv30add1 = srv31;
+ vec_u8_t srv31add1 = vec_perm(s2, s3, maskadd1_31);
+
+ vec_u8_t srv16add1_16 = srv16_17;
+ vec_u8_t srv16add1_17 = srv16_18;
+ vec_u8_t srv16add1_18 = srv16_19;
+ vec_u8_t srv16add1_19 = srv16_20;
+ vec_u8_t srv16add1_20 = srv16_21;
+ vec_u8_t srv16add1_21 = srv16_22;
+ vec_u8_t srv16add1_22 = srv16_23;
+ vec_u8_t srv16add1_23 = srv16_24;
+ vec_u8_t srv16add1_24 = srv16_25;
+ vec_u8_t srv16add1_25 = srv16_26;
+ vec_u8_t srv16add1_26 = srv16_27;
+ vec_u8_t srv16add1_27 = srv16_28;
+ vec_u8_t srv16add1_28 = srv16_29;
+ vec_u8_t srv16add1_29 = srv16_30;
+ vec_u8_t srv16add1_30 = srv16_31;
+ vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31);
+
+vec_u8_t vfrac32_0 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 7, 22, 5, 20, 3, 18, 1, 16, };
+vec_u8_t vfrac32_1 = (vec_u8_t){31, 14, 29, 12, 27, 10, 25, 8, 23, 6, 21, 4, 19, 2, 17, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 32, };
+
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<4, 16>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x2, 0x1, 0x1, 0x0, 0x3, 0x2, 0x2, 0x1, 0x4, 0x3, 0x3, 0x2, 0x5, 0x4, 0x4, 0x3, };
+ vec_u8_t mask1={0x3, 0x2, 0x2, 0x1, 0x4, 0x3, 0x3, 0x2, 0x5, 0x4, 0x4, 0x3, 0x6, 0x5, 0x5, 0x4, };
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){11, 22, 1, 12, 11, 22, 1, 12, 11, 22, 1, 12, 11, 22, 1, 12, };
+ vec_u8_t vfrac4_32 = (vec_u8_t){21, 10, 31, 20, 21, 10, 31, 20, 21, 10, 31, 20, 21, 10, 31, 20, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 16>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x5, 0x4, 0x4, 0x3, 0x2, 0x2, 0x1, 0x0, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, };
+vec_u8_t mask1={0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, };
+vec_u8_t mask2={0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, };
+vec_u8_t mask3={0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, };
+vec_u8_t mask4={0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, };
+vec_u8_t mask5={0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, };
+vec_u8_t mask6={0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, };
+vec_u8_t mask7={0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, };
+//vec_u8_t mask8={0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_8={0x8, 0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+ vec_u8_t vfrac8 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 11, 22, 1, 12, 23, 2, 13, 24, };
+ vec_u8_t vfrac8_32 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 21, 10, 31, 20, 9, 30, 19, 8, };
+
+one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0);
+one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1);
+one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2);
+one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 16>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x1, 0x0, };
+vec_u8_t mask1={0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x2, 0x1, };
+vec_u8_t mask2={0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, };
+vec_u8_t mask3={0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, };
+vec_u8_t mask4={0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, };
+vec_u8_t mask5={0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, };
+vec_u8_t mask6={0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, };
+vec_u8_t mask7={0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, };
+vec_u8_t mask8={0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, };
+vec_u8_t mask9={0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, };
+vec_u8_t mask10={0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, };
+vec_u8_t mask11={0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, };
+vec_u8_t mask12={0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, };
+vec_u8_t mask13={0x17, 0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, };
+vec_u8_t mask14={0x18, 0x17, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, };
+vec_u8_t mask15={0x19, 0x18, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, };
+vec_u8_t maskadd1_15={0x1a, 0x19, 0x19, 0x18, 0x17, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0x9, 0x8, 0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(38, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 =vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = srv1;
+ vec_u8_t srv1_add1 = srv2;
+ vec_u8_t srv2_add1 = srv3;
+ vec_u8_t srv3_add1 = srv4;
+ vec_u8_t srv4_add1 = srv5;
+ vec_u8_t srv5_add1 = srv6;
+ vec_u8_t srv6_add1 = srv7;
+ vec_u8_t srv7_add1 = srv8;
+ vec_u8_t srv8_add1 = srv9;
+ vec_u8_t srv9_add1 = srv10;
+ vec_u8_t srv10_add1 = srv11;
+ vec_u8_t srv11_add1 = srv12;
+ vec_u8_t srv12_add1= srv13;
+ vec_u8_t srv13_add1 = srv14;
+ vec_u8_t srv14_add1 = srv15;
+ vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15);
+
+ vec_u8_t vfrac16 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 3, 14, 25, 4, 15, 26, 5, 16, };
+ vec_u8_t vfrac16_32 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 16>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, };
+vec_u8_t mask1={0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, };
+vec_u8_t mask2={0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, };
+vec_u8_t mask3={0x17, 0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, };
+vec_u8_t mask4={0x18, 0x17, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, };
+vec_u8_t mask5={0x19, 0x18, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, };
+vec_u8_t mask6={0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x1, 0x0, };
+vec_u8_t mask7={0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x2, 0x1, };
+vec_u8_t mask8={0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, };
+vec_u8_t mask9={0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, };
+vec_u8_t mask10={0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, };
+vec_u8_t mask11={0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, };
+vec_u8_t mask12={0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, };
+vec_u8_t mask13={0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, };
+vec_u8_t mask14={0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, };
+vec_u8_t mask15={0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, };
+
+vec_u8_t mask16_0={0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x1, 0x0, 0x0, };
+vec_u8_t mask16_1={0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x2, 0x1, 0x1, };
+vec_u8_t mask16_2={0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x2, };
+vec_u8_t mask16_3={0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x3, };
+vec_u8_t mask16_4={0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x4, };
+vec_u8_t mask16_5={0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x5, };
+vec_u8_t mask16_6={0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x6, };
+vec_u8_t mask16_7={0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x7, };
+vec_u8_t mask16_8={0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x8, };
+vec_u8_t mask16_9={0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x9, };
+vec_u8_t mask16_10={0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0xa, };
+vec_u8_t mask16_11={0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xb, };
+vec_u8_t mask16_12={0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xc, };
+vec_u8_t mask16_13={0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xd, };
+vec_u8_t mask16_14={0x17, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xe, };
+vec_u8_t mask16_15={0x18, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xf, };
+
+vec_u8_t maskadd1_31={0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, };
+vec_u8_t maskadd1_16_31={0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x1, 0x0, 0x0, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1e, 0x1d, 0x1b, 0x1a, 0x18, 0x17, 0x15, 0x14, 0x12, 0x11, 0xf, 0xe, 0xc, 0xb, 0x9, 0x8};
+ vec_u8_t refmask_32_1={0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(76, srcPix0);
+ vec_u8_t s3 = vec_xl(92, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 = vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s1, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s1, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s1, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s1, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s1, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s1, s1, mask11);
+ vec_u8_t srv12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv13 = vec_perm(s1, s2, mask13);
+ vec_u8_t srv14 = vec_perm(s1, s2, mask14);
+ vec_u8_t srv15 = vec_perm(s1, s2, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(s0, s1, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(s0, s1, mask16_1);
+ vec_u8_t srv16_2 = vec_perm(s0, s1, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(s0, s1, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(s0, s1, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(s0, s1, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(s0, s1, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(s0, s1, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(s0, s1, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(s0, s1, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(s0, s1, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(s0, s1, mask16_11);
+ vec_u8_t srv16_12= vec_perm(s0, s1, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(s0, s1, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv17 = vec_perm(s1, s2, mask1);
+ vec_u8_t srv18 = vec_perm(s1, s2, mask2);
+ vec_u8_t srv19 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv20 = vec_perm(s1, s2, mask4);
+ vec_u8_t srv21 = vec_perm(s1, s2, mask5);
+ vec_u8_t srv22 = vec_perm(s2, s2, mask6);
+ vec_u8_t srv23 = vec_perm(s2, s2, mask7);
+ vec_u8_t srv24 = vec_perm(s2, s2, mask8);
+ vec_u8_t srv25 = vec_perm(s2, s2, mask9);
+ vec_u8_t srv26 = vec_perm(s2, s2, mask10);
+ vec_u8_t srv27 = vec_perm(s2, s2, mask11);
+ vec_u8_t srv28 = vec_perm(s2, s3, mask12);
+ vec_u8_t srv29 = vec_perm(s2, s3, mask13);
+ vec_u8_t srv30 = vec_perm(s2, s3, mask14);
+ vec_u8_t srv31 = vec_perm(s2, s3, mask15);
+
+ vec_u8_t srv16_16 = vec_perm(s1, s2, mask16_0);
+ vec_u8_t srv16_17 = vec_perm(s1, s2, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(s1, s2, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(s1, s2, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(s1, s2, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(s1, s2, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(s1, s2, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(s1, s2, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(s1, s2, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(s1, s2, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(s1, s2, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(s1, s2, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15);
+
+ vec_u8_t srv0add1 = srv1;
+ vec_u8_t srv1add1 = srv2;
+ vec_u8_t srv2add1 = srv3;
+ vec_u8_t srv3add1 = srv4;
+ vec_u8_t srv4add1 = srv5;
+ vec_u8_t srv5add1 = srv6;
+ vec_u8_t srv6add1 = srv7;
+ vec_u8_t srv7add1 = srv8;
+ vec_u8_t srv8add1 = srv9;
+ vec_u8_t srv9add1 = srv10;
+ vec_u8_t srv10add1 = srv11;
+ vec_u8_t srv11add1 = srv12;
+ vec_u8_t srv12add1= srv13;
+ vec_u8_t srv13add1 = srv14;
+ vec_u8_t srv14add1 = srv15;
+ vec_u8_t srv15add1 = srv16;
+
+ vec_u8_t srv16add1_0 = srv16_1;
+ vec_u8_t srv16add1_1 = srv16_2;
+ vec_u8_t srv16add1_2 = srv16_3;
+ vec_u8_t srv16add1_3 = srv16_4;
+ vec_u8_t srv16add1_4 = srv16_5;
+ vec_u8_t srv16add1_5 = srv16_6;
+ vec_u8_t srv16add1_6 = srv16_7;
+ vec_u8_t srv16add1_7 = srv16_8;
+ vec_u8_t srv16add1_8 = srv16_9;
+ vec_u8_t srv16add1_9 = srv16_10;
+ vec_u8_t srv16add1_10 = srv16_11;
+ vec_u8_t srv16add1_11 = srv16_12;
+ vec_u8_t srv16add1_12= srv16_13;
+ vec_u8_t srv16add1_13 = srv16_14;
+ vec_u8_t srv16add1_14 = srv16_15;
+ vec_u8_t srv16add1_15 = srv16_16;
+
+ vec_u8_t srv16add1 = srv17;
+ vec_u8_t srv17add1 = srv18;
+ vec_u8_t srv18add1 = srv19;
+ vec_u8_t srv19add1 = srv20;
+ vec_u8_t srv20add1 = srv21;
+ vec_u8_t srv21add1 = srv22;
+ vec_u8_t srv22add1 = srv23;
+ vec_u8_t srv23add1 = srv24;
+ vec_u8_t srv24add1 = srv25;
+ vec_u8_t srv25add1 = srv26;
+ vec_u8_t srv26add1 = srv27;
+ vec_u8_t srv27add1 = srv28;
+ vec_u8_t srv28add1 = srv29;
+ vec_u8_t srv29add1 = srv30;
+ vec_u8_t srv30add1 = srv31;
+ vec_u8_t srv31add1 = vec_perm(s2, s3, maskadd1_31);
+
+ vec_u8_t srv16add1_16 = srv16_17;
+ vec_u8_t srv16add1_17 = srv16_18;
+ vec_u8_t srv16add1_18 = srv16_19;
+ vec_u8_t srv16add1_19 = srv16_20;
+ vec_u8_t srv16add1_20 = srv16_21;
+ vec_u8_t srv16add1_21 = srv16_22;
+ vec_u8_t srv16add1_22 = srv16_23;
+ vec_u8_t srv16add1_23 = srv16_24;
+ vec_u8_t srv16add1_24 = srv16_25;
+ vec_u8_t srv16add1_25 = srv16_26;
+ vec_u8_t srv16add1_26 = srv16_27;
+ vec_u8_t srv16add1_27 = srv16_28;
+ vec_u8_t srv16add1_28 = srv16_29;
+ vec_u8_t srv16add1_29 = srv16_30;
+ vec_u8_t srv16add1_30 = srv16_31;
+ vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31);
+
+vec_u8_t vfrac32_0 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 3, 14, 25, 4, 15, 26, 5, 16, };
+vec_u8_t vfrac32_1 = (vec_u8_t){27, 6, 17, 28, 7, 18, 29, 8, 19, 30, 9, 20, 31, 10, 21, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 32, };
+
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 17>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ //vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, };
+ //vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, };
+ vec_u8_t mask0={0x3, 0x2, 0x1, 0x0, 0x4, 0x3, 0x2, 0x1, 0x5, 0x4, 0x3, 0x2, 0x6, 0x5, 0x4, 0x3};
+ vec_u8_t mask1={0x4, 0x3, 0x2, 0x1, 0x5, 0x4, 0x3, 0x2, 0x6, 0x5, 0x4, 0x3, 0x7, 0x6, 0x5, 0x4};
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ //vec_u8_t vfrac4 = (vec_u8_t){6, 6, 6, 6, 12, 12, 12, 12, 18, 18, 18, 18, 24, 24, 24, 24};
+ //vec_u8_t vfrac4_32 = (vec_u8_t){26, 26, 26, 26, 20, 20, 20, 20, 14, 14, 14, 14, 8, 8, 8, 8};
+ vec_u8_t vfrac4 = (vec_u8_t){6, 12, 18, 24, 6, 12, 18, 24, 6, 12, 18, 24, 6, 12, 18, 24, };
+ vec_u8_t vfrac4_32 = (vec_u8_t){26, 20, 14, 8, 26, 20, 14, 8, 26, 20, 14, 8, 26, 20, 14, 8, };
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 17>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x6, 0x5, 0x4, 0x3, 0x2, 0x2, 0x1, 0x0, 0x7, 0x6, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, };
+ vec_u8_t mask1={0x7, 0x6, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, };
+ vec_u8_t mask2={0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, };
+ vec_u8_t mask3={0x9, 0x8, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, };
+ vec_u8_t mask4={0xa, 0x9, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, };
+ vec_u8_t mask5={0xb, 0xa, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, };
+ vec_u8_t mask6={0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, };
+ vec_u8_t mask7={0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, };
+ //vec_u8_t mask8={0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_8={0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00};
+
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+vec_u8_t vfrac8 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 6, 12, 18, 24, 30, 4, 10, 16, };
+vec_u8_t vfrac8_32 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 26, 20, 14, 8, 2, 28, 22, 16, };
+
+one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0);
+one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1);
+one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2);
+one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 17>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x1, 0x0, 0x0, };
+vec_u8_t mask1={0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x2, 0x1, 0x1, };
+vec_u8_t mask2={0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x3, 0x2, 0x2, };
+vec_u8_t mask3={0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x4, 0x3, 0x3, };
+vec_u8_t mask4={0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, };
+vec_u8_t mask5={0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, };
+vec_u8_t mask6={0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, };
+vec_u8_t mask7={0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, };
+vec_u8_t mask8={0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, };
+vec_u8_t mask9={0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, };
+vec_u8_t mask10={0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, };
+vec_u8_t mask11={0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, };
+vec_u8_t mask12={0x18, 0x17, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, };
+vec_u8_t mask13={0x19, 0x18, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, };
+vec_u8_t mask14={0x1a, 0x19, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, };
+vec_u8_t mask15={0x1b, 0x1a, 0x19, 0x18, 0x17, 0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, };
+vec_u8_t maskadd1_15={0x1c, 0x1b, 0x1a, 0x19, 0x18, 0x18, 0x17, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x11, 0x10, 0x10, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(36, srcPix0);
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 =vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = srv1;
+ vec_u8_t srv1_add1 = srv2;
+ vec_u8_t srv2_add1 = srv3;
+ vec_u8_t srv3_add1 = srv4;
+ vec_u8_t srv4_add1 = srv5;
+ vec_u8_t srv5_add1 = srv6;
+ vec_u8_t srv6_add1 = srv7;
+ vec_u8_t srv7_add1 = srv8;
+ vec_u8_t srv8_add1 = srv9;
+ vec_u8_t srv9_add1 = srv10;
+ vec_u8_t srv10_add1 = srv11;
+ vec_u8_t srv11_add1 = srv12;
+ vec_u8_t srv12_add1= srv13;
+ vec_u8_t srv13_add1 = srv14;
+ vec_u8_t srv14_add1 = srv15;
+ vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15);
+
+vec_u8_t vfrac16 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, };
+vec_u8_t vfrac16_32 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 32, };
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 17>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x19, 0x18, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, };
+vec_u8_t mask1={0x1a, 0x19, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, };
+vec_u8_t mask2={0x1b, 0x1a, 0x19, 0x18, 0x17, 0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, };
+vec_u8_t mask3={0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x1, 0x0, 0x0, };
+vec_u8_t mask4={0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x2, 0x1, 0x1, };
+vec_u8_t mask5={0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x3, 0x2, 0x2, };
+vec_u8_t mask6={0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x4, 0x3, 0x3, };
+vec_u8_t mask7={0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, };
+vec_u8_t mask8={0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, };
+vec_u8_t mask9={0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, };
+vec_u8_t mask10={0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, };
+vec_u8_t mask11={0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, };
+vec_u8_t mask12={0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, };
+vec_u8_t mask13={0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, };
+vec_u8_t mask14={0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, };
+vec_u8_t mask15={0x18, 0x17, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, };
+
+vec_u8_t mask16_0={0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x1, 0x0, 0x0, };
+vec_u8_t mask16_1={0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x2, 0x1, 0x1, };
+vec_u8_t mask16_2={0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x3, 0x2, 0x2, };
+vec_u8_t mask16_3={0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x4, 0x3, 0x3, };
+vec_u8_t mask16_4={0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, };
+vec_u8_t mask16_5={0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, };
+vec_u8_t mask16_6={0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, };
+vec_u8_t mask16_7={0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, };
+vec_u8_t mask16_8={0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, };
+vec_u8_t mask16_9={0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, };
+vec_u8_t mask16_10={0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, };
+vec_u8_t mask16_11={0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, };
+vec_u8_t mask16_12={0x18, 0x17, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, };
+vec_u8_t mask16_13={0x19, 0x18, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, };
+vec_u8_t mask16_14={0x1a, 0x19, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, };
+vec_u8_t mask16_15={0x1b, 0x1a, 0x19, 0x18, 0x17, 0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, };
+
+vec_u8_t maskadd1_31={0x19, 0x18, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, };
+vec_u8_t maskadd1_16_31={0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x1, 0x0, 0x0, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1f, 0x1e, 0x1c, 0x1b, 0x1a, 0x19, 0x17, 0x16, 0x15, 0x14, 0x12, 0x11, 0x10, 0xf, 0xe, 0xc };
+ vec_u8_t refmask_32_1={0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(71, srcPix0);
+ vec_u8_t s3 = vec_xl(87, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s1, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s1, s1, mask4);
+ vec_u8_t srv5 = vec_perm(s1, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s1, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv8 = vec_perm(s1, s2, mask8);
+ vec_u8_t srv9 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv10 = vec_perm(s1, s2, mask10);
+ vec_u8_t srv11 = vec_perm(s1, s2, mask11);
+ vec_u8_t srv12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv13 = vec_perm(s1, s2, mask13);
+ vec_u8_t srv14 = vec_perm(s1, s2, mask14);
+ vec_u8_t srv15 = vec_perm(s1, s2, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(s0, s1, mask16_0);
+ vec_u8_t srv16_1 = vec_perm(s0, s1, mask16_1);
+ vec_u8_t srv16_2 = vec_perm(s0, s1, mask16_2);
+ vec_u8_t srv16_3 = vec_perm(s0, s1, mask16_3);
+ vec_u8_t srv16_4 = vec_perm(s0, s1, mask16_4);
+ vec_u8_t srv16_5 = vec_perm(s0, s1, mask16_5);
+ vec_u8_t srv16_6 = vec_perm(s0, s1, mask16_6);
+ vec_u8_t srv16_7 = vec_perm(s0, s1, mask16_7);
+ vec_u8_t srv16_8 = vec_perm(s0, s1, mask16_8);
+ vec_u8_t srv16_9 = vec_perm(s0, s1, mask16_9);
+ vec_u8_t srv16_10 = vec_perm(s0, s1, mask16_10);
+ vec_u8_t srv16_11 = vec_perm(s0, s1, mask16_11);
+ vec_u8_t srv16_12= vec_perm(s0, s1, mask16_12);
+ vec_u8_t srv16_13 = vec_perm(s0, s1, mask16_13);
+ vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14);
+ vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15);
+
+ vec_u8_t srv16 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv17 = vec_perm(s1, s2, mask1);
+ vec_u8_t srv18 = vec_perm(s1, s2, mask2);
+ vec_u8_t srv19 = vec_perm(s2, s2, mask3);
+ vec_u8_t srv20 = vec_perm(s2, s2, mask4);
+ vec_u8_t srv21 = vec_perm(s2, s2, mask5);
+ vec_u8_t srv22 = vec_perm(s2, s2, mask6);
+ vec_u8_t srv23 = vec_perm(s2, s3, mask7);
+ vec_u8_t srv24 = vec_perm(s2, s3, mask8);
+ vec_u8_t srv25 = vec_perm(s2, s3, mask9);
+ vec_u8_t srv26 = vec_perm(s2, s3, mask10);
+ vec_u8_t srv27 = vec_perm(s2, s3, mask11);
+ vec_u8_t srv28 = vec_perm(s2, s3, mask12);
+ vec_u8_t srv29 = vec_perm(s2, s3, mask13);
+ vec_u8_t srv30 = vec_perm(s2, s3, mask14);
+ vec_u8_t srv31 = vec_perm(s2, s3, mask15);
+
+ vec_u8_t srv16_16 = vec_perm(s1, s2, mask16_0);
+ vec_u8_t srv16_17 = vec_perm(s1, s2, mask16_1);
+ vec_u8_t srv16_18 = vec_perm(s1, s2, mask16_2);
+ vec_u8_t srv16_19 = vec_perm(s1, s2, mask16_3);
+ vec_u8_t srv16_20 = vec_perm(s1, s2, mask16_4);
+ vec_u8_t srv16_21 = vec_perm(s1, s2, mask16_5);
+ vec_u8_t srv16_22 = vec_perm(s1, s2, mask16_6);
+ vec_u8_t srv16_23 = vec_perm(s1, s2, mask16_7);
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask16_8);
+ vec_u8_t srv16_25 = vec_perm(s1, s2, mask16_9);
+ vec_u8_t srv16_26 = vec_perm(s1, s2, mask16_10);
+ vec_u8_t srv16_27 = vec_perm(s1, s2, mask16_11);
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask16_12);
+ vec_u8_t srv16_29 = vec_perm(s1, s2, mask16_13);
+ vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14);
+ vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15);
+
+ vec_u8_t srv0add1 = srv1;
+ vec_u8_t srv1add1 = srv2;
+ vec_u8_t srv2add1 = srv3;
+ vec_u8_t srv3add1 = srv4;
+ vec_u8_t srv4add1 = srv5;
+ vec_u8_t srv5add1 = srv6;
+ vec_u8_t srv6add1 = srv7;
+ vec_u8_t srv7add1 = srv8;
+ vec_u8_t srv8add1 = srv9;
+ vec_u8_t srv9add1 = srv10;
+ vec_u8_t srv10add1 = srv11;
+ vec_u8_t srv11add1 = srv12;
+ vec_u8_t srv12add1= srv13;
+ vec_u8_t srv13add1 = srv14;
+ vec_u8_t srv14add1 = srv15;
+ vec_u8_t srv15add1 = srv16;
+
+ vec_u8_t srv16add1_0 = srv16_1;
+ vec_u8_t srv16add1_1 = srv16_2;
+ vec_u8_t srv16add1_2 = srv16_3;
+ vec_u8_t srv16add1_3 = srv16_4;
+ vec_u8_t srv16add1_4 = srv16_5;
+ vec_u8_t srv16add1_5 = srv16_6;
+ vec_u8_t srv16add1_6 = srv16_7;
+ vec_u8_t srv16add1_7 = srv16_8;
+ vec_u8_t srv16add1_8 = srv16_9;
+ vec_u8_t srv16add1_9 = srv16_10;
+ vec_u8_t srv16add1_10 = srv16_11;
+ vec_u8_t srv16add1_11 = srv16_12;
+ vec_u8_t srv16add1_12= srv16_13;
+ vec_u8_t srv16add1_13 = srv16_14;
+ vec_u8_t srv16add1_14 = srv16_15;
+ vec_u8_t srv16add1_15 = srv16_16;
+
+ vec_u8_t srv16add1 = srv17;
+ vec_u8_t srv17add1 = srv18;
+ vec_u8_t srv18add1 = srv19;
+ vec_u8_t srv19add1 = srv20;
+ vec_u8_t srv20add1 = srv21;
+ vec_u8_t srv21add1 = srv22;
+ vec_u8_t srv22add1 = srv23;
+ vec_u8_t srv23add1 = srv24;
+ vec_u8_t srv24add1 = srv25;
+ vec_u8_t srv25add1 = srv26;
+ vec_u8_t srv26add1 = srv27;
+ vec_u8_t srv27add1 = srv28;
+ vec_u8_t srv28add1 = srv29;
+ vec_u8_t srv29add1 = srv30;
+ vec_u8_t srv30add1 = srv31;
+ vec_u8_t srv31add1 = vec_perm(s2, s3, maskadd1_31);
+
+ vec_u8_t srv16add1_16 = srv16_17;
+ vec_u8_t srv16add1_17 = srv16_18;
+ vec_u8_t srv16add1_18 = srv16_19;
+ vec_u8_t srv16add1_19 = srv16_20;
+ vec_u8_t srv16add1_20 = srv16_21;
+ vec_u8_t srv16add1_21 = srv16_22;
+ vec_u8_t srv16add1_22 = srv16_23;
+ vec_u8_t srv16add1_23 = srv16_24;
+ vec_u8_t srv16add1_24 = srv16_25;
+ vec_u8_t srv16add1_25 = srv16_26;
+ vec_u8_t srv16add1_26 = srv16_27;
+ vec_u8_t srv16add1_27 = srv16_28;
+ vec_u8_t srv16add1_28 = srv16_29;
+ vec_u8_t srv16add1_29 = srv16_30;
+ vec_u8_t srv16add1_30 = srv16_31;
+ vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31);
+
+vec_u8_t vfrac32_0 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, };
+vec_u8_t vfrac32_1 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, };
+vec_u8_t vfrac32_32_0 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 32, };
+vec_u8_t vfrac32_32_1 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 32, };
+
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1);
+
+ one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3);
+
+ one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5);
+
+ one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7);
+
+ one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9);
+
+ one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11);
+
+ one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13);
+
+ one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15);
+
+ one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17);
+
+ one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19);
+
+ one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21);
+
+ one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23);
+
+ one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25);
+
+ one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27);
+
+ one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29);
+
+ one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<4, 18>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ //vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, };
+ //vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, };
+
+
+ vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_4={0x3, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ //vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ //vec_u8_t vfrac4 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+ //vec_u8_t vfrac4_32 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ if(dstStride==4){
+ vec_xst(srv0, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)srv0, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(srv0, srv0, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(srv0, srv0, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(srv0, srv0, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(srv0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(srv0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(srv0, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(srv0, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 18>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, };
+//vec_u8_t mask1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, };
+vec_u8_t mask2={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+//vec_u8_t mask3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, };
+vec_u8_t mask4={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+//vec_u8_t mask5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+//vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+
+ //vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ //vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ //vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ //vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_8={0x7, 0x6, 0x5, 0x4, 0x3, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ //vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ //vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ //vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ //vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+ if(dstStride==8){
+ vec_xst(srv0, 0, dst);
+ vec_xst(srv2, 16, dst);
+ vec_xst(srv4, 32, dst);
+ vec_xst(srv6, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(srv0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(srv0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(srv2, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(srv2, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(srv4, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(srv4, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(srv6, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(srv6, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 18>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+vec_u8_t mask1={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, };
+vec_u8_t mask2={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask3={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask4={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask5={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask6={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask7={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask8={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask9={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask10={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask11={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+ vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xe, 0xd, 0xc, 0xb, 0xa, 0x9, 0x8, 0x7, 0x6, 0x5, 0x4, 0x3, 0x2, 0x1, 0x10};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(1, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 = vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = s0;
+
+
+ vec_xst(srv0, 0, dst);
+ vec_xst(srv1, dstStride, dst);
+ vec_xst(srv2, dstStride*2, dst);
+ vec_xst(srv3, dstStride*3, dst);
+ vec_xst(srv4, dstStride*4, dst);
+ vec_xst(srv5, dstStride*5, dst);
+ vec_xst(srv6, dstStride*6, dst);
+ vec_xst(srv7, dstStride*7, dst);
+ vec_xst(srv8, dstStride*8, dst);
+ vec_xst(srv9, dstStride*9, dst);
+ vec_xst(srv10, dstStride*10, dst);
+ vec_xst(srv11, dstStride*11, dst);
+ vec_xst(srv12, dstStride*12, dst);
+ vec_xst(srv13, dstStride*13, dst);
+ vec_xst(srv14, dstStride*14, dst);
+ vec_xst(srv15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 18>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+vec_u8_t mask1={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, };
+vec_u8_t mask2={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask3={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask4={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask5={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask6={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask7={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask8={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask9={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask10={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask11={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+
+ //vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ //vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t refmask_32_0 = {0x1f, 0x1e, 0x1d, 0x1c, 0x1b, 0x1a, 0x19, 0x18, 0x17, 0x16, 0x15, 0x14, 0x13, 0x12, 0x11, 0x10};
+ vec_u8_t refmask_32_1 = {0xf, 0xe, 0xd, 0xc, 0xb, 0xa, 0x9, 0x8, 0x7, 0x6, 0x5, 0x4, 0x3, 0x2, 0x1, 0x10};
+
+ vec_u8_t srv_left0=vec_xl(64, srcPix0);
+ vec_u8_t srv_left1=vec_xl(80, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(1, srcPix0);
+ vec_u8_t s3 = vec_xl(17, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv1 = vec_perm(s1, s2, mask1);
+ vec_u8_t srv2 = vec_perm(s1, s2, mask2);
+ vec_u8_t srv3 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv4 = vec_perm(s1, s2, mask4);
+ vec_u8_t srv5 = vec_perm(s1, s2, mask5);
+ vec_u8_t srv6 = vec_perm(s1, s2, mask6);
+ vec_u8_t srv7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv8 = vec_perm(s1, s2, mask8);
+ vec_u8_t srv9 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv10 = vec_perm(s1, s2, mask10);
+ vec_u8_t srv11 = vec_perm(s1, s2, mask11);
+ vec_u8_t srv12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv13 = vec_perm(s1, s2, mask13);
+ vec_u8_t srv14 = vec_perm(s1, s2, mask14);
+ vec_u8_t srv15 = s1;
+
+ vec_u8_t srv16_0 = vec_perm(s2, s3, mask0);
+ vec_u8_t srv16_1 = vec_perm(s2, s3, mask1);
+ vec_u8_t srv16_2 = vec_perm(s2, s3, mask2);
+ vec_u8_t srv16_3 = vec_perm(s2, s3, mask3);
+ vec_u8_t srv16_4 = vec_perm(s2, s3, mask4);
+ vec_u8_t srv16_5 = vec_perm(s2, s3, mask5);
+ vec_u8_t srv16_6 = vec_perm(s2, s3, mask6);
+ vec_u8_t srv16_7 = vec_perm(s2, s3, mask7);
+ vec_u8_t srv16_8 = vec_perm(s2, s3, mask8);
+ vec_u8_t srv16_9 = vec_perm(s2, s3, mask9);
+ vec_u8_t srv16_10 = vec_perm(s2, s3, mask10);
+ vec_u8_t srv16_11 = vec_perm(s2, s3, mask11);
+ vec_u8_t srv16_12= vec_perm(s2, s3, mask12);
+ vec_u8_t srv16_13 = vec_perm(s2, s3, mask13);
+ vec_u8_t srv16_14 = vec_perm(s2, s3, mask14);
+ vec_u8_t srv16_15 = s2;
+
+ //0(1,2),1,1,3,4,4,6(1),7(0,1),7,9,10,10,12,13,13,15,16,16,18,19,19,21,22,22,24,25,25,27,28,28,30,30
+
+ vec_u8_t srv16 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv17 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv18 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv19 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv20 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv21 = vec_perm(s0, s1, mask5);
+ vec_u8_t srv22 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv23 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv24 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv25 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv26 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv27 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv28 = vec_perm(s0, s1, mask12);
+ vec_u8_t srv29 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv30 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv31 = s0;
+
+ vec_xst(srv0, 0, dst);
+ vec_xst(srv16_0, 16, dst);
+ vec_xst(srv1, dstStride, dst);
+ vec_xst(srv16_1, dstStride+16, dst);
+ vec_xst(srv2, dstStride*2, dst);
+ vec_xst(srv16_2, dstStride*2+16, dst);
+ vec_xst(srv3, dstStride*3, dst);
+ vec_xst(srv16_3, dstStride*3+16, dst);
+ vec_xst(srv4, dstStride*4, dst);
+ vec_xst(srv16_4, dstStride*4+16, dst);
+ vec_xst(srv5, dstStride*5, dst);
+ vec_xst(srv16_5, dstStride*5+16, dst);
+ vec_xst(srv6, dstStride*6, dst);
+ vec_xst(srv16_6, dstStride*6+16, dst);
+ vec_xst(srv7, dstStride*7, dst);
+ vec_xst(srv16_7, dstStride*7+16, dst);
+ vec_xst(srv8, dstStride*8, dst);
+ vec_xst(srv16_8, dstStride*8+16, dst);
+ vec_xst(srv9, dstStride*9, dst);
+ vec_xst(srv16_9, dstStride*9+16, dst);
+ vec_xst(srv10, dstStride*10, dst);
+ vec_xst(srv16_10, dstStride*10+16, dst);
+ vec_xst(srv11, dstStride*11, dst);
+ vec_xst(srv16_11, dstStride*11+16, dst);
+ vec_xst(srv12, dstStride*12, dst);
+ vec_xst(srv16_12, dstStride*12+16, dst);
+ vec_xst(srv13, dstStride*13, dst);
+ vec_xst(srv16_13, dstStride*13+16, dst);
+ vec_xst(srv14, dstStride*14, dst);
+ vec_xst(srv16_14, dstStride*14+16, dst);
+ vec_xst(srv15, dstStride*15, dst);
+ vec_xst(srv16_15, dstStride*15+16, dst);
+
+ vec_xst(srv16, dstStride*16, dst);
+ vec_xst(srv0, dstStride*16+16, dst);
+ vec_xst(srv17, dstStride*17, dst);
+ vec_xst(srv1, dstStride*17+16, dst);
+ vec_xst(srv18, dstStride*18, dst);
+ vec_xst(srv2, dstStride*18+16, dst);
+ vec_xst(srv19, dstStride*19, dst);
+ vec_xst(srv3, dstStride*19+16, dst);
+ vec_xst(srv20, dstStride*20, dst);
+ vec_xst(srv4, dstStride*20+16, dst);
+ vec_xst(srv21, dstStride*21, dst);
+ vec_xst(srv5, dstStride*21+16, dst);
+ vec_xst(srv22, dstStride*22, dst);
+ vec_xst(srv6, dstStride*22+16, dst);
+ vec_xst(srv23, dstStride*23, dst);
+ vec_xst(srv7, dstStride*23+16, dst);
+ vec_xst(srv24, dstStride*24, dst);
+ vec_xst(srv8, dstStride*24+16, dst);
+ vec_xst(srv25, dstStride*25, dst);
+ vec_xst(srv9, dstStride*25+16, dst);
+ vec_xst(srv26, dstStride*26, dst);
+ vec_xst(srv10, dstStride*26+16, dst);
+ vec_xst(srv27, dstStride*27, dst);
+ vec_xst(srv11, dstStride*27+16, dst);
+ vec_xst(srv28, dstStride*28, dst);
+ vec_xst(srv12, dstStride*28+16, dst);
+ vec_xst(srv29, dstStride*29, dst);
+ vec_xst(srv13, dstStride*29+16, dst);
+ vec_xst(srv30, dstStride*30, dst);
+ vec_xst(srv14, dstStride*30+16, dst);
+ vec_xst(srv31, dstStride*31, dst);
+ vec_xst(srv15, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 19>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, };
+vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, };
+
+
+ //mode 19:
+ //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26};
+ //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0};
+ //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31
+
+ //mode19 invAS[32]= {1, 2, 4, };
+ //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0};
+ vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t srv_left=vec_perm(srv_left, srv_left, mask_left); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_4={0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+vec_u8_t vfrac4 = (vec_u8_t){6, 6, 6, 6, 12, 12, 12, 12, 18, 18, 18, 18, 24, 24, 24, 24};
+vec_u8_t vfrac4_32 = (vec_u8_t){26, 26, 26, 26, 20, 20, 20, 20, 14, 14, 14, 14, 8, 8, 8, 8};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 19>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, };
+vec_u8_t mask1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, };
+vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_8={0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+
+ /* fraction[0-7] */
+vec_u8_t vfrac8_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_2 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_3 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-7] */
+vec_u8_t vfrac8_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 16, 16, 16, 16, 16, 16, 16, 16};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 19>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask1={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask2={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask3={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask4={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+//vec_u8_t mask5={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16 ={0xf, 0xe, 0xc, 0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(4, srcPix0);
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 =srv4;
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0;
+ vec_u8_t srv2_add1 = srv1;
+ vec_u8_t srv3_add1 = srv2;
+ vec_u8_t srv4_add1 = srv3;
+ vec_u8_t srv5_add1 = srv3;
+ vec_u8_t srv6_add1 = srv4;
+ vec_u8_t srv7_add1 = srv6;
+ vec_u8_t srv8_add1 = srv7;
+ vec_u8_t srv9_add1 = srv8;
+ vec_u8_t srv10_add1 = srv8;
+ vec_u8_t srv11_add1 = srv9;
+ vec_u8_t srv12_add1= srv11;
+ vec_u8_t srv13_add1 = srv12;
+ vec_u8_t srv14_add1 = srv13;
+ vec_u8_t srv15_add1 = srv13;
+
+
+ /* fraction[0-15] */
+vec_u8_t vfrac16_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ /* 32- fraction[0-15] */
+vec_u8_t vfrac16_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 19>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask2={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask4={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask11={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask12={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+vec_u8_t mask13={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, };
+vec_u8_t mask14={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+//vec_u8_t mask15={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask16={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+
+vec_u8_t mask17={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask18={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask19={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask20={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+//vec_u8_t mask21={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask22={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask23={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask24={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask25={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask26={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask27={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask28={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t refmask_32_0 ={0x1f, 0x1e, 0x1c, 0x1b, 0x1a, 0x19, 0x17, 0x16, 0x15, 0x14, 0x12, 0x11, 0x10, 0xf, 0xe, 0xc};
+ vec_u8_t refmask_32_1 = {0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+
+ vec_u8_t srv_left0=vec_xl(64, srcPix0);
+ vec_u8_t srv_left1=vec_xl(80, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(7, srcPix0);
+ vec_u8_t s3 = vec_xl(16+7, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv1 = vec_perm(s1, s2, mask1);
+ vec_u8_t srv2 = vec_perm(s1, s2, mask2);
+ vec_u8_t srv3 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv4 = vec_perm(s1, s2, mask4);
+ vec_u8_t srv5 =srv4;
+ vec_u8_t srv6 = vec_perm(s1, s2, mask6);
+ vec_u8_t srv7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv8 = vec_perm(s1, s2, mask8);
+ vec_u8_t srv9 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = s1;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ vec_u8_t srv16_0 = vec_perm(s2, s3, mask0);
+ vec_u8_t srv16_1 = vec_perm(s2, s3, mask1);
+ vec_u8_t srv16_2 = vec_perm(s2, s3, mask2);
+ vec_u8_t srv16_3 = vec_perm(s2, s3, mask3);
+ vec_u8_t srv16_4 = vec_perm(s2, s3, mask4);
+ vec_u8_t srv16_5 =srv16_4;
+ vec_u8_t srv16_6 = vec_perm(s2, s3, mask6);
+ vec_u8_t srv16_7 = vec_perm(s2, s3, mask7);
+ vec_u8_t srv16_8 = vec_perm(s2, s3, mask8);
+ vec_u8_t srv16_9 = vec_perm(s2, s3, mask9);
+ vec_u8_t srv16_10 = srv16_9;
+ vec_u8_t srv16_11 = s2;
+ vec_u8_t srv16_12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv16_13 = vec_perm(s1, s2, mask13);
+ vec_u8_t srv16_14 = vec_perm(s1, s2, mask14);
+ vec_u8_t srv16_15 = srv16_14;
+ //0,1,2,3,4,4,6,7,8,9,9(1,2),11(1),12(0,1),13,14,14,15,16,17,18,19,20,20,22,23,24,25,25,27,28,29,30(0),30,
+
+ vec_u8_t srv16 = vec_perm(s0, s1, mask16);
+ vec_u8_t srv17 = vec_perm(s0, s1, mask17);
+ vec_u8_t srv18 = vec_perm(s0, s1, mask18);
+ vec_u8_t srv19 = vec_perm(s0, s1, mask19);
+ vec_u8_t srv20 = vec_perm(s0, s1, mask20);
+ vec_u8_t srv21 = srv20;
+ vec_u8_t srv22 = vec_perm(s0, s1, mask22);
+ vec_u8_t srv23 = vec_perm(s0, s1, mask23);
+ vec_u8_t srv24 = vec_perm(s0, s1, mask24);
+ vec_u8_t srv25 = vec_perm(s0, s1, mask25);
+ vec_u8_t srv26 = srv25;
+ vec_u8_t srv27 = vec_perm(s0, s1, mask27);
+ vec_u8_t srv28 = vec_perm(s0, s1, mask28);
+ vec_u8_t srv29 = vec_perm(s0, s1, mask29);
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = vec_perm(s1, s2, mask16);
+ vec_u8_t srv16_17 = vec_perm(s1, s2, mask17);
+ vec_u8_t srv16_18 = vec_perm(s1, s2, mask18);
+ vec_u8_t srv16_19 = vec_perm(s1, s2, mask19);
+ vec_u8_t srv16_20 = vec_perm(s1, s2, mask20);
+ vec_u8_t srv16_21 = srv16_20;
+ vec_u8_t srv16_22 = vec_perm(s1, s2, mask22);
+ vec_u8_t srv16_23 = vec_perm(s1, s2, mask23);
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask24);
+ vec_u8_t srv16_25 = vec_perm(s1, s2, mask25);
+ vec_u8_t srv16_26 = srv16_25;
+ vec_u8_t srv16_27 = vec_perm(s1, s2, mask27);
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask28);
+ vec_u8_t srv16_29 = vec_perm(s1, s2, mask29);
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv1add1 = srv0;
+ vec_u8_t srv2add1 = srv1;
+ vec_u8_t srv3add1 = srv2;
+ vec_u8_t srv4add1 = srv3;
+ vec_u8_t srv5add1 = srv3;
+ vec_u8_t srv6add1 = srv4;
+ vec_u8_t srv7add1 = srv6;
+ vec_u8_t srv8add1 = srv7;
+ vec_u8_t srv9add1 = srv8;
+ vec_u8_t srv10add1 = srv8;
+ vec_u8_t srv11add1 = srv9;
+ vec_u8_t srv12add1= srv11;
+ vec_u8_t srv13add1 = srv12;
+ vec_u8_t srv14add1 = srv13;
+ vec_u8_t srv15add1 = srv13;
+
+ //0(1,2),1,2,3,3.4,6,7,8,8,9,11(1),12(0,1),13,13,14,16, 17, 18,19,19,20,22,26,24,24,25,27,28,29,29,
+
+ vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0);
+ vec_u8_t srv16add1_1 = srv16_0;
+ vec_u8_t srv16add1_2 = srv16_1;
+ vec_u8_t srv16add1_3 = srv16_2;
+ vec_u8_t srv16add1_4 = srv16_3;
+ vec_u8_t srv16add1_5 = srv16_3;
+ vec_u8_t srv16add1_6 = srv16_4;
+ vec_u8_t srv16add1_7 = srv16_6;
+ vec_u8_t srv16add1_8 = srv16_7;
+ vec_u8_t srv16add1_9 = srv16_8;
+ vec_u8_t srv16add1_10 = srv16_8;
+ vec_u8_t srv16add1_11 = srv16_9;
+ vec_u8_t srv16add1_12= srv16_11;
+ vec_u8_t srv16add1_13 = srv16_12;
+ vec_u8_t srv16add1_14 = srv16_13;
+ vec_u8_t srv16add1_15 = srv16_13;
+
+ vec_u8_t srv16add1 = srv14;
+ vec_u8_t srv17add1 = srv16;
+ vec_u8_t srv18add1 = srv17;
+ vec_u8_t srv19add1 = srv18;
+ vec_u8_t srv20add1 = srv19;
+ vec_u8_t srv21add1 = srv19;
+ vec_u8_t srv22add1 = srv20;
+ vec_u8_t srv23add1 = srv22;
+ vec_u8_t srv24add1 = srv23;
+ vec_u8_t srv25add1 = srv24;
+ vec_u8_t srv26add1 = srv24;
+ vec_u8_t srv27add1 = srv25;
+ vec_u8_t srv28add1 = srv27;
+ vec_u8_t srv29add1 = srv28;
+ vec_u8_t srv30add1 = srv29;
+ vec_u8_t srv31add1 = srv29;
+
+ vec_u8_t srv16add1_16 = srv16_14;
+ vec_u8_t srv16add1_17 = srv16_16;
+ vec_u8_t srv16add1_18 = srv16_17;
+ vec_u8_t srv16add1_19 = srv16_18;
+ vec_u8_t srv16add1_20 = srv16_19;
+ vec_u8_t srv16add1_21 = srv16_19;
+ vec_u8_t srv16add1_22 = srv16_20;
+ vec_u8_t srv16add1_23 = srv16_22;
+ vec_u8_t srv16add1_24 = srv16_23;
+ vec_u8_t srv16add1_25 = srv16_24;
+ vec_u8_t srv16add1_26 = srv16_24;
+ vec_u8_t srv16add1_27 = srv16_25;
+ vec_u8_t srv16add1_28 = srv16_27;
+ vec_u8_t srv16add1_29 = srv16_28;
+ vec_u8_t srv16add1_30 = srv16_29;
+ vec_u8_t srv16add1_31 = srv16_29;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 20>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, };
+vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, };
+
+
+ //mode 19:
+ //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26};
+ //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0};
+ //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31
+
+ //mode19 invAS[32]= {1, 2, 4, };
+ //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0};
+ vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t srv_left=vec_perm(srv_left, srv_left, mask_left); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_4={0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+vec_u8_t vfrac4 = (vec_u8_t){11, 11, 11, 11, 22, 22, 22, 22, 1, 1, 1, 1, 12, 12, 12, 12};
+vec_u8_t vfrac4_32 = (vec_u8_t){21, 21, 21, 21, 10, 10, 10, 10, 31, 31, 31, 31, 20, 20, 20, 20};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 20>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+vec_u8_t mask1={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, };
+vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_8={0x8, 0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+
+vec_u8_t vfrac8_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac8_1 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_2 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac8_3 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24};
+
+vec_u8_t vfrac8_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 8, 8, 8, 8, 8, 8, 8, 8};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 20>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask1={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+//vec_u8_t mask2={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask3={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask5={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask6={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask7={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t maskadd1_0={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+/*vec_u8_t maskadd1_1={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t maskadd1_2={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t maskadd1_3={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t maskadd1_4={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t maskadd1_5={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t maskadd1_6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t maskadd1_7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_8={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_9={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_11={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0x9, 0x8, 0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(6, srcPix0);
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = srv1;
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 =srv4;
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = srv10;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = srv13;
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0;
+ vec_u8_t srv2_add1 = srv0;
+ vec_u8_t srv3_add1 = srv1;
+ vec_u8_t srv4_add1 = srv3;
+ vec_u8_t srv5_add1 = srv3;
+ vec_u8_t srv6_add1 = srv4;
+ vec_u8_t srv7_add1 = srv6;
+ vec_u8_t srv8_add1 = srv6;
+ vec_u8_t srv9_add1 = srv7;
+ vec_u8_t srv10_add1 = srv9;
+ vec_u8_t srv11_add1 = srv9;
+ vec_u8_t srv12_add1= srv10;
+ vec_u8_t srv13_add1 = srv12;
+ vec_u8_t srv14_add1 = srv12;
+ vec_u8_t srv15_add1 = srv13;
+vec_u8_t vfrac16_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 20>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask7={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+//vec_u8_t mask8={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+vec_u8_t mask9={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, };
+vec_u8_t mask10={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+//vec_u8_t mask11={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask12={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask13={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+//vec_u8_t mask14={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask15={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+
+vec_u8_t mask16={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+//vec_u8_t mask17={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask18={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask19={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask20={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask21={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask22={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask23={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask24={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask25={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask26={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask27={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t refmask_32_0 = {0x1e, 0x1d, 0x1b, 0x1a, 0x18, 0x17, 0x15, 0x14, 0x12, 0x11, 0xf, 0xe, 0xc, 0xb, 0x9, 0x8, };
+ vec_u8_t refmask_32_1 = {0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+
+ vec_u8_t srv_left0=vec_xl(64, srcPix0);
+ vec_u8_t srv_left1=vec_xl(80, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(12, srcPix0);
+ vec_u8_t s3 = vec_xl(16+12, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv1 = vec_perm(s1, s2, mask1);
+ vec_u8_t srv2 = srv1;
+ vec_u8_t srv3 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv4 = vec_perm(s1, s2, mask4);
+ vec_u8_t srv5 = srv4;
+ vec_u8_t srv6 = s1;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = srv10;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = srv13;
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(s2, s3, mask0);
+ vec_u8_t srv16_1 = vec_perm(s2, s3, mask1);
+ vec_u8_t srv16_2 = srv16_1;
+ vec_u8_t srv16_3 = vec_perm(s2, s3, mask3);
+ vec_u8_t srv16_4 = vec_perm(s2, s3, mask4);
+ vec_u8_t srv16_5 = srv16_4;
+ vec_u8_t srv16_6 = s2;
+ vec_u8_t srv16_7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv16_8 = srv16_7;
+ vec_u8_t srv16_9 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv16_10 = vec_perm(s1, s2, mask10);
+ vec_u8_t srv16_11 = srv16_10;
+ vec_u8_t srv16_12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv16_13 = vec_perm(s1, s2, mask13);
+ vec_u8_t srv16_14 = srv16_13;
+ vec_u8_t srv16_15 = vec_perm(s1, s2, mask15);
+
+ //0(1,2),1,1,3,4,4,6(1),7(0,1),7,9,10,10,12,13,13,15,16,16,18,19,19,21,22,22,24,25,25,27,28,28,30,30
+
+ vec_u8_t srv16 = vec_perm(s0, s1, mask16);
+ vec_u8_t srv17 = srv16;
+ vec_u8_t srv18 = vec_perm(s0, s1, mask18);
+ vec_u8_t srv19 = vec_perm(s0, s1, mask19);
+ vec_u8_t srv20 = srv19;
+ vec_u8_t srv21 = vec_perm(s0, s1, mask21);
+ vec_u8_t srv22 = vec_perm(s0, s1, mask22);
+ vec_u8_t srv23 = srv22;
+ vec_u8_t srv24 = vec_perm(s0, s1, mask24);
+ vec_u8_t srv25 = vec_perm(s0, s1, mask25);
+ vec_u8_t srv26 = srv25;
+ vec_u8_t srv27 = vec_perm(s0, s1, mask27);
+ vec_u8_t srv28 = vec_perm(s0, s1, mask28);
+ vec_u8_t srv29 = srv28;
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = vec_perm(s1, s2, mask16);
+ vec_u8_t srv16_17 = srv16_16;
+ vec_u8_t srv16_18 = vec_perm(s1, s2, mask18);
+ vec_u8_t srv16_19 = vec_perm(s1, s2, mask19);
+ vec_u8_t srv16_20 = srv16_19;
+ vec_u8_t srv16_21 = vec_perm(s1, s2, mask21);
+ vec_u8_t srv16_22 = vec_perm(s1, s2, mask22);
+ vec_u8_t srv16_23 = srv16_22;
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask24);
+ vec_u8_t srv16_25 = vec_perm(s1, s2, mask25);
+ vec_u8_t srv16_26 = srv16_25;
+ vec_u8_t srv16_27 = vec_perm(s1, s2, mask27);
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask28);
+ vec_u8_t srv16_29 = srv16_28;
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv1add1 = srv0;
+ vec_u8_t srv2add1 = srv0;
+ vec_u8_t srv3add1 = srv1;
+ vec_u8_t srv4add1 = srv3;
+ vec_u8_t srv5add1 = srv3;
+ vec_u8_t srv6add1 = srv4;
+ vec_u8_t srv7add1 = s1;
+ vec_u8_t srv8add1 = s1;
+ vec_u8_t srv9add1 = srv7;
+ vec_u8_t srv10add1 = srv9;
+ vec_u8_t srv11add1 = srv9;
+ vec_u8_t srv12add1= srv10;
+ vec_u8_t srv13add1 = srv12;
+ vec_u8_t srv14add1 = srv12;
+ vec_u8_t srv15add1 = srv13;
+
+ vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0);
+ vec_u8_t srv16add1_1 = srv16_0;
+ vec_u8_t srv16add1_2 = srv16_0;
+ vec_u8_t srv16add1_3 = srv16_1;
+ vec_u8_t srv16add1_4 = srv16_3;
+ vec_u8_t srv16add1_5 = srv16_3;
+ vec_u8_t srv16add1_6 = srv16_4;
+ vec_u8_t srv16add1_7 = s2;
+ vec_u8_t srv16add1_8 = s2;
+ vec_u8_t srv16add1_9 = srv16_7;
+ vec_u8_t srv16add1_10 = srv16_9;
+ vec_u8_t srv16add1_11 = srv16_9;
+ vec_u8_t srv16add1_12= srv16_10;
+ vec_u8_t srv16add1_13 = srv16_12;
+ vec_u8_t srv16add1_14 = srv16_12;
+ vec_u8_t srv16add1_15 = srv16_13;
+
+ //0,0,1,3,3,4,6(0),6,7,9,9,10,12,12,13,15,15,16,18,18,19,21,21,22,24,24,25,27,27,28,28
+
+ vec_u8_t srv16add1 = srv15;
+ vec_u8_t srv17add1 = srv15;
+ vec_u8_t srv18add1 = srv16;
+ vec_u8_t srv19add1 = srv18;
+ vec_u8_t srv20add1 = srv18;
+ vec_u8_t srv21add1 = srv19;
+ vec_u8_t srv22add1 = srv21;
+ vec_u8_t srv23add1 = srv21;
+ vec_u8_t srv24add1 = srv22;
+ vec_u8_t srv25add1 = srv24;
+ vec_u8_t srv26add1 = srv24;
+ vec_u8_t srv27add1 = srv25;
+ vec_u8_t srv28add1 = srv27;
+ vec_u8_t srv29add1 = srv27;
+ vec_u8_t srv30add1 = srv28;
+ vec_u8_t srv31add1 = srv28;
+
+ vec_u8_t srv16add1_16 = srv16_15;
+ vec_u8_t srv16add1_17 = srv16_15;
+ vec_u8_t srv16add1_18 = srv16_16;
+ vec_u8_t srv16add1_19 = srv16_18;
+ vec_u8_t srv16add1_20 = srv16_18;
+ vec_u8_t srv16add1_21 = srv16_19;
+ vec_u8_t srv16add1_22 = srv16_21;
+ vec_u8_t srv16add1_23 = srv16_21;
+ vec_u8_t srv16add1_24 = srv16_22;
+ vec_u8_t srv16add1_25 = srv16_24;
+ vec_u8_t srv16add1_26 = srv16_24;
+ vec_u8_t srv16add1_27 = srv16_25;
+ vec_u8_t srv16add1_28 = srv16_27;
+ vec_u8_t srv16add1_29 = srv16_27;
+ vec_u8_t srv16add1_30 = srv16_28;
+ vec_u8_t srv16add1_31 = srv16_28;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_18 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_20 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_22 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_24 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_26 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_28 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_30 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 21>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, };
+vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, };
+
+
+
+ //mode 19:
+ //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26};
+ //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0};
+ //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31
+
+ //mode19 invAS[32]= {1, 2, 4, };
+ //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0};
+ vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t srv_left=vec_perm(srv_left, srv_left, mask_left); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_4={0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+vec_u8_t vfrac4 = (vec_u8_t){15, 15, 15, 15, 30, 30, 30, 30, 13, 13, 13, 13, 28, 28, 28, 28};
+vec_u8_t vfrac4_32 = (vec_u8_t){17, 17, 17, 17, 2, 2, 2, 2, 19, 19, 19, 19, 4, 4, 4, 4};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 21>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask1={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+vec_u8_t mask2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_8={0x8, 0x6, 0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+
+vec_u8_t vfrac8_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac8_1 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_2 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac8_3 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 8, 8, 8, 8, 8, 8, 8, 8};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 21>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t mask0={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask2={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+//vec_u8_t mask4={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask6={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask7={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask8={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+/*vec_u8_t maskadd1_1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t maskadd1_2={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t maskadd1_3={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t maskadd1_4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t maskadd1_5={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_6={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_7={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+
+ vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(8, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = srv1;
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = srv3;
+ vec_u8_t srv5 = vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = srv5;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= srv11;
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = srv13;
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0;
+ vec_u8_t srv2_add1 = srv0;
+ vec_u8_t srv3_add1 = srv1;
+ vec_u8_t srv4_add1 = srv1;
+ vec_u8_t srv5_add1 = srv3;
+ vec_u8_t srv6_add1 = srv3;
+ vec_u8_t srv7_add1 = srv5;
+ vec_u8_t srv8_add1 = srv5;
+ vec_u8_t srv9_add1 = srv7;
+ vec_u8_t srv10_add1 = srv7;
+ vec_u8_t srv11_add1 = srv9;
+ vec_u8_t srv12_add1= srv9;
+ vec_u8_t srv13_add1 = srv11;
+ vec_u8_t srv14_add1 = srv11;
+ vec_u8_t srv15_add1 = srv13;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 21>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+//vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask1={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+//vec_u8_t mask2={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+vec_u8_t mask3={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, };
+//vec_u8_t mask4={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, };
+vec_u8_t mask5={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+//vec_u8_t mask6={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask7={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+//vec_u8_t mask8={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask9={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+//vec_u8_t mask10={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask11={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+//vec_u8_t mask12={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask13={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+//vec_u8_t mask14={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask15={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+
+vec_u8_t mask16={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask17={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask18={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+//vec_u8_t mask19={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask20={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask21={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask22={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask23={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask24={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask25={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask26={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask27={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ //vec_u8_t srv_left0=vec_xl(64, srcPix0);
+ //vec_u8_t srv_left1=vec_xl(80, srcPix0);
+ //vec_u8_t srv_right=vec_xl(0, srcPix0);
+ //vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ //vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ //vec_u8_t s2 = vec_xl(12, srcPix0);
+ //vec_u8_t s3 = vec_xl(16+12, srcPix0);
+
+ vec_u8_t srv_left0=vec_xl(64, srcPix0);
+ vec_u8_t srv_left1=vec_xl(80, srcPix0);
+ vec_u8_t refmask_32 = {0x1e, 0x1c, 0x1a, 0x18, 0x17, 0x15, 0x13, 0x11, 0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2};
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32);
+ vec_u8_t s1 = vec_xl(0, srcPix0);;
+ vec_u8_t s2 = vec_xl(16, srcPix0);
+ vec_u8_t s3 = vec_xl(32, srcPix0);
+
+
+ vec_u8_t srv0 = s1;
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = srv1;
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = srv3;
+ vec_u8_t srv5 = vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = srv5;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= srv11;
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = srv13;
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv16_0 = s2;
+ vec_u8_t srv16_1 = vec_perm(s1, s2, mask1);
+ vec_u8_t srv16_2 = srv16_1;
+ vec_u8_t srv16_3 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv16_4 = srv16_3;
+ vec_u8_t srv16_5 = vec_perm(s1, s2, mask5);
+ vec_u8_t srv16_6 = srv16_5;
+ vec_u8_t srv16_7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv16_8 = srv16_7;
+ vec_u8_t srv16_9 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv16_10 = srv16_9;
+ vec_u8_t srv16_11 = vec_perm(s1, s2, mask11);
+ vec_u8_t srv16_12= srv16_11;
+ vec_u8_t srv16_13 = vec_perm(s1, s2, mask13);
+ vec_u8_t srv16_14 = srv16_13;
+ vec_u8_t srv16_15 = vec_perm(s1, s2, mask15);
+
+ //s1, 1,1,3,3,5,5,7,7,9,9,11,11,13,13,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28,s0,s0
+
+ vec_u8_t srv16 = vec_perm(s0, s1, mask16);
+ vec_u8_t srv17 = srv16;
+ vec_u8_t srv18 = vec_perm(s0, s1, mask18);
+ vec_u8_t srv19 = srv18;
+ vec_u8_t srv20 = vec_perm(s0, s1, mask20);
+ vec_u8_t srv21 = srv20;
+ vec_u8_t srv22 = vec_perm(s0, s1, mask22);
+ vec_u8_t srv23 = srv22;
+ vec_u8_t srv24 = vec_perm(s0, s1, mask24);
+ vec_u8_t srv25 = srv24;
+ vec_u8_t srv26 = vec_perm(s0, s1, mask26);
+ vec_u8_t srv27 = srv26;
+ vec_u8_t srv28 = vec_perm(s0, s1, mask28);
+ vec_u8_t srv29 = srv28;
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = vec_perm(s1, s2, mask16);
+ vec_u8_t srv16_17 = srv16_16;
+ vec_u8_t srv16_18 = vec_perm(s1, s2, mask18);
+ vec_u8_t srv16_19 = srv16_18;
+ vec_u8_t srv16_20 = vec_perm(s1, s2, mask20);
+ vec_u8_t srv16_21 = srv16_20;
+ vec_u8_t srv16_22 = vec_perm(s1, s2, mask22);
+ vec_u8_t srv16_23 = srv16_22;
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask24);
+ vec_u8_t srv16_25 = srv16_24;
+ vec_u8_t srv16_26 = vec_perm(s1, s2, mask26);
+ vec_u8_t srv16_27 = srv16_26;
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask28);
+ vec_u8_t srv16_29 = srv16_28;
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv1add1 = s1;
+ vec_u8_t srv2add1 = s1;
+ vec_u8_t srv3add1 = srv1;
+ vec_u8_t srv4add1 = srv1;
+ vec_u8_t srv5add1 = srv3;
+ vec_u8_t srv6add1 = srv3;
+ vec_u8_t srv7add1 = srv6;
+ vec_u8_t srv8add1 = srv6;
+ vec_u8_t srv9add1 = srv7;
+ vec_u8_t srv10add1 = srv7;
+ vec_u8_t srv11add1 = srv9;
+ vec_u8_t srv12add1= srv9;
+ vec_u8_t srv13add1 = srv11;
+ vec_u8_t srv14add1 = srv11;
+ vec_u8_t srv15add1 = srv14;
+
+ vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0);
+ vec_u8_t srv16add1_1 = s2;
+ vec_u8_t srv16add1_2 = s2;
+ vec_u8_t srv16add1_3 = srv16_1;
+ vec_u8_t srv16add1_4 = srv16_1;
+ vec_u8_t srv16add1_5 = srv16_3;
+ vec_u8_t srv16add1_6 = srv16_3;
+ vec_u8_t srv16add1_7 = srv16_6;
+ vec_u8_t srv16add1_8 = srv16_6;
+ vec_u8_t srv16add1_9 = srv16_7;
+ vec_u8_t srv16add1_10 = srv16_7;
+ vec_u8_t srv16add1_11 = srv16_9;
+ vec_u8_t srv16add1_12= srv16_9;
+ vec_u8_t srv16add1_13 = srv16_11;
+ vec_u8_t srv16add1_14 = srv16_11;
+ vec_u8_t srv16add1_15 = srv16_14;
+
+ //srv28, s1,s1, 1,1,3,3,6,6,7,7,9,9,11,11,14,15,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28,
+
+ vec_u8_t srv16add1 = srv15;
+ vec_u8_t srv17add1 = srv15;
+ vec_u8_t srv18add1 = srv16;
+ vec_u8_t srv19add1 = srv16;
+ vec_u8_t srv20add1 = srv18;
+ vec_u8_t srv21add1 = srv18;
+ vec_u8_t srv22add1 = srv20;
+ vec_u8_t srv23add1 = srv20;
+ vec_u8_t srv24add1 = srv22;
+ vec_u8_t srv25add1 = srv22;
+ vec_u8_t srv26add1 = srv24;
+ vec_u8_t srv27add1 = srv24;
+ vec_u8_t srv28add1 = srv26;
+ vec_u8_t srv29add1 = srv26;
+ vec_u8_t srv30add1 = srv28;
+ vec_u8_t srv31add1 = srv28;
+
+ vec_u8_t srv16add1_16 = srv16_15;
+ vec_u8_t srv16add1_17 = srv16_15;
+ vec_u8_t srv16add1_18 = srv16_16;
+ vec_u8_t srv16add1_19 = srv16_16;
+ vec_u8_t srv16add1_20 = srv16_18;
+ vec_u8_t srv16add1_21 = srv16_18;
+ vec_u8_t srv16add1_22 = srv16_20;
+ vec_u8_t srv16add1_23 = srv16_20;
+ vec_u8_t srv16add1_24 = srv16_22;
+ vec_u8_t srv16add1_25 = srv16_22;
+ vec_u8_t srv16add1_26 = srv16_24;
+ vec_u8_t srv16add1_27 = srv16_24;
+ vec_u8_t srv16add1_28 = srv16_26;
+ vec_u8_t srv16add1_29 = srv16_26;
+ vec_u8_t srv16add1_30 = srv16_28;
+ vec_u8_t srv16add1_31 = srv16_28;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_17 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_18 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_20 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_21 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_22 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_24 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_25 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_26 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_28 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_29 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_30 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<4, 22>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, };
+vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, };
+
+
+
+ //mode 19:
+ //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26};
+ //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0};
+ //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31
+
+ //mode19 invAS[32]= {1, 2, 4, };
+ //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0};
+ vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t srv_left=vec_perm(srv_left, srv_left, mask_left); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_4={0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+vec_u8_t vfrac4 = (vec_u8_t){19, 19, 19, 19, 6, 6, 6, 6, 25, 25, 25, 25, 12, 12, 12, 12};
+vec_u8_t vfrac4_32 = (vec_u8_t){13, 13, 13, 13, 26, 26, 26, 26, 7, 7, 7, 7, 20, 20, 20, 20};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 22>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_8={0x7, 0x5, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+
+vec_u8_t vfrac8_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac8_1 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac8_3 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 8, 8, 8, 8, 8, 8, 8, 8};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 22>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+//vec_u8_t mask1={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask2={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+/*vec_u8_t maskadd1_1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t maskadd1_2={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_4={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_6={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_7={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_8={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(10, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = srv2;
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 = srv4;
+ vec_u8_t srv6 = srv4;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = srv9;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = srv12;
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0_add1;
+ vec_u8_t srv2_add1 = srv0;
+ vec_u8_t srv3_add1 = srv0;
+ vec_u8_t srv4_add1 = srv2;
+ vec_u8_t srv5_add1 = srv2;
+ vec_u8_t srv6_add1 = srv2;
+ vec_u8_t srv7_add1 = srv4;
+ vec_u8_t srv8_add1 = srv4;
+ vec_u8_t srv9_add1 = srv7;
+ vec_u8_t srv10_add1 = srv7;
+ vec_u8_t srv11_add1 = srv7;
+ vec_u8_t srv12_add1= srv9;
+ vec_u8_t srv13_add1 = srv9;
+ vec_u8_t srv14_add1 = srv12;
+ vec_u8_t srv15_add1 = srv12;
+vec_u8_t vfrac16_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 22>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+//vec_u8_t mask1={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask2={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+//vec_u8_t mask3={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask4={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+//vec_u8_t mask5={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+//vec_u8_t mask6={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask7={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+//vec_u8_t mask8={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask9={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+//vec_u8_t mask10={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+//vec_u8_t mask11={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask12={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask13={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask14={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+//vec_u8_t mask15={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+
+//vec_u8_t mask16={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask17={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask18={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask19={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask20={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask21={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask22={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask23={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask24={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask25={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask26={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask27={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ //vec_u8_t srv_left0=vec_xl(64, srcPix0);
+ //vec_u8_t srv_left1=vec_xl(80, srcPix0);
+ //vec_u8_t srv_right=vec_xl(0, srcPix0);
+ //vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ //vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ //vec_u8_t s2 = vec_xl(12, srcPix0);
+ //vec_u8_t s3 = vec_xl(16+12, srcPix0);
+
+ vec_u8_t srv_left0 = vec_xl(64, srcPix0);
+ vec_u8_t srv_left1 = vec_xl(80, srcPix0);
+ vec_u8_t srv_right = vec_xl(0, srcPix0);;
+ vec_u8_t refmask_32_0 ={0x1e, 0x1b, 0x19, 0x16, 0x14, 0x11, 0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(4, srcPix0);;
+ vec_u8_t s2 = vec_xl(20, srcPix0);
+ //vec_u8_t s3 = vec_xl(36, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = srv2;
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 = srv4;
+ vec_u8_t srv6 = srv4;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = srv9;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = srv12;
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ vec_u8_t srv16_0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv16_1 = srv16_0;
+ vec_u8_t srv16_2 = vec_perm(s1, s2, mask2);
+ vec_u8_t srv16_3 = srv16_2;
+ vec_u8_t srv16_4 = vec_perm(s1, s2, mask4);
+ vec_u8_t srv16_5 = srv16_4;
+ vec_u8_t srv16_6 = srv16_4;
+ vec_u8_t srv16_7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv16_8 = srv16_7;
+ vec_u8_t srv16_9 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv16_10 = srv16_9;
+ vec_u8_t srv16_11 = srv16_9;
+ vec_u8_t srv16_12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv16_13 = srv16_12;
+ vec_u8_t srv16_14 = vec_perm(s1, s2, mask14);
+ vec_u8_t srv16_15 = srv16_14;
+
+ //0(0,1),0,2,2,4,4,4,7,7,9,9,9,12,12,14,14,14,17,17,19,19,19,22,22,24,24,24,27,27,s0,s0,s0
+
+ vec_u8_t srv16 = srv14;
+ vec_u8_t srv17 = vec_perm(s0, s1, mask17);
+ vec_u8_t srv18 = srv17;
+ vec_u8_t srv19 = vec_perm(s0, s1, mask19);
+ vec_u8_t srv20 = srv19;
+ vec_u8_t srv21 = srv19;
+ vec_u8_t srv22 = vec_perm(s0, s1, mask22);
+ vec_u8_t srv23 = srv22;
+ vec_u8_t srv24 = vec_perm(s0, s1, mask24);
+ vec_u8_t srv25 = srv24;
+ vec_u8_t srv26 = srv24;
+ vec_u8_t srv27 = vec_perm(s0, s1, mask27);
+ vec_u8_t srv28 = srv27;
+ vec_u8_t srv29 = s0;
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = srv16_14;
+ vec_u8_t srv16_17 = vec_perm(s1, s2, mask17);
+ vec_u8_t srv16_18 = srv16_17;
+ vec_u8_t srv16_19 = vec_perm(s1, s2, mask19);
+ vec_u8_t srv16_20 = srv16_19;
+ vec_u8_t srv16_21 = srv16_19;
+ vec_u8_t srv16_22 = vec_perm(s1, s2, mask22);
+ vec_u8_t srv16_23 = srv16_22;
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask24);
+ vec_u8_t srv16_25 = srv16_24;
+ vec_u8_t srv16_26 = srv16_24;
+ vec_u8_t srv16_27 = vec_perm(s1, s2, mask27);
+ vec_u8_t srv16_28 = srv16_27;
+ vec_u8_t srv16_29 = s1;
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1add1 = srv0add1;
+ vec_u8_t srv2add1 = srv0;
+ vec_u8_t srv3add1 = srv0;
+ vec_u8_t srv4add1 = srv2;
+ vec_u8_t srv5add1 = srv2;
+ vec_u8_t srv6add1 = srv2;
+ vec_u8_t srv7add1 = srv4;
+ vec_u8_t srv8add1 = srv4;
+ vec_u8_t srv9add1 = srv7;
+ vec_u8_t srv10add1 = srv7;
+ vec_u8_t srv11add1 = srv7;
+ vec_u8_t srv12add1= srv9;
+ vec_u8_t srv13add1 = srv9;
+ vec_u8_t srv14add1 = srv12;
+ vec_u8_t srv15add1 = srv12;
+
+ vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv16add1_1 = srv16add1_0;
+ vec_u8_t srv16add1_2 = srv16_0;
+ vec_u8_t srv16add1_3 = srv16_0;
+ vec_u8_t srv16add1_4 = srv16_2;
+ vec_u8_t srv16add1_5 = srv16_2;
+ vec_u8_t srv16add1_6 = srv16_2;
+ vec_u8_t srv16add1_7 = srv16_4;
+ vec_u8_t srv16add1_8 = srv16_4;
+ vec_u8_t srv16add1_9 = srv16_7;
+ vec_u8_t srv16add1_10 = srv16_7;
+ vec_u8_t srv16add1_11 = srv16_7;
+ vec_u8_t srv16add1_12= srv16_9;
+ vec_u8_t srv16add1_13 = srv16_9;
+ vec_u8_t srv16add1_14 = srv16_12;
+ vec_u8_t srv16add1_15 = srv16_12;
+
+ //srv28, s1,s1, 1,1,3,3,6,6,7,7,9,9,11,11,14,15,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28,
+ //0,0,2,2,2,4,4,7,7,7,9,9,12,12,12,14,14,17,17,17,19,19,22,22,22,24,24,27,27,27,
+
+ vec_u8_t srv16add1 = srv12;
+ vec_u8_t srv17add1 = srv14;
+ vec_u8_t srv18add1 = srv14;
+ vec_u8_t srv19add1 = srv17;
+ vec_u8_t srv20add1 = srv17;
+ vec_u8_t srv21add1 = srv17;
+ vec_u8_t srv22add1 = srv19;
+ vec_u8_t srv23add1 = srv19;
+ vec_u8_t srv24add1 = srv22;
+ vec_u8_t srv25add1 = srv22;
+ vec_u8_t srv26add1 = srv22;
+ vec_u8_t srv27add1 = srv24;
+ vec_u8_t srv28add1 = srv24;
+ vec_u8_t srv29add1 = srv27;
+ vec_u8_t srv30add1 = srv27;
+ vec_u8_t srv31add1 = srv27;
+
+ vec_u8_t srv16add1_16 = srv16_12;
+ vec_u8_t srv16add1_17 = srv16_14;
+ vec_u8_t srv16add1_18 = srv16_14;
+ vec_u8_t srv16add1_19 = srv16_17;
+ vec_u8_t srv16add1_20 = srv16_17;
+ vec_u8_t srv16add1_21 = srv16_17;
+ vec_u8_t srv16add1_22 = srv16_19;
+ vec_u8_t srv16add1_23 = srv16_19;
+ vec_u8_t srv16add1_24 = srv16_22;
+ vec_u8_t srv16add1_25 = srv16_22;
+ vec_u8_t srv16add1_26 = srv16_22;
+ vec_u8_t srv16add1_27 = srv16_24;
+ vec_u8_t srv16add1_28 = srv16_24;
+ vec_u8_t srv16add1_29 = srv16_27;
+ vec_u8_t srv16add1_30 = srv16_27;
+ vec_u8_t srv16add1_31 = srv16_27;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_17 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_18 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_20 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_21 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_22 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_24 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_25 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_26 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_28 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_29 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_30 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 23>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, };
+vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, };
+
+ //mode 19:
+ //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26};
+ //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0};
+ //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31
+
+ //mode19 invAS[32]= {1, 2, 4, };
+ //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0};
+ vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t srv_left=vec_perm(srv_left, srv_left, mask_left); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+vec_u8_t refmask_4={0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+vec_u8_t vfrac4 = (vec_u8_t){23, 23, 23, 23, 14, 14, 14, 14, 5, 5, 5, 5, 28, 28, 28, 28};
+vec_u8_t vfrac4_32 = (vec_u8_t){9, 9, 9, 9, 18, 18, 18, 18, 27, 27, 27, 27, 4, 4, 4, 4};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 23>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+vec_u8_t refmask_8={0x7, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, };
+
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+
+vec_u8_t vfrac8_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac8_1 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac8_3 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 8, 8, 8, 8, 8, 8, 8, 8};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 23>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask4={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask6={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+/*vec_u8_t maskadd1_1={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_2={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+vec_u8_t refmask_16={0xe, 0xb, 0x7, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(12, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = srv0;
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = srv3;
+ vec_u8_t srv5 = srv3;
+ vec_u8_t srv6 = srv3;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = srv7;
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = srv10;
+ vec_u8_t srv12= srv10;
+ vec_u8_t srv13 = srv10;
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0_add1;
+ vec_u8_t srv2_add1 = srv0_add1;
+ vec_u8_t srv3_add1 = srv0;
+ vec_u8_t srv4_add1 = srv0;
+ vec_u8_t srv5_add1 = srv0;
+ vec_u8_t srv6_add1 = srv0;
+ vec_u8_t srv7_add1 = srv3;
+ vec_u8_t srv8_add1 = srv3;
+ vec_u8_t srv9_add1 = srv3;
+ vec_u8_t srv10_add1 = srv7;
+ vec_u8_t srv11_add1 = srv7;
+ vec_u8_t srv12_add1= srv7;
+ vec_u8_t srv13_add1 = srv7;
+ vec_u8_t srv14_add1 = srv10;
+ vec_u8_t srv15_add1 = srv10;
+vec_u8_t vfrac16_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 23>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask3={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask10={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask14={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask17={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask21={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask24={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+/*vec_u8_t mask1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask2={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask5={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask8={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask9={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask11={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask12={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask13={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask15={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+
+vec_u8_t mask16={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask18={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask19={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask20={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask22={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask23={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask25={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask26={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask27={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask28={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/
+
+vec_u8_t maskadd1_0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left0 = vec_xl(64, srcPix0);
+ vec_u8_t srv_left1 = vec_xl(80, srcPix0);
+ vec_u8_t srv_right = vec_xl(0, srcPix0);;
+ vec_u8_t refmask_32_0 ={0x1c, 0x19, 0x15, 0x12, 0xe, 0xb, 0x7, 0x4, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(8, srcPix0);;
+ vec_u8_t s2 = vec_xl(24, srcPix0);
+ //vec_u8_t s3 = vec_xl(40, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = srv0;
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = srv3;
+ vec_u8_t srv5 = srv3;
+ vec_u8_t srv6 = srv3;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = srv7;
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = srv10;
+ vec_u8_t srv12= srv10;
+ vec_u8_t srv13 = srv10;
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ //0,0,0,3,3,3,3,7,7,7,10,10,10,10,14,14,14,17,17,17,17,21,21,21,24,24,24,24,s0,s0,s0,s0
+
+ vec_u8_t srv16_0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv16_1 = srv16_0;
+ vec_u8_t srv16_2 = srv16_0;
+ vec_u8_t srv16_3 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv16_4 = srv16_3;
+ vec_u8_t srv16_5 = srv16_3;
+ vec_u8_t srv16_6 = srv16_3;
+ vec_u8_t srv16_7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv16_8 = srv16_7;
+ vec_u8_t srv16_9 = srv16_7;
+ vec_u8_t srv16_10 = vec_perm(s1, s2, mask10);
+ vec_u8_t srv16_11 = srv16_10;
+ vec_u8_t srv16_12= srv16_10;
+ vec_u8_t srv16_13 = srv16_10;
+ vec_u8_t srv16_14 = vec_perm(s1, s2, mask14);
+ vec_u8_t srv16_15 = srv16_14;
+
+ vec_u8_t srv16 = srv14;
+ vec_u8_t srv17 = vec_perm(s0, s1, mask17);
+ vec_u8_t srv18 = srv17;
+ vec_u8_t srv19 = srv17;
+ vec_u8_t srv20 = srv17;
+ vec_u8_t srv21 = vec_perm(s0, s1, mask21);
+ vec_u8_t srv22 = srv21;
+ vec_u8_t srv23 = srv21;
+ vec_u8_t srv24 = vec_perm(s0, s1, mask24);
+ vec_u8_t srv25 = srv24;
+ vec_u8_t srv26 = srv24;
+ vec_u8_t srv27 = srv24;
+ vec_u8_t srv28 = s0;
+ vec_u8_t srv29 = s0;
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = srv16_14;
+ vec_u8_t srv16_17 = vec_perm(s1, s2, mask17);
+ vec_u8_t srv16_18 = srv16_17;
+ vec_u8_t srv16_19 = srv16_17;
+ vec_u8_t srv16_20 = srv16_17;
+ vec_u8_t srv16_21 = vec_perm(s1, s2, mask21);
+ vec_u8_t srv16_22 = srv16_21;
+ vec_u8_t srv16_23 = srv16_21;
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask24);
+ vec_u8_t srv16_25 = srv16_24;
+ vec_u8_t srv16_26 = srv16_24;
+ vec_u8_t srv16_27 = srv16_24;
+ vec_u8_t srv16_28 = s1;
+ vec_u8_t srv16_29 = s1;
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1add1 = srv0add1;
+ vec_u8_t srv2add1 = srv0add1;
+ vec_u8_t srv3add1 = srv0;
+ vec_u8_t srv4add1 = srv0;
+ vec_u8_t srv5add1 = srv0;
+ vec_u8_t srv6add1 = srv0;
+ vec_u8_t srv7add1 = srv3;
+ vec_u8_t srv8add1 = srv3;
+ vec_u8_t srv9add1 = srv3;
+ vec_u8_t srv10add1 = srv7;
+ vec_u8_t srv11add1 = srv7;
+ vec_u8_t srv12add1= srv7;
+ vec_u8_t srv13add1 = srv7;
+ vec_u8_t srv14add1 = srv10;
+ vec_u8_t srv15add1 = srv10;
+ //0,0,0,0,3,3,3,7,7,7,7,10,10,10,14,14,14,14,17,17,17,21,21,21,21,24,24,24,24,
+ vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv16add1_1 = srv16add1_0;
+ vec_u8_t srv16add1_2 = srv16add1_0;
+ vec_u8_t srv16add1_3 = srv16_0;
+ vec_u8_t srv16add1_4 = srv16_0;
+ vec_u8_t srv16add1_5 = srv16_0;
+ vec_u8_t srv16add1_6 = srv16_0;
+ vec_u8_t srv16add1_7 = srv16_3;
+ vec_u8_t srv16add1_8 = srv16_3;
+ vec_u8_t srv16add1_9 = srv16_3;
+ vec_u8_t srv16add1_10 = srv16_7;
+ vec_u8_t srv16add1_11 = srv16_7;
+ vec_u8_t srv16add1_12= srv16_7;
+ vec_u8_t srv16add1_13 = srv16_7;
+ vec_u8_t srv16add1_14 = srv16_10;
+ vec_u8_t srv16add1_15 = srv16_10;
+
+ vec_u8_t srv16add1 = srv10;
+ vec_u8_t srv17add1 = srv14;
+ vec_u8_t srv18add1 = srv14;
+ vec_u8_t srv19add1 = srv14;
+ vec_u8_t srv20add1 = srv14;
+ vec_u8_t srv21add1 = srv17;
+ vec_u8_t srv22add1 = srv17;
+ vec_u8_t srv23add1 = srv17;
+ vec_u8_t srv24add1 = srv21;
+ vec_u8_t srv25add1 = srv21;
+ vec_u8_t srv26add1 = srv21;
+ vec_u8_t srv27add1 = srv21;
+ vec_u8_t srv28add1 = srv24;
+ vec_u8_t srv29add1 = srv24;
+ vec_u8_t srv30add1 = srv24;
+ vec_u8_t srv31add1 = srv24;
+
+ vec_u8_t srv16add1_16 = srv16_10;
+ vec_u8_t srv16add1_17 = srv16_14;
+ vec_u8_t srv16add1_18 = srv16_14;
+ vec_u8_t srv16add1_19 = srv16_14;
+ vec_u8_t srv16add1_20 = srv16_14;
+ vec_u8_t srv16add1_21 = srv16_17;
+ vec_u8_t srv16add1_22 = srv16_17;
+ vec_u8_t srv16add1_23 = srv16_17;
+ vec_u8_t srv16add1_24 = srv16_21;
+ vec_u8_t srv16add1_25 = srv16_21;
+ vec_u8_t srv16add1_26 = srv16_21;
+ vec_u8_t srv16add1_27 = srv16_21;
+ vec_u8_t srv16add1_28 = srv16_24;
+ vec_u8_t srv16add1_29 = srv16_24;
+ vec_u8_t srv16add1_30 = srv16_24;
+ vec_u8_t srv16add1_31 = srv16_24;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_17 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_18 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_20 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_21 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_22 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_24 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_25 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_26 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_28 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_29 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_30 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<4, 24>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, };
+vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, };
+
+
+ //mode 19:
+ //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26};
+ //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0};
+ //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31
+
+ //mode19 invAS[32]= {1, 2, 4, };
+ //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0};
+ //vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ // vec_u8_t refmask_4={0x10, 0x11, 0x12, 0x13, 0x14, 0x00, };
+ //vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+ vec_u8_t srv = vec_xl(0, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+vec_u8_t vfrac4 = (vec_u8_t){27, 27, 27, 27, 22, 22, 22, 22, 17, 17, 17, 17, 12, 12, 12, 12};
+vec_u8_t vfrac4_32 = (vec_u8_t){5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 15, 20, 20, 20, 20};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 24>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask2={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_8={0x6, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+vec_u8_t vfrac8_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac8_1 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac8_3 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 24>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask12={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t maskadd1_0={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+/*vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask8={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask13={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_4={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_6={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xd, 0x6, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(14, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = srv0;
+ vec_u8_t srv3 = srv0;
+ vec_u8_t srv4 = srv0;
+ vec_u8_t srv5 = srv0;
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = srv6;
+ vec_u8_t srv8 = srv6;
+ vec_u8_t srv9 = srv6;
+ vec_u8_t srv10 = srv6;
+ vec_u8_t srv11 = srv6;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = srv12;
+ vec_u8_t srv14 = srv12;
+ vec_u8_t srv15 = srv12;
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0_add1;
+ vec_u8_t srv2_add1 = srv0_add1;
+ vec_u8_t srv3_add1 = srv0_add1;
+ vec_u8_t srv4_add1 = srv0_add1;
+ vec_u8_t srv5_add1 = srv0_add1;
+ vec_u8_t srv6_add1 = srv0;
+ vec_u8_t srv7_add1 = srv0;
+ vec_u8_t srv8_add1 = srv0;
+ vec_u8_t srv9_add1 = srv0;
+ vec_u8_t srv10_add1 = srv0;
+ vec_u8_t srv11_add1 = srv0;
+ vec_u8_t srv12_add1= srv6;
+ vec_u8_t srv13_add1 = srv6;
+ vec_u8_t srv14_add1 = srv6;
+ vec_u8_t srv15_add1 = srv6;
+vec_u8_t vfrac16_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 24>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+/*vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };*/
+vec_u8_t mask6={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+/*vec_u8_t mask13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask15={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+
+vec_u8_t mask16={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask17={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask18={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };*/
+vec_u8_t mask19={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+/*vec_u8_t mask20={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask21={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask22={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask23={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask24={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask25={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask26={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask27={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask28={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/
+
+vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left0 = vec_xl(64, srcPix0);
+ vec_u8_t srv_left1 = vec_xl(80, srcPix0);
+ vec_u8_t srv_right = vec_xl(0, srcPix0);;
+ vec_u8_t refmask_32_0 ={0x1a, 0x13, 0xd, 0x6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(12, srcPix0);;
+ vec_u8_t s2 = vec_xl(28, srcPix0);
+ //vec_u8_t s3 = vec_xl(44, srcPix0);
+
+ //(0,6)(6,6)(12,7)(19,6)(25, s0)
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = srv0;
+ vec_u8_t srv3 = srv0;
+ vec_u8_t srv4 = srv0;
+ vec_u8_t srv5 = srv0;
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = srv6;
+ vec_u8_t srv8 = srv6;
+ vec_u8_t srv9 = srv6;
+ vec_u8_t srv10 = srv6;
+ vec_u8_t srv11 = srv6;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = srv12;
+ vec_u8_t srv14 = srv12;
+ vec_u8_t srv15 = srv12;
+
+ //0,0,0,3,3,3,3,7,7,7,10,10,10,10,14,14,14,17,17,17,17,21,21,21,24,24,24,24,s0,s0,s0,s0
+
+ vec_u8_t srv16_0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv16_1 = srv16_0;
+ vec_u8_t srv16_2 = srv16_0;
+ vec_u8_t srv16_3 = srv16_0;
+ vec_u8_t srv16_4 = srv16_0;
+ vec_u8_t srv16_5 = srv16_0;
+ vec_u8_t srv16_6 = vec_perm(s1, s2, mask6);
+ vec_u8_t srv16_7 = srv16_6;
+ vec_u8_t srv16_8 = srv16_6;
+ vec_u8_t srv16_9 = srv16_6;
+ vec_u8_t srv16_10 = srv16_6;
+ vec_u8_t srv16_11 = srv16_6;
+ vec_u8_t srv16_12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv16_13 = srv16_12;
+ vec_u8_t srv16_14 = srv16_12;
+ vec_u8_t srv16_15 = srv16_12;
+
+ vec_u8_t srv16 = srv12;
+ vec_u8_t srv17 = srv12;
+ vec_u8_t srv18 = srv12;
+ vec_u8_t srv19 = vec_perm(s0, s1, mask19);
+ vec_u8_t srv20 = srv19;
+ vec_u8_t srv21 = srv19;
+ vec_u8_t srv22 = srv19;
+ vec_u8_t srv23 = srv19;
+ vec_u8_t srv24 = srv19;
+ vec_u8_t srv25 = s0;
+ vec_u8_t srv26 = s0;
+ vec_u8_t srv27 = s0;
+ vec_u8_t srv28 = s0;
+ vec_u8_t srv29 = s0;
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = srv16_12;
+ vec_u8_t srv16_17 = srv16_12;
+ vec_u8_t srv16_18 = srv16_12;
+ vec_u8_t srv16_19 = vec_perm(s1, s2, mask19);
+ vec_u8_t srv16_20 = srv16_19;
+ vec_u8_t srv16_21 = srv16_19;
+ vec_u8_t srv16_22 = srv16_19;
+ vec_u8_t srv16_23 = srv16_19;
+ vec_u8_t srv16_24 = srv16_19;
+ vec_u8_t srv16_25 = s1;
+ vec_u8_t srv16_26 = s1;
+ vec_u8_t srv16_27 = s1;
+ vec_u8_t srv16_28 = s1;
+ vec_u8_t srv16_29 = s1;
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1add1 = srv0add1;
+ vec_u8_t srv2add1 = srv0add1;
+ vec_u8_t srv3add1 = srv0add1;
+ vec_u8_t srv4add1 = srv0add1;
+ vec_u8_t srv5add1 = srv0add1;
+ vec_u8_t srv6add1 = srv0;
+ vec_u8_t srv7add1 = srv0;
+ vec_u8_t srv8add1 = srv0;
+ vec_u8_t srv9add1 = srv0;
+ vec_u8_t srv10add1 = srv0;
+ vec_u8_t srv11add1 = srv0;
+ vec_u8_t srv12add1= srv6;
+ vec_u8_t srv13add1 = srv6;
+ vec_u8_t srv14add1 = srv6;
+ vec_u8_t srv15add1 = srv6;
+
+ vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv16add1_1 = srv16add1_0;
+ vec_u8_t srv16add1_2 = srv16add1_0;
+ vec_u8_t srv16add1_3 = srv16add1_0;
+ vec_u8_t srv16add1_4 = srv16add1_0;
+ vec_u8_t srv16add1_5 = srv16add1_0;
+ vec_u8_t srv16add1_6 = srv16_0;
+ vec_u8_t srv16add1_7 = srv16_0;
+ vec_u8_t srv16add1_8 = srv16_0;
+ vec_u8_t srv16add1_9 = srv16_0;
+ vec_u8_t srv16add1_10 = srv16_0;
+ vec_u8_t srv16add1_11 = srv16_0;
+ vec_u8_t srv16add1_12= srv16_6;
+ vec_u8_t srv16add1_13 = srv16_6;
+ vec_u8_t srv16add1_14 = srv16_6;
+ vec_u8_t srv16add1_15 = srv16_6;
+
+ vec_u8_t srv16add1 = srv6;
+ vec_u8_t srv17add1 = srv6;
+ vec_u8_t srv18add1 = srv6;
+ vec_u8_t srv19add1 = srv12;
+ vec_u8_t srv20add1 = srv12;
+ vec_u8_t srv21add1 = srv12;
+ vec_u8_t srv22add1 = srv12;
+ vec_u8_t srv23add1 = srv12;
+ vec_u8_t srv24add1 = srv12;
+ vec_u8_t srv25add1 = srv19;
+ vec_u8_t srv26add1 = srv19;
+ vec_u8_t srv27add1 = srv19;
+ vec_u8_t srv28add1 = srv19;
+ vec_u8_t srv29add1 = srv19;
+ vec_u8_t srv30add1 = srv19;
+ vec_u8_t srv31add1 = srv19;
+
+ vec_u8_t srv16add1_16 = srv16_6;
+ vec_u8_t srv16add1_17 = srv16_6;
+ vec_u8_t srv16add1_18 = srv16_6;
+ vec_u8_t srv16add1_19 = srv16_12;
+ vec_u8_t srv16add1_20 = srv16_12;
+ vec_u8_t srv16add1_21 = srv16_12;
+ vec_u8_t srv16add1_22 = srv16_12;
+ vec_u8_t srv16add1_23 = srv16_12;
+ vec_u8_t srv16add1_24 = srv16_12;
+ vec_u8_t srv16add1_25 = srv16_19;
+ vec_u8_t srv16add1_26 = srv16_19;
+ vec_u8_t srv16add1_27 = srv16_19;
+ vec_u8_t srv16add1_28 = srv16_19;
+ vec_u8_t srv16add1_29 = srv16_19;
+ vec_u8_t srv16add1_30 = srv16_19;
+ vec_u8_t srv16add1_31 = srv16_19;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_18 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_20 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_22 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_24 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_26 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_28 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_30 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<4, 25>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, };
+vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, };
+
+
+ //mode 19:
+ //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26};
+ //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0};
+ //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31
+
+ //mode19 invAS[32]= {1, 2, 4, };
+ //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0};
+ //vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t refmask_4={0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ //vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv=vec_xl(0, srcPix0);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+vec_u8_t vfrac4 = (vec_u8_t){30, 30, 30, 30, 28, 28, 28, 28, 26, 26, 26, 26, 24, 24, 24, 24};
+vec_u8_t vfrac4_32 = (vec_u8_t){2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 25>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask2={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask3={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask4={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ //vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t refmask_8={0x7, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ //vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+ vec_u8_t srv = vec_xl(0, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+vec_u8_t vfrac8_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_2 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_3 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 25>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+/*vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask2={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask3={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask4={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask5={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask7={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask8={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask9={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask10={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask11={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask12={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask13={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t maskadd1_0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_1={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_2={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_3={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_8={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ //vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ //vec_u8_t refmask_16={0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, };
+ //vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ //vec_u8_t s1 = vec_xl(12, srcPix0);
+
+ vec_u8_t srv0 = vec_xl(0, srcPix0);
+ vec_u8_t srv1 = vec_xl(1, srcPix0);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv0, srv1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 25>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left = vec_xl(80, srcPix0);
+ vec_u8_t srv_right = vec_xl(0, srcPix0);;
+ vec_u8_t refmask_32 ={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_32);
+ vec_u8_t s1 = vec_xl(15, srcPix0);;
+ vec_u8_t s2 = vec_xl(31, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv16_0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv0, srv0add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv0, srv0add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv0, srv0add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv0, srv0add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv0, srv0add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv0, srv0add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv0, srv0add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv0, srv0add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv0, srv0add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv0, srv0add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv0, srv0add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv0, srv0add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv0, srv0add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv0, srv0add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv0, srv0add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(s0, srv0, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(s1, srv16_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(s0, srv0, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(s1, srv16_0, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(s0, srv0, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(s1, srv16_0, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(s0, srv0, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(s1, srv16_0, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(s0, srv0, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(s1, srv16_0, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(s0, srv0, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(s1, srv16_0, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(s0, srv0, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(s1, srv16_0, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(s0, srv0, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(s1, srv16_0, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(s0, srv0, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(s1, srv16_0, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(s0, srv0, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(s1, srv16_0, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(s0, srv0, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(s1, srv16_0, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(s0, srv0, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(s1, srv16_0, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(s0, srv0, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(s1, srv16_0, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(s0, srv0, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(s1, srv16_0, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(s0, srv0, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(s1, srv16_0, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(s0, srv0, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(s1, srv16_0, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 26>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(0, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_sld(srv, srv, 15);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_w4x4_mask9));
+ vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v );
+ vec_s16_t v_sum = vec_add(c1_s16v, v1_s16);
+ vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum));
+ vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v);
+ vec_u8_t v_mask = {0x10, 0x02, 0x03, 0x04, 0x11, 0x02, 0x03, 0x04, 0x12, 0x02, 0x03, 0x04, 0x13, 0x02, 0x03, 0x04};
+ vec_u8_t vout = vec_perm(srv, v_filter_u8, v_mask);
+ if(dstStride == 4) {
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_u8_t v1 = vec_sld(vout, vout, 12);
+ vec_ste((vec_u32_t)v1, 0, (unsigned int*)(dst+dstStride));
+ vec_u8_t v2 = vec_sld(vout, vout, 8);
+ vec_ste((vec_u32_t)v2, 0, (unsigned int*)(dst+dstStride*2));
+ vec_u8_t v3 = vec_sld(vout, vout, 4);
+ vec_ste((vec_u32_t)v3, 0, (unsigned int*)(dst+dstStride*3));
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+ }
+ else{
+
+ if(dstStride == 4) {
+ vec_u8_t v_mask0 = {0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04};
+ vec_u8_t v0 = vec_perm(srv, srv, v_mask0);
+ vec_xst(v0, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_u8_t v_mask0 = {0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04};
+ vec_u8_t v0 = vec_perm(srv, srv, v_mask0);
+ vec_ste((vec_u32_t)v0, 0, (unsigned int*)dst);
+ vec_u8_t v1 = vec_sld(v0, v0, 12);
+ vec_ste((vec_u32_t)v1, 0, (unsigned int*)(dst+dstStride));
+ vec_u8_t v2 = vec_sld(v0, v0, 8);
+ vec_ste((vec_u32_t)v2, 0, (unsigned int*)(dst+dstStride*2));
+ vec_u8_t v3 = vec_sld(v0, v0, 4);
+ vec_ste((vec_u32_t)v3, 0, (unsigned int*)(dst+dstStride*3));
+ }
+ else{
+ vec_u8_t v_mask0 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(srv, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(srv, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(srv, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(srv, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<8, 26>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(0, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(17, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b1_mask));
+ vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskh));
+ vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v );
+ vec_s16_t v_sum = vec_add(c1_s16v, v1_s16);
+ vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum));
+ vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v);
+ vec_u8_t v_mask0 = {0x00, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x01, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t v_mask1 = {0x02, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x03, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t v_mask2 = {0x04, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x05, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t v_mask3 = {0x06, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x07, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t v0 = vec_perm(v_filter_u8, srv, v_mask0);
+ vec_u8_t v1 = vec_perm(v_filter_u8, srv, v_mask1);
+ vec_u8_t v2 = vec_perm(v_filter_u8, srv, v_mask2);
+ vec_u8_t v3 = vec_perm(v_filter_u8, srv, v_mask3);
+ if(dstStride == 8) {
+ vec_xst(v0, 0, dst);
+ vec_xst(v1, 16, dst);
+ vec_xst(v2, 32, dst);
+ vec_xst(v3, 48, dst);
+ }
+ else{
+ vec_u8_t v_maskh = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_maskl = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_xst(vec_perm(v0, vec_xl(0, dst), v_maskh), 0, dst);
+ vec_xst(vec_perm(v0, vec_xl(dstStride, dst), v_maskl), dstStride, dst);
+ vec_xst(vec_perm(v1, vec_xl(dstStride*2, dst), v_maskh), dstStride*2, dst);
+ vec_xst(vec_perm(v1, vec_xl(dstStride*3, dst), v_maskl), dstStride*3, dst);
+ vec_xst(vec_perm(v2, vec_xl(dstStride*4, dst), v_maskh), dstStride*4, dst);
+ vec_xst(vec_perm(v2, vec_xl(dstStride*5, dst), v_maskl), dstStride*5, dst);
+ vec_xst(vec_perm(v3, vec_xl(dstStride*6, dst), v_maskh), dstStride*6, dst);
+ vec_xst(vec_perm(v3, vec_xl(dstStride*7, dst), v_maskl), dstStride*7, dst);
+ }
+ }
+ else{
+ if(dstStride == 8) {
+ vec_u8_t v_mask = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t v0 = vec_perm(srv, srv, v_mask);
+ vec_xst(v0, 0, dst);
+ vec_xst(v0, 16, dst);
+ vec_xst(v0, 32, dst);
+ vec_xst(v0, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_xst(vec_perm(srv, vec_xl(0, dst), v_mask), 0, dst);
+ vec_xst(vec_perm(srv, vec_xl(dstStride, dst), v_mask), dstStride, dst);
+ vec_xst(vec_perm(srv, vec_xl(dstStride*2, dst), v_mask), dstStride*2, dst);
+ vec_xst(vec_perm(srv, vec_xl(dstStride*3, dst), v_mask), dstStride*3, dst);
+ vec_xst(vec_perm(srv, vec_xl(dstStride*4, dst), v_mask), dstStride*4, dst);
+ vec_xst(vec_perm(srv, vec_xl(dstStride*5, dst), v_mask), dstStride*5, dst);
+ vec_xst(vec_perm(srv, vec_xl(dstStride*6, dst), v_mask), dstStride*6, dst);
+ vec_xst(vec_perm(srv, vec_xl(dstStride*7, dst), v_mask), dstStride*7, dst);
+ }
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 26>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(0, srcPix0);
+ vec_u8_t srv1 =vec_xl(1, srcPix0);
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(33, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b1_mask));
+ vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskh));
+ vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskl));
+ vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v );
+ vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v );
+ vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16);
+ vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16);
+ vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum));
+ vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum));
+ vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16);
+ vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+
+
+ if(dstStride == 16) {
+ vec_xst(vec_perm(v_filter_u8, srv1, mask0), 0, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask1), 16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask2), 32, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask3), 48, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask4), 64, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask5), 80, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask6), 96, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask7), 112, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask8), 128, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask9), 144, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask10), 160, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask11), 176, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask12), 192, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask13), 208, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask14), 224, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask15), 240, dst);
+ }
+ else{
+ vec_xst(vec_perm(v_filter_u8, srv1, mask0), 0, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask1), dstStride, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask2), dstStride*2, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask3), dstStride*3, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask4), dstStride*4, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask5), dstStride*5, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask6), dstStride*6, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask7), dstStride*7, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask8), dstStride*8, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask9), dstStride*9, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask10), dstStride*10, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask11), dstStride*11, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask12), dstStride*12, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask13), dstStride*13, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask14), dstStride*14, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask15), dstStride*15, dst);
+ }
+ }
+ else{
+ if(dstStride == 16) {
+ vec_xst(srv1, 0, dst);
+ vec_xst(srv1, 16, dst);
+ vec_xst(srv1, 32, dst);
+ vec_xst(srv1, 48, dst);
+ vec_xst(srv1, 64, dst);
+ vec_xst(srv1, 80, dst);
+ vec_xst(srv1, 96, dst);
+ vec_xst(srv1, 112, dst);
+ vec_xst(srv1, 128, dst);
+ vec_xst(srv1, 144, dst);
+ vec_xst(srv1, 160, dst);
+ vec_xst(srv1, 176, dst);
+ vec_xst(srv1, 192, dst);
+ vec_xst(srv1, 208, dst);
+ vec_xst(srv1, 224, dst);
+ vec_xst(srv1, 240, dst);
+ }
+ else{
+ vec_xst(srv1, 0, dst);
+ vec_xst(srv1, dstStride, dst);
+ vec_xst(srv1, dstStride*2, dst);
+ vec_xst(srv1, dstStride*3, dst);
+ vec_xst(srv1, dstStride*4, dst);
+ vec_xst(srv1, dstStride*5, dst);
+ vec_xst(srv1, dstStride*6, dst);
+ vec_xst(srv1, dstStride*7, dst);
+ vec_xst(srv1, dstStride*8, dst);
+ vec_xst(srv1, dstStride*9, dst);
+ vec_xst(srv1, dstStride*10, dst);
+ vec_xst(srv1, dstStride*11, dst);
+ vec_xst(srv1, dstStride*12, dst);
+ vec_xst(srv1, dstStride*13, dst);
+ vec_xst(srv1, dstStride*14, dst);
+ vec_xst(srv1, dstStride*15, dst);
+ }
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 26>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(1, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+ vec_u8_t srv1 =vec_xl(17, srcPix0);
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(0, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_u8_t srcv1 = vec_xl(65, srcPix0);
+ vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh));
+ vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl));
+ vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v );
+ vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v );
+
+ vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16);
+ vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16);
+ vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum));
+ vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum));
+ vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16);
+
+ vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_xst(vec_perm(v_filter_u8, srv, mask0), 0, dst);
+ vec_xst(srv1, 16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask1), dstStride, dst);
+ vec_xst(srv1, dstStride+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask2), dstStride*2, dst);
+ vec_xst(srv1, dstStride*2+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask3), dstStride*3, dst);
+ vec_xst(srv1, dstStride*3+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask4), dstStride*4, dst);
+ vec_xst(srv1, dstStride*4+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask5), dstStride*5, dst);
+ vec_xst(srv1, dstStride*5+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask6), dstStride*6, dst);
+ vec_xst(srv1, dstStride*6+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask7), dstStride*7, dst);
+ vec_xst(srv1, dstStride*7+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask8), dstStride*8, dst);
+ vec_xst(srv1, dstStride*8+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask9), dstStride*9, dst);
+ vec_xst(srv1, dstStride*9+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask10), dstStride*10, dst);
+ vec_xst(srv1, dstStride*10+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask11), dstStride*11, dst);
+ vec_xst(srv1, dstStride*11+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask12), dstStride*12, dst);
+ vec_xst(srv1, dstStride*12+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask13), dstStride*13, dst);
+ vec_xst(srv1, dstStride*13+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask14), dstStride*14, dst);
+ vec_xst(srv1, dstStride*14+16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask15), dstStride*15, dst);
+ vec_xst(srv1, dstStride*15+16, dst);
+
+ vec_u8_t srcv2 = vec_xl(81, srcPix0);
+ vec_s16_t v2h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskh));
+ vec_s16_t v2l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskl));
+ vec_s16_t v3h_s16 = (vec_s16_t)vec_sra( vec_sub(v2h_s16, c0_s16v), one_u16v );
+ vec_s16_t v3l_s16 = (vec_s16_t)vec_sra( vec_sub(v2l_s16, c0_s16v), one_u16v );
+ vec_s16_t v2h_sum = vec_add(c1_s16v, v3h_s16);
+ vec_s16_t v2l_sum = vec_add(c1_s16v, v3l_s16);
+ vec_u16_t v2h_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2h_sum));
+ vec_u16_t v2l_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2l_sum));
+ vec_u8_t v2_filter_u8 = vec_pack(v2h_filter_u16, v2l_filter_u16);
+
+ vec_xst(vec_perm(v2_filter_u8, srv, mask0), dstStride*16, dst);
+ vec_xst(srv1, dstStride*16+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask1), dstStride*17, dst);
+ vec_xst(srv1, dstStride*17+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask2), dstStride*18, dst);
+ vec_xst(srv1, dstStride*18+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask3), dstStride*19, dst);
+ vec_xst(srv1, dstStride*19+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask4), dstStride*20, dst);
+ vec_xst(srv1, dstStride*20+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask5), dstStride*21, dst);
+ vec_xst(srv1, dstStride*21+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask6), dstStride*22, dst);
+ vec_xst(srv1, dstStride*22+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask7), dstStride*23, dst);
+ vec_xst(srv1, dstStride*23+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask8), dstStride*24, dst);
+ vec_xst(srv1, dstStride*24+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask9), dstStride*25, dst);
+ vec_xst(srv1, dstStride*25+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask10), dstStride*26, dst);
+ vec_xst(srv1, dstStride*26+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask11), dstStride*27, dst);
+ vec_xst(srv1, dstStride*27+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask12), dstStride*28, dst);
+ vec_xst(srv1, dstStride*28+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask13), dstStride*29, dst);
+ vec_xst(srv1, dstStride*29+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask14), dstStride*30, dst);
+ vec_xst(srv1, dstStride*30+16, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask15), dstStride*31, dst);
+ vec_xst(srv1, dstStride*31+16, dst);
+
+ }
+ else{
+ int offset = 0;
+
+ for(int i=0; i<32; i++){
+ vec_xst(srv, offset, dst);
+ vec_xst(srv1, 16+offset, dst);
+ offset += dstStride;
+ }
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 27>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8}; /* fraction[0-3] */
+ vec_u8_t vfrac4_32 = (vec_u8_t){30, 30, 30, 30, 28, 28, 28, 28, 26, 26, 26, 26, 24, 24, 24, 24}; /* 32 - fraction[0-3] */
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 27>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ /* fraction[0-7] */
+ vec_u8_t vfrac8_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac8_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac8_2 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac8_3 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-7] */
+ vec_u8_t vfrac8_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac8_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac8_32_2 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac8_32_3 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv0, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv0, vfrac8_32_1);
+ vmle1 = vec_mule(srv1, vfrac8_1);
+ vmlo1 = vec_mulo(srv1, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv0, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv0, vfrac8_32_2);
+ vmle1 = vec_mule(srv1, vfrac8_2);
+ vmlo1 = vec_mulo(srv1, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv0, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv0, vfrac8_32_3);
+ vmle1 = vec_mule(srv1, vfrac8_3);
+ vmlo1 = vec_mulo(srv1, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 27>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u8_t vfrac16_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+#if 0
+ #define one_line(s0, s1, vf32, vf, vout) {\
+ vmle0 = vec_mule(s0, vf32);\
+ vmlo0 = vec_mulo(s0, vf32);\
+ vmle1 = vec_mule(s1, vf);\
+ vmlo1 = vec_mulo(s1, vf);\
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);\
+ ve = vec_sra(vsume, u16_5);\
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);\
+ vo = vec_sra(vsumo, u16_5);\
+ vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));\
+ }
+#endif
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv1, srv2, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 27>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); /* from y= 15, use srv1, srv2 */
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); /* y=31, use srv2, srv3 */
+
+ vec_u8_t srv4 = sv1;
+ vec_u8_t srv5 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv6 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv7 = vec_perm(sv2, sv2, mask3);
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u8_t vfrac16_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv4, srv5, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv4, srv5, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv4, srv5, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv4, srv5, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv4, srv5, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv4, srv5, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv1, srv2, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv5, srv6, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+
+ one_line(srv1, srv2, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv5, srv6, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv5, srv6, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv5, srv6, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv5, srv6, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv5, srv6, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv5, srv6, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv5, srv6, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv1, srv2, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv1, srv2, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv1, srv2, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv5, srv6, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 28>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 28
+ //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5};
+ //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0};
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 15, 20, 20, 20, 20}; /* fraction[0-3] */
+ vec_u8_t vfrac4_32 = (vec_u8_t){27, 27, 27, 27, 22, 22, 22, 22, 17, 17, 17, 17, 12, 12, 12, 12}; /* 32 - fraction[0-3] */
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 28>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 28
+ //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5};
+ //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0};
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+
+ /* fraction[0-7] */
+ vec_u8_t vfrac8_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac8_1 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac8_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac8_3 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8};
+
+ /* 32 - fraction[0-7] */
+ vec_u8_t vfrac8_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac8_32_1 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac8_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac8_32_3 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 24, 24, 24, 24, 24, 24, 24, 24};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv0, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv0, vfrac8_32_1);
+ vmle1 = vec_mule(srv1, vfrac8_1);
+ vmlo1 = vec_mulo(srv1, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv0, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv0, vfrac8_32_2);
+ vmle1 = vec_mule(srv1, vfrac8_2);
+ vmlo1 = vec_mulo(srv1, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv1, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv1, vfrac8_32_3);
+ vmle1 = vec_mule(srv2, vfrac8_3);
+ vmlo1 = vec_mulo(srv2, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 28>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+
+ //mode 28
+ //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5};
+ //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0};
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13,13, 13, 13, 13,13, 13, 13, 13};
+ vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv2, srv3, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv2, srv3, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv2, srv3, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 28>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); /* from y= 15, use srv1, srv2 */
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); /* y=31, use srv2, srv3 */
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask4); /* y=31, use srv2, srv3 */
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask5); /* y=31, use srv2, srv3 */
+ vec_u8_t srv12 = vec_perm(sv0, sv1, mask6); /* y=31, use srv2, srv3 */
+
+ vec_u8_t srv4 = sv1;
+ vec_u8_t srv5 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv6 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv7 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask4); /* y=31, use srv2, srv3 */
+ vec_u8_t srv11 = vec_perm(sv1, sv2, mask5); /* y=31, use srv2, srv3 */
+ vec_u8_t srv13 = vec_perm(sv1, sv2, mask6); /* y=31, use srv2, srv3 */
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13,13, 13, 13, 13,13, 13, 13, 13};
+ vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_16 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_18 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_20 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_22 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_24 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_26 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_28 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_30 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_32_16 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_18 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_20 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_32_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_22 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_24 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_32_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_26 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_28 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_32_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_30 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv4, srv5, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv4, srv5, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv4, srv5, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv4, srv5, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv5, srv6, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv2, srv3, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv2, srv3, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv6, srv7, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv2, srv3, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv2, srv3, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv6, srv7, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv2, srv3, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv6, srv7, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv2, srv3, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv6, srv7, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv3, srv8, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv7, srv10, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv3, srv8, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv7, srv10, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv3, srv8, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv7, srv10, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv3, srv8, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv7, srv10, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv3, srv8, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv7, srv10, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv3, srv8, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv7, srv10, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv8, srv9, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv10, srv11, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv8, srv9, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv10, srv11, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv8, srv9, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv10, srv11, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv8, srv9, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv10, srv11, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv8, srv9, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv10, srv11, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv8, srv9, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv10, srv11, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv9, srv12, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv11, srv13, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 29>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 29:
+ //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9};
+ //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0};
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){9, 9, 9, 9, 18, 18, 18, 18, 27, 27, 27, 27, 4, 4, 4, 4}; /* fraction[0-3] */
+ vec_u8_t vfrac4_32 = (vec_u8_t){23, 23, 23, 23, 14, 14, 14, 14, 5, 5, 5, 5, 28, 28, 28, 28}; /* 32 - fraction[0-3] */
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 29>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 29:
+ //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9};
+ //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0};
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask2={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask3={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask4={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask5={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 0, 1 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 1, 2 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 2 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 2, 3 */
+
+ /* fraction[0-7] */
+ vec_u8_t vfrac8_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac8_1 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac8_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac8_3 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 8, 8, 8, 8, 8, 8, 8, 8};
+
+ /* 32 - fraction[0-7] */
+ vec_u8_t vfrac8_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac8_32_1 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac8_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac8_32_3 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 24, 24, 24, 24, 24, 24, 24, 24};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv2, vfrac8_32_1);
+ vmle1 = vec_mule(srv3, vfrac8_1);
+ vmlo1 = vec_mulo(srv3, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv1, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv1, vfrac8_32_2);
+ vmle1 = vec_mule(srv4, vfrac8_2);
+ vmlo1 = vec_mulo(srv4, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv3, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv3, vfrac8_32_3);
+ vmle1 = vec_mule(srv5, vfrac8_3);
+ vmlo1 = vec_mulo(srv5, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 29>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 29:
+ //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9};
+ //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0};
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+#if 0
+ #define one_line(s0, s1, vf32, vf, vout) {\
+ vmle0 = vec_mule(s0, vf32);\
+ vmlo0 = vec_mulo(s0, vf32);\
+ vmle1 = vec_mule(s1, vf);\
+ vmlo1 = vec_mulo(s1, vf);\
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);\
+ ve = vec_sra(vsume, u16_5);\
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);\
+ vo = vec_sra(vsumo, u16_5);\
+ vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));\
+ }
+#endif
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv2, srv3, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv2, srv3, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv2, srv3, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv3, srv4, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv3, srv4, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv3, srv4, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv3, srv4, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv4, srv5, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 29>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 29:
+ //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9};
+ //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0};
+
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+
+ vec_u8_t srv00 = sv1;
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv40 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv50 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv60 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv70 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv80 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv90 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srva0 = vec_perm(sv1, sv2, mask10);
+
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_16 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_17 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 };
+ vec_u8_t vfrac16_18 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_20 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_21 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_22 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_24 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_25 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_26 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_28 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_29 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_30 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_32_16 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_32_17 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_18 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_32_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_20 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_32_21 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_22 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_24 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_32_25 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_26 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_32_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_28 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_32_29 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_30 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv00, srv10, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv00, srv10, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv10, srv20, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv10, srv20, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv10, srv20, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv10, srv20, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv2, srv3, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv20, srv30, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv2, srv3, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv20, srv30, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv2, srv3, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv20, srv30, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv3, srv4, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv30, srv40, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv3, srv4, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv30, srv40, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv3, srv4, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv30, srv40, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv3, srv4, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv30, srv40, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv40, srv50, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv4, srv5, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv40, srv50, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv4, srv5, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv40, srv50, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv5, srv6, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv50, srv60, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv5, srv6, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv50, srv60, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv5, srv6, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv50, srv60, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv5, srv6, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv50, srv60, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv6, srv7, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv60, srv70, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv6, srv7, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv60, srv70, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv6, srv7, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv60, srv70, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv7, srv8, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv70, srv80, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv7, srv8, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv70, srv80, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv7, srv8, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv70, srv80, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv7, srv8, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv70, srv80, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv8, srv9, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv80, srv90, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv8, srv9, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv80, srv90, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv8, srv9, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv80, srv90, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv9, srva, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv90, srva0, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 30>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 30:
+ //int offset[32] = {0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 12, 12, 13};
+ //int fraction[32] = {13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, 29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 0};
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){13, 13, 13, 13, 26, 26, 26, 26, 7, 7, 7, 7, 20, 20, 20, 20}; /* fraction[0-3] */
+ vec_u8_t vfrac4_32 = (vec_u8_t){19, 19, 19, 19, 6, 6, 6, 6, 25, 25, 25, 25, 12, 12, 12, 12}; /* 32 - fraction[0-3] */
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 30>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 30:
+ //int offset[32] = {0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 12, 12, 13};
+ //int fraction[32] = {13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, 29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 0};
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask4={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask5={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 2 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 3 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 3, 4 */
+
+ /* fraction[0-7] */
+ vec_u8_t vfrac8_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac8_1 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac8_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 14, 14, 14, 14, 14, 14, 14, 14 };
+ vec_u8_t vfrac8_3 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 8, 8, 8, 8, 8, 8, 8, 8};
+
+ /* 32 - fraction[0-7] */
+ vec_u8_t vfrac8_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac8_32_1 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac8_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac8_32_3 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 24, 24, 24, 24, 24, 24, 24, 24};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv1, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv1, vfrac8_32_1);
+ vmle1 = vec_mule(srv2, vfrac8_1);
+ vmlo1 = vec_mulo(srv2, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv2, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv2, vfrac8_32_2);
+ vmle1 = vec_mule(srv3, vfrac8_2);
+ vmlo1 = vec_mulo(srv3, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv4, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv4, vfrac8_32_3);
+ vmle1 = vec_mule(srv5, vfrac8_3);
+ vmlo1 = vec_mulo(srv5, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 30>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 30:
+ //int offset[32] = {0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 12, 12, 13};
+ //int fraction[32] = {13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, 29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 0};
+
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ //vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ //vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ //vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ //vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ //vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ //vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ //vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ //vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ //vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ //vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ //vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ //vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ //vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ //vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv2, srv3, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv2, srv3, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv3, srv4, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv3, srv4, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 30>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ //mode 30:
+ //int offset[32] = {0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 12, 12, 13};
+ //int fraction[32] = {13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, 29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 0};
+
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+
+ vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv40 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv50 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv60 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv70 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv80 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv90 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srva0 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srve0 = vec_perm(sv1, sv2, mask14);
+
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_16 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_17 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_18 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_20 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_21 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_22 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_24 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_25 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_26 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_28 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_29 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_30 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_32_16 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_32_17 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_18 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_20 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_32_21 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_22 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_24 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_32_25 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_26 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_28 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_32_29 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_30 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv00, srv10, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv10, srv20, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv20, srv30, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv2, srv3, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv20, srv30, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv2, srv3, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv20, srv30, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv3, srv4, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv30, srv40, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv3, srv4, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv30, srv40, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv40, srv50, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv40, srv50, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv40, srv50, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv50, srv60, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv50, srv60, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv60, srv70, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv60, srv70, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv6, srv7, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv60, srv70, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv7, srv8, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv70, srv80, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv7, srv8, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv70, srv80, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv8, srv9, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv80, srv90, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv8, srv9, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv80, srv90, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv8, srv9, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv80, srv90, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv9, srva, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv90, srva0, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv9, srva, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv90, srva0, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srva, srvb, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srva0, srvb0, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srva, srvb, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srva0, srvb0, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srva, srvb, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srva0, srvb0, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srvb, srvc, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srvb0, srvc0, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srvb, srvc, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srvb0, srvc0, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srvc, srvd, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srvc0, srvd0, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srvc, srvd, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srvc0, srvd0, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srvd, srve, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srvd0, srve0, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 31>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ }
+ */
+ //mode 31:
+ //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17};
+ //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0};
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){17, 17, 17, 17, 2, 2, 2, 2, 19, 19, 19, 19, 4, 4, 4, 4};
+ vec_u8_t vfrac4_32 = (vec_u8_t){15, 15, 15, 15, 30, 30, 30, 30, 13, 13, 13, 13, 28, 28, 28, 28};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 31>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off0 + 7] + f[0] * ref[off0 + 7] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[1]* ref[off1 + 7] + f[1] * ref[off1 + 7] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[2]* ref[off2 + 7] + f[2] * ref[off2 + 7] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off3 + 7] + f[0] * ref[off3 + 7] + 16) >> 5);
+
+ ...
+
+ y=7; off7 = offset[7]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[7]* ref[off7 + 0] + f[7] * ref[off7 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[7]* ref[off7 + 1] + f[7] * ref[off7 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[7]* ref[off7 + 2] + f[7] * ref[off7 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[7]* ref[off7 + 3] + f[7] * ref[off7 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off7 + 7] + f[0] * ref[off7 + 7] + 16) >> 5);
+ }
+ */
+ //mode 31:
+ //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17};
+ //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0};
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 2 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 3 */
+
+ /* fraction[0-7] */
+ vec_u8_t vfrac8_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac8_1 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac8_2 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac8_3 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 8, 8, 8, 8, 8, 8, 8, 8};
+
+ /* 32 - fraction[0-7] */
+ vec_u8_t vfrac8_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac8_32_1 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac8_32_2 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac8_32_3 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 24, 24, 24, 24, 24, 24, 24, 24};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv1, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv1, vfrac8_32_1);
+ vmle1 = vec_mule(srv2, vfrac8_1);
+ vmlo1 = vec_mulo(srv2, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv2, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv2, vfrac8_32_2);
+ vmle1 = vec_mule(srv3, vfrac8_2);
+ vmlo1 = vec_mulo(srv3, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv3, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv3, vfrac8_32_3);
+ vmle1 = vec_mule(srv4, vfrac8_3);
+ vmlo1 = vec_mulo(srv4, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 31>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[3]* ref[off3 + 15] + f[3] * ref[off3 + 16] + 16) >> 5);
+
+ ...
+
+ y=15; off7 = offset[7]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5);
+ }
+ */
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv3, srv4, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv6, srv7, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv7, srv8, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv7, srv8, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv8, srv9, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 31>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5);
+
+ ...
+
+ y=15; off15 = offset[15]; x=0-31; off15-off30 = 1;
+ dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5);
+
+ ...
+
+ y=31; off31= offset[31]; x=0-31; off31 = 2;
+ dst[y * dstStride + 0] = (pixel)((f32[31]* ref[off15 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[31]* ref[off15 + 1] + f[31] * ref[off31 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[31]* ref[off15 + 2] + f[31] * ref[off31 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[31]* ref[off15 + 3] + f[31] * ref[off31 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 31] = (pixel)((f32[31]* ref[off15 + 31] + f[31] * ref[off31 + 32] + 16) >> 5);
+ }
+ */
+ //mode 31:
+ //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17};
+ //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0};
+
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv40 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv50 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv60 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv70 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv80 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv90 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srva0 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srve0 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15);
+
+ vec_u8_t srv000 = sv2;
+ vec_u8_t srv100 = vec_perm(sv2, sv3, mask1);
+ vec_u8_t srv200 = vec_perm(sv2, sv3, mask2);
+
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_17 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_18 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_20 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_21 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_22 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_24 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_25 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_26 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_28 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_29 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_30 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ /* 32 - fraction[0-15] */
+vec_u8_t vfrac16_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+ //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17};
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv20, srv30, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv20, srv30, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv30, srv40, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv3, srv4, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv30, srv40, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv40, srv50, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv40, srv50, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv50, srv60, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv50, srv60, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv6, srv7, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv60, srv70, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv60, srv70, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv7, srv8, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv70, srv80, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv7, srv8, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv70, srv80, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv8, srv9, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv80, srv90, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srv9, srva, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv90, srva0, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv9, srva, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv90, srva0, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srva, srvb, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srva0, srvb0, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srva, srvb, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srva0, srvb0, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srvb, srvc, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srvb0, srvc0, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srvb, srvc, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srvb0, srvc0, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srvc, srvd, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srvc0, srvd0, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srvc, srvd, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srvc0, srvd0, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srvd, srve, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srvd0, srve0, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srvd, srve, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srvd0, srve0, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srve, srvf, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srve0, srvf0, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srve, srvf, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srve0, srvf0, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srvf, srv00, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srvf0, srv000, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srvf, srv00, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srvf0, srv000, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv00, srv10, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv000, srv100, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv10, srv20, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv100, srv200, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void intra_pred<4, 32>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ }
+ */
+ //mode 32:
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+ //int fraction[32] = {21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, 5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0};
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+vec_u8_t vfrac4 = (vec_u8_t){21, 21, 21, 21, 10, 10, 10, 10, 31, 31, 31, 31, 20, 20, 20, 20};
+vec_u8_t vfrac4_32 = (vec_u8_t){11, 11, 11, 11, 22, 22, 22, 22, 1, 1, 1, 1, 12, 12, 12, 12};
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 32>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off0 + 7] + f[0] * ref[off0 + 7] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[1]* ref[off1 + 7] + f[1] * ref[off1 + 7] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[2]* ref[off2 + 7] + f[2] * ref[off2 + 7] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off3 + 7] + f[0] * ref[off3 + 7] + 16) >> 5);
+
+ ...
+
+ y=7; off7 = offset[7]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[7]* ref[off7 + 0] + f[7] * ref[off7 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[7]* ref[off7 + 1] + f[7] * ref[off7 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[7]* ref[off7 + 2] + f[7] * ref[off7 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[7]* ref[off7 + 3] + f[7] * ref[off7 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off7 + 7] + f[0] * ref[off7 + 7] + 16) >> 5);
+ }
+ */
+ //mode 32:
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+ //int fraction[32] = {21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, 5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0};
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b};
+ vec_u8_t mask5={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c};
+ vec_u8_t mask6={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+
+vec_u8_t vfrac8_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac8_1 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_2 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac8_3 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 8, 8, 8, 8, 8, 8, 8, 8};
+
+vec_u8_t vfrac8_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv1, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv1, vfrac8_32_1);
+ vmle1 = vec_mule(srv2, vfrac8_1);
+ vmlo1 = vec_mulo(srv2, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv3, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv3, vfrac8_32_2);
+ vmle1 = vec_mule(srv4, vfrac8_2);
+ vmlo1 = vec_mulo(srv4, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv5, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv5, vfrac8_32_3);
+ vmle1 = vec_mule(srv6, vfrac8_3);
+ vmlo1 = vec_mulo(srv6, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 32>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[3]* ref[off3 + 15] + f[3] * ref[off3 + 16] + 16) >> 5);
+
+ ...
+
+ y=15; off7 = offset[7]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5);
+ }
+ */
+ //mode 32:
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+ //int fraction[32] = {21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, 5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0};
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv3, srv4, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv6, srv7, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv7, srv8, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv7, srv8, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv8, srv9, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv9, srva, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv9, srva, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srva, srvb, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 32>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5);
+
+ ...
+
+ y=15; off15 = offset[15]; x=0-31; off15-off30 = 1;
+ dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5);
+
+ ...
+
+ y=31; off31= offset[31]; x=0-31; off31 = 2;
+ dst[y * dstStride + 0] = (pixel)((f32[31]* ref[off15 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[31]* ref[off15 + 1] + f[31] * ref[off31 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[31]* ref[off15 + 2] + f[31] * ref[off31 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[31]* ref[off15 + 3] + f[31] * ref[off31 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 31] = (pixel)((f32[31]* ref[off15 + 31] + f[31] * ref[off31 + 32] + 16) >> 5);
+ }
+ */
+ //mode 32:
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+ //int fraction[32] = {21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, 5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0};
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv40 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv50 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv60 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv70 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv80 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv90 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srva0 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srve0 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15);
+
+ vec_u8_t srv000 = sv2;
+ vec_u8_t srv100 = vec_perm(sv2, sv3, mask1);
+ vec_u8_t srv200 = vec_perm(sv2, sv3, mask2);
+ vec_u8_t srv300 = vec_perm(sv2, sv3, mask3);
+ vec_u8_t srv400 = vec_perm(sv2, sv3, mask4);
+ vec_u8_t srv500 = vec_perm(sv2, sv3, mask5);
+ vec_u8_t srv600 = vec_perm(sv2, sv3, mask6);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_18 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_20 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_22 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_24 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_26 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_28 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_30 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv20, srv30, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv3, srv4, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv30, srv40, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv30, srv40, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv40, srv50, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv50, srv60, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv50, srv60, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv6, srv7, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv60, srv70, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv7, srv8, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv70, srv80, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv7, srv8, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv70, srv80, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv8, srv9, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv80, srv90, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv9, srva, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv90, srva0, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv9, srva, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv90, srva0, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srva, srvb, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srva0, srvb0, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srvb, srvc, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srvb0, srvc0, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srvb, srvc, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srvb0, srvc0, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srvc, srvd, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srvc0, srvd0, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srvd, srve, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srvd0, srve0, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srvd, srve, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srvd0, srve0, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srve, srvf, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srve0, srvf0, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srvf, srv00, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srvf0, srv000, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srvf, srv00, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srvf0, srv000, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv00, srv10, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv000, srv100, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv10, srv20, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv100, srv200, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv10, srv20, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv100, srv200, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv20, srv30, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv200, srv300, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv30, srv40, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv300, srv400, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv30, srv40, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv300, srv400, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv40, srv50, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv400, srv500, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv50, srv60, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv500, srv600, vfrac16_32_31, vfrac16_31, vout_31);
+ //int offset[32] = { 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 33>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ }
+ */
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06, 0x04, 0x05, 0x06, 0x07};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){26, 26, 26, 26, 20, 20, 20, 20, 14, 14, 14, 14, 8, 8, 8, 8};
+ vec_u8_t vfrac4_32 = (vec_u8_t){6, 6, 6, 6, 12, 12, 12, 12, 18, 18, 18, 18, 24, 24, 24, 24};
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==4){
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst);
+ vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 33>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off0 + 7] + f[0] * ref[off0 + 7] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[1]* ref[off1 + 7] + f[1] * ref[off1 + 7] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[2]* ref[off2 + 7] + f[2] * ref[off2 + 7] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off3 + 7] + f[0] * ref[off3 + 7] + 16) >> 5);
+
+ ...
+
+ y=7; off7 = offset[7]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[7]* ref[off7 + 0] + f[7] * ref[off7 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[7]* ref[off7 + 1] + f[7] * ref[off7 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[7]* ref[off7 + 2] + f[7] * ref[off7 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[7]* ref[off7 + 3] + f[7] * ref[off7 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off7 + 7] + f[0] * ref[off7 + 7] + 16) >> 5);
+ }
+ */
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c};
+ vec_u8_t mask6={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d};
+ vec_u8_t mask7={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */
+
+vec_u8_t vfrac8_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac8_2 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_3 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 16, 16, 16, 16, 16, 16, 16, 16};
+
+vec_u8_t vfrac8_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv2, vfrac8_32_1);
+ vmle1 = vec_mule(srv3, vfrac8_1);
+ vmlo1 = vec_mulo(srv3, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv4, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv4, vfrac8_32_2);
+ vmle1 = vec_mule(srv5, vfrac8_2);
+ vmlo1 = vec_mulo(srv5, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv6, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv6, vfrac8_32_3);
+ vmle1 = vec_mule(srv7, vfrac8_3);
+ vmlo1 = vec_mulo(srv7, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ if(dstStride==8){
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1);
+ vec_xst(v1, dstStride, dst);
+
+ vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1);
+ vec_xst(v3, dstStride*3, dst);
+
+ vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1);
+ vec_xst(v5, dstStride*5, dst);
+
+ vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 33>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[3]* ref[off3 + 15] + f[3] * ref[off3 + 16] + 16) >> 5);
+
+ ...
+
+ y=15; off7 = offset[7]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5);
+ }
+ */
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv3, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv4, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv6, srv7, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv7, srv8, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv8, srv9, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv8, srv9, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv9, srva, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srva, srvb, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srvb, srvc, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srvc, srvd, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srvd, srve, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, dstStride, dst);
+ vec_xst(vout_2, dstStride*2, dst);
+ vec_xst(vout_3, dstStride*3, dst);
+ vec_xst(vout_4, dstStride*4, dst);
+ vec_xst(vout_5, dstStride*5, dst);
+ vec_xst(vout_6, dstStride*6, dst);
+ vec_xst(vout_7, dstStride*7, dst);
+ vec_xst(vout_8, dstStride*8, dst);
+ vec_xst(vout_9, dstStride*9, dst);
+ vec_xst(vout_10, dstStride*10, dst);
+ vec_xst(vout_11, dstStride*11, dst);
+ vec_xst(vout_12, dstStride*12, dst);
+ vec_xst(vout_13, dstStride*13, dst);
+ vec_xst(vout_14, dstStride*14, dst);
+ vec_xst(vout_15, dstStride*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 33>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5);
+
+ ...
+
+ y=15; off15 = offset[15]; x=0-31; off15-off30 = 1;
+ dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5);
+
+ ...
+
+ y=31; off31= offset[31]; x=0-31; off31 = 2;
+ dst[y * dstStride + 0] = (pixel)((f32[31]* ref[off15 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[31]* ref[off15 + 1] + f[31] * ref[off31 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[31]* ref[off15 + 2] + f[31] * ref[off31 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[31]* ref[off15 + 3] + f[31] * ref[off31 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 31] = (pixel)((f32[31]* ref[off15 + 31] + f[31] * ref[off31 + 32] + 16) >> 5);
+ }
+ */
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv40 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv50 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv60 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv70 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv80 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv90 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srva0 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srve0 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15);
+
+ vec_u8_t srv000 = sv2;
+ vec_u8_t srv100 = vec_perm(sv2, sv3, mask1);
+ vec_u8_t srv200 = vec_perm(sv2, sv3, mask2);
+ vec_u8_t srv300 = vec_perm(sv2, sv3, mask3);
+ vec_u8_t srv400 = vec_perm(sv2, sv3, mask4);
+ vec_u8_t srv500 = vec_perm(sv2, sv3, mask5);
+ vec_u8_t srv600 = vec_perm(sv2, sv3, mask6);
+ vec_u8_t srv700 = vec_perm(sv2, sv3, mask7);
+ vec_u8_t srv800 = vec_perm(sv2, sv3, mask8);
+ vec_u8_t srv900 = vec_perm(sv2, sv3, mask9);
+ vec_u8_t srva00 = vec_perm(sv2, sv3, mask10);
+ vec_u8_t srvb00 = vec_perm(sv2, sv3, mask11);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_16 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_17 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_18 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_19 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_20 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_21 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_22 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_23 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_24 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_25 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_26 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_27 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_28 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_29 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_30 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv3, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv20, srv30, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv4, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv30, srv40, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv40, srv50, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv40, srv50, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv50, srv60, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv6, srv7, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv60, srv70, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv7, srv8, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv70, srv80, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv8, srv9, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv80, srv90, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv8, srv9, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv80, srv90, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv9, srva, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv90, srva0, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srva, srvb, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srva0, srvb0, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srvb, srvc, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srvb0, srvc0, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srvc, srvd, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srvc0, srvd0, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srvd, srve, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srvd0, srve0, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, dstStride, dst);
+ vec_xst(vout_3, dstStride+16, dst);
+ vec_xst(vout_4, dstStride*2, dst);
+ vec_xst(vout_5, dstStride*2+16, dst);
+ vec_xst(vout_6, dstStride*3, dst);
+ vec_xst(vout_7, dstStride*3+16, dst);
+ vec_xst(vout_8, dstStride*4, dst);
+ vec_xst(vout_9, dstStride*4+16, dst);
+ vec_xst(vout_10, dstStride*5, dst);
+ vec_xst(vout_11, dstStride*5+16, dst);
+ vec_xst(vout_12, dstStride*6, dst);
+ vec_xst(vout_13, dstStride*6+16, dst);
+ vec_xst(vout_14, dstStride*7, dst);
+ vec_xst(vout_15, dstStride*7+16, dst);
+ vec_xst(vout_16, dstStride*8, dst);
+ vec_xst(vout_17, dstStride*8+16, dst);
+ vec_xst(vout_18, dstStride*9, dst);
+ vec_xst(vout_19, dstStride*9+16, dst);
+ vec_xst(vout_20, dstStride*10, dst);
+ vec_xst(vout_21, dstStride*10+16, dst);
+ vec_xst(vout_22, dstStride*11, dst);
+ vec_xst(vout_23, dstStride*11+16, dst);
+ vec_xst(vout_24, dstStride*12, dst);
+ vec_xst(vout_25, dstStride*12+16, dst);
+ vec_xst(vout_26, dstStride*13, dst);
+ vec_xst(vout_27, dstStride*13+16, dst);
+ vec_xst(vout_28, dstStride*14, dst);
+ vec_xst(vout_29, dstStride*14+16, dst);
+ vec_xst(vout_30, dstStride*15, dst);
+ vec_xst(vout_31, dstStride*15+16, dst);
+
+ one_line(srvd, srve, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srvd0, srve0, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srve, srvf, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srve0, srvf0, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srvf, srv00, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srvf0, srv000, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv00, srv10, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv000, srv100, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv10, srv20, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv100, srv200, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv10, srv20, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv100, srv200, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv20, srv30, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv200, srv300, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv30, srv40, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv300, srv400, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv40, srv50, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv400, srv500, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv50, srv60, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv500, srv600, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv50, srv60, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv500, srv600, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv60, srv70, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv600, srv700, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv70, srv80, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv700, srv800, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv80, srv90, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv800, srv900, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv90, srva0, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv900, srva00, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srva0, srvb0, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srva00, srvb00, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, dstStride*16, dst);
+ vec_xst(vout_1, dstStride*16+16, dst);
+ vec_xst(vout_2, dstStride*17, dst);
+ vec_xst(vout_3, dstStride*17+16, dst);
+ vec_xst(vout_4, dstStride*18, dst);
+ vec_xst(vout_5, dstStride*18+16, dst);
+ vec_xst(vout_6, dstStride*19, dst);
+ vec_xst(vout_7, dstStride*19+16, dst);
+ vec_xst(vout_8, dstStride*20, dst);
+ vec_xst(vout_9, dstStride*20+16, dst);
+ vec_xst(vout_10, dstStride*21, dst);
+ vec_xst(vout_11, dstStride*21+16, dst);
+ vec_xst(vout_12, dstStride*22, dst);
+ vec_xst(vout_13, dstStride*22+16, dst);
+ vec_xst(vout_14, dstStride*23, dst);
+ vec_xst(vout_15, dstStride*23+16, dst);
+ vec_xst(vout_16, dstStride*24, dst);
+ vec_xst(vout_17, dstStride*24+16, dst);
+ vec_xst(vout_18, dstStride*25, dst);
+ vec_xst(vout_19, dstStride*25+16, dst);
+ vec_xst(vout_20, dstStride*26, dst);
+ vec_xst(vout_21, dstStride*26+16, dst);
+ vec_xst(vout_22, dstStride*27, dst);
+ vec_xst(vout_23, dstStride*27+16, dst);
+ vec_xst(vout_24, dstStride*28, dst);
+ vec_xst(vout_25, dstStride*28+16, dst);
+ vec_xst(vout_26, dstStride*29, dst);
+ vec_xst(vout_27, dstStride*29+16, dst);
+ vec_xst(vout_28, dstStride*30, dst);
+ vec_xst(vout_29, dstStride*30+16, dst);
+ vec_xst(vout_30, dstStride*31, dst);
+ vec_xst(vout_31, dstStride*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<4, 34>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ if(dstStride == 4) {
+ const vec_u8_t srcV = vec_xl(2, srcPix0);
+ const vec_u8_t mask = {0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03,0x04, 0x02, 0x03,0x04,0x05, 0x03,0x04,0x05, 0x06};
+ vec_u8_t vout = vec_perm(srcV, srcV, mask);
+ vec_xst(vout, 0, dst);
+ }
+ else if(dstStride%16 == 0){
+ vec_u8_t v0 = vec_xl(2, srcPix0);
+ vec_ste((vec_u32_t)v0, 0, (unsigned int*)dst);
+ vec_u8_t v1 = vec_xl(3, srcPix0);
+ vec_ste((vec_u32_t)v1, 0, (unsigned int*)(dst+dstStride));
+ vec_u8_t v2 = vec_xl(4, srcPix0);
+ vec_ste((vec_u32_t)v2, 0, (unsigned int*)(dst+dstStride*2));
+ vec_u8_t v3 = vec_xl(5, srcPix0);
+ vec_ste((vec_u32_t)v3, 0, (unsigned int*)(dst+dstStride*3));
+ }
+ else{
+ const vec_u8_t srcV = vec_xl(2, srcPix0); /* offset = width2+2 = width<<1 + 2*/
+ const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_1 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_2 = {0x02, 0x03, 0x04, 0x05, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_3 = {0x03, 0x04, 0x05, 0x06, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(srcV, vec_xl(0, dst), mask_0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(srcV, vec_xl(dstStride, dst), mask_1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(srcV, vec_xl(dstStride*2, dst), mask_2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(srcV, vec_xl(dstStride*3, dst), mask_3);
+ vec_xst(v3, dstStride*3, dst);
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<8, 34>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ if(dstStride == 8) {
+ const vec_u8_t srcV1 = vec_xl(2, srcPix0); /* offset = width2+2 = width<<1 + 2*/
+ const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03,0x04, 0x05, 0x06, 0x07, 0x08};
+ const vec_u8_t mask_1 = {0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ const vec_u8_t mask_2 = {0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c};
+ const vec_u8_t mask_3 = {0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e};
+ vec_u8_t v0 = vec_perm(srcV1, srcV1, mask_0);
+ vec_u8_t v1 = vec_perm(srcV1, srcV1, mask_1);
+ vec_u8_t v2 = vec_perm(srcV1, srcV1, mask_2);
+ vec_u8_t v3 = vec_perm(srcV1, srcV1, mask_3);
+ vec_xst(v0, 0, dst);
+ vec_xst(v1, 16, dst);
+ vec_xst(v2, 32, dst);
+ vec_xst(v3, 48, dst);
+ }
+ else{
+ const vec_u8_t srcV1 = vec_xl(2, srcPix0); /* offset = width2+2 = width<<1 + 2*/
+ const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_1 = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_2 = {0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_3 = {0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_4 = {0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_5 = {0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_6 = {0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ const vec_u8_t mask_7 = {0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t v0 = vec_perm(srcV1, vec_xl(0, dst), mask_0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(srcV1, vec_xl(dstStride, dst), mask_1);
+ vec_xst(v1, dstStride, dst);
+ vec_u8_t v2 = vec_perm(srcV1, vec_xl(dstStride*2, dst), mask_2);
+ vec_xst(v2, dstStride*2, dst);
+ vec_u8_t v3 = vec_perm(srcV1, vec_xl(dstStride*3, dst), mask_3);
+ vec_xst(v3, dstStride*3, dst);
+ vec_u8_t v4 = vec_perm(srcV1, vec_xl(dstStride*4, dst), mask_4);
+ vec_xst(v4, dstStride*4, dst);
+ vec_u8_t v5 = vec_perm(srcV1, vec_xl(dstStride*5, dst), mask_5);
+ vec_xst(v5, dstStride*5, dst);
+ vec_u8_t v6 = vec_perm(srcV1, vec_xl(dstStride*6, dst), mask_6);
+ vec_xst(v6, dstStride*6, dst);
+ vec_u8_t v7 = vec_perm(srcV1, vec_xl(dstStride*7, dst), mask_7);
+ vec_xst(v7, dstStride*7, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<16, 34>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ int i;
+ //int off = dstStride;
+ //const pixel *srcPix = srcPix0;
+ for(i=0; i<16; i++){
+ vec_xst( vec_xl(2+i, srcPix0), i*dstStride, dst); /* first offset = width2+2 = width<<1 + 2*/
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x <16; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void intra_pred<32, 34>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter)
+{
+ int i;
+ int off = dstStride;
+ //const pixel *srcPix = srcPix0;
+ for(i=0; i<32; i++){
+ off = i*dstStride;
+ vec_xst(vec_xl(2+i, srcPix0), off, dst); /* first offset = width2+2 = width<<1 + 2*/
+ vec_xst(vec_xl(18+i, srcPix0), off+16, dst); /* first offset = width2+2 = width<<1 + 2*/
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x <32; x++)
+ {
+ printf("%d ",dst[y * dstStride + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<int width>
+void intra_pred_ang_altivec(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int dirMode, int bFilter)
+{
+ const int size = width;
+ switch(dirMode){
+ case 2:
+ intra_pred<size, 2>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 3:
+ intra_pred<size, 3>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 4:
+ intra_pred<size, 4>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 5:
+ intra_pred<size, 5>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 6:
+ intra_pred<size, 6>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 7:
+ intra_pred<size, 7>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 8:
+ intra_pred<size, 8>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 9:
+ intra_pred<size, 9>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 10:
+ intra_pred<size, 10>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 11:
+ intra_pred<size, 11>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 12:
+ intra_pred<size, 12>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 13:
+ intra_pred<size, 13>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 14:
+ intra_pred<size, 14>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 15:
+ intra_pred<size, 15>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 16:
+ intra_pred<size, 16>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 17:
+ intra_pred<size, 17>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 18:
+ intra_pred<size, 18>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 19:
+ intra_pred<size, 19>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 20:
+ intra_pred<size, 20>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 21:
+ intra_pred<size, 21>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 22:
+ intra_pred<size, 22>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 23:
+ intra_pred<size, 23>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 24:
+ intra_pred<size, 24>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 25:
+ intra_pred<size, 25>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 26:
+ intra_pred<size, 26>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 27:
+ intra_pred<size, 27>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 28:
+ intra_pred<size, 28>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 29:
+ intra_pred<size, 29>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 30:
+ intra_pred<size, 30>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 31:
+ intra_pred<size, 31>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 32:
+ intra_pred<size, 32>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 33:
+ intra_pred<size, 33>(dst, dstStride, srcPix0, bFilter);
+ return;
+ case 34:
+ intra_pred<size, 34>(dst, dstStride, srcPix0, bFilter);
+ return;
+ default:
+ printf("No supported intra prediction mode\n");
+ exit(1);
+ }
+}
+
+template<int dstStride, int dirMode>
+void one_ang_pred_altivec(pixel* dst, const pixel *srcPix0, int bFilter){};
+
+template<>
+void one_ang_pred_altivec<4, 2>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 2>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 2>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 2>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 2>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 2>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 2>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 2>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 18>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 18>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 18>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 18>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 18>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 18>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 18>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 18>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 19>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 19>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 19>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 19>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 19>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 19>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 19>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 19>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 20>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 20>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 20>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 20>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 20>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 20>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 20>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 20>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 21>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 21>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 21>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 21>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 21>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 21>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 21>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 21>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 22>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 22>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 22>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 22>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 22>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 22>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 22>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 22>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+
+template<>
+void one_ang_pred_altivec<4, 23>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 23>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 23>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 23>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 23>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 23>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 23>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 23>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 24>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 24>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 24>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 24>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 24>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 24>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 24>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 24>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+
+template<>
+void one_ang_pred_altivec<4, 25>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 25>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 25>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 25>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 25>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 25>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 25>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 25>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 27>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 27>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 27>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 27>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 27>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 27>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 27>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 27>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 28>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 28>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 28>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 28>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 28>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 28>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 28>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 28>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 29>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 29>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 29>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 29>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 29>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 29>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 29>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 29>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 30>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 30>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 30>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 30>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 30>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 30>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 30>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 30>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 31>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 31>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 31>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 31>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 31>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 31>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 31>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 31>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 32>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 32>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 32>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 32>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 32>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 32>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 32>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 32>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 33>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 33>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 33>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 33>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 33>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 33>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 33>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 33>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 34>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<4, 34>(dst, 4, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<8, 34>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<8, 34>(dst, 8, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<16, 34>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<16, 34>(dst, 16, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<32, 34>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ intra_pred<32, 34>(dst, 32, srcPix0, bFilter);
+ return;
+}
+
+template<>
+void one_ang_pred_altivec<4, 6>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){13, 13, 13, 13, 26, 26, 26, 26, 7, 7, 7, 7, 20, 20, 20, 20}; /* fraction[0-3] */
+ vec_u8_t vfrac4_32 = (vec_u8_t){19, 19, 19, 19, 6, 6, 6, 6, 25, 25, 25, 25, 12, 12, 12, 12}; /* 32 - fraction[0-3] */
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 6>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask4={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask5={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 2 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 3 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 3, 4 */
+
+ /* fraction[0-7] */
+ vec_u8_t vfrac8_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac8_1 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac8_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 14, 14, 14, 14, 14, 14, 14, 14 };
+ vec_u8_t vfrac8_3 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 8, 8, 8, 8, 8, 8, 8, 8};
+
+ /* 32 - fraction[0-7] */
+ vec_u8_t vfrac8_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac8_32_1 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac8_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac8_32_3 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 24, 24, 24, 24, 24, 24, 24, 24};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv1, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv1, vfrac8_32_1);
+ vmle1 = vec_mule(srv2, vfrac8_1);
+ vmlo1 = vec_mulo(srv2, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv2, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv2, vfrac8_32_2);
+ vmle1 = vec_mule(srv3, vfrac8_2);
+ vmlo1 = vec_mulo(srv3, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv4, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv4, vfrac8_32_3);
+ vmle1 = vec_mule(srv5, vfrac8_3);
+ vmlo1 = vec_mulo(srv5, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 6>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ /*vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};*/
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ //vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ //vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ //vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ //vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ //vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ //vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ //vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv2, srv3, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv2, srv3, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv3, srv4, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv3, srv4, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 6>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+
+ vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv40 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv50 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv60 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv70 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv80 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv90 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srva0 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srve0 = vec_perm(sv1, sv2, mask14);
+
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_16 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_17 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_18 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_20 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_21 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_22 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_24 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_25 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_26 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_28 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_29 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_30 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_32_16 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_32_17 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_18 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_20 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_32_21 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_22 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_24 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_32_25 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_26 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_28 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_32_29 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_30 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv00, srv10, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv10, srv20, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv20, srv30, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv2, srv3, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv20, srv30, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv2, srv3, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv20, srv30, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv3, srv4, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv30, srv40, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv3, srv4, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv30, srv40, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv40, srv50, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv40, srv50, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv40, srv50, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv50, srv60, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv50, srv60, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv60, srv70, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv60, srv70, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srv6, srv7, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv60, srv70, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv7, srv8, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv70, srv80, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv7, srv8, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv70, srv80, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv8, srv9, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv80, srv90, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv8, srv9, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv80, srv90, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv8, srv9, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv80, srv90, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv9, srva, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv90, srva0, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv9, srva, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv90, srva0, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srva, srvb, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srva0, srvb0, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srva, srvb, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srva0, srvb0, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srva, srvb, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srva0, srvb0, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srvb, srvc, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srvb0, srvc0, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srvb, srvc, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srvb0, srvc0, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srvc, srvd, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srvc0, srvd0, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srvc, srvd, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srvc0, srvd0, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srvd, srve, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srvd0, srve0, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 7>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //mode 29:
+ //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9};
+ //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0};
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){9, 9, 9, 9, 18, 18, 18, 18, 27, 27, 27, 27, 4, 4, 4, 4}; /* fraction[0-3] */
+ vec_u8_t vfrac4_32 = (vec_u8_t){23, 23, 23, 23, 14, 14, 14, 14, 5, 5, 5, 5, 28, 28, 28, 28}; /* 32 - fraction[0-3] */
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 7>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //mode 29:
+ //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9};
+ //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0};
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask2={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask3={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask4={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask5={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 0, 1 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 1, 2 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 2 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 2, 3 */
+
+ /* fraction[0-7] */
+ vec_u8_t vfrac8_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac8_1 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac8_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac8_3 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 8, 8, 8, 8, 8, 8, 8, 8};
+
+ /* 32 - fraction[0-7] */
+ vec_u8_t vfrac8_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac8_32_1 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac8_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac8_32_3 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 24, 24, 24, 24, 24, 24, 24, 24};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv2, vfrac8_32_1);
+ vmle1 = vec_mule(srv3, vfrac8_1);
+ vmlo1 = vec_mulo(srv3, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv1, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv1, vfrac8_32_2);
+ vmle1 = vec_mule(srv4, vfrac8_2);
+ vmlo1 = vec_mulo(srv4, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv3, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv3, vfrac8_32_3);
+ vmle1 = vec_mule(srv5, vfrac8_3);
+ vmlo1 = vec_mulo(srv5, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 7>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //mode 29:
+ //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9};
+ //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0};
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv2, srv3, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv2, srv3, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv2, srv3, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv3, srv4, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv3, srv4, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv3, srv4, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv3, srv4, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv4, srv5, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 7>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //mode 29:
+ //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9};
+ //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0};
+
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+
+ vec_u8_t srv00 = sv1;
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv40 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv50 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv60 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv70 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv80 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv90 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srva0 = vec_perm(sv1, sv2, mask10);
+
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_16 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_17 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 };
+ vec_u8_t vfrac16_18 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_20 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_21 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_22 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_24 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_25 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_26 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_28 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_29 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_30 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_32_16 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_32_17 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_18 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_32_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_20 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_32_21 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_22 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_24 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_32_25 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_26 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_32_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_28 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_32_29 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_30 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv00, srv10, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv00, srv10, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv10, srv20, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv10, srv20, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv10, srv20, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv10, srv20, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv2, srv3, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv20, srv30, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv2, srv3, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv20, srv30, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv2, srv3, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv20, srv30, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv3, srv4, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv30, srv40, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv3, srv4, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv30, srv40, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv3, srv4, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv30, srv40, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv3, srv4, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv30, srv40, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv40, srv50, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv4, srv5, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv40, srv50, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srv4, srv5, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv40, srv50, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv5, srv6, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv50, srv60, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv5, srv6, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv50, srv60, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv5, srv6, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv50, srv60, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv5, srv6, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv50, srv60, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv6, srv7, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv60, srv70, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv6, srv7, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv60, srv70, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv6, srv7, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv60, srv70, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv7, srv8, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv70, srv80, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv7, srv8, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv70, srv80, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv7, srv8, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv70, srv80, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv7, srv8, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv70, srv80, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv8, srv9, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv80, srv90, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv8, srv9, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv80, srv90, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv8, srv9, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv80, srv90, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv9, srva, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv90, srva0, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 8>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //mode 28
+ //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5};
+ //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0};
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 15, 20, 20, 20, 20}; /* fraction[0-3] */
+ vec_u8_t vfrac4_32 = (vec_u8_t){27, 27, 27, 27, 22, 22, 22, 22, 17, 17, 17, 17, 12, 12, 12, 12}; /* 32 - fraction[0-3] */
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 8>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //mode 28
+ //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5};
+ //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0};
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+
+ /* fraction[0-7] */
+ vec_u8_t vfrac8_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac8_1 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac8_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac8_3 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8};
+
+ /* 32 - fraction[0-7] */
+ vec_u8_t vfrac8_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac8_32_1 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac8_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac8_32_3 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 24, 24, 24, 24, 24, 24, 24, 24};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv0, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv0, vfrac8_32_1);
+ vmle1 = vec_mule(srv1, vfrac8_1);
+ vmlo1 = vec_mulo(srv1, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv0, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv0, vfrac8_32_2);
+ vmle1 = vec_mule(srv1, vfrac8_2);
+ vmlo1 = vec_mulo(srv1, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv1, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv1, vfrac8_32_3);
+ vmle1 = vec_mule(srv2, vfrac8_3);
+ vmlo1 = vec_mulo(srv2, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 8>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+
+ //mode 28
+ //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5};
+ //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0};
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13,13, 13, 13, 13,13, 13, 13, 13};
+ vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv2, srv3, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv2, srv3, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv2, srv3, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 8>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); /* from y= 15, use srv1, srv2 */
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); /* y=31, use srv2, srv3 */
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask4); /* y=31, use srv2, srv3 */
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask5); /* y=31, use srv2, srv3 */
+ vec_u8_t srv12 = vec_perm(sv0, sv1, mask6); /* y=31, use srv2, srv3 */
+
+ vec_u8_t srv4 = sv1;
+ vec_u8_t srv5 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv6 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv7 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask4); /* y=31, use srv2, srv3 */
+ vec_u8_t srv11 = vec_perm(sv1, sv2, mask5); /* y=31, use srv2, srv3 */
+ vec_u8_t srv13 = vec_perm(sv1, sv2, mask6); /* y=31, use srv2, srv3 */
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13,13, 13, 13, 13,13, 13, 13, 13};
+ vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_16 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_18 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_20 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_22 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_24 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_26 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_28 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_30 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ vec_u8_t vfrac16_32_16 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+ vec_u8_t vfrac16_32_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_18 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+ vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_20 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+ vec_u8_t vfrac16_32_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_22 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+ vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_24 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+ vec_u8_t vfrac16_32_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_26 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+ vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_28 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+ vec_u8_t vfrac16_32_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_30 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv4, srv5, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv4, srv5, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv4, srv5, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv4, srv5, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv5, srv6, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv2, srv3, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv2, srv3, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv6, srv7, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv2, srv3, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srv2, srv3, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv6, srv7, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv2, srv3, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv6, srv7, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv2, srv3, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv6, srv7, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv3, srv8, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv7, srv10, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv3, srv8, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv7, srv10, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv3, srv8, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv7, srv10, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv3, srv8, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv7, srv10, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv3, srv8, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv7, srv10, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv3, srv8, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv7, srv10, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv8, srv9, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv10, srv11, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv8, srv9, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv10, srv11, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv8, srv9, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv10, srv11, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv8, srv9, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv10, srv11, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv8, srv9, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv10, srv11, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv8, srv9, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv10, srv11, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv9, srv12, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv11, srv13, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 9>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8}; /* fraction[0-3] */
+ vec_u8_t vfrac4_32 = (vec_u8_t){30, 30, 30, 30, 28, 28, 28, 28, 26, 26, 26, 26, 24, 24, 24, 24}; /* 32 - fraction[0-3] */
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 9>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ /*width2*/
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ /* fraction[0-7] */
+ vec_u8_t vfrac8_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac8_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac8_2 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac8_3 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-7] */
+ vec_u8_t vfrac8_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac8_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac8_32_2 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac8_32_3 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv0, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv0, vfrac8_32_1);
+ vmle1 = vec_mule(srv1, vfrac8_1);
+ vmlo1 = vec_mulo(srv1, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv0, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv0, vfrac8_32_2);
+ vmle1 = vec_mule(srv1, vfrac8_2);
+ vmlo1 = vec_mulo(srv1, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv0, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv0, vfrac8_32_3);
+ vmle1 = vec_mule(srv1, vfrac8_3);
+ vmlo1 = vec_mulo(srv1, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 9>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u8_t vfrac16_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv1, srv2, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+ vec_xst(vout_4, 64, dst);
+ vec_xst(vout_5, 80, dst);
+ vec_xst(vout_6, 96, dst);
+ vec_xst(vout_7, 112, dst);
+ vec_xst(vout_8, 128, dst);
+ vec_xst(vout_9, 144, dst);
+ vec_xst(vout_10, 160, dst);
+ vec_xst(vout_11, 176, dst);
+ vec_xst(vout_12, 192, dst);
+ vec_xst(vout_13, 208, dst);
+ vec_xst(vout_14, 224, dst);
+ vec_xst(vout_15, 240, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 9>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); /* from y= 15, use srv1, srv2 */
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); /* y=31, use srv2, srv3 */
+
+ vec_u8_t srv4 = sv1;
+ vec_u8_t srv5 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv6 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv7 = vec_perm(sv2, sv2, mask3);
+
+ /* fraction[0-15] */
+ vec_u8_t vfrac16_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u8_t vfrac16_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ /* 32 - fraction[0-15] */
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv4, srv5, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv4, srv5, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv4, srv5, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv4, srv5, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv4, srv5, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv4, srv5, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv1, srv2, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv5, srv6, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+
+ one_line(srv1, srv2, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv5, srv6, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv5, srv6, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv5, srv6, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv5, srv6, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv5, srv6, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv5, srv6, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv5, srv6, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv1, srv2, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv1, srv2, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv1, srv2, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv5, srv6, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 10>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srcV = vec_xl(9, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(0, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srcV, u8_to_s16_b0_mask));
+ vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w4x4_mask1));
+ vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v );
+ vec_s16_t v_sum = vec_add(c1_s16v, v1_s16);
+ vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum));
+ vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v);
+ vec_u8_t mask = {0x00, 0x11, 0x12, 0x13, 0x01, 0x11, 0x12, 0x13, 0x02, 0x11, 0x12, 0x13, 0x03, 0x11, 0x12, 0x13};
+ vec_u8_t v0 = vec_perm(v_filter_u8, srcV, mask);
+ vec_xst(v0, 0, dst);
+ }
+ else{
+ vec_u8_t v0 = (vec_u8_t)vec_splat((vec_u32_t)srcV, 0);
+ vec_xst(v0, 0, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 10>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srcV = vec_xl(17, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(0, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srcV, u8_to_s16_b0_mask));
+ vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_mask1));
+ vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v );
+ vec_s16_t v_sum = vec_add(c1_s16v, v1_s16);
+ vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum));
+ vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v);
+ vec_u8_t v_mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t v_mask1 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t v_mask2 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t v_mask3 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t v0 = vec_perm(v_filter_u8, srcV, v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_u8_t v1 = vec_perm(v_filter_u8, srcV, v_mask1);
+ vec_xst(v1, 16, dst);
+ vec_u8_t v2 = vec_perm(v_filter_u8, srcV, v_mask2);
+ vec_xst(v2, 32, dst);
+ vec_u8_t v3 = vec_perm(v_filter_u8, srcV, v_mask3);
+ vec_xst(v3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07};
+ vec_u8_t v0 = vec_perm(srcV, srcV, v_mask0);
+ vec_xst(v0, 0, dst);
+ vec_xst(v0, 16, dst);
+ vec_xst(v0, 32, dst);
+ vec_xst(v0, 48, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+
+}
+
+template<>
+void one_ang_pred_altivec<16, 10>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(33, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(0, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_u8_t srcv1 = vec_xl(1, srcPix0);
+ vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh));
+ vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl));
+ vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v );
+ vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v );
+ vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16);
+ vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16);
+ vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum));
+ vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum));
+ vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16);
+
+ vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+
+ vec_xst(vec_perm(v_filter_u8, srv, mask0), 0, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask1), 16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask2), 32, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask3), 48, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask4), 64, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask5), 80, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask6), 96, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask7), 112, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask8), 128, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask9), 144, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask10), 160, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask11), 176, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask12), 192, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask13), 208, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask14), 224, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask15), 240, dst);
+ }
+ else{
+ vec_xst(srv, 0, dst);
+ vec_xst(srv, 16, dst);
+ vec_xst(srv, 32, dst);
+ vec_xst(srv, 48, dst);
+ vec_xst(srv, 64, dst);
+ vec_xst(srv, 80, dst);
+ vec_xst(srv, 96, dst);
+ vec_xst(srv, 112, dst);
+ vec_xst(srv, 128, dst);
+ vec_xst(srv, 144, dst);
+ vec_xst(srv, 160, dst);
+ vec_xst(srv, 176, dst);
+ vec_xst(srv, 192, dst);
+ vec_xst(srv, 208, dst);
+ vec_xst(srv, 224, dst);
+ vec_xst(srv, 240, dst);
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 10>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(65, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+ vec_u8_t srv1 =vec_xl(81, srcPix0);
+ //vec_u8_t vout;
+ int offset = 0;
+
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(0, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_u8_t srcv1 = vec_xl(1, srcPix0);
+ vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh));
+ vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl));
+ vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v );
+ vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v );
+ vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16);
+ vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16);
+ vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum));
+ vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum));
+ vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16);
+
+ vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_xst(vec_perm(v_filter_u8, srv, mask0), 0, dst);
+ vec_xst(srv1, 16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask1), 32, dst);
+ vec_xst(srv1, 48, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask2), 64, dst);
+ vec_xst(srv1, 80, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask3), 96, dst);
+ vec_xst(srv1, 112, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask4), 128, dst);
+ vec_xst(srv1, 144, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask5), 160, dst);
+ vec_xst(srv1, 176, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask6), 192, dst);
+ vec_xst(srv1, 208, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask7), 224, dst);
+ vec_xst(srv1, 240, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask8), 256, dst);
+ vec_xst(srv1, 272, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask9), 288, dst);
+ vec_xst(srv1, 304, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask10), 320, dst);
+ vec_xst(srv1, 336, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask11), 352, dst);
+ vec_xst(srv1, 368, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask12), 384, dst);
+ vec_xst(srv1, 400, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask13), 416, dst);
+ vec_xst(srv1, 432, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask14), 448, dst);
+ vec_xst(srv1, 464, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask15), 480, dst);
+ vec_xst(srv1, 496, dst);
+
+ vec_u8_t srcv2 = vec_xl(17, srcPix0);
+ vec_s16_t v2h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskh));
+ vec_s16_t v2l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskl));
+ vec_s16_t v3h_s16 = (vec_s16_t)vec_sra( vec_sub(v2h_s16, c0_s16v), one_u16v );
+ vec_s16_t v3l_s16 = (vec_s16_t)vec_sra( vec_sub(v2l_s16, c0_s16v), one_u16v );
+ vec_s16_t v2h_sum = vec_add(c1_s16v, v3h_s16);
+ vec_s16_t v2l_sum = vec_add(c1_s16v, v3l_s16);
+ vec_u16_t v2h_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2h_sum));
+ vec_u16_t v2l_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2l_sum));
+ vec_u8_t v2_filter_u8 = vec_pack(v2h_filter_u16, v2l_filter_u16);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask0), 512, dst);
+ vec_xst(srv1, 528, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask1), 544, dst);
+ vec_xst(srv1, 560, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask2), 576, dst);
+ vec_xst(srv1, 592, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask3), 608, dst);
+ vec_xst(srv1, 624, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask4), 640, dst);
+ vec_xst(srv1, 656, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask5), 672, dst);
+ vec_xst(srv1, 688, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask6), 704, dst);
+ vec_xst(srv1, 720, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask7), 736, dst);
+ vec_xst(srv1, 752, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask8), 768, dst);
+ vec_xst(srv1, 784, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask9), 800, dst);
+ vec_xst(srv1, 816, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask10), 832, dst);
+ vec_xst(srv1, 848, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask11), 864, dst);
+ vec_xst(srv1, 880, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask12), 896, dst);
+ vec_xst(srv1, 912, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask13), 928, dst);
+ vec_xst(srv1, 944, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask14), 960, dst);
+ vec_xst(srv1, 976, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask15), 992, dst);
+ vec_xst(srv1, 1008, dst);
+
+ }
+ else{
+ for(int i = 0; i<32; i++){
+ vec_xst(srv, offset, dst);
+ vec_xst(srv1, offset+16, dst);
+ offset += 32;
+ }
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 26>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(0, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+ vec_u8_t v0;
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_sld(srv, srv, 15);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_w4x4_mask9));
+ vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v );
+ vec_s16_t v_sum = vec_add(c1_s16v, v1_s16);
+ vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum));
+ vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v);
+ vec_u8_t v_mask = {0x10, 0x02, 0x03, 0x04, 0x11, 0x02, 0x03, 0x04, 0x12, 0x02, 0x03, 0x04, 0x13, 0x02, 0x03, 0x04};
+ v0 = vec_perm(srv, v_filter_u8, v_mask);
+ }
+ else{
+ vec_u8_t v_mask = {0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04};
+ v0 = vec_perm(srv, srv, v_mask);
+ }
+ vec_xst(v0, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 26>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(0, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(17, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b1_mask));
+ vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskh));
+ vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v );
+ vec_s16_t v_sum = vec_add(c1_s16v, v1_s16);
+ vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum));
+ vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v);
+ vec_u8_t v_mask0 = {0x00, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x01, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t v_mask1 = {0x02, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x03, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t v_mask2 = {0x04, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x05, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t v_mask3 = {0x06, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x07, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t v0 = vec_perm(v_filter_u8, srv, v_mask0);
+ vec_u8_t v1 = vec_perm(v_filter_u8, srv, v_mask1);
+ vec_u8_t v2 = vec_perm(v_filter_u8, srv, v_mask2);
+ vec_u8_t v3 = vec_perm(v_filter_u8, srv, v_mask3);
+ vec_xst(v0, 0, dst);
+ vec_xst(v1, 16, dst);
+ vec_xst(v2, 32, dst);
+ vec_xst(v3, 48, dst);
+ }
+ else{
+ vec_u8_t v_mask = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t v0 = vec_perm(srv, srv, v_mask);
+ vec_xst(v0, 0, dst);
+ vec_xst(v0, 16, dst);
+ vec_xst(v0, 32, dst);
+ vec_xst(v0, 48, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 26>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(0, srcPix0);
+ vec_u8_t srv1 =vec_xl(1, srcPix0);
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(33, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b1_mask));
+ vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskh));
+ vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskl));
+ vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v );
+ vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v );
+ vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16);
+ vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16);
+ vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum));
+ vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum));
+ vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16);
+ vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+
+
+ vec_xst(vec_perm(v_filter_u8, srv1, mask0), 0, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask1), 16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask2), 32, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask3), 48, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask4), 64, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask5), 80, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask6), 96, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask7), 112, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask8), 128, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask9), 144, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask10), 160, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask11), 176, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask12), 192, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask13), 208, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask14), 224, dst);
+ vec_xst(vec_perm(v_filter_u8, srv1, mask15), 240, dst);
+ }
+ else{
+ vec_xst(srv1, 0, dst);
+ vec_xst(srv1, 16, dst);
+ vec_xst(srv1, 32, dst);
+ vec_xst(srv1, 48, dst);
+ vec_xst(srv1, 64, dst);
+ vec_xst(srv1, 80, dst);
+ vec_xst(srv1, 96, dst);
+ vec_xst(srv1, 112, dst);
+ vec_xst(srv1, 128, dst);
+ vec_xst(srv1, 144, dst);
+ vec_xst(srv1, 160, dst);
+ vec_xst(srv1, 176, dst);
+ vec_xst(srv1, 192, dst);
+ vec_xst(srv1, 208, dst);
+ vec_xst(srv1, 224, dst);
+ vec_xst(srv1, 240, dst);
+ }
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 26>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t srv =vec_xl(1, srcPix0); /* offset = width2+1 = width<<1 + 1 */
+ vec_u8_t srv1 =vec_xl(17, srcPix0);
+
+ if (bFilter){
+ LOAD_ZERO;
+ vec_u8_t tmp_v = vec_xl(0, srcPix0);
+ vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask));
+ vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask));
+ vec_u8_t srcv1 = vec_xl(65, srcPix0);
+ vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh));
+ vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl));
+ vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v );
+ vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v );
+
+ vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16);
+ vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16);
+ vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum));
+ vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum));
+ vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16);
+
+ vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+ vec_xst(vec_perm(v_filter_u8, srv, mask0), 0, dst);
+ vec_xst(srv1, 16, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask1), 32, dst);
+ vec_xst(srv1, 48, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask2), 64, dst);
+ vec_xst(srv1, 80, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask3), 96, dst);
+ vec_xst(srv1, 112, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask4), 128, dst);
+ vec_xst(srv1, 144, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask5), 160, dst);
+ vec_xst(srv1, 176, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask6), 192, dst);
+ vec_xst(srv1, 208, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask7), 224, dst);
+ vec_xst(srv1, 240, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask8), 256, dst);
+ vec_xst(srv1, 272, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask9), 288, dst);
+ vec_xst(srv1, 304, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask10), 320, dst);
+ vec_xst(srv1, 336, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask11), 352, dst);
+ vec_xst(srv1, 368, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask12), 384, dst);
+ vec_xst(srv1, 400, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask13), 416, dst);
+ vec_xst(srv1, 432, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask14), 448, dst);
+ vec_xst(srv1, 464, dst);
+ vec_xst(vec_perm(v_filter_u8, srv, mask15), 480, dst);
+ vec_xst(srv1, 496, dst);
+
+ vec_u8_t srcv2 = vec_xl(81, srcPix0);
+ vec_s16_t v2h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskh));
+ vec_s16_t v2l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskl));
+ vec_s16_t v3h_s16 = (vec_s16_t)vec_sra( vec_sub(v2h_s16, c0_s16v), one_u16v );
+ vec_s16_t v3l_s16 = (vec_s16_t)vec_sra( vec_sub(v2l_s16, c0_s16v), one_u16v );
+ vec_s16_t v2h_sum = vec_add(c1_s16v, v3h_s16);
+ vec_s16_t v2l_sum = vec_add(c1_s16v, v3l_s16);
+ vec_u16_t v2h_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2h_sum));
+ vec_u16_t v2l_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2l_sum));
+ vec_u8_t v2_filter_u8 = vec_pack(v2h_filter_u16, v2l_filter_u16);
+
+ vec_xst(vec_perm(v2_filter_u8, srv, mask0), 512, dst);
+ vec_xst(srv1, 528, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask1), 544, dst);
+ vec_xst(srv1, 560, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask2), 576, dst);
+ vec_xst(srv1, 592, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask3), 608, dst);
+ vec_xst(srv1, 624, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask4), 640, dst);
+ vec_xst(srv1, 656, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask5), 672, dst);
+ vec_xst(srv1, 688, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask6), 704, dst);
+ vec_xst(srv1, 720, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask7), 736, dst);
+ vec_xst(srv1, 752, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask8), 768, dst);
+ vec_xst(srv1, 784, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask9), 800, dst);
+ vec_xst(srv1, 816, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask10), 832, dst);
+ vec_xst(srv1, 848, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask11), 864, dst);
+ vec_xst(srv1, 880, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask12), 896, dst);
+ vec_xst(srv1, 912, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask13), 928, dst);
+ vec_xst(srv1, 944, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask14), 960, dst);
+ vec_xst(srv1, 976, dst);
+ vec_xst(vec_perm(v2_filter_u8, srv, mask15), 992, dst);
+ vec_xst(srv1, 1008, dst);
+
+ }
+ else{
+ int offset = 0;
+ for(int i=0; i<32; i++){
+ vec_xst(srv, offset, dst);
+ vec_xst(srv1, 16+offset, dst);
+ offset += 32;
+ }
+ }
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 3>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06, 0x04, 0x05, 0x06, 0x07};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){26, 26, 26, 26, 20, 20, 20, 20, 14, 14, 14, 14, 8, 8, 8, 8};
+ vec_u8_t vfrac4_32 = (vec_u8_t){6, 6, 6, 6, 12, 12, 12, 12, 18, 18, 18, 18, 24, 24, 24, 24};
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 3>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c};
+ vec_u8_t mask6={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d};
+ vec_u8_t mask7={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */
+
+vec_u8_t vfrac8_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac8_2 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_3 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 16, 16, 16, 16, 16, 16, 16, 16};
+
+vec_u8_t vfrac8_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv2, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv2, vfrac8_32_1);
+ vmle1 = vec_mule(srv3, vfrac8_1);
+ vmlo1 = vec_mulo(srv3, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv4, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv4, vfrac8_32_2);
+ vmle1 = vec_mule(srv5, vfrac8_2);
+ vmlo1 = vec_mulo(srv5, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv6, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv6, vfrac8_32_3);
+ vmle1 = vec_mule(srv7, vfrac8_3);
+ vmlo1 = vec_mulo(srv7, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 3>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[3]* ref[off3 + 15] + f[3] * ref[off3 + 16] + 16) >> 5);
+
+ ...
+
+ y=15; off7 = offset[7]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5);
+ }
+ */
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv3, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv4, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv6, srv7, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv7, srv8, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv8, srv9, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv8, srv9, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv9, srva, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srva, srvb, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srvb, srvc, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srvc, srvd, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srvd, srve, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 3>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5);
+
+ ...
+
+ y=15; off15 = offset[15]; x=0-31; off15-off30 = 1;
+ dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5);
+
+ ...
+
+ y=31; off31= offset[31]; x=0-31; off31 = 2;
+ dst[y * dstStride + 0] = (pixel)((f32[31]* ref[off15 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[31]* ref[off15 + 1] + f[31] * ref[off31 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[31]* ref[off15 + 2] + f[31] * ref[off31 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[31]* ref[off15 + 3] + f[31] * ref[off31 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 31] = (pixel)((f32[31]* ref[off15 + 31] + f[31] * ref[off31 + 32] + 16) >> 5);
+ }
+ */
+ //mode 33:
+ //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26};
+ //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0};
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv40 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv50 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv60 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv70 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv80 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv90 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srva0 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srve0 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15);
+
+ vec_u8_t srv000 = sv2;
+ vec_u8_t srv100 = vec_perm(sv2, sv3, mask1);
+ vec_u8_t srv200 = vec_perm(sv2, sv3, mask2);
+ vec_u8_t srv300 = vec_perm(sv2, sv3, mask3);
+ vec_u8_t srv400 = vec_perm(sv2, sv3, mask4);
+ vec_u8_t srv500 = vec_perm(sv2, sv3, mask5);
+ vec_u8_t srv600 = vec_perm(sv2, sv3, mask6);
+ vec_u8_t srv700 = vec_perm(sv2, sv3, mask7);
+ vec_u8_t srv800 = vec_perm(sv2, sv3, mask8);
+ vec_u8_t srv900 = vec_perm(sv2, sv3, mask9);
+ vec_u8_t srva00 = vec_perm(sv2, sv3, mask10);
+ vec_u8_t srvb00 = vec_perm(sv2, sv3, mask11);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_16 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_17 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_18 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_19 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_20 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_21 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_22 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_23 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_24 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_25 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_26 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_27 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_28 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_29 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_30 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv3, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv20, srv30, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv4, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv30, srv40, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv40, srv50, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv40, srv50, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv50, srv60, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv6, srv7, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv60, srv70, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv7, srv8, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv70, srv80, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv8, srv9, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv80, srv90, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv8, srv9, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv80, srv90, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv9, srva, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv90, srva0, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srva, srvb, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srva0, srvb0, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srvb, srvc, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srvb0, srvc0, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srvc, srvd, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srvc0, srvd0, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srvd, srve, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srvd0, srve0, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srvd, srve, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srvd0, srve0, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srve, srvf, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srve0, srvf0, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srvf, srv00, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srvf0, srv000, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv00, srv10, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv000, srv100, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv10, srv20, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv100, srv200, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv10, srv20, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv100, srv200, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv20, srv30, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv200, srv300, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv30, srv40, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv300, srv400, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv40, srv50, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv400, srv500, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv50, srv60, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv500, srv600, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv50, srv60, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv500, srv600, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv60, srv70, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv600, srv700, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv70, srv80, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv700, srv800, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv80, srv90, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv800, srv900, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv90, srva0, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv900, srva00, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srva0, srvb0, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srva00, srvb00, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 4>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+vec_u8_t vfrac4 = (vec_u8_t){21, 21, 21, 21, 10, 10, 10, 10, 31, 31, 31, 31, 20, 20, 20, 20};
+vec_u8_t vfrac4_32 = (vec_u8_t){11, 11, 11, 11, 22, 22, 22, 22, 1, 1, 1, 1, 12, 12, 12, 12};
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 4>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b};
+ vec_u8_t mask5={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c};
+ vec_u8_t mask6={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */
+
+vec_u8_t vfrac8_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac8_1 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_2 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac8_3 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 8, 8, 8, 8, 8, 8, 8, 8};
+
+vec_u8_t vfrac8_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv1, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv1, vfrac8_32_1);
+ vmle1 = vec_mule(srv2, vfrac8_1);
+ vmlo1 = vec_mulo(srv2, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv3, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv3, vfrac8_32_2);
+ vmle1 = vec_mule(srv4, vfrac8_2);
+ vmlo1 = vec_mulo(srv4, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+ //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv5, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv5, vfrac8_32_3);
+ vmle1 = vec_mule(srv6, vfrac8_3);
+ vmlo1 = vec_mulo(srv6, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 4>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv3, srv4, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv6, srv7, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv7, srv8, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv7, srv8, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv8, srv9, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv9, srva, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv9, srva, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srva, srvb, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 4>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv40 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv50 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv60 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv70 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv80 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv90 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srva0 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srve0 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15);
+
+ vec_u8_t srv000 = sv2;
+ vec_u8_t srv100 = vec_perm(sv2, sv3, mask1);
+ vec_u8_t srv200 = vec_perm(sv2, sv3, mask2);
+ vec_u8_t srv300 = vec_perm(sv2, sv3, mask3);
+ vec_u8_t srv400 = vec_perm(sv2, sv3, mask4);
+ vec_u8_t srv500 = vec_perm(sv2, sv3, mask5);
+ vec_u8_t srv600 = vec_perm(sv2, sv3, mask6);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_18 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_20 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_22 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_24 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_26 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_28 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_30 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv20, srv30, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv3, srv4, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv30, srv40, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv30, srv40, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv40, srv50, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv50, srv60, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv50, srv60, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv6, srv7, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv60, srv70, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv7, srv8, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv70, srv80, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv7, srv8, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv70, srv80, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv8, srv9, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv80, srv90, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv9, srva, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv90, srva0, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv9, srva, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv90, srva0, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srva, srvb, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srva0, srvb0, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srvb, srvc, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srvb0, srvc0, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srvb, srvc, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srvb0, srvc0, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srvc, srvd, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srvc0, srvd0, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srvd, srve, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srvd0, srve0, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srvd, srve, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srvd0, srve0, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srve, srvf, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srve0, srvf0, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srvf, srv00, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srvf0, srv000, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srvf, srv00, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srvf0, srv000, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv00, srv10, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv000, srv100, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv10, srv20, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv100, srv200, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv10, srv20, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv100, srv200, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv20, srv30, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv200, srv300, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv30, srv40, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv300, srv400, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv30, srv40, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv300, srv400, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv40, srv50, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv400, srv500, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv50, srv60, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv500, srv600, vfrac16_32_31, vfrac16_31, vout_31);
+ //int offset[32] = { 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21};
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 5>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-3;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ }
+ */
+ //mode 31:
+ //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17};
+ //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0};
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){17, 17, 17, 17, 2, 2, 2, 2, 19, 19, 19, 19, 4, 4, 4, 4}; /* fraction[0-3] */
+ vec_u8_t vfrac4_32 = (vec_u8_t){15, 15, 15, 15, 30, 30, 30, 30, 13, 13, 13, 13, 28, 28, 28, 28}; /* 32 - fraction[0-3] */
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 5>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off0 + 7] + f[0] * ref[off0 + 7] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[1]* ref[off1 + 7] + f[1] * ref[off1 + 7] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[2]* ref[off2 + 7] + f[2] * ref[off2 + 7] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off3 + 7] + f[0] * ref[off3 + 7] + 16) >> 5);
+
+ ...
+
+ y=7; off7 = offset[7]; x=0-7;
+ dst[y * dstStride + 0] = (pixel)((f32[7]* ref[off7 + 0] + f[7] * ref[off7 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[7]* ref[off7 + 1] + f[7] * ref[off7 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[7]* ref[off7 + 2] + f[7] * ref[off7 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[7]* ref[off7 + 3] + f[7] * ref[off7 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off7 + 7] + f[0] * ref[off7 + 7] + 16) >> 5);
+ }
+ */
+ //mode 31:
+ //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17};
+ //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0};
+
+ vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 2 */
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 3 */
+
+vec_u8_t vfrac8_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac8_1 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_2 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac8_3 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 24, 24, 24, 24, 24, 24, 24, 24};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ /* y0, y1 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+
+ /* y2, y3 */
+ vmle0 = vec_mule(srv1, vfrac8_32_1);
+ vmlo0 = vec_mulo(srv1, vfrac8_32_1);
+ vmle1 = vec_mule(srv2, vfrac8_1);
+ vmlo1 = vec_mulo(srv2, vfrac8_1);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y4, y5 */
+ vmle0 = vec_mule(srv2, vfrac8_32_2);
+ vmlo0 = vec_mulo(srv2, vfrac8_32_2);
+ vmle1 = vec_mule(srv3, vfrac8_2);
+ vmlo1 = vec_mulo(srv3, vfrac8_2);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ /* y6, y7 */
+ vmle0 = vec_mule(srv3, vfrac8_32_3);
+ vmlo0 = vec_mulo(srv3, vfrac8_32_3);
+ vmle1 = vec_mule(srv4, vfrac8_3);
+ vmlo1 = vec_mulo(srv4, vfrac8_3);
+ vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ ve = vec_sra(vsume, u16_5);
+ vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 5>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5);
+
+ y=3; off3 = offset[3]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[3]* ref[off3 + 15] + f[3] * ref[off3 + 16] + 16) >> 5);
+
+ ...
+
+ y=15; off7 = offset[7]; x=0-15;
+ dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5);
+ }
+ */
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv3, srv4, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv6, srv7, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv7, srv8, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv7, srv8, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv8, srv9, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 5>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ /*
+ for (int y = 0; y < width; y++)
+ {
+ y=0; off0 = offset[0]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5);
+
+ y=1; off1 = offset[1]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5);
+
+ y=2; off2 = offset[2]; x=0-31;
+ dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5);
+
+ ...
+
+ y=15; off15 = offset[15]; x=0-31; off15-off30 = 1;
+ dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5);
+
+ ...
+
+ y=31; off31= offset[31]; x=0-31; off31 = 2;
+ dst[y * dstStride + 0] = (pixel)((f32[31]* ref[off15 + 0] + f[31] * ref[off31 + 1] + 16) >> 5);
+ dst[y * dstStride + 1] = (pixel)((f32[31]* ref[off15 + 1] + f[31] * ref[off31 + 2] + 16) >> 5);
+ dst[y * dstStride + 2] = (pixel)((f32[31]* ref[off15 + 2] + f[31] * ref[off31 + 3] + 16) >> 5);
+ dst[y * dstStride + 3] = (pixel)((f32[31]* ref[off15 + 3] + f[31] * ref[off31 + 4] + 16) >> 5);
+ ...
+ dst[y * dstStride + 31] = (pixel)((f32[31]* ref[off15 + 31] + f[31] * ref[off31 + 32] + 16) >> 5);
+ }
+ */
+ //mode 31:
+ //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17};
+ //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0};
+
+ //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f};
+ vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10};
+ vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11};
+ vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12};
+ vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+ vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */
+
+ vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(sv0, sv1, mask1);
+ vec_u8_t srv2 = vec_perm(sv0, sv1, mask2);
+ vec_u8_t srv3 = vec_perm(sv0, sv1, mask3);
+ vec_u8_t srv4 = vec_perm(sv0, sv1, mask4);
+ vec_u8_t srv5 = vec_perm(sv0, sv1, mask5);
+ vec_u8_t srv6 = vec_perm(sv0, sv1, mask6);
+ vec_u8_t srv7 = vec_perm(sv0, sv1, mask7);
+ vec_u8_t srv8 = vec_perm(sv0, sv1, mask8);
+ vec_u8_t srv9 = vec_perm(sv0, sv1, mask9);
+ vec_u8_t srva = vec_perm(sv0, sv1, mask10);
+ vec_u8_t srvb = vec_perm(sv0, sv1, mask11);
+ vec_u8_t srvc = vec_perm(sv0, sv1, mask12);
+ vec_u8_t srvd = vec_perm(sv0, sv1, mask13);
+ vec_u8_t srve = vec_perm(sv0, sv1, mask14);
+ vec_u8_t srvf = vec_perm(sv0, sv1, mask15);
+
+ vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv10 = vec_perm(sv1, sv2, mask1);
+ vec_u8_t srv20 = vec_perm(sv1, sv2, mask2);
+ vec_u8_t srv30 = vec_perm(sv1, sv2, mask3);
+ vec_u8_t srv40 = vec_perm(sv1, sv2, mask4);
+ vec_u8_t srv50 = vec_perm(sv1, sv2, mask5);
+ vec_u8_t srv60 = vec_perm(sv1, sv2, mask6);
+ vec_u8_t srv70 = vec_perm(sv1, sv2, mask7);
+ vec_u8_t srv80 = vec_perm(sv1, sv2, mask8);
+ vec_u8_t srv90 = vec_perm(sv1, sv2, mask9);
+ vec_u8_t srva0 = vec_perm(sv1, sv2, mask10);
+ vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11);
+ vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12);
+ vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13);
+ vec_u8_t srve0 = vec_perm(sv1, sv2, mask14);
+ vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15);
+
+ vec_u8_t srv000 = sv2;
+ vec_u8_t srv100 = vec_perm(sv2, sv3, mask1);
+ vec_u8_t srv200 = vec_perm(sv2, sv3, mask2);
+
+
+vec_u8_t vfrac16_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_17 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_18 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_20 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_21 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_22 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_24 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_25 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_26 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_28 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_29 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_30 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+ //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17};
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv20, srv30, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv20, srv30, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv30, srv40, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv3, srv4, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv30, srv40, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv40, srv50, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv40, srv50, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv50, srv60, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv50, srv60, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv6, srv7, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv60, srv70, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv60, srv70, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv7, srv8, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv70, srv80, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv7, srv8, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv70, srv80, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv8, srv9, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv80, srv90, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srv9, srva, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv90, srva0, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv9, srva, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv90, srva0, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srva, srvb, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srva0, srvb0, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srva, srvb, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srva0, srvb0, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srvb, srvc, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srvb0, srvc0, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srvb, srvc, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srvb0, srvc0, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srvc, srvd, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srvc0, srvd0, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srvc, srvd, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srvc0, srvd0, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srvd, srve, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srvd0, srve0, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srvd, srve, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srvd0, srve0, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srve, srvf, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srve0, srvf0, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srve, srvf, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srve0, srvf0, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srvf, srv00, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srvf0, srv000, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srvf, srv00, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srvf0, srv000, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv00, srv10, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv000, srv100, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv10, srv20, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv100, srv200, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void one_ang_pred_altivec<4, 17>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3};
+ vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4};
+
+ /*vec_u8_t srv_left=vec_xl(8, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_4={0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);*/
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){6, 6, 6, 6, 12, 12, 12, 12, 18, 18, 18, 18, 24, 24, 24, 24};
+ vec_u8_t vfrac4_32 = (vec_u8_t){26, 26, 26, 26, 20, 20, 20, 20, 14, 14, 14, 14, 8, 8, 8, 8};
+
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32);
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4);
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 17>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, };
+ vec_u8_t mask1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, };
+ vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+ vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+ vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+ vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+ vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+ vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+/*
+ vec_u8_t srv_left=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_8={0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+
+ /* fraction[0-7] */
+vec_u8_t vfrac8_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_2 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_3 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* 32 - fraction[0-7] */
+vec_u8_t vfrac8_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 16, 16, 16, 16, 16, 16, 16, 16};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 17>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+ vec_u8_t mask1={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+ vec_u8_t mask2={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+ vec_u8_t mask3={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+ vec_u8_t mask4={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+ //vec_u8_t mask5={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+ vec_u8_t mask6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+ vec_u8_t mask7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+ vec_u8_t mask8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+ vec_u8_t mask9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+ //vec_u8_t mask10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+ vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+ vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+ vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+ vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+ //vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+ vec_u8_t maskadd1_0={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t srv_left=vec_xl(32, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_16 ={0xf, 0xe, 0xc, 0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(4, srcPix0);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(36, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 =srv4;
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = vec_perm(s0, s1, mask8);
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0;
+ vec_u8_t srv2_add1 = srv1;
+ vec_u8_t srv3_add1 = srv2;
+ vec_u8_t srv4_add1 = srv3;
+ vec_u8_t srv5_add1 = srv3;
+ vec_u8_t srv6_add1 = srv4;
+ vec_u8_t srv7_add1 = srv6;
+ vec_u8_t srv8_add1 = srv7;
+ vec_u8_t srv9_add1 = srv8;
+ vec_u8_t srv10_add1 = srv8;
+ vec_u8_t srv11_add1 = srv9;
+ vec_u8_t srv12_add1= srv11;
+ vec_u8_t srv13_add1 = srv12;
+ vec_u8_t srv14_add1 = srv13;
+ vec_u8_t srv15_add1 = srv13;
+
+
+ /* fraction[0-15] */
+vec_u8_t vfrac16_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ /* 32- fraction[0-15] */
+vec_u8_t vfrac16_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 17>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask2={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask4={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask11={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask12={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+vec_u8_t mask13={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, };
+vec_u8_t mask14={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+//vec_u8_t mask15={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask16={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+
+vec_u8_t mask17={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask18={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask19={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask20={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+//vec_u8_t mask21={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask22={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask23={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask24={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask25={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask26={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask27={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask28={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t refmask_32_0 ={0x1f, 0x1e, 0x1c, 0x1b, 0x1a, 0x19, 0x17, 0x16, 0x15, 0x14, 0x12, 0x11, 0x10, 0xf, 0xe, 0xc};
+ vec_u8_t refmask_32_1 = {0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t srv_left0=vec_xl(64, srcPix0);
+ vec_u8_t srv_left1=vec_xl(80, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(7, srcPix0);
+ vec_u8_t s3 = vec_xl(16+7, srcPix0);
+*/
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1f, 0x1e, 0x1c, 0x1b, 0x1a, 0x19, 0x17, 0x16, 0x15, 0x14, 0x12, 0x11, 0x10, 0xf, 0xe, 0xc };
+ vec_u8_t refmask_32_1={0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(71, srcPix0);
+ vec_u8_t s3 = vec_xl(87, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv1 = vec_perm(s1, s2, mask1);
+ vec_u8_t srv2 = vec_perm(s1, s2, mask2);
+ vec_u8_t srv3 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv4 = vec_perm(s1, s2, mask4);
+ vec_u8_t srv5 =srv4;
+ vec_u8_t srv6 = vec_perm(s1, s2, mask6);
+ vec_u8_t srv7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv8 = vec_perm(s1, s2, mask8);
+ vec_u8_t srv9 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = s1;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ vec_u8_t srv16_0 = vec_perm(s2, s3, mask0);
+ vec_u8_t srv16_1 = vec_perm(s2, s3, mask1);
+ vec_u8_t srv16_2 = vec_perm(s2, s3, mask2);
+ vec_u8_t srv16_3 = vec_perm(s2, s3, mask3);
+ vec_u8_t srv16_4 = vec_perm(s2, s3, mask4);
+ vec_u8_t srv16_5 =srv16_4;
+ vec_u8_t srv16_6 = vec_perm(s2, s3, mask6);
+ vec_u8_t srv16_7 = vec_perm(s2, s3, mask7);
+ vec_u8_t srv16_8 = vec_perm(s2, s3, mask8);
+ vec_u8_t srv16_9 = vec_perm(s2, s3, mask9);
+ vec_u8_t srv16_10 = srv16_9;
+ vec_u8_t srv16_11 = s2;
+ vec_u8_t srv16_12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv16_13 = vec_perm(s1, s2, mask13);
+ vec_u8_t srv16_14 = vec_perm(s1, s2, mask14);
+ vec_u8_t srv16_15 = srv16_14;
+ //0,1,2,3,4,4,6,7,8,9,9(1,2),11(1),12(0,1),13,14,14,15,16,17,18,19,20,20,22,23,24,25,25,27,28,29,30(0),30,
+
+ vec_u8_t srv16 = vec_perm(s0, s1, mask16);
+ vec_u8_t srv17 = vec_perm(s0, s1, mask17);
+ vec_u8_t srv18 = vec_perm(s0, s1, mask18);
+ vec_u8_t srv19 = vec_perm(s0, s1, mask19);
+ vec_u8_t srv20 = vec_perm(s0, s1, mask20);
+ vec_u8_t srv21 = srv20;
+ vec_u8_t srv22 = vec_perm(s0, s1, mask22);
+ vec_u8_t srv23 = vec_perm(s0, s1, mask23);
+ vec_u8_t srv24 = vec_perm(s0, s1, mask24);
+ vec_u8_t srv25 = vec_perm(s0, s1, mask25);
+ vec_u8_t srv26 = srv25;
+ vec_u8_t srv27 = vec_perm(s0, s1, mask27);
+ vec_u8_t srv28 = vec_perm(s0, s1, mask28);
+ vec_u8_t srv29 = vec_perm(s0, s1, mask29);
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = vec_perm(s1, s2, mask16);
+ vec_u8_t srv16_17 = vec_perm(s1, s2, mask17);
+ vec_u8_t srv16_18 = vec_perm(s1, s2, mask18);
+ vec_u8_t srv16_19 = vec_perm(s1, s2, mask19);
+ vec_u8_t srv16_20 = vec_perm(s1, s2, mask20);
+ vec_u8_t srv16_21 = srv16_20;
+ vec_u8_t srv16_22 = vec_perm(s1, s2, mask22);
+ vec_u8_t srv16_23 = vec_perm(s1, s2, mask23);
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask24);
+ vec_u8_t srv16_25 = vec_perm(s1, s2, mask25);
+ vec_u8_t srv16_26 = srv16_25;
+ vec_u8_t srv16_27 = vec_perm(s1, s2, mask27);
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask28);
+ vec_u8_t srv16_29 = vec_perm(s1, s2, mask29);
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv1add1 = srv0;
+ vec_u8_t srv2add1 = srv1;
+ vec_u8_t srv3add1 = srv2;
+ vec_u8_t srv4add1 = srv3;
+ vec_u8_t srv5add1 = srv3;
+ vec_u8_t srv6add1 = srv4;
+ vec_u8_t srv7add1 = srv6;
+ vec_u8_t srv8add1 = srv7;
+ vec_u8_t srv9add1 = srv8;
+ vec_u8_t srv10add1 = srv8;
+ vec_u8_t srv11add1 = srv9;
+ vec_u8_t srv12add1= srv11;
+ vec_u8_t srv13add1 = srv12;
+ vec_u8_t srv14add1 = srv13;
+ vec_u8_t srv15add1 = srv13;
+
+ //0(1,2),1,2,3,3.4,6,7,8,8,9,11(1),12(0,1),13,13,14,16, 17, 18,19,19,20,22,26,24,24,25,27,28,29,29,
+
+ vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0);
+ vec_u8_t srv16add1_1 = srv16_0;
+ vec_u8_t srv16add1_2 = srv16_1;
+ vec_u8_t srv16add1_3 = srv16_2;
+ vec_u8_t srv16add1_4 = srv16_3;
+ vec_u8_t srv16add1_5 = srv16_3;
+ vec_u8_t srv16add1_6 = srv16_4;
+ vec_u8_t srv16add1_7 = srv16_6;
+ vec_u8_t srv16add1_8 = srv16_7;
+ vec_u8_t srv16add1_9 = srv16_8;
+ vec_u8_t srv16add1_10 = srv16_8;
+ vec_u8_t srv16add1_11 = srv16_9;
+ vec_u8_t srv16add1_12= srv16_11;
+ vec_u8_t srv16add1_13 = srv16_12;
+ vec_u8_t srv16add1_14 = srv16_13;
+ vec_u8_t srv16add1_15 = srv16_13;
+
+ vec_u8_t srv16add1 = srv14;
+ vec_u8_t srv17add1 = srv16;
+ vec_u8_t srv18add1 = srv17;
+ vec_u8_t srv19add1 = srv18;
+ vec_u8_t srv20add1 = srv19;
+ vec_u8_t srv21add1 = srv19;
+ vec_u8_t srv22add1 = srv20;
+ vec_u8_t srv23add1 = srv22;
+ vec_u8_t srv24add1 = srv23;
+ vec_u8_t srv25add1 = srv24;
+ vec_u8_t srv26add1 = srv24;
+ vec_u8_t srv27add1 = srv25;
+ vec_u8_t srv28add1 = srv27;
+ vec_u8_t srv29add1 = srv28;
+ vec_u8_t srv30add1 = srv29;
+ vec_u8_t srv31add1 = srv29;
+
+ vec_u8_t srv16add1_16 = srv16_14;
+ vec_u8_t srv16add1_17 = srv16_16;
+ vec_u8_t srv16add1_18 = srv16_17;
+ vec_u8_t srv16add1_19 = srv16_18;
+ vec_u8_t srv16add1_20 = srv16_19;
+ vec_u8_t srv16add1_21 = srv16_19;
+ vec_u8_t srv16add1_22 = srv16_20;
+ vec_u8_t srv16add1_23 = srv16_22;
+ vec_u8_t srv16add1_24 = srv16_23;
+ vec_u8_t srv16add1_25 = srv16_24;
+ vec_u8_t srv16add1_26 = srv16_24;
+ vec_u8_t srv16add1_27 = srv16_25;
+ vec_u8_t srv16add1_28 = srv16_27;
+ vec_u8_t srv16add1_29 = srv16_28;
+ vec_u8_t srv16add1_30 = srv16_29;
+ vec_u8_t srv16add1_31 = srv16_29;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 16>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, };
+ vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, };
+
+/*
+ vec_u8_t srv_left=vec_xl(8, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_4={0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){11, 11, 11, 11, 22, 22, 22, 22, 1, 1, 1, 1, 12, 12, 12, 12};
+ vec_u8_t vfrac4_32 = (vec_u8_t){21, 21, 21, 21, 10, 10, 10, 10, 31, 31, 31, 31, 20, 20, 20, 20};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 16>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+vec_u8_t mask1={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, };
+vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+/*
+ vec_u8_t srv_left=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_8={0x8, 0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_8={0x8, 0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+
+vec_u8_t vfrac8_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac8_1 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_2 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac8_3 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24};
+
+vec_u8_t vfrac8_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 8, 8, 8, 8, 8, 8, 8, 8};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 16>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask1={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+//vec_u8_t mask2={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask3={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask5={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask6={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask7={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t maskadd1_0={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+/*vec_u8_t maskadd1_1={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t maskadd1_2={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t maskadd1_3={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t maskadd1_4={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t maskadd1_5={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t maskadd1_6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t maskadd1_7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_8={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_9={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_11={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+*/
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t srv_left=vec_xl(32, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0x9, 0x8, 0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(6, srcPix0);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0x9, 0x8, 0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(38, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = srv1;
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 =srv4;
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = srv10;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = srv13;
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0;
+ vec_u8_t srv2_add1 = srv0;
+ vec_u8_t srv3_add1 = srv1;
+ vec_u8_t srv4_add1 = srv3;
+ vec_u8_t srv5_add1 = srv3;
+ vec_u8_t srv6_add1 = srv4;
+ vec_u8_t srv7_add1 = srv6;
+ vec_u8_t srv8_add1 = srv6;
+ vec_u8_t srv9_add1 = srv7;
+ vec_u8_t srv10_add1 = srv9;
+ vec_u8_t srv11_add1 = srv9;
+ vec_u8_t srv12_add1= srv10;
+ vec_u8_t srv13_add1 = srv12;
+ vec_u8_t srv14_add1 = srv12;
+ vec_u8_t srv15_add1 = srv13;
+vec_u8_t vfrac16_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 16>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask7={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+//vec_u8_t mask8={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+vec_u8_t mask9={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, };
+vec_u8_t mask10={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+//vec_u8_t mask11={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask12={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask13={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+//vec_u8_t mask14={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask15={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+
+vec_u8_t mask16={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+//vec_u8_t mask17={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask18={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask19={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask20={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask21={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask22={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask23={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask24={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask25={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask26={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask27={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t refmask_32_0 = {0x1e, 0x1d, 0x1b, 0x1a, 0x18, 0x17, 0x15, 0x14, 0x12, 0x11, 0xf, 0xe, 0xc, 0xb, 0x9, 0x8, };
+ vec_u8_t refmask_32_1 = {0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t srv_left0=vec_xl(64, srcPix0);
+ vec_u8_t srv_left1=vec_xl(80, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(12, srcPix0);
+ vec_u8_t s3 = vec_xl(16+12, srcPix0);
+*/
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1e, 0x1d, 0x1b, 0x1a, 0x18, 0x17, 0x15, 0x14, 0x12, 0x11, 0xf, 0xe, 0xc, 0xb, 0x9, 0x8};
+ vec_u8_t refmask_32_1={0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(76, srcPix0);
+ vec_u8_t s3 = vec_xl(92, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv1 = vec_perm(s1, s2, mask1);
+ vec_u8_t srv2 = srv1;
+ vec_u8_t srv3 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv4 = vec_perm(s1, s2, mask4);
+ vec_u8_t srv5 = srv4;
+ vec_u8_t srv6 = s1;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = srv10;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = srv13;
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv16_0 = vec_perm(s2, s3, mask0);
+ vec_u8_t srv16_1 = vec_perm(s2, s3, mask1);
+ vec_u8_t srv16_2 = srv16_1;
+ vec_u8_t srv16_3 = vec_perm(s2, s3, mask3);
+ vec_u8_t srv16_4 = vec_perm(s2, s3, mask4);
+ vec_u8_t srv16_5 = srv16_4;
+ vec_u8_t srv16_6 = s2;
+ vec_u8_t srv16_7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv16_8 = srv16_7;
+ vec_u8_t srv16_9 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv16_10 = vec_perm(s1, s2, mask10);
+ vec_u8_t srv16_11 = srv16_10;
+ vec_u8_t srv16_12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv16_13 = vec_perm(s1, s2, mask13);
+ vec_u8_t srv16_14 = srv16_13;
+ vec_u8_t srv16_15 = vec_perm(s1, s2, mask15);
+
+ //0(1,2),1,1,3,4,4,6(1),7(0,1),7,9,10,10,12,13,13,15,16,16,18,19,19,21,22,22,24,25,25,27,28,28,30,30
+
+ vec_u8_t srv16 = vec_perm(s0, s1, mask16);
+ vec_u8_t srv17 = srv16;
+ vec_u8_t srv18 = vec_perm(s0, s1, mask18);
+ vec_u8_t srv19 = vec_perm(s0, s1, mask19);
+ vec_u8_t srv20 = srv19;
+ vec_u8_t srv21 = vec_perm(s0, s1, mask21);
+ vec_u8_t srv22 = vec_perm(s0, s1, mask22);
+ vec_u8_t srv23 = srv22;
+ vec_u8_t srv24 = vec_perm(s0, s1, mask24);
+ vec_u8_t srv25 = vec_perm(s0, s1, mask25);
+ vec_u8_t srv26 = srv25;
+ vec_u8_t srv27 = vec_perm(s0, s1, mask27);
+ vec_u8_t srv28 = vec_perm(s0, s1, mask28);
+ vec_u8_t srv29 = srv28;
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = vec_perm(s1, s2, mask16);
+ vec_u8_t srv16_17 = srv16_16;
+ vec_u8_t srv16_18 = vec_perm(s1, s2, mask18);
+ vec_u8_t srv16_19 = vec_perm(s1, s2, mask19);
+ vec_u8_t srv16_20 = srv16_19;
+ vec_u8_t srv16_21 = vec_perm(s1, s2, mask21);
+ vec_u8_t srv16_22 = vec_perm(s1, s2, mask22);
+ vec_u8_t srv16_23 = srv16_22;
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask24);
+ vec_u8_t srv16_25 = vec_perm(s1, s2, mask25);
+ vec_u8_t srv16_26 = srv16_25;
+ vec_u8_t srv16_27 = vec_perm(s1, s2, mask27);
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask28);
+ vec_u8_t srv16_29 = srv16_28;
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv1add1 = srv0;
+ vec_u8_t srv2add1 = srv0;
+ vec_u8_t srv3add1 = srv1;
+ vec_u8_t srv4add1 = srv3;
+ vec_u8_t srv5add1 = srv3;
+ vec_u8_t srv6add1 = srv4;
+ vec_u8_t srv7add1 = s1;
+ vec_u8_t srv8add1 = s1;
+ vec_u8_t srv9add1 = srv7;
+ vec_u8_t srv10add1 = srv9;
+ vec_u8_t srv11add1 = srv9;
+ vec_u8_t srv12add1= srv10;
+ vec_u8_t srv13add1 = srv12;
+ vec_u8_t srv14add1 = srv12;
+ vec_u8_t srv15add1 = srv13;
+
+ vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0);
+ vec_u8_t srv16add1_1 = srv16_0;
+ vec_u8_t srv16add1_2 = srv16_0;
+ vec_u8_t srv16add1_3 = srv16_1;
+ vec_u8_t srv16add1_4 = srv16_3;
+ vec_u8_t srv16add1_5 = srv16_3;
+ vec_u8_t srv16add1_6 = srv16_4;
+ vec_u8_t srv16add1_7 = s2;
+ vec_u8_t srv16add1_8 = s2;
+ vec_u8_t srv16add1_9 = srv16_7;
+ vec_u8_t srv16add1_10 = srv16_9;
+ vec_u8_t srv16add1_11 = srv16_9;
+ vec_u8_t srv16add1_12= srv16_10;
+ vec_u8_t srv16add1_13 = srv16_12;
+ vec_u8_t srv16add1_14 = srv16_12;
+ vec_u8_t srv16add1_15 = srv16_13;
+
+ //0,0,1,3,3,4,6(0),6,7,9,9,10,12,12,13,15,15,16,18,18,19,21,21,22,24,24,25,27,27,28,28
+
+ vec_u8_t srv16add1 = srv15;
+ vec_u8_t srv17add1 = srv15;
+ vec_u8_t srv18add1 = srv16;
+ vec_u8_t srv19add1 = srv18;
+ vec_u8_t srv20add1 = srv18;
+ vec_u8_t srv21add1 = srv19;
+ vec_u8_t srv22add1 = srv21;
+ vec_u8_t srv23add1 = srv21;
+ vec_u8_t srv24add1 = srv22;
+ vec_u8_t srv25add1 = srv24;
+ vec_u8_t srv26add1 = srv24;
+ vec_u8_t srv27add1 = srv25;
+ vec_u8_t srv28add1 = srv27;
+ vec_u8_t srv29add1 = srv27;
+ vec_u8_t srv30add1 = srv28;
+ vec_u8_t srv31add1 = srv28;
+
+ vec_u8_t srv16add1_16 = srv16_15;
+ vec_u8_t srv16add1_17 = srv16_15;
+ vec_u8_t srv16add1_18 = srv16_16;
+ vec_u8_t srv16add1_19 = srv16_18;
+ vec_u8_t srv16add1_20 = srv16_18;
+ vec_u8_t srv16add1_21 = srv16_19;
+ vec_u8_t srv16add1_22 = srv16_21;
+ vec_u8_t srv16add1_23 = srv16_21;
+ vec_u8_t srv16add1_24 = srv16_22;
+ vec_u8_t srv16add1_25 = srv16_24;
+ vec_u8_t srv16add1_26 = srv16_24;
+ vec_u8_t srv16add1_27 = srv16_25;
+ vec_u8_t srv16add1_28 = srv16_27;
+ vec_u8_t srv16add1_29 = srv16_27;
+ vec_u8_t srv16add1_30 = srv16_28;
+ vec_u8_t srv16add1_31 = srv16_28;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_18 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_20 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_22 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_24 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_26 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_28 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_30 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 15>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, };
+ vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, };
+
+/*
+ vec_u8_t srv_left=vec_xl(8, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_4={0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t vfrac4 = (vec_u8_t){15, 15, 15, 15, 30, 30, 30, 30, 13, 13, 13, 13, 28, 28, 28, 28};
+ vec_u8_t vfrac4_32 = (vec_u8_t){17, 17, 17, 17, 2, 2, 2, 2, 19, 19, 19, 19, 4, 4, 4, 4};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 15>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask1={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+vec_u8_t mask2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+/*
+ vec_u8_t srv_left=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_8={0x8, 0x6, 0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x8, 0x6, 0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+
+vec_u8_t vfrac8_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac8_1 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_2 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac8_3 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 8, 8, 8, 8, 8, 8, 8, 8};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 15>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t mask0={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask2={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+//vec_u8_t mask4={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask6={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask7={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask8={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+/*vec_u8_t maskadd1_1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t maskadd1_2={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t maskadd1_3={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t maskadd1_4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t maskadd1_5={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_6={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_7={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+
+/*
+ vec_u8_t srv_left=vec_xl(32, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_16={0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(8, srcPix0);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(40, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = srv1;
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = srv3;
+ vec_u8_t srv5 = vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = srv5;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= srv11;
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = srv13;
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0;
+ vec_u8_t srv2_add1 = srv0;
+ vec_u8_t srv3_add1 = srv1;
+ vec_u8_t srv4_add1 = srv1;
+ vec_u8_t srv5_add1 = srv3;
+ vec_u8_t srv6_add1 = srv3;
+ vec_u8_t srv7_add1 = srv5;
+ vec_u8_t srv8_add1 = srv5;
+ vec_u8_t srv9_add1 = srv7;
+ vec_u8_t srv10_add1 = srv7;
+ vec_u8_t srv11_add1 = srv9;
+ vec_u8_t srv12_add1= srv9;
+ vec_u8_t srv13_add1 = srv11;
+ vec_u8_t srv14_add1 = srv11;
+ vec_u8_t srv15_add1 = srv13;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 15>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+//vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask1={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+//vec_u8_t mask2={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, };
+vec_u8_t mask3={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, };
+//vec_u8_t mask4={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, };
+vec_u8_t mask5={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+//vec_u8_t mask6={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+vec_u8_t mask7={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+//vec_u8_t mask8={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask9={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+//vec_u8_t mask10={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask11={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+//vec_u8_t mask12={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask13={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+//vec_u8_t mask14={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask15={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+
+vec_u8_t mask16={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask17={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask18={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+//vec_u8_t mask19={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask20={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask21={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask22={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask23={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask24={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask25={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask26={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask27={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t srv_left0=vec_xl(64, srcPix0);
+ vec_u8_t srv_left1=vec_xl(80, srcPix0);
+ vec_u8_t refmask_32 = {0x1e, 0x1c, 0x1a, 0x18, 0x17, 0x15, 0x13, 0x11, 0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2};
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32);
+ vec_u8_t s1 = vec_xl(0, srcPix0);;
+ vec_u8_t s2 = vec_xl(16, srcPix0);
+ vec_u8_t s3 = vec_xl(32, srcPix0);
+ */
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1e, 0x1c, 0x1a, 0x18, 0x17, 0x15, 0x13, 0x11, 0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2};
+ vec_u8_t refmask_32_1={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0);
+ vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1);
+ vec_u8_t s2 = vec_xl(80, srcPix0);
+ vec_u8_t s3 = vec_xl(96, srcPix0);
+
+ vec_u8_t srv0 = s1;
+ vec_u8_t srv1 = vec_perm(s0, s1, mask1);
+ vec_u8_t srv2 = srv1;
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = srv3;
+ vec_u8_t srv5 = vec_perm(s0, s1, mask5);
+ vec_u8_t srv6 = srv5;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = vec_perm(s0, s1, mask11);
+ vec_u8_t srv12= srv11;
+ vec_u8_t srv13 = vec_perm(s0, s1, mask13);
+ vec_u8_t srv14 = srv13;
+ vec_u8_t srv15 = vec_perm(s0, s1, mask15);
+
+ vec_u8_t srv16_0 = s2;
+ vec_u8_t srv16_1 = vec_perm(s1, s2, mask1);
+ vec_u8_t srv16_2 = srv16_1;
+ vec_u8_t srv16_3 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv16_4 = srv16_3;
+ vec_u8_t srv16_5 = vec_perm(s1, s2, mask5);
+ vec_u8_t srv16_6 = srv16_5;
+ vec_u8_t srv16_7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv16_8 = srv16_7;
+ vec_u8_t srv16_9 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv16_10 = srv16_9;
+ vec_u8_t srv16_11 = vec_perm(s1, s2, mask11);
+ vec_u8_t srv16_12= srv16_11;
+ vec_u8_t srv16_13 = vec_perm(s1, s2, mask13);
+ vec_u8_t srv16_14 = srv16_13;
+ vec_u8_t srv16_15 = vec_perm(s1, s2, mask15);
+
+ //s1, 1,1,3,3,5,5,7,7,9,9,11,11,13,13,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28,s0,s0
+
+ vec_u8_t srv16 = vec_perm(s0, s1, mask16);
+ vec_u8_t srv17 = srv16;
+ vec_u8_t srv18 = vec_perm(s0, s1, mask18);
+ vec_u8_t srv19 = srv18;
+ vec_u8_t srv20 = vec_perm(s0, s1, mask20);
+ vec_u8_t srv21 = srv20;
+ vec_u8_t srv22 = vec_perm(s0, s1, mask22);
+ vec_u8_t srv23 = srv22;
+ vec_u8_t srv24 = vec_perm(s0, s1, mask24);
+ vec_u8_t srv25 = srv24;
+ vec_u8_t srv26 = vec_perm(s0, s1, mask26);
+ vec_u8_t srv27 = srv26;
+ vec_u8_t srv28 = vec_perm(s0, s1, mask28);
+ vec_u8_t srv29 = srv28;
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = vec_perm(s1, s2, mask16);
+ vec_u8_t srv16_17 = srv16_16;
+ vec_u8_t srv16_18 = vec_perm(s1, s2, mask18);
+ vec_u8_t srv16_19 = srv16_18;
+ vec_u8_t srv16_20 = vec_perm(s1, s2, mask20);
+ vec_u8_t srv16_21 = srv16_20;
+ vec_u8_t srv16_22 = vec_perm(s1, s2, mask22);
+ vec_u8_t srv16_23 = srv16_22;
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask24);
+ vec_u8_t srv16_25 = srv16_24;
+ vec_u8_t srv16_26 = vec_perm(s1, s2, mask26);
+ vec_u8_t srv16_27 = srv16_26;
+ vec_u8_t srv16_28 = vec_perm(s1, s2, mask28);
+ vec_u8_t srv16_29 = srv16_28;
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv1add1 = s1;
+ vec_u8_t srv2add1 = s1;
+ vec_u8_t srv3add1 = srv1;
+ vec_u8_t srv4add1 = srv1;
+ vec_u8_t srv5add1 = srv3;
+ vec_u8_t srv6add1 = srv3;
+ vec_u8_t srv7add1 = srv6;
+ vec_u8_t srv8add1 = srv6;
+ vec_u8_t srv9add1 = srv7;
+ vec_u8_t srv10add1 = srv7;
+ vec_u8_t srv11add1 = srv9;
+ vec_u8_t srv12add1= srv9;
+ vec_u8_t srv13add1 = srv11;
+ vec_u8_t srv14add1 = srv11;
+ vec_u8_t srv15add1 = srv14;
+
+ vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0);
+ vec_u8_t srv16add1_1 = s2;
+ vec_u8_t srv16add1_2 = s2;
+ vec_u8_t srv16add1_3 = srv16_1;
+ vec_u8_t srv16add1_4 = srv16_1;
+ vec_u8_t srv16add1_5 = srv16_3;
+ vec_u8_t srv16add1_6 = srv16_3;
+ vec_u8_t srv16add1_7 = srv16_6;
+ vec_u8_t srv16add1_8 = srv16_6;
+ vec_u8_t srv16add1_9 = srv16_7;
+ vec_u8_t srv16add1_10 = srv16_7;
+ vec_u8_t srv16add1_11 = srv16_9;
+ vec_u8_t srv16add1_12= srv16_9;
+ vec_u8_t srv16add1_13 = srv16_11;
+ vec_u8_t srv16add1_14 = srv16_11;
+ vec_u8_t srv16add1_15 = srv16_14;
+
+ //srv28, s1,s1, 1,1,3,3,6,6,7,7,9,9,11,11,14,15,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28,
+
+ vec_u8_t srv16add1 = srv15;
+ vec_u8_t srv17add1 = srv15;
+ vec_u8_t srv18add1 = srv16;
+ vec_u8_t srv19add1 = srv16;
+ vec_u8_t srv20add1 = srv18;
+ vec_u8_t srv21add1 = srv18;
+ vec_u8_t srv22add1 = srv20;
+ vec_u8_t srv23add1 = srv20;
+ vec_u8_t srv24add1 = srv22;
+ vec_u8_t srv25add1 = srv22;
+ vec_u8_t srv26add1 = srv24;
+ vec_u8_t srv27add1 = srv24;
+ vec_u8_t srv28add1 = srv26;
+ vec_u8_t srv29add1 = srv26;
+ vec_u8_t srv30add1 = srv28;
+ vec_u8_t srv31add1 = srv28;
+
+ vec_u8_t srv16add1_16 = srv16_15;
+ vec_u8_t srv16add1_17 = srv16_15;
+ vec_u8_t srv16add1_18 = srv16_16;
+ vec_u8_t srv16add1_19 = srv16_16;
+ vec_u8_t srv16add1_20 = srv16_18;
+ vec_u8_t srv16add1_21 = srv16_18;
+ vec_u8_t srv16add1_22 = srv16_20;
+ vec_u8_t srv16add1_23 = srv16_20;
+ vec_u8_t srv16add1_24 = srv16_22;
+ vec_u8_t srv16add1_25 = srv16_22;
+ vec_u8_t srv16add1_26 = srv16_24;
+ vec_u8_t srv16add1_27 = srv16_24;
+ vec_u8_t srv16add1_28 = srv16_26;
+ vec_u8_t srv16add1_29 = srv16_26;
+ vec_u8_t srv16add1_30 = srv16_28;
+ vec_u8_t srv16add1_31 = srv16_28;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_17 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_18 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_20 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_21 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_22 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_24 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_25 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_26 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_28 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_29 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_30 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void one_ang_pred_altivec<4, 14>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, };
+ vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, };
+
+/*
+ vec_u8_t srv_left=vec_xl(8, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_4={0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t vfrac4 = (vec_u8_t){19, 19, 19, 19, 6, 6, 6, 6, 25, 25, 25, 25, 12, 12, 12, 12};
+ vec_u8_t vfrac4_32 = (vec_u8_t){13, 13, 13, 13, 26, 26, 26, 26, 7, 7, 7, 7, 20, 20, 20, 20};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 14>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, };
+vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+/*
+ vec_u8_t srv_left=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_8={0x7, 0x5, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x7, 0x5, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+
+vec_u8_t vfrac8_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac8_1 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac8_3 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 8, 8, 8, 8, 8, 8, 8, 8};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 14>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+//vec_u8_t mask1={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask2={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+/*vec_u8_t maskadd1_1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t maskadd1_2={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t maskadd1_4={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_6={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_7={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_8={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t srv_left=vec_xl(32, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_16={0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(10, srcPix0);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(42, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = srv2;
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 = srv4;
+ vec_u8_t srv6 = srv4;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = srv9;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = srv12;
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0_add1;
+ vec_u8_t srv2_add1 = srv0;
+ vec_u8_t srv3_add1 = srv0;
+ vec_u8_t srv4_add1 = srv2;
+ vec_u8_t srv5_add1 = srv2;
+ vec_u8_t srv6_add1 = srv2;
+ vec_u8_t srv7_add1 = srv4;
+ vec_u8_t srv8_add1 = srv4;
+ vec_u8_t srv9_add1 = srv7;
+ vec_u8_t srv10_add1 = srv7;
+ vec_u8_t srv11_add1 = srv7;
+ vec_u8_t srv12_add1= srv9;
+ vec_u8_t srv13_add1 = srv9;
+ vec_u8_t srv14_add1 = srv12;
+ vec_u8_t srv15_add1 = srv12;
+vec_u8_t vfrac16_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+vec_u8_t vfrac16_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 14>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+//vec_u8_t mask1={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, };
+vec_u8_t mask2={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+//vec_u8_t mask3={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, };
+vec_u8_t mask4={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+//vec_u8_t mask5={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+//vec_u8_t mask6={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };
+vec_u8_t mask7={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+//vec_u8_t mask8={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+vec_u8_t mask9={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+//vec_u8_t mask10={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+//vec_u8_t mask11={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask12={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask13={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask14={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+//vec_u8_t mask15={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+
+//vec_u8_t mask16={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask17={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask18={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask19={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask20={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask21={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask22={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask23={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask24={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask25={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask26={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask27={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t srv_left0 = vec_xl(64, srcPix0);
+ vec_u8_t srv_left1 = vec_xl(80, srcPix0);
+ vec_u8_t srv_right = vec_xl(0, srcPix0);;
+ vec_u8_t refmask_32_0 ={0x1e, 0x1b, 0x19, 0x16, 0x14, 0x11, 0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x10, 0x11, 0x12, 0x13};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(4, srcPix0);;
+ vec_u8_t s2 = vec_xl(20, srcPix0);
+ */
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1e, 0x1b, 0x19, 0x16, 0x14, 0x11, 0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x0, 0x0, 0x0};
+ vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x10, 0x11, 0x12};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(68, srcPix0);
+ vec_u8_t s2 = vec_xl(84, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = vec_perm(s0, s1, mask2);
+ vec_u8_t srv3 = srv2;
+ vec_u8_t srv4 = vec_perm(s0, s1, mask4);
+ vec_u8_t srv5 = srv4;
+ vec_u8_t srv6 = srv4;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = vec_perm(s0, s1, mask9);
+ vec_u8_t srv10 = srv9;
+ vec_u8_t srv11 = srv9;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = srv12;
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ vec_u8_t srv16_0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv16_1 = srv16_0;
+ vec_u8_t srv16_2 = vec_perm(s1, s2, mask2);
+ vec_u8_t srv16_3 = srv16_2;
+ vec_u8_t srv16_4 = vec_perm(s1, s2, mask4);
+ vec_u8_t srv16_5 = srv16_4;
+ vec_u8_t srv16_6 = srv16_4;
+ vec_u8_t srv16_7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv16_8 = srv16_7;
+ vec_u8_t srv16_9 = vec_perm(s1, s2, mask9);
+ vec_u8_t srv16_10 = srv16_9;
+ vec_u8_t srv16_11 = srv16_9;
+ vec_u8_t srv16_12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv16_13 = srv16_12;
+ vec_u8_t srv16_14 = vec_perm(s1, s2, mask14);
+ vec_u8_t srv16_15 = srv16_14;
+
+ //0(0,1),0,2,2,4,4,4,7,7,9,9,9,12,12,14,14,14,17,17,19,19,19,22,22,24,24,24,27,27,s0,s0,s0
+
+ vec_u8_t srv16 = srv14;
+ vec_u8_t srv17 = vec_perm(s0, s1, mask17);
+ vec_u8_t srv18 = srv17;
+ vec_u8_t srv19 = vec_perm(s0, s1, mask19);
+ vec_u8_t srv20 = srv19;
+ vec_u8_t srv21 = srv19;
+ vec_u8_t srv22 = vec_perm(s0, s1, mask22);
+ vec_u8_t srv23 = srv22;
+ vec_u8_t srv24 = vec_perm(s0, s1, mask24);
+ vec_u8_t srv25 = srv24;
+ vec_u8_t srv26 = srv24;
+ vec_u8_t srv27 = vec_perm(s0, s1, mask27);
+ vec_u8_t srv28 = srv27;
+ vec_u8_t srv29 = s0;
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = srv16_14;
+ vec_u8_t srv16_17 = vec_perm(s1, s2, mask17);
+ vec_u8_t srv16_18 = srv16_17;
+ vec_u8_t srv16_19 = vec_perm(s1, s2, mask19);
+ vec_u8_t srv16_20 = srv16_19;
+ vec_u8_t srv16_21 = srv16_19;
+ vec_u8_t srv16_22 = vec_perm(s1, s2, mask22);
+ vec_u8_t srv16_23 = srv16_22;
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask24);
+ vec_u8_t srv16_25 = srv16_24;
+ vec_u8_t srv16_26 = srv16_24;
+ vec_u8_t srv16_27 = vec_perm(s1, s2, mask27);
+ vec_u8_t srv16_28 = srv16_27;
+ vec_u8_t srv16_29 = s1;
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1add1 = srv0add1;
+ vec_u8_t srv2add1 = srv0;
+ vec_u8_t srv3add1 = srv0;
+ vec_u8_t srv4add1 = srv2;
+ vec_u8_t srv5add1 = srv2;
+ vec_u8_t srv6add1 = srv2;
+ vec_u8_t srv7add1 = srv4;
+ vec_u8_t srv8add1 = srv4;
+ vec_u8_t srv9add1 = srv7;
+ vec_u8_t srv10add1 = srv7;
+ vec_u8_t srv11add1 = srv7;
+ vec_u8_t srv12add1= srv9;
+ vec_u8_t srv13add1 = srv9;
+ vec_u8_t srv14add1 = srv12;
+ vec_u8_t srv15add1 = srv12;
+
+ vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv16add1_1 = srv16add1_0;
+ vec_u8_t srv16add1_2 = srv16_0;
+ vec_u8_t srv16add1_3 = srv16_0;
+ vec_u8_t srv16add1_4 = srv16_2;
+ vec_u8_t srv16add1_5 = srv16_2;
+ vec_u8_t srv16add1_6 = srv16_2;
+ vec_u8_t srv16add1_7 = srv16_4;
+ vec_u8_t srv16add1_8 = srv16_4;
+ vec_u8_t srv16add1_9 = srv16_7;
+ vec_u8_t srv16add1_10 = srv16_7;
+ vec_u8_t srv16add1_11 = srv16_7;
+ vec_u8_t srv16add1_12= srv16_9;
+ vec_u8_t srv16add1_13 = srv16_9;
+ vec_u8_t srv16add1_14 = srv16_12;
+ vec_u8_t srv16add1_15 = srv16_12;
+
+ //srv28, s1,s1, 1,1,3,3,6,6,7,7,9,9,11,11,14,15,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28,
+ //0,0,2,2,2,4,4,7,7,7,9,9,12,12,12,14,14,17,17,17,19,19,22,22,22,24,24,27,27,27,
+
+ vec_u8_t srv16add1 = srv12;
+ vec_u8_t srv17add1 = srv14;
+ vec_u8_t srv18add1 = srv14;
+ vec_u8_t srv19add1 = srv17;
+ vec_u8_t srv20add1 = srv17;
+ vec_u8_t srv21add1 = srv17;
+ vec_u8_t srv22add1 = srv19;
+ vec_u8_t srv23add1 = srv19;
+ vec_u8_t srv24add1 = srv22;
+ vec_u8_t srv25add1 = srv22;
+ vec_u8_t srv26add1 = srv22;
+ vec_u8_t srv27add1 = srv24;
+ vec_u8_t srv28add1 = srv24;
+ vec_u8_t srv29add1 = srv27;
+ vec_u8_t srv30add1 = srv27;
+ vec_u8_t srv31add1 = srv27;
+
+ vec_u8_t srv16add1_16 = srv16_12;
+ vec_u8_t srv16add1_17 = srv16_14;
+ vec_u8_t srv16add1_18 = srv16_14;
+ vec_u8_t srv16add1_19 = srv16_17;
+ vec_u8_t srv16add1_20 = srv16_17;
+ vec_u8_t srv16add1_21 = srv16_17;
+ vec_u8_t srv16add1_22 = srv16_19;
+ vec_u8_t srv16add1_23 = srv16_19;
+ vec_u8_t srv16add1_24 = srv16_22;
+ vec_u8_t srv16add1_25 = srv16_22;
+ vec_u8_t srv16add1_26 = srv16_22;
+ vec_u8_t srv16add1_27 = srv16_24;
+ vec_u8_t srv16add1_28 = srv16_24;
+ vec_u8_t srv16add1_29 = srv16_27;
+ vec_u8_t srv16add1_30 = srv16_27;
+ vec_u8_t srv16add1_31 = srv16_27;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_17 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_18 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_20 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_21 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_22 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_24 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_25 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_26 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_28 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_29 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_30 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<4, 13>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, };
+ vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, };
+
+/*
+ vec_u8_t srv_left=vec_xl(8, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_4={0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t vfrac4 = (vec_u8_t){23, 23, 23, 23, 14, 14, 14, 14, 5, 5, 5, 5, 28, 28, 28, 28};
+ vec_u8_t vfrac4_32 = (vec_u8_t){9, 9, 9, 9, 18, 18, 18, 18, 27, 27, 27, 27, 4, 4, 4, 4};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 13>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, };
+vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+/*
+ vec_u8_t srv_left=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_8={0x7, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x7, 0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+
+vec_u8_t vfrac8_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac8_1 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac8_3 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 8, 8, 8, 8, 8, 8, 8, 8};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 13>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask4={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask6={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+//vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+
+vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+/*vec_u8_t maskadd1_1={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_2={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t maskadd1_3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t maskadd1_7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t srv_left=vec_xl(32, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_16={0xe, 0xb, 0x7, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(12, srcPix0);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0xe, 0xb, 0x7, 0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(44, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = srv0;
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = srv3;
+ vec_u8_t srv5 = srv3;
+ vec_u8_t srv6 = srv3;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = srv7;
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = srv10;
+ vec_u8_t srv12= srv10;
+ vec_u8_t srv13 = srv10;
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0_add1;
+ vec_u8_t srv2_add1 = srv0_add1;
+ vec_u8_t srv3_add1 = srv0;
+ vec_u8_t srv4_add1 = srv0;
+ vec_u8_t srv5_add1 = srv0;
+ vec_u8_t srv6_add1 = srv0;
+ vec_u8_t srv7_add1 = srv3;
+ vec_u8_t srv8_add1 = srv3;
+ vec_u8_t srv9_add1 = srv3;
+ vec_u8_t srv10_add1 = srv7;
+ vec_u8_t srv11_add1 = srv7;
+ vec_u8_t srv12_add1= srv7;
+ vec_u8_t srv13_add1 = srv7;
+ vec_u8_t srv14_add1 = srv10;
+ vec_u8_t srv15_add1 = srv10;
+vec_u8_t vfrac16_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 13>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+//vec_u8_t mask1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+//vec_u8_t mask2={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, };
+vec_u8_t mask3={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask5={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+//vec_u8_t mask6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, };
+vec_u8_t mask7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+//vec_u8_t mask8={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+//vec_u8_t mask9={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, };
+vec_u8_t mask10={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask11={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask12={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+//vec_u8_t mask13={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+vec_u8_t mask14={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+//vec_u8_t mask15={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+
+//vec_u8_t mask16={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask17={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask18={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask19={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+//vec_u8_t mask20={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask21={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask22={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+//vec_u8_t mask23={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask24={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+/*vec_u8_t mask25={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask26={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask27={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask28={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/
+vec_u8_t maskadd1_0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t srv_left0 = vec_xl(64, srcPix0);
+ vec_u8_t srv_left1 = vec_xl(80, srcPix0);
+ vec_u8_t srv_right = vec_xl(0, srcPix0);;
+ vec_u8_t refmask_32_0 ={0x1c, 0x19, 0x15, 0x12, 0xe, 0xb, 0x7, 0x4, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(8, srcPix0);;
+ vec_u8_t s2 = vec_xl(24, srcPix0);
+*/
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1c, 0x19, 0x15, 0x12, 0xe, 0xb, 0x7, 0x4, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
+ vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(72, srcPix0);
+ vec_u8_t s2 = vec_xl(88, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = srv0;
+ vec_u8_t srv3 = vec_perm(s0, s1, mask3);
+ vec_u8_t srv4 = srv3;
+ vec_u8_t srv5 = srv3;
+ vec_u8_t srv6 = srv3;
+ vec_u8_t srv7 = vec_perm(s0, s1, mask7);
+ vec_u8_t srv8 = srv7;
+ vec_u8_t srv9 = srv7;
+ vec_u8_t srv10 = vec_perm(s0, s1, mask10);
+ vec_u8_t srv11 = srv10;
+ vec_u8_t srv12= srv10;
+ vec_u8_t srv13 = srv10;
+ vec_u8_t srv14 = vec_perm(s0, s1, mask14);
+ vec_u8_t srv15 = srv14;
+
+ //0,0,0,3,3,3,3,7,7,7,10,10,10,10,14,14,14,17,17,17,17,21,21,21,24,24,24,24,s0,s0,s0,s0
+
+ vec_u8_t srv16_0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv16_1 = srv16_0;
+ vec_u8_t srv16_2 = srv16_0;
+ vec_u8_t srv16_3 = vec_perm(s1, s2, mask3);
+ vec_u8_t srv16_4 = srv16_3;
+ vec_u8_t srv16_5 = srv16_3;
+ vec_u8_t srv16_6 = srv16_3;
+ vec_u8_t srv16_7 = vec_perm(s1, s2, mask7);
+ vec_u8_t srv16_8 = srv16_7;
+ vec_u8_t srv16_9 = srv16_7;
+ vec_u8_t srv16_10 = vec_perm(s1, s2, mask10);
+ vec_u8_t srv16_11 = srv16_10;
+ vec_u8_t srv16_12= srv16_10;
+ vec_u8_t srv16_13 = srv16_10;
+ vec_u8_t srv16_14 = vec_perm(s1, s2, mask14);
+ vec_u8_t srv16_15 = srv16_14;
+
+ vec_u8_t srv16 = srv14;
+ vec_u8_t srv17 = vec_perm(s0, s1, mask17);
+ vec_u8_t srv18 = srv17;
+ vec_u8_t srv19 = srv17;
+ vec_u8_t srv20 = srv17;
+ vec_u8_t srv21 = vec_perm(s0, s1, mask21);
+ vec_u8_t srv22 = srv21;
+ vec_u8_t srv23 = srv21;
+ vec_u8_t srv24 = vec_perm(s0, s1, mask24);
+ vec_u8_t srv25 = srv24;
+ vec_u8_t srv26 = srv24;
+ vec_u8_t srv27 = srv24;
+ vec_u8_t srv28 = s0;
+ vec_u8_t srv29 = s0;
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = srv16_14;
+ vec_u8_t srv16_17 = vec_perm(s1, s2, mask17);
+ vec_u8_t srv16_18 = srv16_17;
+ vec_u8_t srv16_19 = srv16_17;
+ vec_u8_t srv16_20 = srv16_17;
+ vec_u8_t srv16_21 = vec_perm(s1, s2, mask21);
+ vec_u8_t srv16_22 = srv16_21;
+ vec_u8_t srv16_23 = srv16_21;
+ vec_u8_t srv16_24 = vec_perm(s1, s2, mask24);
+ vec_u8_t srv16_25 = srv16_24;
+ vec_u8_t srv16_26 = srv16_24;
+ vec_u8_t srv16_27 = srv16_24;
+ vec_u8_t srv16_28 = s1;
+ vec_u8_t srv16_29 = s1;
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1add1 = srv0add1;
+ vec_u8_t srv2add1 = srv0add1;
+ vec_u8_t srv3add1 = srv0;
+ vec_u8_t srv4add1 = srv0;
+ vec_u8_t srv5add1 = srv0;
+ vec_u8_t srv6add1 = srv0;
+ vec_u8_t srv7add1 = srv3;
+ vec_u8_t srv8add1 = srv3;
+ vec_u8_t srv9add1 = srv3;
+ vec_u8_t srv10add1 = srv7;
+ vec_u8_t srv11add1 = srv7;
+ vec_u8_t srv12add1= srv7;
+ vec_u8_t srv13add1 = srv7;
+ vec_u8_t srv14add1 = srv10;
+ vec_u8_t srv15add1 = srv10;
+ //0,0,0,0,3,3,3,7,7,7,7,10,10,10,14,14,14,14,17,17,17,21,21,21,21,24,24,24,24,
+ vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv16add1_1 = srv16add1_0;
+ vec_u8_t srv16add1_2 = srv16add1_0;
+ vec_u8_t srv16add1_3 = srv16_0;
+ vec_u8_t srv16add1_4 = srv16_0;
+ vec_u8_t srv16add1_5 = srv16_0;
+ vec_u8_t srv16add1_6 = srv16_0;
+ vec_u8_t srv16add1_7 = srv16_3;
+ vec_u8_t srv16add1_8 = srv16_3;
+ vec_u8_t srv16add1_9 = srv16_3;
+ vec_u8_t srv16add1_10 = srv16_7;
+ vec_u8_t srv16add1_11 = srv16_7;
+ vec_u8_t srv16add1_12= srv16_7;
+ vec_u8_t srv16add1_13 = srv16_7;
+ vec_u8_t srv16add1_14 = srv16_10;
+ vec_u8_t srv16add1_15 = srv16_10;
+
+ vec_u8_t srv16add1 = srv10;
+ vec_u8_t srv17add1 = srv14;
+ vec_u8_t srv18add1 = srv14;
+ vec_u8_t srv19add1 = srv14;
+ vec_u8_t srv20add1 = srv14;
+ vec_u8_t srv21add1 = srv17;
+ vec_u8_t srv22add1 = srv17;
+ vec_u8_t srv23add1 = srv17;
+ vec_u8_t srv24add1 = srv21;
+ vec_u8_t srv25add1 = srv21;
+ vec_u8_t srv26add1 = srv21;
+ vec_u8_t srv27add1 = srv21;
+ vec_u8_t srv28add1 = srv24;
+ vec_u8_t srv29add1 = srv24;
+ vec_u8_t srv30add1 = srv24;
+ vec_u8_t srv31add1 = srv24;
+
+ vec_u8_t srv16add1_16 = srv16_10;
+ vec_u8_t srv16add1_17 = srv16_14;
+ vec_u8_t srv16add1_18 = srv16_14;
+ vec_u8_t srv16add1_19 = srv16_14;
+ vec_u8_t srv16add1_20 = srv16_14;
+ vec_u8_t srv16add1_21 = srv16_17;
+ vec_u8_t srv16add1_22 = srv16_17;
+ vec_u8_t srv16add1_23 = srv16_17;
+ vec_u8_t srv16add1_24 = srv16_21;
+ vec_u8_t srv16add1_25 = srv16_21;
+ vec_u8_t srv16add1_26 = srv16_21;
+ vec_u8_t srv16add1_27 = srv16_21;
+ vec_u8_t srv16add1_28 = srv16_24;
+ vec_u8_t srv16add1_29 = srv16_24;
+ vec_u8_t srv16add1_30 = srv16_24;
+ vec_u8_t srv16add1_31 = srv16_24;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_17 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_18 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_20 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_21 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_22 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_24 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_25 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_26 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_28 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_29 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_30 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void one_ang_pred_altivec<4, 12>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, };
+ vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, };
+
+ //vec_u8_t srv = vec_xl(0, srcPix0);
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+
+ vec_u8_t vfrac4 = (vec_u8_t){27, 27, 27, 27, 22, 22, 22, 22, 17, 17, 17, 17, 12, 12, 12, 12};
+ vec_u8_t vfrac4_32 = (vec_u8_t){5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 15, 20, 20, 20, 20};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32);
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4);
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 12>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask2={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, };
+vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+/*
+ vec_u8_t srv_left=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_8={0x6, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x6, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+vec_u8_t vfrac8_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac8_1 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac8_3 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<16, 12>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+/*vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };*/
+vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+/*vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask8={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+vec_u8_t mask12={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+/*vec_u8_t mask13={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/
+
+vec_u8_t maskadd1_0={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+/*vec_u8_t maskadd1_1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_4={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t maskadd1_6={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t maskadd1_12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t srv_left=vec_xl(32, srcPix0);
+ vec_u8_t srv_right=vec_xl(0, srcPix0);
+ vec_u8_t refmask_16={0xd, 0x6, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(14, srcPix0);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(33, srcPix0);
+ vec_u8_t refmask_16={0xd, 0x6, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(46, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = srv0;
+ vec_u8_t srv3 = srv0;
+ vec_u8_t srv4 = srv0;
+ vec_u8_t srv5 = srv0;
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = srv6;
+ vec_u8_t srv8 = srv6;
+ vec_u8_t srv9 = srv6;
+ vec_u8_t srv10 = srv6;
+ vec_u8_t srv11 = srv6;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = srv12;
+ vec_u8_t srv14 = srv12;
+ vec_u8_t srv15 = srv12;
+
+ vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1_add1 = srv0_add1;
+ vec_u8_t srv2_add1 = srv0_add1;
+ vec_u8_t srv3_add1 = srv0_add1;
+ vec_u8_t srv4_add1 = srv0_add1;
+ vec_u8_t srv5_add1 = srv0_add1;
+ vec_u8_t srv6_add1 = srv0;
+ vec_u8_t srv7_add1 = srv0;
+ vec_u8_t srv8_add1 = srv0;
+ vec_u8_t srv9_add1 = srv0;
+ vec_u8_t srv10_add1 = srv0;
+ vec_u8_t srv11_add1 = srv0;
+ vec_u8_t srv12_add1= srv6;
+ vec_u8_t srv13_add1 = srv6;
+ vec_u8_t srv14_add1 = srv6;
+ vec_u8_t srv15_add1 = srv6;
+vec_u8_t vfrac16_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 12>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+/*vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };
+vec_u8_t mask5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };*/
+vec_u8_t mask6={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+/*vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };
+vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };*/
+vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+/*vec_u8_t mask13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask15={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+
+vec_u8_t mask16={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask17={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+vec_u8_t mask18={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };*/
+vec_u8_t mask19={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+/*vec_u8_t mask20={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask21={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask22={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask23={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask24={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t mask25={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask26={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask27={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask28={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/
+
+vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+/*
+ vec_u8_t srv_left0 = vec_xl(64, srcPix0);
+ vec_u8_t srv_left1 = vec_xl(80, srcPix0);
+ vec_u8_t srv_right = vec_xl(0, srcPix0);;
+ vec_u8_t refmask_32_0 ={0x1a, 0x13, 0xd, 0x6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
+ vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(12, srcPix0);
+ vec_u8_t s2 = vec_xl(28, srcPix0);
+*/
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x1a, 0x13, 0xd, 0x6, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
+ vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(76, srcPix0);
+ vec_u8_t s2 = vec_xl(92, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv1 = srv0;
+ vec_u8_t srv2 = srv0;
+ vec_u8_t srv3 = srv0;
+ vec_u8_t srv4 = srv0;
+ vec_u8_t srv5 = srv0;
+ vec_u8_t srv6 = vec_perm(s0, s1, mask6);
+ vec_u8_t srv7 = srv6;
+ vec_u8_t srv8 = srv6;
+ vec_u8_t srv9 = srv6;
+ vec_u8_t srv10 = srv6;
+ vec_u8_t srv11 = srv6;
+ vec_u8_t srv12= vec_perm(s0, s1, mask12);
+ vec_u8_t srv13 = srv12;
+ vec_u8_t srv14 = srv12;
+ vec_u8_t srv15 = srv12;
+
+ //0,0,0,3,3,3,3,7,7,7,10,10,10,10,14,14,14,17,17,17,17,21,21,21,24,24,24,24,s0,s0,s0,s0
+
+ vec_u8_t srv16_0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv16_1 = srv16_0;
+ vec_u8_t srv16_2 = srv16_0;
+ vec_u8_t srv16_3 = srv16_0;
+ vec_u8_t srv16_4 = srv16_0;
+ vec_u8_t srv16_5 = srv16_0;
+ vec_u8_t srv16_6 = vec_perm(s1, s2, mask6);
+ vec_u8_t srv16_7 = srv16_6;
+ vec_u8_t srv16_8 = srv16_6;
+ vec_u8_t srv16_9 = srv16_6;
+ vec_u8_t srv16_10 = srv16_6;
+ vec_u8_t srv16_11 = srv16_6;
+ vec_u8_t srv16_12= vec_perm(s1, s2, mask12);
+ vec_u8_t srv16_13 = srv16_12;
+ vec_u8_t srv16_14 = srv16_12;
+ vec_u8_t srv16_15 = srv16_12;
+
+ vec_u8_t srv16 = srv12;
+ vec_u8_t srv17 = srv12;
+ vec_u8_t srv18 = srv12;
+ vec_u8_t srv19 = vec_perm(s0, s1, mask19);
+ vec_u8_t srv20 = srv19;
+ vec_u8_t srv21 = srv19;
+ vec_u8_t srv22 = srv19;
+ vec_u8_t srv23 = srv19;
+ vec_u8_t srv24 = srv19;
+ vec_u8_t srv25 = s0;
+ vec_u8_t srv26 = s0;
+ vec_u8_t srv27 = s0;
+ vec_u8_t srv28 = s0;
+ vec_u8_t srv29 = s0;
+ vec_u8_t srv30 = s0;
+ vec_u8_t srv31 = s0;
+
+ vec_u8_t srv16_16 = srv16_12;
+ vec_u8_t srv16_17 = srv16_12;
+ vec_u8_t srv16_18 = srv16_12;
+ vec_u8_t srv16_19 = vec_perm(s1, s2, mask19);
+ vec_u8_t srv16_20 = srv16_19;
+ vec_u8_t srv16_21 = srv16_19;
+ vec_u8_t srv16_22 = srv16_19;
+ vec_u8_t srv16_23 = srv16_19;
+ vec_u8_t srv16_24 = srv16_19;
+ vec_u8_t srv16_25 = s1;
+ vec_u8_t srv16_26 = s1;
+ vec_u8_t srv16_27 = s1;
+ vec_u8_t srv16_28 = s1;
+ vec_u8_t srv16_29 = s1;
+ vec_u8_t srv16_30 = s1;
+ vec_u8_t srv16_31 = s1;
+
+ vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv1add1 = srv0add1;
+ vec_u8_t srv2add1 = srv0add1;
+ vec_u8_t srv3add1 = srv0add1;
+ vec_u8_t srv4add1 = srv0add1;
+ vec_u8_t srv5add1 = srv0add1;
+ vec_u8_t srv6add1 = srv0;
+ vec_u8_t srv7add1 = srv0;
+ vec_u8_t srv8add1 = srv0;
+ vec_u8_t srv9add1 = srv0;
+ vec_u8_t srv10add1 = srv0;
+ vec_u8_t srv11add1 = srv0;
+ vec_u8_t srv12add1= srv6;
+ vec_u8_t srv13add1 = srv6;
+ vec_u8_t srv14add1 = srv6;
+ vec_u8_t srv15add1 = srv6;
+
+ vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0);
+ vec_u8_t srv16add1_1 = srv16add1_0;
+ vec_u8_t srv16add1_2 = srv16add1_0;
+ vec_u8_t srv16add1_3 = srv16add1_0;
+ vec_u8_t srv16add1_4 = srv16add1_0;
+ vec_u8_t srv16add1_5 = srv16add1_0;
+ vec_u8_t srv16add1_6 = srv16_0;
+ vec_u8_t srv16add1_7 = srv16_0;
+ vec_u8_t srv16add1_8 = srv16_0;
+ vec_u8_t srv16add1_9 = srv16_0;
+ vec_u8_t srv16add1_10 = srv16_0;
+ vec_u8_t srv16add1_11 = srv16_0;
+ vec_u8_t srv16add1_12= srv16_6;
+ vec_u8_t srv16add1_13 = srv16_6;
+ vec_u8_t srv16add1_14 = srv16_6;
+ vec_u8_t srv16add1_15 = srv16_6;
+
+ vec_u8_t srv16add1 = srv6;
+ vec_u8_t srv17add1 = srv6;
+ vec_u8_t srv18add1 = srv6;
+ vec_u8_t srv19add1 = srv12;
+ vec_u8_t srv20add1 = srv12;
+ vec_u8_t srv21add1 = srv12;
+ vec_u8_t srv22add1 = srv12;
+ vec_u8_t srv23add1 = srv12;
+ vec_u8_t srv24add1 = srv12;
+ vec_u8_t srv25add1 = srv19;
+ vec_u8_t srv26add1 = srv19;
+ vec_u8_t srv27add1 = srv19;
+ vec_u8_t srv28add1 = srv19;
+ vec_u8_t srv29add1 = srv19;
+ vec_u8_t srv30add1 = srv19;
+ vec_u8_t srv31add1 = srv19;
+
+ vec_u8_t srv16add1_16 = srv16_6;
+ vec_u8_t srv16add1_17 = srv16_6;
+ vec_u8_t srv16add1_18 = srv16_6;
+ vec_u8_t srv16add1_19 = srv16_12;
+ vec_u8_t srv16add1_20 = srv16_12;
+ vec_u8_t srv16add1_21 = srv16_12;
+ vec_u8_t srv16add1_22 = srv16_12;
+ vec_u8_t srv16add1_23 = srv16_12;
+ vec_u8_t srv16add1_24 = srv16_12;
+ vec_u8_t srv16add1_25 = srv16_19;
+ vec_u8_t srv16add1_26 = srv16_19;
+ vec_u8_t srv16add1_27 = srv16_19;
+ vec_u8_t srv16add1_28 = srv16_19;
+ vec_u8_t srv16add1_29 = srv16_19;
+ vec_u8_t srv16add1_30 = srv16_19;
+ vec_u8_t srv16add1_31 = srv16_19;
+
+vec_u8_t vfrac16_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_16 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_18 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_20 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_22 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_24 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_26 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_28 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_30 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_16 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21};
+vec_u8_t vfrac16_32_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_18 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31};
+vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_20 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9};
+vec_u8_t vfrac16_32_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_22 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19};
+vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_24 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29};
+vec_u8_t vfrac16_32_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_26 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7};
+vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_28 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17};
+vec_u8_t vfrac16_32_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_30 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27};
+vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0);
+ one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1);
+
+ one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2);
+ one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3);
+
+ one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4);
+ one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5);
+
+ one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6);
+ one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7);
+
+ one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8);
+ one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9);
+
+ one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10);
+ one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11);
+
+ one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12);
+ one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13);
+
+ one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14);
+ one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15);
+
+ one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16);
+ one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17);
+
+ one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18);
+ one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19);
+
+ one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20);
+ one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21);
+
+ one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22);
+ one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23);
+
+ one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24);
+ one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25);
+
+ one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26);
+ one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27);
+
+ one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28);
+ one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29);
+
+ one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30);
+ one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void one_ang_pred_altivec<4, 11>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, };
+ vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, };
+/*
+ vec_u8_t srv=vec_xl(0, srcPix0);
+*/
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(9, srcPix0);
+ vec_u8_t refmask_4={0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t vfrac4 = (vec_u8_t){30, 30, 30, 30, 28, 28, 28, 28, 26, 26, 26, 26, 24, 24, 24, 24};
+ vec_u8_t vfrac4_32 = (vec_u8_t){2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8};
+
+ vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32);
+ vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32);
+ vec_u16_t vmle1 = vec_mule(srv1, vfrac4);
+ vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4);
+ vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16);
+ vec_u16_t ve = vec_sra(vsume, u16_5);
+ vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);
+ vec_u16_t vo = vec_sra(vsumo, u16_5);
+ vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));
+
+ vec_xst(vout, 0, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 4; y++)
+ {
+ for (int x = 0; x < 4; x++)
+ {
+ printf("%d ",dst[y * 4 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<8, 11>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+ vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+ vec_u8_t mask2={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+ vec_u8_t mask3={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+ vec_u8_t mask4={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+ vec_u8_t mask5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+ vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, };
+ vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, };
+
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+ vec_u8_t vout_0, vout_1, vout_2, vout_3;
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0);
+ vec_u8_t srv_right=vec_xl(17, srcPix0);
+ vec_u8_t refmask_8={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, };
+ vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8);
+
+ vec_u8_t srv0 = vec_perm(srv, srv, mask0);
+ vec_u8_t srv1 = vec_perm(srv, srv, mask1);
+ vec_u8_t srv2 = vec_perm(srv, srv, mask2);
+ vec_u8_t srv3 = vec_perm(srv, srv, mask3);
+ vec_u8_t srv4 = vec_perm(srv, srv, mask4);
+ vec_u8_t srv5 = vec_perm(srv, srv, mask5);
+ vec_u8_t srv6 = vec_perm(srv, srv, mask6);
+ vec_u8_t srv7 = vec_perm(srv, srv, mask7);
+
+vec_u8_t vfrac8_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac8_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac8_2 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac8_3 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac8_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac8_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac8_32_2 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac8_32_3 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16};
+
+one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0);
+one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1);
+one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2);
+one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 48, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 8; y++)
+ {
+ for (int x = 0; x < 8; x++)
+ {
+ printf("%d ",dst[y * 8 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+template<>
+void one_ang_pred_altivec<16, 11>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+/*vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask2={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask3={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask4={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask5={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask7={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask8={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask9={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask10={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask11={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask12={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask13={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };
+vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/
+vec_u8_t maskadd1_0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+/*vec_u8_t maskadd1_1={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_2={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_3={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_8={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+
+ vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */
+ vec_u8_t refmask_16={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16);
+ vec_u8_t s1 = vec_xl(48, srcPix0);
+
+ vec_u8_t srv0 = s0;
+ vec_u8_t srv1 = vec_perm(s0, s1, maskadd1_0);
+
+vec_u8_t vfrac16_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+vec_u8_t vfrac16_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+vec_u8_t vfrac16_32_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+vec_u8_t vfrac16_32_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+vec_u8_t vfrac16_32_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+vec_u8_t vfrac16_32_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+vec_u8_t vfrac16_32_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+vec_u8_t vfrac16_32_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+vec_u8_t vfrac16_32_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+vec_u8_t vfrac16_32_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+vec_u8_t vfrac16_32_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+vec_u8_t vfrac16_32_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+vec_u8_t vfrac16_32_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+
+ one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1);
+ one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2);
+ one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3);
+ one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4);
+ one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5);
+ one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_6);
+ one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_7);
+ one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_8);
+ one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_9);
+ one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_10);
+ one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_11);
+ one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_12);
+ one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_13);
+ one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_14);
+ one_line(srv0, srv1, vfrac16_32_15, vfrac16_15, vout_15);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 16*2, dst);
+ vec_xst(vout_3, 16*3, dst);
+ vec_xst(vout_4, 16*4, dst);
+ vec_xst(vout_5, 16*5, dst);
+ vec_xst(vout_6, 16*6, dst);
+ vec_xst(vout_7, 16*7, dst);
+ vec_xst(vout_8, 16*8, dst);
+ vec_xst(vout_9, 16*9, dst);
+ vec_xst(vout_10, 16*10, dst);
+ vec_xst(vout_11, 16*11, dst);
+ vec_xst(vout_12, 16*12, dst);
+ vec_xst(vout_13, 16*13, dst);
+ vec_xst(vout_14, 16*14, dst);
+ vec_xst(vout_15, 16*15, dst);
+
+#ifdef DEBUG
+ for (int y = 0; y < 16; y++)
+ {
+ for (int x = 0; x < 16; x++)
+ {
+ printf("%d ",dst[y * 16 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+template<>
+void one_ang_pred_altivec<32, 11>(pixel* dst, const pixel *srcPix0, int bFilter)
+{
+ vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };
+ vec_u8_t maskadd1_0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };
+
+ vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5};
+/*
+ vec_u8_t srv_left = vec_xl(80, srcPix0);
+ vec_u8_t srv_right = vec_xl(0, srcPix0);;
+ vec_u8_t refmask_32 ={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e};
+ vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_32);
+ vec_u8_t s1 = vec_xl(15, srcPix0);;
+ vec_u8_t s2 = vec_xl(31, srcPix0);
+*/
+ vec_u8_t srv_left0=vec_xl(0, srcPix0);
+ vec_u8_t srv_left1=vec_xl(16, srcPix0);
+ vec_u8_t srv_right=vec_xl(65, srcPix0);
+ vec_u8_t refmask_32_0={0x10, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
+ vec_u8_t refmask_32_1={0x0, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};
+ vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 );
+ vec_u8_t s1 = vec_xl(79, srcPix0);
+ vec_u8_t s2 = vec_xl(95, srcPix0);
+
+ vec_u8_t srv0 = vec_perm(s0, s1, mask0);
+ vec_u8_t srv16_0 = vec_perm(s1, s2, mask0);
+ vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0);
+ vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0);
+
+ vec_u8_t vfrac16_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u8_t vfrac16_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
+
+ vec_u8_t vfrac16_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2};
+ vec_u8_t vfrac16_32_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
+ vec_u8_t vfrac16_32_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6};
+ vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8};
+ vec_u8_t vfrac16_32_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10};
+ vec_u8_t vfrac16_32_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12};
+ vec_u8_t vfrac16_32_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14};
+ vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16};
+ vec_u8_t vfrac16_32_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18};
+ vec_u8_t vfrac16_32_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20};
+ vec_u8_t vfrac16_32_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22};
+ vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24};
+ vec_u8_t vfrac16_32_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26};
+ vec_u8_t vfrac16_32_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28};
+ vec_u8_t vfrac16_32_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30};
+ vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
+
+
+ /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */
+ vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo;
+ vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7;
+ vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15;
+ vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23;
+ vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31;
+
+ one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(srv0, srv0add1, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(srv0, srv0add1, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(srv0, srv0add1, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(srv0, srv0add1, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(srv0, srv0add1, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(srv0, srv0add1, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(srv0, srv0add1, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(srv0, srv0add1, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(srv0, srv0add1, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(srv0, srv0add1, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(srv0, srv0add1, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(srv0, srv0add1, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(srv0, srv0add1, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(srv0, srv0add1, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(srv0, srv0add1, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(srv16_0, srv16add1_0, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 0, dst);
+ vec_xst(vout_1, 16, dst);
+ vec_xst(vout_2, 32, dst);
+ vec_xst(vout_3, 32+16, dst);
+ vec_xst(vout_4, 32*2, dst);
+ vec_xst(vout_5, 32*2+16, dst);
+ vec_xst(vout_6, 32*3, dst);
+ vec_xst(vout_7, 32*3+16, dst);
+ vec_xst(vout_8, 32*4, dst);
+ vec_xst(vout_9, 32*4+16, dst);
+ vec_xst(vout_10, 32*5, dst);
+ vec_xst(vout_11, 32*5+16, dst);
+ vec_xst(vout_12, 32*6, dst);
+ vec_xst(vout_13, 32*6+16, dst);
+ vec_xst(vout_14, 32*7, dst);
+ vec_xst(vout_15, 32*7+16, dst);
+ vec_xst(vout_16, 32*8, dst);
+ vec_xst(vout_17, 32*8+16, dst);
+ vec_xst(vout_18, 32*9, dst);
+ vec_xst(vout_19, 32*9+16, dst);
+ vec_xst(vout_20, 32*10, dst);
+ vec_xst(vout_21, 32*10+16, dst);
+ vec_xst(vout_22, 32*11, dst);
+ vec_xst(vout_23, 32*11+16, dst);
+ vec_xst(vout_24, 32*12, dst);
+ vec_xst(vout_25, 32*12+16, dst);
+ vec_xst(vout_26, 32*13, dst);
+ vec_xst(vout_27, 32*13+16, dst);
+ vec_xst(vout_28, 32*14, dst);
+ vec_xst(vout_29, 32*14+16, dst);
+ vec_xst(vout_30, 32*15, dst);
+ vec_xst(vout_31, 32*15+16, dst);
+
+ one_line(s0, srv0, vfrac16_32_0, vfrac16_0, vout_0);
+ one_line(s1, srv16_0, vfrac16_32_0, vfrac16_0, vout_1);
+
+ one_line(s0, srv0, vfrac16_32_1, vfrac16_1, vout_2);
+ one_line(s1, srv16_0, vfrac16_32_1, vfrac16_1, vout_3);
+
+ one_line(s0, srv0, vfrac16_32_2, vfrac16_2, vout_4);
+ one_line(s1, srv16_0, vfrac16_32_2, vfrac16_2, vout_5);
+
+ one_line(s0, srv0, vfrac16_32_3, vfrac16_3, vout_6);
+ one_line(s1, srv16_0, vfrac16_32_3, vfrac16_3, vout_7);
+
+ one_line(s0, srv0, vfrac16_32_4, vfrac16_4, vout_8);
+ one_line(s1, srv16_0, vfrac16_32_4, vfrac16_4, vout_9);
+
+ one_line(s0, srv0, vfrac16_32_5, vfrac16_5, vout_10);
+ one_line(s1, srv16_0, vfrac16_32_5, vfrac16_5, vout_11);
+
+ one_line(s0, srv0, vfrac16_32_6, vfrac16_6, vout_12);
+ one_line(s1, srv16_0, vfrac16_32_6, vfrac16_6, vout_13);
+
+ one_line(s0, srv0, vfrac16_32_7, vfrac16_7, vout_14);
+ one_line(s1, srv16_0, vfrac16_32_7, vfrac16_7, vout_15);
+
+ one_line(s0, srv0, vfrac16_32_8, vfrac16_8, vout_16);
+ one_line(s1, srv16_0, vfrac16_32_8, vfrac16_8, vout_17);
+
+ one_line(s0, srv0, vfrac16_32_9, vfrac16_9, vout_18);
+ one_line(s1, srv16_0, vfrac16_32_9, vfrac16_9, vout_19);
+
+ one_line(s0, srv0, vfrac16_32_10, vfrac16_10, vout_20);
+ one_line(s1, srv16_0, vfrac16_32_10, vfrac16_10, vout_21);
+
+ one_line(s0, srv0, vfrac16_32_11, vfrac16_11, vout_22);
+ one_line(s1, srv16_0, vfrac16_32_11, vfrac16_11, vout_23);
+
+ one_line(s0, srv0, vfrac16_32_12, vfrac16_12, vout_24);
+ one_line(s1, srv16_0, vfrac16_32_12, vfrac16_12, vout_25);
+
+ one_line(s0, srv0, vfrac16_32_13, vfrac16_13, vout_26);
+ one_line(s1, srv16_0, vfrac16_32_13, vfrac16_13, vout_27);
+
+ one_line(s0, srv0, vfrac16_32_14, vfrac16_14, vout_28);
+ one_line(s1, srv16_0, vfrac16_32_14, vfrac16_14, vout_29);
+
+ one_line(s0, srv0, vfrac16_32_15, vfrac16_15, vout_30);
+ one_line(s1, srv16_0, vfrac16_32_15, vfrac16_15, vout_31);
+
+ vec_xst(vout_0, 32*16, dst);
+ vec_xst(vout_1, 32*16+16, dst);
+ vec_xst(vout_2, 32*17, dst);
+ vec_xst(vout_3, 32*17+16, dst);
+ vec_xst(vout_4, 32*18, dst);
+ vec_xst(vout_5, 32*18+16, dst);
+ vec_xst(vout_6, 32*19, dst);
+ vec_xst(vout_7, 32*19+16, dst);
+ vec_xst(vout_8, 32*20, dst);
+ vec_xst(vout_9, 32*20+16, dst);
+ vec_xst(vout_10, 32*21, dst);
+ vec_xst(vout_11, 32*21+16, dst);
+ vec_xst(vout_12, 32*22, dst);
+ vec_xst(vout_13, 32*22+16, dst);
+ vec_xst(vout_14, 32*23, dst);
+ vec_xst(vout_15, 32*23+16, dst);
+ vec_xst(vout_16, 32*24, dst);
+ vec_xst(vout_17, 32*24+16, dst);
+ vec_xst(vout_18, 32*25, dst);
+ vec_xst(vout_19, 32*25+16, dst);
+ vec_xst(vout_20, 32*26, dst);
+ vec_xst(vout_21, 32*26+16, dst);
+ vec_xst(vout_22, 32*27, dst);
+ vec_xst(vout_23, 32*27+16, dst);
+ vec_xst(vout_24, 32*28, dst);
+ vec_xst(vout_25, 32*28+16, dst);
+ vec_xst(vout_26, 32*29, dst);
+ vec_xst(vout_27, 32*29+16, dst);
+ vec_xst(vout_28, 32*30, dst);
+ vec_xst(vout_29, 32*30+16, dst);
+ vec_xst(vout_30, 32*31, dst);
+ vec_xst(vout_31, 32*31+16, dst);
+
+
+#ifdef DEBUG
+ for (int y = 0; y < 32; y++)
+ {
+ for (int x = 0; x < 32; x++)
+ {
+ printf("%d ",dst[y * 32 + x] );
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+#endif
+}
+
+
+#define ONE_ANG(log2Size, mode, dest, refPix, filtPix, bLuma)\
+{\
+ const int width = 1<< log2Size;\
+ pixel *srcPix0 = (g_intraFilterFlags[mode] & width ? filtPix : refPix);\
+ pixel *dst = dest + ((mode - 2) << (log2Size * 2));\
+ srcPix0 = refPix;\
+ dst = dest;\
+ one_ang_pred_altivec<width, mode>(dst, srcPix0, bLuma);\
+}
+
+
+template<int log2Size>
+void all_angs_pred_altivec(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
+{
+ ONE_ANG(log2Size, 2, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 3, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 4, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 5, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 6, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 7, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 8, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 9, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 10, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 11, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 12, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 13, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 14, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 15, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 16, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 17, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 18, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 19, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 20, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 21, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 22, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 23, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 24, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 25, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 26, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 27, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 28, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 29, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 30, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 31, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 32, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 33, dest, refPix, filtPix, bLuma);
+ ONE_ANG(log2Size, 34, dest, refPix, filtPix, bLuma);
+ return;
+}
+
+void setupIntraPrimitives_altivec(EncoderPrimitives &p)
+{
+ for (int i = 2; i < NUM_INTRA_MODE; i++)
+ {
+ p.cu[BLOCK_4x4].intra_pred[i] = intra_pred_ang_altivec<4>;
+ p.cu[BLOCK_8x8].intra_pred[i] = intra_pred_ang_altivec<8>;
+ p.cu[BLOCK_16x16].intra_pred[i] = intra_pred_ang_altivec<16>;
+ p.cu[BLOCK_32x32].intra_pred[i] = intra_pred_ang_altivec<32>;
+ }
+
+ p.cu[BLOCK_4x4].intra_pred_allangs = all_angs_pred_altivec<2>;
+ p.cu[BLOCK_8x8].intra_pred_allangs = all_angs_pred_altivec<3>;
+ p.cu[BLOCK_16x16].intra_pred_allangs = all_angs_pred_altivec<4>;
+ p.cu[BLOCK_32x32].intra_pred_allangs = all_angs_pred_altivec<5>;
+}
+
+}
+
diff --git a/source/common/ppc/ipfilter_altivec.cpp b/source/common/ppc/ipfilter_altivec.cpp
new file mode 100644
index 0000000..caf11b0
--- /dev/null
+++ b/source/common/ppc/ipfilter_altivec.cpp
@@ -0,0 +1,1522 @@
+/*****************************************************************************
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Roger Moussalli <rmoussal at us.ibm.com>
+ * Min Chen <min.chen at multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include <iostream>
+#include "common.h"
+#include "primitives.h"
+#include "ppccommon.h"
+
+using namespace X265_NS;
+
+// ORIGINAL : for(col=0; col<16; col++) {sum[col] = src[ocol+col + 0 * srcStride] * c[0];}
+#define multiply_pixel_coeff(/*vector int*/ v_sum_0, /*vector int*/ v_sum_1, /*vector int*/ v_sum_2, /*vector int*/ v_sum_3, /*const pixel * */ src, /*int*/ src_offset, /*vector signed short*/ v_coeff) \
+{ \
+ vector unsigned char v_pixel ; \
+ vector signed short v_pixel_16_h, v_pixel_16_l ; \
+ const vector signed short v_mask_unisgned_8_to_16 = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; \
+\
+ /* load the pixels */ \
+ v_pixel = vec_xl(src_offset, src) ; \
+\
+ /* unpack the 8-bit pixels to 16-bit values (and undo the sign extension) */ \
+ v_pixel_16_h = vec_unpackh((vector signed char)v_pixel) ; \
+ v_pixel_16_l = vec_unpackl((vector signed char)v_pixel) ; \
+ v_pixel_16_h = vec_and(v_pixel_16_h, v_mask_unisgned_8_to_16) ; \
+ v_pixel_16_l = vec_and(v_pixel_16_l, v_mask_unisgned_8_to_16) ; \
+\
+ /* multiply the pixels by the coefficient */ \
+ v_sum_0 = vec_mule(v_pixel_16_h, v_coeff) ; \
+ v_sum_1 = vec_mulo(v_pixel_16_h, v_coeff) ; \
+ v_sum_2 = vec_mule(v_pixel_16_l, v_coeff) ; \
+ v_sum_3 = vec_mulo(v_pixel_16_l, v_coeff) ; \
+} // end multiply_pixel_coeff()
+
+
+// ORIGINAL : for(col=0; col<16; col++) {sum[col] += src[ocol+col + 1 * srcStride] * c[1];}
+#define multiply_accumulate_pixel_coeff(/*vector int*/ v_sum_0, /*vector int*/ v_sum_1, /*vector int*/ v_sum_2, /*vector int*/ v_sum_3, /*const pixel * */ src, /*int*/ src_offset, /*vector signed short*/ v_coeff) \
+{ \
+ vector unsigned char v_pixel ; \
+ vector signed short v_pixel_16_h, v_pixel_16_l ; \
+ vector int v_product_int_0, v_product_int_1, v_product_int_2, v_product_int_3 ; \
+ const vector signed short v_mask_unisgned_8_to_16 = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; \
+\
+ /* ORIGINAL : for(col=0; col<16; col++) {sum[col] = src[ocol+col + 0 * srcStride] * c[0];} */ \
+ /* load the pixels */ \
+ v_pixel = vec_xl(src_offset, src) ; \
+\
+ /* unpack the 8-bit pixels to 16-bit values (and undo the sign extension) */ \
+ v_pixel_16_h = vec_unpackh((vector signed char)v_pixel) ; \
+ v_pixel_16_l = vec_unpackl((vector signed char)v_pixel) ; \
+ v_pixel_16_h = vec_and(v_pixel_16_h, v_mask_unisgned_8_to_16) ; \
+ v_pixel_16_l = vec_and(v_pixel_16_l, v_mask_unisgned_8_to_16) ; \
+\
+ /* multiply the pixels by the coefficient */ \
+ v_product_int_0 = vec_mule(v_pixel_16_h, v_coeff) ; \
+ v_product_int_1 = vec_mulo(v_pixel_16_h, v_coeff) ; \
+ v_product_int_2 = vec_mule(v_pixel_16_l, v_coeff) ; \
+ v_product_int_3 = vec_mulo(v_pixel_16_l, v_coeff) ; \
+\
+ /* accumulate the results with the sum vectors */ \
+ v_sum_0 = vec_add(v_sum_0, v_product_int_0) ; \
+ v_sum_1 = vec_add(v_sum_1, v_product_int_1) ; \
+ v_sum_2 = vec_add(v_sum_2, v_product_int_2) ; \
+ v_sum_3 = vec_add(v_sum_3, v_product_int_3) ; \
+} // end multiply_accumulate_pixel_coeff()
+
+
+
+#if 0
+//ORIGINAL
+// Works with the following values:
+// N = 8
+// width >= 16 (multiple of 16)
+// any height
+template<int N, int width, int height>
+void interp_vert_pp_altivec(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+{
+
+
+ const int16_t* c = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx];
+ const int shift = IF_FILTER_PREC;
+ const int offset = 1 << (shift - 1);
+ const uint16_t maxVal = (1 << X265_DEPTH) - 1;
+
+ src -= (N / 2 - 1) * srcStride;
+
+
+ // Vector to hold replicated shift amount
+ const vector unsigned int v_shift = {shift, shift, shift, shift} ;
+
+ // Vector to hold replicated offset
+ const vector int v_offset = {offset, offset, offset, offset} ;
+
+ // Vector to hold replicated maxVal
+ const vector signed short v_maxVal = {maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal} ;
+
+
+ // Vector to hold replicated coefficients (one coefficient replicated per vector)
+ vector signed short v_coeff_0, v_coeff_1, v_coeff_2, v_coeff_3, v_coeff_4, v_coeff_5, v_coeff_6, v_coeff_7 ;
+ vector signed short v_coefficients = vec_xl(0, c) ; // load all coefficients into one vector
+
+ // Replicate the coefficients into respective vectors
+ v_coeff_0 = vec_splat(v_coefficients, 0) ;
+ v_coeff_1 = vec_splat(v_coefficients, 1) ;
+ v_coeff_2 = vec_splat(v_coefficients, 2) ;
+ v_coeff_3 = vec_splat(v_coefficients, 3) ;
+ v_coeff_4 = vec_splat(v_coefficients, 4) ;
+ v_coeff_5 = vec_splat(v_coefficients, 5) ;
+ v_coeff_6 = vec_splat(v_coefficients, 6) ;
+ v_coeff_7 = vec_splat(v_coefficients, 7) ;
+
+
+
+ int row, ocol, col;
+ for (row = 0; row < height; row++)
+ {
+ for (ocol = 0; ocol < width; ocol+=16)
+ {
+
+
+ // int sum[16] ;
+ // int16_t val[16] ;
+
+ // --> for(col=0; col<16; col++) {sum[col] = src[ocol+col + 1 * srcStride] * c[0];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 1 * srcStride] * c[1];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 2 * srcStride] * c[2];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 3 * srcStride] * c[3];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 4 * srcStride] * c[4];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 5 * srcStride] * c[5];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 6 * srcStride] * c[6];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 7 * srcStride] * c[7];}
+
+
+ vector signed int v_sum_0, v_sum_1, v_sum_2, v_sum_3 ;
+ vector signed short v_val_0, v_val_1 ;
+
+
+
+ multiply_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol, v_coeff_0) ;
+ multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 1 * srcStride, v_coeff_1) ;
+ multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 2 * srcStride, v_coeff_2) ;
+ multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 3 * srcStride, v_coeff_3) ;
+ multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 4 * srcStride, v_coeff_4) ;
+ multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 5 * srcStride, v_coeff_5) ;
+ multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 6 * srcStride, v_coeff_6) ;
+ multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 7 * srcStride, v_coeff_7) ;
+
+
+
+
+
+ // --> for(col=0; col<16; col++) {val[col] = (int16_t)((sum[col] + offset) >> shift);}
+ // Add offset
+ v_sum_0 = vec_add(v_sum_0, v_offset) ;
+ v_sum_1 = vec_add(v_sum_1, v_offset) ;
+ v_sum_2 = vec_add(v_sum_2, v_offset) ;
+ v_sum_3 = vec_add(v_sum_3, v_offset) ;
+ // Shift right by "shift"
+ v_sum_0 = vec_sra(v_sum_0, v_shift) ;
+ v_sum_1 = vec_sra(v_sum_1, v_shift) ;
+ v_sum_2 = vec_sra(v_sum_2, v_shift) ;
+ v_sum_3 = vec_sra(v_sum_3, v_shift) ;
+
+ // Pack into 16-bit numbers
+ v_val_0 = vec_pack(v_sum_0, v_sum_2) ;
+ v_val_1 = vec_pack(v_sum_1, v_sum_3) ;
+
+
+
+ // --> for(col=0; col<16; col++) {val[col] = (val[col] < 0) ? 0 : val[col];}
+ vector bool short v_comp_zero_0, v_comp_zero_1 ;
+ vector signed short v_max_masked_0, v_max_masked_1 ;
+ vector signed short zeros16 = {0,0,0,0,0,0,0,0} ;
+ // Compute less than 0
+ v_comp_zero_0 = vec_cmplt(v_val_0, zeros16) ;
+ v_comp_zero_1 = vec_cmplt(v_val_1, zeros16) ;
+ // Keep values that are greater or equal to 0
+ v_val_0 = vec_andc(v_val_0, v_comp_zero_0) ;
+ v_val_1 = vec_andc(v_val_1, v_comp_zero_1) ;
+
+
+
+ // --> for(col=0; col<16; col++) {val[col] = (val[col] > maxVal) ? maxVal : val[col];}
+ vector bool short v_comp_max_0, v_comp_max_1 ;
+ // Compute greater than max
+ v_comp_max_0 = vec_cmpgt(v_val_0, v_maxVal) ;
+ v_comp_max_1 = vec_cmpgt(v_val_1, v_maxVal) ;
+ // Replace values greater than maxVal with maxVal
+ v_val_0 = vec_sel(v_val_0, v_maxVal, v_comp_max_0) ;
+ v_val_1 = vec_sel(v_val_1, v_maxVal, v_comp_max_1) ;
+
+
+
+ // --> for(col=0; col<16; col++) {dst[ocol+col] = (pixel)val[col];}
+ // Pack the vals into 8-bit numbers
+ // but also re-ordering them - side effect of mule and mulo
+ vector unsigned char v_result ;
+ vector unsigned char v_perm_index = {0x00, 0x10, 0x02, 0x12, 0x04, 0x14, 0x06, 0x16, 0x08 ,0x18, 0x0A, 0x1A, 0x0C, 0x1C, 0x0E, 0x1E} ;
+ v_result = (vector unsigned char)vec_perm(v_val_0, v_val_1, v_perm_index) ;
+ // Store the results back to dst[]
+ vec_xst(v_result, ocol, (unsigned char *)dst) ;
+ }
+
+ src += srcStride;
+ dst += dstStride;
+ }
+} // end interp_vert_pp_altivec()
+#else
+// Works with the following values:
+// N = 8
+// width >= 16 (multiple of 16)
+// any height
+template<int N, int width, int height>
+void interp_vert_pp_altivec(const pixel* __restrict__ src, intptr_t srcStride, pixel* __restrict__ dst, intptr_t dstStride, int coeffIdx)
+{
+ const int16_t* __restrict__ c = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx];
+ int shift = IF_FILTER_PREC;
+ int offset = 1 << (shift - 1);
+ uint16_t maxVal = (1 << X265_DEPTH) - 1;
+
+ src -= (N / 2 - 1) * srcStride;
+
+ vector signed short vcoeff0 = vec_splats(c[0]);
+ vector signed short vcoeff1 = vec_splats(c[1]);
+ vector signed short vcoeff2 = vec_splats(c[2]);
+ vector signed short vcoeff3 = vec_splats(c[3]);
+ vector signed short vcoeff4 = vec_splats(c[4]);
+ vector signed short vcoeff5 = vec_splats(c[5]);
+ vector signed short vcoeff6 = vec_splats(c[6]);
+ vector signed short vcoeff7 = vec_splats(c[7]);
+ vector signed short voffset = vec_splats((short)offset);
+ vector signed short vshift = vec_splats((short)shift);
+ vector signed short vmaxVal = vec_splats((short)maxVal);
+ vector signed short vzero_s16 = vec_splats( (signed short)0u);;
+ vector signed int vzero_s32 = vec_splats( (signed int)0u);
+ vector unsigned char vzero_u8 = vec_splats( (unsigned char)0u );
+ vector unsigned char vchar_to_short_maskH = {24, 0, 25, 0, 26, 0, 27, 0, 28, 0, 29, 0, 30, 0, 31, 0};
+ vector unsigned char vchar_to_short_maskL = {16, 0, 17, 0 ,18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0};
+
+ vector signed short vsrcH, vsrcL, vsumH, vsumL;
+ vector unsigned char vsrc;
+
+ vector signed short vsrc2H, vsrc2L, vsum2H, vsum2L;
+ vector unsigned char vsrc2;
+
+ const pixel* __restrict__ src2 = src+srcStride;
+ pixel* __restrict__ dst2 = dst+dstStride;
+
+ int row, col;
+ for (row = 0; row < height; row+=2)
+ {
+ for (col = 0; col < width; col+=16)
+ {
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 0*srcStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH = vsrcH * vcoeff0;
+ vsumL = vsrcL * vcoeff0;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 1*srcStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff1;
+ vsumL += vsrcL * vcoeff1;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 2*srcStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff2;
+ vsumL += vsrcL * vcoeff2;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 3*srcStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff3;
+ vsumL += vsrcL * vcoeff3;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 4*srcStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff4;
+ vsumL += vsrcL * vcoeff4;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 5*srcStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff5;
+ vsumL += vsrcL * vcoeff5;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 6*srcStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff6;
+ vsumL += vsrcL * vcoeff6;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 7*srcStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff7;
+ vsumL += vsrcL * vcoeff7;
+
+ vector short vvalH = (vsumH + voffset) >> vshift;
+ vvalH = vec_max( vvalH, vzero_s16 );
+ vvalH = vec_min( vvalH, vmaxVal );
+
+ vector short vvalL = (vsumL + voffset) >> vshift;
+ vvalL = vec_max( vvalL, vzero_s16 );
+ vvalL = vec_min( vvalL, vmaxVal );
+
+ vector signed char vdst = vec_pack( vvalL, vvalH );
+ vec_xst( vdst, 0, (signed char*)&dst[col] );
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 0*srcStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H = vsrc2H * vcoeff0;
+ vsum2L = vsrc2L * vcoeff0;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 1*srcStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff1;
+ vsum2L += vsrc2L * vcoeff1;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 2*srcStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff2;
+ vsum2L += vsrc2L * vcoeff2;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 3*srcStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff3;
+ vsum2L += vsrc2L * vcoeff3;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 4*srcStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff4;
+ vsum2L += vsrc2L * vcoeff4;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 5*srcStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff5;
+ vsum2L += vsrc2L * vcoeff5;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 6*srcStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff6;
+ vsum2L += vsrc2L * vcoeff6;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 7*srcStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff7;
+ vsum2L += vsrc2L * vcoeff7;
+
+ vector short vval2H = (vsum2H + voffset) >> vshift;
+ vval2H = vec_max( vval2H, vzero_s16 );
+ vval2H = vec_min( vval2H, vmaxVal );
+
+ vector short vval2L = (vsum2L + voffset) >> vshift;
+ vval2L = vec_max( vval2L, vzero_s16 );
+ vval2L = vec_min( vval2L, vmaxVal );
+
+ vector signed char vdst2 = vec_pack( vval2L, vval2H );
+ vec_xst( vdst2, 0, (signed char*)&dst2[col] );
+ }
+
+ src += 2*srcStride;
+ dst += 2*dstStride;
+ src2 += 2*srcStride;
+ dst2 += 2*dstStride;
+ }
+}
+#endif
+
+
+// ORIGINAL : for(col=0; col<16; col++) {sum[col] = src[ocol+col + 0 * srcStride] * c[0];}
+#define multiply_sp_pixel_coeff(/*vector int*/ v_sum_0, /*vector int*/ v_sum_1, /*vector int*/ v_sum_2, /*vector int*/ v_sum_3, /*const int16_t * */ src, /*int*/ src_offset, /*vector signed short*/ v_coeff) \
+{ \
+ vector signed short v_pixel_16_h, v_pixel_16_l ; \
+\
+ /* load the pixels */ \
+ v_pixel_16_h = vec_xl(src_offset, src) ; \
+ v_pixel_16_l = vec_xl(src_offset + 16, src) ; \
+\
+ /* multiply the pixels by the coefficient */ \
+ v_sum_0 = vec_mule(v_pixel_16_h, v_coeff) ; \
+ v_sum_1 = vec_mulo(v_pixel_16_h, v_coeff) ; \
+ v_sum_2 = vec_mule(v_pixel_16_l, v_coeff) ; \
+ v_sum_3 = vec_mulo(v_pixel_16_l, v_coeff) ; \
+\
+} // end multiply_pixel_coeff()
+
+
+// ORIGINAL : for(col=0; col<16; col++) {sum[col] += src[ocol+col + 1 * srcStride] * c[1];}
+#define multiply_accumulate_sp_pixel_coeff(/*vector int*/ v_sum_0, /*vector int*/ v_sum_1, /*vector int*/ v_sum_2, /*vector int*/ v_sum_3, /*const pixel * */ src, /*int*/ src_offset, /*vector signed short*/ v_coeff) \
+{ \
+ vector signed short v_pixel_16_h, v_pixel_16_l ; \
+ vector int v_product_int_0, v_product_int_1, v_product_int_2, v_product_int_3 ; \
+\
+ /* ORIGINAL : for(col=0; col<16; col++) {sum[col] = src[ocol+col + 0 * srcStride] * c[0];} */ \
+\
+ /* load the pixels */ \
+ v_pixel_16_h = vec_xl(src_offset, src) ; \
+ v_pixel_16_l = vec_xl(src_offset + 16, src) ; \
+\
+ /* multiply the pixels by the coefficient */ \
+ v_product_int_0 = vec_mule(v_pixel_16_h, v_coeff) ; \
+ v_product_int_1 = vec_mulo(v_pixel_16_h, v_coeff) ; \
+ v_product_int_2 = vec_mule(v_pixel_16_l, v_coeff) ; \
+ v_product_int_3 = vec_mulo(v_pixel_16_l, v_coeff) ; \
+\
+ /* accumulate the results with the sum vectors */ \
+ v_sum_0 = vec_add(v_sum_0, v_product_int_0) ; \
+ v_sum_1 = vec_add(v_sum_1, v_product_int_1) ; \
+ v_sum_2 = vec_add(v_sum_2, v_product_int_2) ; \
+ v_sum_3 = vec_add(v_sum_3, v_product_int_3) ; \
+\
+} // end multiply_accumulate_pixel_coeff()
+
+
+// Works with the following values:
+// N = 8
+// width >= 16 (multiple of 16)
+// any height
+template <int N, int width, int height>
+void filterVertical_sp_altivec(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+{
+ int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+ unsigned int shift = IF_FILTER_PREC + headRoom;
+ int offset = (1 << (shift - 1)) + (IF_INTERNAL_OFFS << IF_FILTER_PREC);
+ const uint16_t maxVal = (1 << X265_DEPTH) - 1;
+ const int16_t* coeff = (N == 8 ? g_lumaFilter[coeffIdx] : g_chromaFilter[coeffIdx]);
+
+ src -= (N / 2 - 1) * srcStride;
+
+
+ // Vector to hold replicated shift amount
+ const vector unsigned int v_shift = {shift, shift, shift, shift} ;
+
+ // Vector to hold replicated offset
+ const vector int v_offset = {offset, offset, offset, offset} ;
+
+ // Vector to hold replicated maxVal
+ const vector signed short v_maxVal = {maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal} ;
+
+
+ // Vector to hold replicated coefficients (one coefficient replicated per vector)
+ vector signed short v_coeff_0, v_coeff_1, v_coeff_2, v_coeff_3, v_coeff_4, v_coeff_5, v_coeff_6, v_coeff_7 ;
+ vector signed short v_coefficients = vec_xl(0, coeff) ; // load all coefficients into one vector
+
+ // Replicate the coefficients into respective vectors
+ v_coeff_0 = vec_splat(v_coefficients, 0) ;
+ v_coeff_1 = vec_splat(v_coefficients, 1) ;
+ v_coeff_2 = vec_splat(v_coefficients, 2) ;
+ v_coeff_3 = vec_splat(v_coefficients, 3) ;
+ v_coeff_4 = vec_splat(v_coefficients, 4) ;
+ v_coeff_5 = vec_splat(v_coefficients, 5) ;
+ v_coeff_6 = vec_splat(v_coefficients, 6) ;
+ v_coeff_7 = vec_splat(v_coefficients, 7) ;
+
+
+
+ int row, ocol, col;
+ for (row = 0; row < height; row++)
+ {
+ for (ocol = 0; ocol < width; ocol+= 16 )
+ {
+
+ // int sum[16] ;
+ // int16_t val[16] ;
+
+ // --> for(col=0; col<16; col++) {sum[col] = src[ocol+col + 1 * srcStride] * c[0];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 1 * srcStride] * c[1];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 2 * srcStride] * c[2];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 3 * srcStride] * c[3];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 4 * srcStride] * c[4];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 5 * srcStride] * c[5];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 6 * srcStride] * c[6];}
+ // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 7 * srcStride] * c[7];}
+
+
+ vector signed int v_sum_0, v_sum_1, v_sum_2, v_sum_3 ;
+ vector signed short v_val_0, v_val_1 ;
+
+
+ // Added a factor of 2 to the offset since this is a BYTE offset, and each input pixel is of size 2Bytes
+ multiply_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol * 2, v_coeff_0) ;
+ multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 1 * srcStride) * 2, v_coeff_1) ;
+ multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 2 * srcStride) * 2, v_coeff_2) ;
+ multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 3 * srcStride) * 2, v_coeff_3) ;
+ multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 4 * srcStride) * 2, v_coeff_4) ;
+ multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 5 * srcStride) * 2, v_coeff_5) ;
+ multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 6 * srcStride) * 2, v_coeff_6) ;
+ multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 7 * srcStride) * 2, v_coeff_7) ;
+
+
+
+
+
+ // --> for(col=0; col<16; col++) {val[col] = (int16_t)((sum[col] + offset) >> shift);}
+ // Add offset
+ v_sum_0 = vec_add(v_sum_0, v_offset) ;
+ v_sum_1 = vec_add(v_sum_1, v_offset) ;
+ v_sum_2 = vec_add(v_sum_2, v_offset) ;
+ v_sum_3 = vec_add(v_sum_3, v_offset) ;
+ // Shift right by "shift"
+ v_sum_0 = vec_sra(v_sum_0, v_shift) ;
+ v_sum_1 = vec_sra(v_sum_1, v_shift) ;
+ v_sum_2 = vec_sra(v_sum_2, v_shift) ;
+ v_sum_3 = vec_sra(v_sum_3, v_shift) ;
+
+ // Pack into 16-bit numbers
+ v_val_0 = vec_pack(v_sum_0, v_sum_2) ;
+ v_val_1 = vec_pack(v_sum_1, v_sum_3) ;
+
+
+
+ // --> for(col=0; col<16; col++) {val[col] = (val[col] < 0) ? 0 : val[col];}
+ vector bool short v_comp_zero_0, v_comp_zero_1 ;
+ vector signed short v_max_masked_0, v_max_masked_1 ;
+ vector signed short zeros16 = {0,0,0,0,0,0,0,0} ;
+ // Compute less than 0
+ v_comp_zero_0 = vec_cmplt(v_val_0, zeros16) ;
+ v_comp_zero_1 = vec_cmplt(v_val_1, zeros16) ;
+ // Keep values that are greater or equal to 0
+ v_val_0 = vec_andc(v_val_0, v_comp_zero_0) ;
+ v_val_1 = vec_andc(v_val_1, v_comp_zero_1) ;
+
+
+
+ // --> for(col=0; col<16; col++) {val[col] = (val[col] > maxVal) ? maxVal : val[col];}
+ vector bool short v_comp_max_0, v_comp_max_1 ;
+ // Compute greater than max
+ v_comp_max_0 = vec_cmpgt(v_val_0, v_maxVal) ;
+ v_comp_max_1 = vec_cmpgt(v_val_1, v_maxVal) ;
+ // Replace values greater than maxVal with maxVal
+ v_val_0 = vec_sel(v_val_0, v_maxVal, v_comp_max_0) ;
+ v_val_1 = vec_sel(v_val_1, v_maxVal, v_comp_max_1) ;
+
+
+
+ // --> for(col=0; col<16; col++) {dst[ocol+col] = (pixel)val[col];}
+ // Pack the vals into 8-bit numbers
+ // but also re-ordering them - side effect of mule and mulo
+ vector unsigned char v_result ;
+ vector unsigned char v_perm_index = {0x00, 0x10, 0x02, 0x12, 0x04, 0x14, 0x06, 0x16, 0x08 ,0x18, 0x0A, 0x1A, 0x0C, 0x1C, 0x0E, 0x1E} ;
+ v_result = (vector unsigned char)vec_perm(v_val_0, v_val_1, v_perm_index) ;
+ // Store the results back to dst[]
+ vec_xst(v_result, ocol, (unsigned char *)dst) ;
+ }
+
+ src += srcStride;
+ dst += dstStride;
+ }
+} // end filterVertical_sp_altivec()
+
+
+
+
+
+// Works with the following values:
+// N = 8
+// width >= 32 (multiple of 32)
+// any height
+template <int N, int width, int height>
+void interp_horiz_ps_altivec(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
+{
+
+ const int16_t* coeff = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx];
+ int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+ unsigned int shift = IF_FILTER_PREC - headRoom;
+ int offset = -IF_INTERNAL_OFFS << shift;
+ int blkheight = height;
+
+ src -= N / 2 - 1;
+
+ if (isRowExt)
+ {
+ src -= (N / 2 - 1) * srcStride;
+ blkheight += N - 1;
+ }
+
+
+ vector signed short v_coeff ;
+ v_coeff = vec_xl(0, coeff) ;
+
+
+ vector unsigned char v_pixel_char_0, v_pixel_char_1, v_pixel_char_2 ;
+ vector signed short v_pixel_short_0, v_pixel_short_1, v_pixel_short_2, v_pixel_short_3, v_pixel_short_4 ;
+ const vector signed short v_mask_unisgned_char_to_short = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; \
+ const vector signed int v_zeros_int = {0, 0, 0, 0} ;
+ const vector signed short v_zeros_short = {0, 0, 0, 0, 0, 0, 0, 0} ;
+
+ vector signed int v_product_0_0, v_product_0_1 ;
+ vector signed int v_product_1_0, v_product_1_1 ;
+ vector signed int v_product_2_0, v_product_2_1 ;
+ vector signed int v_product_3_0, v_product_3_1 ;
+
+ vector signed int v_sum_0, v_sum_1, v_sum_2, v_sum_3 ;
+
+ vector signed int v_sums_temp_col0, v_sums_temp_col1, v_sums_temp_col2, v_sums_temp_col3 ;
+ vector signed int v_sums_col0_0, v_sums_col0_1 ;
+ vector signed int v_sums_col1_0, v_sums_col1_1 ;
+ vector signed int v_sums_col2_0, v_sums_col2_1 ;
+ vector signed int v_sums_col3_0, v_sums_col3_1 ;
+
+
+ const vector signed int v_offset = {offset, offset, offset, offset};
+ const vector unsigned int v_shift = {shift, shift, shift, shift} ;
+
+
+ vector unsigned char v_sums_shamt = {0x20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ;
+
+
+
+ pixel *next_src ;
+ int16_t *next_dst ;
+
+ int row, col;
+ for (row = 0; row < blkheight; row++)
+ {
+ next_src = (pixel *)src + srcStride ;
+ next_dst = (int16_t *)dst + dstStride ;
+
+ for(int col_iter=0; col_iter<width; col_iter+=32)
+ {
+ // Load a full row of pixels (32 + 7)
+ v_pixel_char_0 = vec_xl(0, src) ;
+ v_pixel_char_1 = vec_xl(16, src) ;
+ v_pixel_char_2 = vec_xl(32, src) ;
+
+
+ v_sums_temp_col0 = v_zeros_int ;
+ v_sums_temp_col1 = v_zeros_int ;
+ v_sums_temp_col2 = v_zeros_int ;
+ v_sums_temp_col3 = v_zeros_int ;
+
+
+ // Expand the loaded pixels into shorts
+ v_pixel_short_0 = vec_unpackh((vector signed char)v_pixel_char_0) ;
+ v_pixel_short_1 = vec_unpackl((vector signed char)v_pixel_char_0) ;
+ v_pixel_short_2 = vec_unpackh((vector signed char)v_pixel_char_1) ;
+ v_pixel_short_3 = vec_unpackl((vector signed char)v_pixel_char_1) ;
+ v_pixel_short_4 = vec_unpackh((vector signed char)v_pixel_char_2) ;
+
+ v_pixel_short_0 = vec_and(v_pixel_short_0, v_mask_unisgned_char_to_short) ;
+ v_pixel_short_1 = vec_and(v_pixel_short_1, v_mask_unisgned_char_to_short) ;
+ v_pixel_short_2 = vec_and(v_pixel_short_2, v_mask_unisgned_char_to_short) ;
+ v_pixel_short_3 = vec_and(v_pixel_short_3, v_mask_unisgned_char_to_short) ;
+ v_pixel_short_4 = vec_and(v_pixel_short_4, v_mask_unisgned_char_to_short) ;
+
+
+
+ // Four colum sets are processed below
+ // One colum per set per iteration
+ for(col=0; col < 8; col++)
+ {
+
+ // Multiply the pixels by the coefficients
+ v_product_0_0 = vec_mule(v_pixel_short_0, v_coeff) ;
+ v_product_0_1 = vec_mulo(v_pixel_short_0, v_coeff) ;
+
+ v_product_1_0 = vec_mule(v_pixel_short_1, v_coeff) ;
+ v_product_1_1 = vec_mulo(v_pixel_short_1, v_coeff) ;
+
+ v_product_2_0 = vec_mule(v_pixel_short_2, v_coeff) ;
+ v_product_2_1 = vec_mulo(v_pixel_short_2, v_coeff) ;
+
+ v_product_3_0 = vec_mule(v_pixel_short_3, v_coeff) ;
+ v_product_3_1 = vec_mulo(v_pixel_short_3, v_coeff) ;
+
+
+ // Sum up the multiplication results
+ v_sum_0 = vec_add(v_product_0_0, v_product_0_1) ;
+ v_sum_0 = vec_sums(v_sum_0, v_zeros_int) ;
+
+ v_sum_1 = vec_add(v_product_1_0, v_product_1_1) ;
+ v_sum_1 = vec_sums(v_sum_1, v_zeros_int) ;
+
+ v_sum_2 = vec_add(v_product_2_0, v_product_2_1) ;
+ v_sum_2 = vec_sums(v_sum_2, v_zeros_int) ;
+
+ v_sum_3 = vec_add(v_product_3_0, v_product_3_1) ;
+ v_sum_3 = vec_sums(v_sum_3, v_zeros_int) ;
+
+
+ // Insert the sum results into respective vectors
+ v_sums_temp_col0 = vec_sro(v_sums_temp_col0, v_sums_shamt) ;
+ v_sums_temp_col0 = vec_or(v_sum_0, v_sums_temp_col0) ;
+
+ v_sums_temp_col1 = vec_sro(v_sums_temp_col1, v_sums_shamt) ;
+ v_sums_temp_col1 = vec_or(v_sum_1, v_sums_temp_col1) ;
+
+ v_sums_temp_col2 = vec_sro(v_sums_temp_col2, v_sums_shamt) ;
+ v_sums_temp_col2 = vec_or(v_sum_2, v_sums_temp_col2) ;
+
+ v_sums_temp_col3 = vec_sro(v_sums_temp_col3, v_sums_shamt) ;
+ v_sums_temp_col3 = vec_or(v_sum_3, v_sums_temp_col3) ;
+
+
+ if(col == 3)
+ {
+ v_sums_col0_0 = v_sums_temp_col0 ;
+ v_sums_col1_0 = v_sums_temp_col1 ;
+ v_sums_col2_0 = v_sums_temp_col2 ;
+ v_sums_col3_0 = v_sums_temp_col3 ;
+
+ v_sums_temp_col0 = v_zeros_int ;
+ v_sums_temp_col1 = v_zeros_int ;
+ v_sums_temp_col2 = v_zeros_int ;
+ v_sums_temp_col3 = v_zeros_int ;
+ }
+
+
+ // Shift the pixels by 1 (short pixel)
+ v_pixel_short_0 = vec_sld(v_pixel_short_1, v_pixel_short_0, 14) ;
+ v_pixel_short_1 = vec_sld(v_pixel_short_2, v_pixel_short_1, 14) ;
+ v_pixel_short_2 = vec_sld(v_pixel_short_3, v_pixel_short_2, 14) ;
+ v_pixel_short_3 = vec_sld(v_pixel_short_4, v_pixel_short_3, 14) ;
+ const vector unsigned char v_shift_right_two_bytes_shamt = {0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ;
+ v_pixel_short_4 = vec_sro(v_pixel_short_4, v_shift_right_two_bytes_shamt) ;
+ }
+
+ // Copy the sums result to the second vector (per colum)
+ v_sums_col0_1 = v_sums_temp_col0 ;
+ v_sums_col1_1 = v_sums_temp_col1 ;
+ v_sums_col2_1 = v_sums_temp_col2 ;
+ v_sums_col3_1 = v_sums_temp_col3 ;
+
+
+
+ // Post processing and eventually 2 stores
+ // Original code:
+ // int16_t val = (int16_t)((sum + offset) >> shift);
+ // dst[col] = val;
+
+
+ v_sums_col0_0 = vec_sra(vec_add(v_sums_col0_0, v_offset), v_shift) ;
+ v_sums_col0_1 = vec_sra(vec_add(v_sums_col0_1, v_offset), v_shift) ;
+ v_sums_col1_0 = vec_sra(vec_add(v_sums_col1_0, v_offset), v_shift) ;
+ v_sums_col1_1 = vec_sra(vec_add(v_sums_col1_1, v_offset), v_shift) ;
+ v_sums_col2_0 = vec_sra(vec_add(v_sums_col2_0, v_offset), v_shift) ;
+ v_sums_col2_1 = vec_sra(vec_add(v_sums_col2_1, v_offset), v_shift) ;
+ v_sums_col3_0 = vec_sra(vec_add(v_sums_col3_0, v_offset), v_shift) ;
+ v_sums_col3_1 = vec_sra(vec_add(v_sums_col3_1, v_offset), v_shift) ;
+
+
+ vector signed short v_val_col0, v_val_col1, v_val_col2, v_val_col3 ;
+ v_val_col0 = vec_pack(v_sums_col0_0, v_sums_col0_1) ;
+ v_val_col1 = vec_pack(v_sums_col1_0, v_sums_col1_1) ;
+ v_val_col2 = vec_pack(v_sums_col2_0, v_sums_col2_1) ;
+ v_val_col3 = vec_pack(v_sums_col3_0, v_sums_col3_1) ;
+
+
+
+ // Store results
+ vec_xst(v_val_col0, 0, dst) ;
+ vec_xst(v_val_col1, 16, dst) ;
+ vec_xst(v_val_col2, 32, dst) ;
+ vec_xst(v_val_col3, 48, dst) ;
+
+ src += 32 ;
+ dst += 32 ;
+
+ } // end for col_iter
+
+ src = next_src ;
+ dst = next_dst ;
+ }
+} // interp_horiz_ps_altivec ()
+
+
+
+// Works with the following values:
+// N = 8
+// width >= 32 (multiple of 32)
+// any height
+template <int N, int width, int height>
+void interp_hv_pp_altivec(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY)
+{
+
+ short immedVals[(64 + 8) * (64 + 8)];
+
+ interp_horiz_ps_altivec<N, width, height>(src, srcStride, immedVals, width, idxX, 1);
+
+ //!!filterVertical_sp_c<N>(immedVals + 3 * width, width, dst, dstStride, width, height, idxY);
+ filterVertical_sp_altivec<N,width,height>(immedVals + 3 * width, width, dst, dstStride, idxY);
+}
+
+//ORIGINAL
+#if 0
+// Works with the following values:
+// N = 8
+// width >= 32 (multiple of 32)
+// any height
+template <int N, int width, int height>
+void interp_horiz_pp_altivec(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+{
+
+ const int16_t* coeff = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx];
+ int headRoom = IF_FILTER_PREC;
+ int offset = (1 << (headRoom - 1));
+ uint16_t maxVal = (1 << X265_DEPTH) - 1;
+ int cStride = 1;
+
+ src -= (N / 2 - 1) * cStride;
+
+
+ vector signed short v_coeff ;
+ v_coeff = vec_xl(0, coeff) ;
+
+
+ vector unsigned char v_pixel_char_0, v_pixel_char_1, v_pixel_char_2 ;
+ vector signed short v_pixel_short_0, v_pixel_short_1, v_pixel_short_2, v_pixel_short_3, v_pixel_short_4 ;
+ const vector signed short v_mask_unisgned_char_to_short = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; \
+ const vector signed int v_zeros_int = {0, 0, 0, 0} ;
+ const vector signed short v_zeros_short = {0, 0, 0, 0, 0, 0, 0, 0} ;
+
+ vector signed int v_product_0_0, v_product_0_1 ;
+ vector signed int v_product_1_0, v_product_1_1 ;
+ vector signed int v_product_2_0, v_product_2_1 ;
+ vector signed int v_product_3_0, v_product_3_1 ;
+
+ vector signed int v_sum_0, v_sum_1, v_sum_2, v_sum_3 ;
+
+ vector signed int v_sums_temp_col0, v_sums_temp_col1, v_sums_temp_col2, v_sums_temp_col3 ;
+ vector signed int v_sums_col0_0, v_sums_col0_1 ;
+ vector signed int v_sums_col1_0, v_sums_col1_1 ;
+ vector signed int v_sums_col2_0, v_sums_col2_1 ;
+ vector signed int v_sums_col3_0, v_sums_col3_1 ;
+
+
+ const vector signed int v_offset = {offset, offset, offset, offset};
+ const vector unsigned int v_headRoom = {headRoom, headRoom, headRoom, headRoom} ;
+
+
+ vector unsigned char v_sums_shamt = {0x20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ;
+
+
+ pixel *next_src ;
+ pixel *next_dst ;
+
+ int row, col;
+ for (row = 0; row < height; row++)
+ {
+ next_src = (pixel *)src + srcStride ;
+ next_dst = (pixel *)dst + dstStride ;
+
+ for(int col_iter=0; col_iter<width; col_iter+=32)
+ {
+
+ // Load a full row of pixels (32 + 7)
+ v_pixel_char_0 = vec_xl(0, src) ;
+ v_pixel_char_1 = vec_xl(16, src) ;
+ v_pixel_char_2 = vec_xl(32, src) ;
+
+
+ v_sums_temp_col0 = v_zeros_int ;
+ v_sums_temp_col1 = v_zeros_int ;
+ v_sums_temp_col2 = v_zeros_int ;
+ v_sums_temp_col3 = v_zeros_int ;
+
+
+ // Expand the loaded pixels into shorts
+ v_pixel_short_0 = vec_unpackh((vector signed char)v_pixel_char_0) ;
+ v_pixel_short_1 = vec_unpackl((vector signed char)v_pixel_char_0) ;
+ v_pixel_short_2 = vec_unpackh((vector signed char)v_pixel_char_1) ;
+ v_pixel_short_3 = vec_unpackl((vector signed char)v_pixel_char_1) ;
+ v_pixel_short_4 = vec_unpackh((vector signed char)v_pixel_char_2) ;
+
+ v_pixel_short_0 = vec_and(v_pixel_short_0, v_mask_unisgned_char_to_short) ;
+ v_pixel_short_1 = vec_and(v_pixel_short_1, v_mask_unisgned_char_to_short) ;
+ v_pixel_short_2 = vec_and(v_pixel_short_2, v_mask_unisgned_char_to_short) ;
+ v_pixel_short_3 = vec_and(v_pixel_short_3, v_mask_unisgned_char_to_short) ;
+ v_pixel_short_4 = vec_and(v_pixel_short_4, v_mask_unisgned_char_to_short) ;
+
+
+
+ // Four colum sets are processed below
+ // One colum per set per iteration
+ for(col=0; col < 8; col++)
+ {
+
+ // Multiply the pixels by the coefficients
+ v_product_0_0 = vec_mule(v_pixel_short_0, v_coeff) ;
+ v_product_0_1 = vec_mulo(v_pixel_short_0, v_coeff) ;
+
+ v_product_1_0 = vec_mule(v_pixel_short_1, v_coeff) ;
+ v_product_1_1 = vec_mulo(v_pixel_short_1, v_coeff) ;
+
+ v_product_2_0 = vec_mule(v_pixel_short_2, v_coeff) ;
+ v_product_2_1 = vec_mulo(v_pixel_short_2, v_coeff) ;
+
+ v_product_3_0 = vec_mule(v_pixel_short_3, v_coeff) ;
+ v_product_3_1 = vec_mulo(v_pixel_short_3, v_coeff) ;
+
+
+ // Sum up the multiplication results
+ v_sum_0 = vec_add(v_product_0_0, v_product_0_1) ;
+ v_sum_0 = vec_sums(v_sum_0, v_zeros_int) ;
+
+ v_sum_1 = vec_add(v_product_1_0, v_product_1_1) ;
+ v_sum_1 = vec_sums(v_sum_1, v_zeros_int) ;
+
+ v_sum_2 = vec_add(v_product_2_0, v_product_2_1) ;
+ v_sum_2 = vec_sums(v_sum_2, v_zeros_int) ;
+
+ v_sum_3 = vec_add(v_product_3_0, v_product_3_1) ;
+ v_sum_3 = vec_sums(v_sum_3, v_zeros_int) ;
+
+
+ // Insert the sum results into respective vectors
+ v_sums_temp_col0 = vec_sro(v_sums_temp_col0, v_sums_shamt) ;
+ v_sums_temp_col0 = vec_or(v_sum_0, v_sums_temp_col0) ;
+
+ v_sums_temp_col1 = vec_sro(v_sums_temp_col1, v_sums_shamt) ;
+ v_sums_temp_col1 = vec_or(v_sum_1, v_sums_temp_col1) ;
+
+ v_sums_temp_col2 = vec_sro(v_sums_temp_col2, v_sums_shamt) ;
+ v_sums_temp_col2 = vec_or(v_sum_2, v_sums_temp_col2) ;
+
+ v_sums_temp_col3 = vec_sro(v_sums_temp_col3, v_sums_shamt) ;
+ v_sums_temp_col3 = vec_or(v_sum_3, v_sums_temp_col3) ;
+
+
+ if(col == 3)
+ {
+ v_sums_col0_0 = v_sums_temp_col0 ;
+ v_sums_col1_0 = v_sums_temp_col1 ;
+ v_sums_col2_0 = v_sums_temp_col2 ;
+ v_sums_col3_0 = v_sums_temp_col3 ;
+
+ v_sums_temp_col0 = v_zeros_int ;
+ v_sums_temp_col1 = v_zeros_int ;
+ v_sums_temp_col2 = v_zeros_int ;
+ v_sums_temp_col3 = v_zeros_int ;
+ }
+
+
+ // Shift the pixels by 1 (short pixel)
+ v_pixel_short_0 = vec_sld(v_pixel_short_1, v_pixel_short_0, 14) ;
+ v_pixel_short_1 = vec_sld(v_pixel_short_2, v_pixel_short_1, 14) ;
+ v_pixel_short_2 = vec_sld(v_pixel_short_3, v_pixel_short_2, 14) ;
+ v_pixel_short_3 = vec_sld(v_pixel_short_4, v_pixel_short_3, 14) ;
+ const vector unsigned char v_shift_right_two_bytes_shamt = {0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ;
+ v_pixel_short_4 = vec_sro(v_pixel_short_4, v_shift_right_two_bytes_shamt) ;
+ }
+
+ // Copy the sums result to the second vector (per colum)
+ v_sums_col0_1 = v_sums_temp_col0 ;
+ v_sums_col1_1 = v_sums_temp_col1 ;
+ v_sums_col2_1 = v_sums_temp_col2 ;
+ v_sums_col3_1 = v_sums_temp_col3 ;
+
+
+
+ // Post processing and eventually 2 stores
+ // Original code:
+ // int16_t val = (int16_t)((sum + offset) >> headRoom);
+ // if (val < 0) val = 0;
+ // if (val > maxVal) val = maxVal;
+ // dst[col] = (pixel)val;
+
+
+ v_sums_col0_0 = vec_sra(vec_add(v_sums_col0_0, v_offset), v_headRoom) ;
+ v_sums_col0_1 = vec_sra(vec_add(v_sums_col0_1, v_offset), v_headRoom) ;
+ v_sums_col1_0 = vec_sra(vec_add(v_sums_col1_0, v_offset), v_headRoom) ;
+ v_sums_col1_1 = vec_sra(vec_add(v_sums_col1_1, v_offset), v_headRoom) ;
+ v_sums_col2_0 = vec_sra(vec_add(v_sums_col2_0, v_offset), v_headRoom) ;
+ v_sums_col2_1 = vec_sra(vec_add(v_sums_col2_1, v_offset), v_headRoom) ;
+ v_sums_col3_0 = vec_sra(vec_add(v_sums_col3_0, v_offset), v_headRoom) ;
+ v_sums_col3_1 = vec_sra(vec_add(v_sums_col3_1, v_offset), v_headRoom) ;
+
+
+ vector signed short v_val_col0, v_val_col1, v_val_col2, v_val_col3 ;
+ v_val_col0 = vec_pack(v_sums_col0_0, v_sums_col0_1) ;
+ v_val_col1 = vec_pack(v_sums_col1_0, v_sums_col1_1) ;
+ v_val_col2 = vec_pack(v_sums_col2_0, v_sums_col2_1) ;
+ v_val_col3 = vec_pack(v_sums_col3_0, v_sums_col3_1) ;
+
+
+ // if (val < 0) val = 0;
+ vector bool short v_comp_zero_col0, v_comp_zero_col1, v_comp_zero_col2, v_comp_zero_col3 ;
+ // Compute less than 0
+ v_comp_zero_col0 = vec_cmplt(v_val_col0, v_zeros_short) ;
+ v_comp_zero_col1 = vec_cmplt(v_val_col1, v_zeros_short) ;
+ v_comp_zero_col2 = vec_cmplt(v_val_col2, v_zeros_short) ;
+ v_comp_zero_col3 = vec_cmplt(v_val_col3, v_zeros_short) ;
+ // Keep values that are greater or equal to 0
+ v_val_col0 = vec_andc(v_val_col0, v_comp_zero_col0) ;
+ v_val_col1 = vec_andc(v_val_col1, v_comp_zero_col1) ;
+ v_val_col2 = vec_andc(v_val_col2, v_comp_zero_col2) ;
+ v_val_col3 = vec_andc(v_val_col3, v_comp_zero_col3) ;
+
+
+ // if (val > maxVal) val = maxVal;
+ vector bool short v_comp_max_col0, v_comp_max_col1, v_comp_max_col2, v_comp_max_col3 ;
+ const vector signed short v_maxVal = {maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal} ;
+ // Compute greater than max
+ v_comp_max_col0 = vec_cmpgt(v_val_col0, v_maxVal) ;
+ v_comp_max_col1 = vec_cmpgt(v_val_col1, v_maxVal) ;
+ v_comp_max_col2 = vec_cmpgt(v_val_col2, v_maxVal) ;
+ v_comp_max_col3 = vec_cmpgt(v_val_col3, v_maxVal) ;
+ // Replace values greater than maxVal with maxVal
+ v_val_col0 = vec_sel(v_val_col0, v_maxVal, v_comp_max_col0) ;
+ v_val_col1 = vec_sel(v_val_col1, v_maxVal, v_comp_max_col1) ;
+ v_val_col2 = vec_sel(v_val_col2, v_maxVal, v_comp_max_col2) ;
+ v_val_col3 = vec_sel(v_val_col3, v_maxVal, v_comp_max_col3) ;
+
+ // (pixel)val
+ vector unsigned char v_final_result_0, v_final_result_1 ;
+ v_final_result_0 = vec_pack((vector unsigned short)v_val_col0, (vector unsigned short)v_val_col1) ;
+ v_final_result_1 = vec_pack((vector unsigned short)v_val_col2, (vector unsigned short)v_val_col3) ;
+
+
+
+ // Store results
+ vec_xst(v_final_result_0, 0, dst) ;
+ vec_xst(v_final_result_1, 16, dst) ;
+
+
+ src += 32 ;
+ dst += 32 ;
+
+ } // end for col_iter
+
+
+ src = next_src ;
+ dst = next_dst ;
+ }
+} // interp_horiz_pp_altivec()
+#else
+template<int N, int width, int height>
+void interp_horiz_pp_altivec(const pixel* __restrict__ src, intptr_t srcStride, pixel* __restrict__ dst, intptr_t dstStride, int coeffIdx)
+{
+ const int16_t* __restrict__ coeff = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx];
+ int headRoom = IF_FILTER_PREC;
+ int offset = (1 << (headRoom - 1));
+ uint16_t maxVal = (1 << X265_DEPTH) - 1;
+ int cStride = 1;
+
+ src -= (N / 2 - 1) * cStride;
+
+ vector signed short vcoeff0 = vec_splats(coeff[0]);
+ vector signed short vcoeff1 = vec_splats(coeff[1]);
+ vector signed short vcoeff2 = vec_splats(coeff[2]);
+ vector signed short vcoeff3 = vec_splats(coeff[3]);
+ vector signed short vcoeff4 = vec_splats(coeff[4]);
+ vector signed short vcoeff5 = vec_splats(coeff[5]);
+ vector signed short vcoeff6 = vec_splats(coeff[6]);
+ vector signed short vcoeff7 = vec_splats(coeff[7]);
+ vector signed short voffset = vec_splats((short)offset);
+ vector signed short vheadRoom = vec_splats((short)headRoom);
+ vector signed short vmaxVal = vec_splats((short)maxVal);
+ vector signed short vzero_s16 = vec_splats( (signed short)0u);;
+ vector signed int vzero_s32 = vec_splats( (signed int)0u);
+ vector unsigned char vzero_u8 = vec_splats( (unsigned char)0u );
+
+ vector signed short vsrcH, vsrcL, vsumH, vsumL;
+ vector unsigned char vsrc;
+
+ vector signed short vsrc2H, vsrc2L, vsum2H, vsum2L;
+ vector unsigned char vsrc2;
+
+ vector unsigned char vchar_to_short_maskH = {24, 0, 25, 0, 26, 0, 27, 0, 28, 0, 29, 0, 30, 0, 31, 0};
+ vector unsigned char vchar_to_short_maskL = {16, 0, 17, 0 ,18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0};
+
+ const pixel* __restrict__ src2 = src+srcStride;
+ pixel* __restrict__ dst2 = dst+dstStride;
+
+ int row, col;
+ for (row = 0; row < height; row+=2)
+ {
+ for (col = 0; col < width; col+=16)
+ {
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 0*cStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+
+ vsumH = vsrcH * vcoeff0;
+ vsumL = vsrcL * vcoeff0;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 1*cStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff1;
+ vsumL += vsrcL * vcoeff1;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 2*cStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff2;
+ vsumL += vsrcL * vcoeff2;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 3*cStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff3;
+ vsumL += vsrcL * vcoeff3;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 4*cStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff4;
+ vsumL += vsrcL * vcoeff4;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 5*cStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff5;
+ vsumL += vsrcL * vcoeff5;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 6*cStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff6;
+ vsumL += vsrcL * vcoeff6;
+
+ vsrc = vec_xl(0, (unsigned char*)&src[col + 7*cStride]);
+ vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH );
+ vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL );
+ vsumH += vsrcH * vcoeff7;
+ vsumL += vsrcL * vcoeff7;
+
+ vector short vvalH = (vsumH + voffset) >> vheadRoom;
+ vvalH = vec_max( vvalH, vzero_s16 );
+ vvalH = vec_min( vvalH, vmaxVal );
+
+ vector short vvalL = (vsumL + voffset) >> vheadRoom;
+ vvalL = vec_max( vvalL, vzero_s16 );
+ vvalL = vec_min( vvalL, vmaxVal );
+
+ vector signed char vdst = vec_pack( vvalL, vvalH );
+ vec_xst( vdst, 0, (signed char*)&dst[col] );
+
+
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 0*cStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+
+ vsum2H = vsrc2H * vcoeff0;
+ vsum2L = vsrc2L * vcoeff0;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 1*cStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff1;
+ vsum2L += vsrc2L * vcoeff1;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 2*cStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff2;
+ vsum2L += vsrc2L * vcoeff2;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 3*cStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff3;
+ vsum2L += vsrc2L * vcoeff3;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 4*cStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff4;
+ vsum2L += vsrc2L * vcoeff4;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 5*cStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff5;
+ vsum2L += vsrc2L * vcoeff5;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 6*cStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff6;
+ vsum2L += vsrc2L * vcoeff6;
+
+ vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 7*cStride]);
+ vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH );
+ vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL );
+ vsum2H += vsrc2H * vcoeff7;
+ vsum2L += vsrc2L * vcoeff7;
+
+ vector short vval2H = (vsum2H + voffset) >> vheadRoom;
+ vval2H = vec_max( vval2H, vzero_s16 );
+ vval2H = vec_min( vval2H, vmaxVal );
+
+ vector short vval2L = (vsum2L + voffset) >> vheadRoom;
+ vval2L = vec_max( vval2L, vzero_s16 );
+ vval2L = vec_min( vval2L, vmaxVal );
+
+ vector signed char vdst2 = vec_pack( vval2L, vval2H );
+ vec_xst( vdst2, 0, (signed char*)&dst2[col] );
+ }
+
+ src += 2*srcStride;
+ dst += 2*dstStride;
+
+ src2 += 2*srcStride;
+ dst2 += 2*dstStride;
+ }
+}
+#endif
+
+
+// Works with the following values:
+// N = 8
+// width >= 32 (multiple of 32)
+// any height
+//template <int N, int width, int height>
+//void interp_horiz_pp_altivec(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+//{
+//
+// const int16_t* coeff = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx];
+// int headRoom = IF_FILTER_PREC;
+// int offset = (1 << (headRoom - 1));
+// uint16_t maxVal = (1 << X265_DEPTH) - 1;
+// int cStride = 1;
+//
+// src -= (N / 2 - 1) * cStride;
+//
+//
+// vector signed short v_coeff ;
+// v_coeff = vec_xl(0, coeff) ;
+//
+//
+// vector unsigned char v_pixel_char_0, v_pixel_char_1, v_pixel_char_2 ;
+// vector signed short v_pixel_short_0, v_pixel_short_1, v_pixel_short_2, v_pixel_short_3, v_pixel_short_4 ;
+// const vector signed short v_mask_unisgned_char_to_short = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ;
+// const vector signed int v_zeros_int = {0, 0, 0, 0} ;
+// const vector signed short v_zeros_short = {0, 0, 0, 0, 0, 0, 0, 0} ;
+//
+// vector signed int v_product_0_0, v_product_0_1 ;
+// vector signed int v_product_1_0, v_product_1_1 ;
+// vector signed int v_product_2_0, v_product_2_1 ;
+// vector signed int v_product_3_0, v_product_3_1 ;
+//
+// vector signed int v_sum_0, v_sum_1, v_sum_2, v_sum_3 ;
+//
+// vector signed int v_sums_temp_col0, v_sums_temp_col1, v_sums_temp_col2, v_sums_temp_col3 ;
+// vector signed int v_sums_col0_0, v_sums_col0_1 ;
+// vector signed int v_sums_col1_0, v_sums_col1_1 ;
+// vector signed int v_sums_col2_0, v_sums_col2_1 ;
+// vector signed int v_sums_col3_0, v_sums_col3_1 ;
+//
+//
+// const vector signed int v_offset = {offset, offset, offset, offset};
+// const vector unsigned int v_headRoom = {headRoom, headRoom, headRoom, headRoom} ;
+//
+//
+// vector unsigned char v_sums_shamt = {0x20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ;
+//
+//
+// pixel *next_src ;
+// pixel *next_dst ;
+//
+// int row, col;
+// for (row = 0; row < height; row++)
+// {
+// next_src = (pixel *)src + srcStride ;
+// next_dst = (pixel *)dst + dstStride ;
+//
+// for(int col_iter=0; col_iter<width; col_iter+=32)
+// {
+//
+// // Load a full row of pixels (32 + 7)
+// v_pixel_char_0 = vec_xl(0, src) ;
+// v_pixel_char_1 = vec_xl(16, src) ;
+// v_pixel_char_2 = vec_xl(32, src) ;
+//
+//
+// v_sums_temp_col0 = v_zeros_int ;
+// v_sums_temp_col1 = v_zeros_int ;
+// v_sums_temp_col2 = v_zeros_int ;
+// v_sums_temp_col3 = v_zeros_int ;
+//
+//
+// // Expand the loaded pixels into shorts
+// v_pixel_short_0 = vec_unpackh((vector signed char)v_pixel_char_0) ;
+// v_pixel_short_1 = vec_unpackl((vector signed char)v_pixel_char_0) ;
+// v_pixel_short_2 = vec_unpackh((vector signed char)v_pixel_char_1) ;
+// v_pixel_short_3 = vec_unpackl((vector signed char)v_pixel_char_1) ;
+// v_pixel_short_4 = vec_unpackh((vector signed char)v_pixel_char_2) ;
+//
+// v_pixel_short_0 = vec_and(v_pixel_short_0, v_mask_unisgned_char_to_short) ;
+// v_pixel_short_1 = vec_and(v_pixel_short_1, v_mask_unisgned_char_to_short) ;
+// v_pixel_short_2 = vec_and(v_pixel_short_2, v_mask_unisgned_char_to_short) ;
+// v_pixel_short_3 = vec_and(v_pixel_short_3, v_mask_unisgned_char_to_short) ;
+// v_pixel_short_4 = vec_and(v_pixel_short_4, v_mask_unisgned_char_to_short) ;
+//
+//
+//
+// // Four colum sets are processed below
+// // One colum per set per iteration
+// for(col=0; col < 8; col++)
+// {
+//
+// // Multiply the pixels by the coefficients
+// v_product_0_0 = vec_mule(v_pixel_short_0, v_coeff) ;
+// v_product_0_1 = vec_mulo(v_pixel_short_0, v_coeff) ;
+//
+// v_product_1_0 = vec_mule(v_pixel_short_1, v_coeff) ;
+// v_product_1_1 = vec_mulo(v_pixel_short_1, v_coeff) ;
+//
+// v_product_2_0 = vec_mule(v_pixel_short_2, v_coeff) ;
+// v_product_2_1 = vec_mulo(v_pixel_short_2, v_coeff) ;
+//
+// v_product_3_0 = vec_mule(v_pixel_short_3, v_coeff) ;
+// v_product_3_1 = vec_mulo(v_pixel_short_3, v_coeff) ;
+//
+//
+// // Sum up the multiplication results
+// v_sum_0 = vec_add(v_product_0_0, v_product_0_1) ;
+// v_sum_0 = vec_sums(v_sum_0, v_zeros_int) ;
+//
+// v_sum_1 = vec_add(v_product_1_0, v_product_1_1) ;
+// v_sum_1 = vec_sums(v_sum_1, v_zeros_int) ;
+//
+// v_sum_2 = vec_add(v_product_2_0, v_product_2_1) ;
+// v_sum_2 = vec_sums(v_sum_2, v_zeros_int) ;
+//
+// v_sum_3 = vec_add(v_product_3_0, v_product_3_1) ;
+// v_sum_3 = vec_sums(v_sum_3, v_zeros_int) ;
+//
+//
+// // Insert the sum results into respective vectors
+// v_sums_temp_col0 = vec_sro(v_sums_temp_col0, v_sums_shamt) ;
+// v_sums_temp_col0 = vec_or(v_sum_0, v_sums_temp_col0) ;
+//
+// v_sums_temp_col1 = vec_sro(v_sums_temp_col1, v_sums_shamt) ;
+// v_sums_temp_col1 = vec_or(v_sum_1, v_sums_temp_col1) ;
+//
+// v_sums_temp_col2 = vec_sro(v_sums_temp_col2, v_sums_shamt) ;
+// v_sums_temp_col2 = vec_or(v_sum_2, v_sums_temp_col2) ;
+//
+// v_sums_temp_col3 = vec_sro(v_sums_temp_col3, v_sums_shamt) ;
+// v_sums_temp_col3 = vec_or(v_sum_3, v_sums_temp_col3) ;
+//
+//
+// if(col == 3)
+// {
+// v_sums_col0_0 = v_sums_temp_col0 ;
+// v_sums_col1_0 = v_sums_temp_col1 ;
+// v_sums_col2_0 = v_sums_temp_col2 ;
+// v_sums_col3_0 = v_sums_temp_col3 ;
+//
+// v_sums_temp_col0 = v_zeros_int ;
+// v_sums_temp_col1 = v_zeros_int ;
+// v_sums_temp_col2 = v_zeros_int ;
+// v_sums_temp_col3 = v_zeros_int ;
+// }
+//
+//
+// // Shift the pixels by 1 (short pixel)
+// v_pixel_short_0 = vec_sld(v_pixel_short_1, v_pixel_short_0, 14) ;
+// v_pixel_short_1 = vec_sld(v_pixel_short_2, v_pixel_short_1, 14) ;
+// v_pixel_short_2 = vec_sld(v_pixel_short_3, v_pixel_short_2, 14) ;
+// v_pixel_short_3 = vec_sld(v_pixel_short_4, v_pixel_short_3, 14) ;
+// const vector unsigned char v_shift_right_two_bytes_shamt = {0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ;
+// v_pixel_short_4 = vec_sro(v_pixel_short_4, v_shift_right_two_bytes_shamt) ;
+// }
+//
+// // Copy the sums result to the second vector (per colum)
+// v_sums_col0_1 = v_sums_temp_col0 ;
+// v_sums_col1_1 = v_sums_temp_col1 ;
+// v_sums_col2_1 = v_sums_temp_col2 ;
+// v_sums_col3_1 = v_sums_temp_col3 ;
+//
+//
+//
+// // Post processing and eventually 2 stores
+// // Original code:
+// // int16_t val = (int16_t)((sum + offset) >> headRoom);
+// // if (val < 0) val = 0;
+// // if (val > maxVal) val = maxVal;
+// // dst[col] = (pixel)val;
+//
+//
+// v_sums_col0_0 = vec_sra(vec_add(v_sums_col0_0, v_offset), v_headRoom) ;
+// v_sums_col0_1 = vec_sra(vec_add(v_sums_col0_1, v_offset), v_headRoom) ;
+// v_sums_col1_0 = vec_sra(vec_add(v_sums_col1_0, v_offset), v_headRoom) ;
+// v_sums_col1_1 = vec_sra(vec_add(v_sums_col1_1, v_offset), v_headRoom) ;
+// v_sums_col2_0 = vec_sra(vec_add(v_sums_col2_0, v_offset), v_headRoom) ;
+// v_sums_col2_1 = vec_sra(vec_add(v_sums_col2_1, v_offset), v_headRoom) ;
+// v_sums_col3_0 = vec_sra(vec_add(v_sums_col3_0, v_offset), v_headRoom) ;
+// v_sums_col3_1 = vec_sra(vec_add(v_sums_col3_1, v_offset), v_headRoom) ;
+//
+//
+// vector signed short v_val_col0, v_val_col1, v_val_col2, v_val_col3 ;
+// v_val_col0 = vec_pack(v_sums_col0_0, v_sums_col0_1) ;
+// v_val_col1 = vec_pack(v_sums_col1_0, v_sums_col1_1) ;
+// v_val_col2 = vec_pack(v_sums_col2_0, v_sums_col2_1) ;
+// v_val_col3 = vec_pack(v_sums_col3_0, v_sums_col3_1) ;
+//
+//
+// // if (val < 0) val = 0;
+// vector bool short v_comp_zero_col0, v_comp_zero_col1, v_comp_zero_col2, v_comp_zero_col3 ;
+// // Compute less than 0
+// v_comp_zero_col0 = vec_cmplt(v_val_col0, v_zeros_short) ;
+// v_comp_zero_col1 = vec_cmplt(v_val_col1, v_zeros_short) ;
+// v_comp_zero_col2 = vec_cmplt(v_val_col2, v_zeros_short) ;
+// v_comp_zero_col3 = vec_cmplt(v_val_col3, v_zeros_short) ;
+// // Keep values that are greater or equal to 0
+// v_val_col0 = vec_andc(v_val_col0, v_comp_zero_col0) ;
+// v_val_col1 = vec_andc(v_val_col1, v_comp_zero_col1) ;
+// v_val_col2 = vec_andc(v_val_col2, v_comp_zero_col2) ;
+// v_val_col3 = vec_andc(v_val_col3, v_comp_zero_col3) ;
+//
+//
+// // if (val > maxVal) val = maxVal;
+// vector bool short v_comp_max_col0, v_comp_max_col1, v_comp_max_col2, v_comp_max_col3 ;
+// const vector signed short v_maxVal = {maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal} ;
+// // Compute greater than max
+// v_comp_max_col0 = vec_cmpgt(v_val_col0, v_maxVal) ;
+// v_comp_max_col1 = vec_cmpgt(v_val_col1, v_maxVal) ;
+// v_comp_max_col2 = vec_cmpgt(v_val_col2, v_maxVal) ;
+// v_comp_max_col3 = vec_cmpgt(v_val_col3, v_maxVal) ;
+// // Replace values greater than maxVal with maxVal
+// v_val_col0 = vec_sel(v_val_col0, v_maxVal, v_comp_max_col0) ;
+// v_val_col1 = vec_sel(v_val_col1, v_maxVal, v_comp_max_col1) ;
+// v_val_col2 = vec_sel(v_val_col2, v_maxVal, v_comp_max_col2) ;
+// v_val_col3 = vec_sel(v_val_col3, v_maxVal, v_comp_max_col3) ;
+//
+// // (pixel)val
+// vector unsigned char v_final_result_0, v_final_result_1 ;
+// v_final_result_0 = vec_pack((vector unsigned short)v_val_col0, (vector unsigned short)v_val_col1) ;
+// v_final_result_1 = vec_pack((vector unsigned short)v_val_col2, (vector unsigned short)v_val_col3) ;
+//
+//
+//
+// // Store results
+// vec_xst(v_final_result_0, 0, dst) ;
+// vec_xst(v_final_result_1, 16, dst) ;
+//
+//
+// src += 32 ;
+// dst += 32 ;
+//
+// } // end for col_iter
+//
+//
+// src = next_src ;
+// dst = next_dst ;
+// }
+//} // interp_horiz_pp_altivec()
+
+
+namespace X265_NS {
+
+void setupFilterPrimitives_altivec(EncoderPrimitives& p)
+{
+ // interp_vert_pp_c
+ p.pu[LUMA_16x16].luma_vpp = interp_vert_pp_altivec<8, 16, 16> ;
+ p.pu[LUMA_32x8].luma_vpp = interp_vert_pp_altivec<8, 32, 8> ;
+ p.pu[LUMA_16x12].luma_vpp = interp_vert_pp_altivec<8, 16, 12> ;
+ p.pu[LUMA_16x4].luma_vpp = interp_vert_pp_altivec<8, 16, 4> ;
+ p.pu[LUMA_32x32].luma_vpp = interp_vert_pp_altivec<8, 32, 32> ;
+ p.pu[LUMA_32x16].luma_vpp = interp_vert_pp_altivec<8, 32, 16> ;
+ p.pu[LUMA_16x32].luma_vpp = interp_vert_pp_altivec<8, 16, 32> ;
+ p.pu[LUMA_32x24].luma_vpp = interp_vert_pp_altivec<8, 32, 24> ;
+ p.pu[LUMA_32x8].luma_vpp = interp_vert_pp_altivec<8, 32, 8> ;
+ p.pu[LUMA_64x64].luma_vpp = interp_vert_pp_altivec<8, 64, 64> ;
+ p.pu[LUMA_64x32].luma_vpp = interp_vert_pp_altivec<8, 64, 32> ;
+ p.pu[LUMA_32x64].luma_vpp = interp_vert_pp_altivec<8, 32, 64> ;
+ p.pu[LUMA_64x48].luma_vpp = interp_vert_pp_altivec<8, 64, 48> ;
+ p.pu[LUMA_48x64].luma_vpp = interp_vert_pp_altivec<8, 48, 64> ;
+ p.pu[LUMA_64x16].luma_vpp = interp_vert_pp_altivec<8, 64, 16> ;
+ p.pu[LUMA_16x64].luma_vpp = interp_vert_pp_altivec<8, 16, 64> ;
+
+ // interp_hv_pp_c
+ p.pu[LUMA_32x32].luma_hvpp = interp_hv_pp_altivec<8, 32, 32> ;
+ p.pu[LUMA_32x16].luma_hvpp = interp_hv_pp_altivec<8, 32, 16> ;
+ p.pu[LUMA_32x24].luma_hvpp = interp_hv_pp_altivec<8, 32, 24> ;
+ p.pu[LUMA_32x8].luma_hvpp = interp_hv_pp_altivec<8, 32, 8> ;
+ p.pu[LUMA_64x64].luma_hvpp = interp_hv_pp_altivec<8, 64, 64> ;
+ p.pu[LUMA_64x32].luma_hvpp = interp_hv_pp_altivec<8, 64, 32> ;
+ p.pu[LUMA_32x64].luma_hvpp = interp_hv_pp_altivec<8, 32, 64> ;
+ p.pu[LUMA_64x48].luma_hvpp = interp_hv_pp_altivec<8, 64, 48> ;
+ p.pu[LUMA_64x16].luma_hvpp = interp_hv_pp_altivec<8, 64, 16> ;
+
+ // interp_horiz_pp_c
+ p.pu[LUMA_32x32].luma_hpp = interp_horiz_pp_altivec<8, 32, 32> ;
+ p.pu[LUMA_32x16].luma_hpp = interp_horiz_pp_altivec<8, 32, 16> ;
+ p.pu[LUMA_32x24].luma_hpp = interp_horiz_pp_altivec<8, 32, 24> ;
+ p.pu[LUMA_32x8].luma_hpp = interp_horiz_pp_altivec<8, 32, 8> ;
+ p.pu[LUMA_64x64].luma_hpp = interp_horiz_pp_altivec<8, 64, 64> ;
+ p.pu[LUMA_64x32].luma_hpp = interp_horiz_pp_altivec<8, 64, 32> ;
+ p.pu[LUMA_32x64].luma_hpp = interp_horiz_pp_altivec<8, 32, 64> ;
+ p.pu[LUMA_64x48].luma_hpp = interp_horiz_pp_altivec<8, 64, 48> ;
+ p.pu[LUMA_64x16].luma_hpp = interp_horiz_pp_altivec<8, 64, 16> ;
+}
+
+} // end namespace X265_NS
diff --git a/source/common/ppc/pixel_altivec.cpp b/source/common/ppc/pixel_altivec.cpp
new file mode 100644
index 0000000..0a75e5b
--- /dev/null
+++ b/source/common/ppc/pixel_altivec.cpp
@@ -0,0 +1,4321 @@
+/*****************************************************************************
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Steve Borho <steve at borho.org>
+ * Mandar Gurav <mandar at multicorewareinc.com>
+ * Mahesh Pittala <mahesh at multicorewareinc.com>
+ * Min Chen <min.chen at multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "common.h"
+#include "primitives.h"
+#include "x265.h"
+#include "ppccommon.h"
+
+#include <cstdlib> // abs()
+
+//using namespace X265_NS;
+
+namespace X265_NS {
+// place functions in anonymous namespace (file static)
+
+ /* Null vector */
+#define LOAD_ZERO const vec_u8_t zerov = vec_splat_u8( 0 )
+
+#define zero_u8v (vec_u8_t) zerov
+#define zero_s8v (vec_s8_t) zerov
+#define zero_u16v (vec_u16_t) zerov
+#define zero_s16v (vec_s16_t) zerov
+#define zero_u32v (vec_u32_t) zerov
+#define zero_s32v (vec_s32_t) zerov
+
+ /* 8 <-> 16 bits conversions */
+#ifdef WORDS_BIGENDIAN
+#define vec_u8_to_u16_h(v) (vec_u16_t) vec_mergeh( zero_u8v, (vec_u8_t) v )
+#define vec_u8_to_u16_l(v) (vec_u16_t) vec_mergel( zero_u8v, (vec_u8_t) v )
+#define vec_u8_to_s16_h(v) (vec_s16_t) vec_mergeh( zero_u8v, (vec_u8_t) v )
+#define vec_u8_to_s16_l(v) (vec_s16_t) vec_mergel( zero_u8v, (vec_u8_t) v )
+#else
+#define vec_u8_to_u16_h(v) (vec_u16_t) vec_mergeh( (vec_u8_t) v, zero_u8v )
+#define vec_u8_to_u16_l(v) (vec_u16_t) vec_mergel( (vec_u8_t) v, zero_u8v )
+#define vec_u8_to_s16_h(v) (vec_s16_t) vec_mergeh( (vec_u8_t) v, zero_u8v )
+#define vec_u8_to_s16_l(v) (vec_s16_t) vec_mergel( (vec_u8_t) v, zero_u8v )
+#endif
+
+#define vec_u8_to_u16(v) vec_u8_to_u16_h(v)
+#define vec_u8_to_s16(v) vec_u8_to_s16_h(v)
+
+#if defined(__GNUC__)
+#define ALIGN_VAR_8(T, var) T var __attribute__((aligned(8)))
+#define ALIGN_VAR_16(T, var) T var __attribute__((aligned(16)))
+#define ALIGN_VAR_32(T, var) T var __attribute__((aligned(32)))
+#elif defined(_MSC_VER)
+#define ALIGN_VAR_8(T, var) __declspec(align(8)) T var
+#define ALIGN_VAR_16(T, var) __declspec(align(16)) T var
+#define ALIGN_VAR_32(T, var) __declspec(align(32)) T var
+#endif // if defined(__GNUC__)
+
+typedef uint8_t pixel;
+typedef uint32_t sum2_t ;
+typedef uint16_t sum_t ;
+#define BITS_PER_SUM (8 * sizeof(sum_t))
+
+/***********************************************************************
+ * SAD routines - altivec implementation
+ **********************************************************************/
+template<int lx, int ly>
+void inline sum_columns_altivec(vec_s32_t sumv, int* sum){}
+
+template<int lx, int ly>
+int inline sad16_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ assert(lx <=16);
+ LOAD_ZERO;
+ vec_u8_t pix1v, pix2v;
+ vec_u8_t absv = zero_u8v;
+ vec_s32_t sumv = zero_s32v;
+ ALIGN_VAR_16(int, sum );
+
+ for( int y = 0; y < ly; y++ )
+ {
+ pix1v = /*vec_vsx_ld*/vec_xl( 0, pix1);
+ pix2v = /*vec_vsx_ld*/vec_xl( 0, pix2);
+ //print_vec_u8("pix1v", &pix1v);
+ //print_vec_u8("pix2v", &pix2v);
+
+ absv = (vector unsigned char)vec_sub(vec_max(pix1v, pix2v), vec_min(pix1v, pix2v));
+ //print_vec_u8("abs sub", &absv);
+
+ sumv = (vec_s32_t) vec_sum4s( absv, (vec_u32_t) sumv);
+ //print_vec_i("vec_sum4s 0", &sumv);
+
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+ }
+
+ sum_columns_altivec<lx, ly>(sumv, &sum);
+ //printf("<%d %d>%d\n", lx, ly, sum);
+ return sum;
+}
+
+template<int lx, int ly> //to be implemented later
+int sad16_altivec(const int16_t* pix1, intptr_t stride_pix1, const int16_t* pix2, intptr_t stride_pix2)
+{
+ int sum = 0;
+ return sum;
+}
+
+template<int lx, int ly>//to be implemented later
+int sad_altivec(const int16_t* pix1, intptr_t stride_pix1, const int16_t* pix2, intptr_t stride_pix2)
+{
+ int sum = 0;
+ return sum;
+}
+
+template<>
+void inline sum_columns_altivec<16, 4>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sums( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 3 );
+ //print_vec_i("vec_splat 3", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<16, 8>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sums( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 3 );
+ //print_vec_i("vec_splat 3", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<16, 12>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sums( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 3 );
+ //print_vec_i("vec_splat 3", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<16, 16>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sums( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 3 );
+ //print_vec_i("vec_splat 3", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<16, 24>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sums( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 3 );
+ //print_vec_i("vec_splat 3", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<16, 32>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sums( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 3 );
+ //print_vec_i("vec_splat 3", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<16, 48>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sums( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 3 );
+ //print_vec_i("vec_splat 3", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<16, 64>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sums( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 3 );
+ //print_vec_i("vec_splat 3", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+
+template<>
+void inline sum_columns_altivec<8, 4>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sum2s( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 1 );
+ //print_vec_i("vec_splat 1", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<8, 8>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sum2s( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 1 );
+ //print_vec_i("vec_splat 1", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<8, 16>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sum2s( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 1 );
+ //print_vec_i("vec_splat 1", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<8, 32>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_sum2s( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 1 );
+ //print_vec_i("vec_splat 1", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<4, 4>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_splat( sumv, 0 );
+ //print_vec_i("vec_splat 0", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<4, 8>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_splat( sumv, 0 );
+ //print_vec_i("vec_splat 0", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<4, 16>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ sumv = vec_splat( sumv, 0 );
+ //print_vec_i("vec_splat 0", &sumv);
+ vec_ste( sumv, 0, sum );
+}
+
+template<>
+void inline sum_columns_altivec<12, 16>(vec_s32_t sumv, int* sum)
+{
+ LOAD_ZERO;
+ vec_s32_t sum1v= vec_splat( sumv, 3);
+ sumv = vec_sums( sumv, zero_s32v );
+ //print_vec_i("vec_sums", &sumv);
+ sumv = vec_splat( sumv, 3 );
+ //print_vec_i("vec_splat 1", &sumv);
+ sumv = vec_sub(sumv, sum1v);
+ vec_ste( sumv, 0, sum );
+}
+
+template<int lx, int ly>
+int inline sad_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2){ return 0; }
+
+template<>
+int inline sad_altivec<24, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 32>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<8, 32>(pix1+16, stride_pix1, pix2+16, stride_pix2);
+ //printf("<24 32>%d\n", sum);
+ return sum;
+}
+
+template<>
+int inline sad_altivec<32, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 8>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<16, 8>(pix1+16, stride_pix1, pix2+16, stride_pix2);
+ //printf("<32 8>%d\n", sum);
+ return sum;
+}
+
+template<>
+int inline sad_altivec<32, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 16>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<16, 16>(pix1+16, stride_pix1, pix2+16, stride_pix2);
+ //printf("<32 16>%d\n", sum);
+ return sum;
+}
+
+template<>
+int inline sad_altivec<32, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 24>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<16, 24>(pix1+16, stride_pix1, pix2+16, stride_pix2);
+ //printf("<32 24>%d\n", sum);
+ return sum;
+}
+
+template<>
+int inline sad_altivec<32, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 32>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<16, 32>(pix1+16, stride_pix1, pix2+16, stride_pix2);
+ //printf("<32 32>%d\n", sum);
+ return sum;
+}
+
+template<>
+int inline sad_altivec<32, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 64>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<16, 64>(pix1+16, stride_pix1, pix2+16, stride_pix2);
+ //printf("<32 64>%d\n", sum);
+ return sum;
+}
+
+template<>
+int inline sad_altivec<48, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 64>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<16, 64>(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + sad16_altivec<16, 64>(pix1+32, stride_pix1, pix2+32, stride_pix2);
+ //printf("<48 64>%d\n", sum);
+ return sum;
+}
+
+template<>
+int inline sad_altivec<64, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 16>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<16, 16>(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + sad16_altivec<16, 16>(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ + sad16_altivec<16, 16>(pix1+48, stride_pix1, pix2+48, stride_pix2);
+ //printf("<64 16>%d\n", sum);
+ return sum;
+}
+
+template<>
+int inline sad_altivec<64, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 32>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<16, 32>(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + sad16_altivec<16, 32>(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ + sad16_altivec<16, 32>(pix1+48, stride_pix1, pix2+48, stride_pix2);
+ //printf("<64 32>%d\n", sum);
+ return sum;
+}
+
+template<>
+int inline sad_altivec<64, 48>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 48>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<16, 48>(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + sad16_altivec<16, 48>(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ + sad16_altivec<16, 48>(pix1+48, stride_pix1, pix2+48, stride_pix2);
+ //printf("<64 48>%d\n", sum);
+ return sum;
+}
+
+template<>
+int inline sad_altivec<64, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16(int, sum );
+ sum = sad16_altivec<16, 64>(pix1, stride_pix1, pix2, stride_pix2)
+ + sad16_altivec<16, 64>(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + sad16_altivec<16, 64>(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ + sad16_altivec<16, 64>(pix1+48, stride_pix1, pix2+48, stride_pix2);
+ //printf("<64 64>%d\n", sum);
+ return sum;
+}
+
+/***********************************************************************
+ * SAD_X3 routines - altivec implementation
+ **********************************************************************/
+template<int lx, int ly>
+void inline sad16_x3_altivec(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ res[0] = 0;
+ res[1] = 0;
+ res[2] = 0;
+ assert(lx <=16);
+ LOAD_ZERO;
+ vec_u8_t pix1v, pix2v, pix3v, pix4v;
+ vec_u8_t absv1_2 = zero_u8v;
+ vec_u8_t absv1_3 = zero_u8v;
+ vec_u8_t absv1_4 = zero_u8v;
+ vec_s32_t sumv0 = zero_s32v;
+ vec_s32_t sumv1 = zero_s32v;
+ vec_s32_t sumv2 = zero_s32v;
+
+ for( int y = 0; y < ly; y++ )
+ {
+ pix1v = vec_xl( 0, pix1); //@@RM vec_vsx_ld( 0, pix1);
+ pix2v = vec_xl( 0, pix2); //@@RM vec_vsx_ld( 0, pix2);
+ pix3v = vec_xl( 0, pix3); //@@RM vec_vsx_ld( 0, pix3);
+ pix4v = vec_xl( 0, pix4); //@@RM vec_vsx_ld( 0, pix4);
+
+ //@@RM : using vec_abs has 2 drawbacks here:
+ //@@RM first, it produces the incorrect result (unpack should be used first)
+ //@@RM second, it is slower than sub(max,min), as noted in freescale's documentation
+ //@@RM absv = (vector unsigned char)vec_abs((vector signed char)vec_sub(pix1v, pix2v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix2v));
+ absv1_2 = (vector unsigned char)vec_sub(vec_max(pix1v, pix2v), vec_min(pix1v, pix2v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix2v));
+ sumv0 = (vec_s32_t) vec_sum4s( absv1_2, (vec_u32_t) sumv0);
+
+ absv1_3 = (vector unsigned char)vec_sub(vec_max(pix1v, pix3v), vec_min(pix1v, pix3v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix3v));
+ sumv1 = (vec_s32_t) vec_sum4s( absv1_3, (vec_u32_t) sumv1);
+
+ absv1_4 = (vector unsigned char)vec_sub(vec_max(pix1v, pix4v), vec_min(pix1v, pix4v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix3v));
+ sumv2 = (vec_s32_t) vec_sum4s( absv1_4, (vec_u32_t) sumv2);
+
+ pix1 += FENC_STRIDE;
+ pix2 += frefstride;
+ pix3 += frefstride;
+ pix4 += frefstride;
+ }
+
+ sum_columns_altivec<lx, ly>(sumv0, res+0);
+ sum_columns_altivec<lx, ly>(sumv1, res+1);
+ sum_columns_altivec<lx, ly>(sumv2, res+2);
+ //printf("<%d %d>%d %d %d\n", lx, ly, res[0], res[1], res[2]);
+}
+
+template<int lx, int ly>
+void inline sad_x3_altivec(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res){}
+
+template<>
+void inline sad_x3_altivec<24, 32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[3];
+ sad16_x3_altivec<16, 32>(pix1, pix2, pix3, pix4, frefstride, sum);
+ sad16_x3_altivec<8, 32>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, res);
+ res[0] += sum[0];
+ res[1] += sum[1];
+ res[2] += sum[2];
+ //printf("<24 32>%d %d %d\n", res[0], res[1], res[2]);
+}
+
+template<>
+void inline sad_x3_altivec<32, 8>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[3];
+ sad16_x3_altivec<16, 8>(pix1, pix2, pix3, pix4, frefstride, sum);
+ sad16_x3_altivec<16, 8>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, res);
+ res[0] += sum[0];
+ res[1] += sum[1];
+ res[2] += sum[2];
+ //printf("<32 8>%d %d %d\n", res[0], res[1], res[2]);
+}
+
+template<>
+void inline sad_x3_altivec<32, 16>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[3];
+ sad16_x3_altivec<16, 16>(pix1, pix2, pix3, pix4, frefstride, sum);
+ sad16_x3_altivec<16, 16>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, res);
+ res[0] += sum[0];
+ res[1] += sum[1];
+ res[2] += sum[2];
+ //printf("<32 16>%d %d %d\n", res[0], res[1], res[2]);
+}
+
+template<>
+void inline sad_x3_altivec<32, 24>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[3];
+ sad16_x3_altivec<16, 24>(pix1, pix2, pix3, pix4, frefstride, sum);
+ sad16_x3_altivec<16, 24>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, res);
+ res[0] += sum[0];
+ res[1] += sum[1];
+ res[2] += sum[2];
+ //printf("<32 24>%d %d %d\n", res[0], res[1], res[2]);
+}
+
+template<>
+void sad_x3_altivec<32, 32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+
+ const int lx = 32 ;
+ const int ly = 32 ;
+
+ vector unsigned int v_zeros = {0, 0, 0, 0} ;
+
+ vector signed short v_results_0 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+ vector signed short v_results_1 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+ vector signed short v_results_2 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+
+
+ vector signed int v_results_int_0 ;
+ vector signed int v_results_int_1 ;
+ vector signed int v_results_int_2 ;
+
+ vector unsigned char v_pix1 ;
+ vector unsigned char v_pix2 ;
+ vector unsigned char v_pix3 ;
+ vector unsigned char v_pix4 ;
+
+ vector unsigned char v_abs_diff_0 ;
+ vector unsigned char v_abs_diff_1 ;
+ vector unsigned char v_abs_diff_2 ;
+
+ vector signed short v_unpack_mask = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ;
+
+ vector signed short v_short_0_0 , v_short_0_1 ;
+ vector signed short v_short_1_0 , v_short_1_1 ;
+ vector signed short v_short_2_0 , v_short_2_1 ;
+
+ vector signed short v_sum_0 ;
+ vector signed short v_sum_1 ;
+ vector signed short v_sum_2 ;
+
+
+
+ res[0] = 0;
+ res[1] = 0;
+ res[2] = 0;
+ for (int y = 0; y < ly; y++)
+ {
+ for (int x = 0; x < lx; x+=16)
+ {
+ v_pix1 = vec_xl(x, pix1) ;
+
+ // for(int ii=0; ii<16; ii++) { res[0] += abs(pix1[x + ii] - pix2[x + ii]); }
+ v_pix2 = vec_xl(x, pix2) ;
+ v_abs_diff_0 = vec_sub(vec_max(v_pix1, v_pix2), vec_min(v_pix1, v_pix2)) ;
+ v_short_0_0 = vec_unpackh((vector signed char)v_abs_diff_0) ;
+ v_short_0_0 = vec_and(v_short_0_0, v_unpack_mask) ;
+ v_short_0_1 = vec_unpackl((vector signed char)v_abs_diff_0) ;
+ v_short_0_1 = vec_and(v_short_0_1, v_unpack_mask) ;
+ v_sum_0 = vec_add(v_short_0_0, v_short_0_1) ;
+ v_results_0 = vec_add(v_results_0, v_sum_0) ;
+
+ // for(int ii=0; ii<16; ii++) { res[1] += abs(pix1[x + ii] - pix3[x + ii]); }
+ v_pix3 = vec_xl(x, pix3) ;
+ v_abs_diff_1 = vec_sub(vec_max(v_pix1, v_pix3), vec_min(v_pix1, v_pix3)) ;
+ v_short_1_0 = vec_unpackh((vector signed char)v_abs_diff_1) ;
+ v_short_1_0 = vec_and(v_short_1_0, v_unpack_mask) ;
+ v_short_1_1 = vec_unpackl((vector signed char)v_abs_diff_1) ;
+ v_short_1_1 = vec_and(v_short_1_1, v_unpack_mask) ;
+ v_sum_1 = vec_add(v_short_1_0, v_short_1_1) ;
+ v_results_1 = vec_add(v_results_1, v_sum_1) ;
+
+
+ // for(int ii=0; ii<16; ii++) { res[2] += abs(pix1[x + ii] - pix4[x + ii]); }
+ v_pix4 = vec_xl(x, pix4) ;
+ v_abs_diff_2 = vec_sub(vec_max(v_pix1, v_pix4), vec_min(v_pix1, v_pix4)) ;
+ v_short_2_0 = vec_unpackh((vector signed char)v_abs_diff_2) ;
+ v_short_2_0 = vec_and(v_short_2_0, v_unpack_mask) ;
+ v_short_2_1 = vec_unpackl((vector signed char)v_abs_diff_2) ;
+ v_short_2_1 = vec_and(v_short_2_1, v_unpack_mask) ;
+ v_sum_2 = vec_add(v_short_2_0, v_short_2_1) ;
+ v_results_2 = vec_add(v_results_2, v_sum_2) ;
+
+ }
+
+ pix1 += FENC_STRIDE;
+ pix2 += frefstride;
+ pix3 += frefstride;
+ pix4 += frefstride;
+ }
+
+
+ v_results_int_0 = vec_sum4s((vector signed short)v_results_0, (vector signed int)v_zeros) ;
+ v_results_int_0 = vec_sums(v_results_int_0, (vector signed int)v_zeros) ;
+ res[0] = v_results_int_0[3] ;
+
+
+ v_results_int_1 = vec_sum4s((vector signed short)v_results_1, (vector signed int)v_zeros) ;
+ v_results_int_1 = vec_sums(v_results_int_1, (vector signed int)v_zeros) ;
+ res[1] = v_results_int_1[3] ;
+
+
+ v_results_int_2 = vec_sum4s((vector signed short)v_results_2, (vector signed int)v_zeros) ;
+ v_results_int_2 = vec_sums(v_results_int_2, (vector signed int)v_zeros) ;
+ res[2] = v_results_int_2[3] ;
+
+ //printf("<32 32>%d %d %d\n", res[0], res[1], res[2]);
+
+} // end sad_x3_altivec
+
+template<>
+void inline sad_x3_altivec<32, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[3];
+ sad16_x3_altivec<16, 64>(pix1, pix2, pix3, pix4, frefstride, sum);
+ sad16_x3_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, res);
+ res[0] += sum[0];
+ res[1] += sum[1];
+ res[2] += sum[2];
+ //printf("<32 64>%d %d %d\n", res[0], res[1], res[2]);
+}
+
+template<>
+void inline sad_x3_altivec<48, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[6];
+ sad16_x3_altivec<16, 64>(pix1, pix2, pix3, pix4, frefstride, sum);
+ sad16_x3_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, sum+3);
+ sad16_x3_altivec<16, 64>(pix1+32, pix2+32, pix3+32, pix4+32, frefstride, res);
+ res[0] = sum[0]+sum[3]+res[0];
+ res[1] = sum[1]+sum[4]+res[1];
+ res[2] = sum[2]+sum[5]+res[2];
+ //printf("<48 64>%d %d %d\n", res[0], res[1], res[2]);
+}
+
+template<>
+void inline sad_x3_altivec<64, 16>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[9];
+ sad16_x3_altivec<16, 16>(pix1, pix2, pix3, pix4, frefstride, sum);
+ sad16_x3_altivec<16, 16>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, sum+3);
+ sad16_x3_altivec<16, 16>(pix1+32, pix2+32, pix3+32, pix4+32, frefstride, sum+6);
+ sad16_x3_altivec<16, 16>(pix1+48, pix2+48, pix3+48, pix4+48, frefstride, res);
+ res[0] = sum[0]+sum[3]+sum[6]+res[0];
+ res[1] = sum[1]+sum[4]+sum[7]+res[1];
+ res[2] = sum[2]+sum[5]+sum[8]+res[2];
+ //printf("<64 16>%d %d %d\n", res[0], res[1], res[2]);
+}
+
+template<>
+void inline sad_x3_altivec<64, 32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[9];
+ sad16_x3_altivec<16, 32>(pix1, pix2, pix3, pix4, frefstride, sum);
+ sad16_x3_altivec<16, 32>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, sum+3);
+ sad16_x3_altivec<16, 32>(pix1+32, pix2+32, pix3+32, pix4+32, frefstride, sum+6);
+ sad16_x3_altivec<16, 32>(pix1+48, pix2+48, pix3+48, pix4+48, frefstride, res);
+ res[0] = sum[0]+sum[3]+sum[6]+res[0];
+ res[1] = sum[1]+sum[4]+sum[7]+res[1];
+ res[2] = sum[2]+sum[5]+sum[8]+res[2];
+ //printf("<64 32>%d %d %d\n", res[0], res[1], res[2]);
+}
+
+template<>
+void inline sad_x3_altivec<64, 48>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[9];
+ sad16_x3_altivec<16, 48>(pix1, pix2, pix3, pix4, frefstride, sum);
+ sad16_x3_altivec<16, 48>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, sum+3);
+ sad16_x3_altivec<16, 48>(pix1+32, pix2+32, pix3+32, pix4+32, frefstride, sum+6);
+ sad16_x3_altivec<16, 48>(pix1+48, pix2+48, pix3+48, pix4+48, frefstride, res);
+ res[0] = sum[0]+sum[3]+sum[6]+res[0];
+ res[1] = sum[1]+sum[4]+sum[7]+res[1];
+ res[2] = sum[2]+sum[5]+sum[8]+res[2];
+ //printf("<64 48>%d %d %d\n", res[0], res[1], res[2]);
+}
+
+template<>
+void inline sad_x3_altivec<64, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[9];
+ sad16_x3_altivec<16, 64>(pix1, pix2, pix3, pix4, frefstride, sum);
+ sad16_x3_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, sum+3);
+ sad16_x3_altivec<16, 64>(pix1+32, pix2+32, pix3+32, pix4+32, frefstride, sum+6);
+ sad16_x3_altivec<16, 64>(pix1+48, pix2+48, pix3+48, pix4+48, frefstride, res);
+ res[0] = sum[0]+sum[3]+sum[6]+res[0];
+ res[1] = sum[1]+sum[4]+sum[7]+res[1];
+ res[2] = sum[2]+sum[5]+sum[8]+res[2];
+ //printf("<64 64>%d %d %d\n", res[0], res[1], res[2]);
+}
+
+/***********************************************************************
+ * SAD_X4 routines - altivec implementation
+ **********************************************************************/
+template<int lx, int ly>
+void inline sad16_x4_altivec(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+ res[0] = 0;
+ res[1] = 0;
+ res[2] = 0;
+ assert(lx <=16);
+ LOAD_ZERO;
+ vec_u8_t pix1v, pix2v, pix3v, pix4v, pix5v;
+ vec_u8_t absv1_2 = zero_u8v;
+ vec_u8_t absv1_3 = zero_u8v;
+ vec_u8_t absv1_4 = zero_u8v;
+ vec_u8_t absv1_5 = zero_u8v;
+ vec_s32_t sumv0 = zero_s32v;
+ vec_s32_t sumv1 = zero_s32v;
+ vec_s32_t sumv2 = zero_s32v;
+ vec_s32_t sumv3 = zero_s32v;
+
+ for( int y = 0; y < ly; y++ )
+ {
+ pix1v = vec_xl( 0, pix1); //@@RM vec_vsx_ld( 0, pix1);
+ pix2v = vec_xl( 0, pix2); //@@RM vec_vsx_ld( 0, pix2);
+ pix3v = vec_xl( 0, pix3); //@@RM vec_vsx_ld( 0, pix3);
+ pix4v = vec_xl( 0, pix4); //@@RM vec_vsx_ld( 0, pix4);
+ pix5v = vec_xl( 0, pix5); //@@RM vec_vsx_ld( 0, pix4);
+
+ //@@RM : using vec_abs has 2 drawbacks here:
+ //@@RM first, it produces the incorrect result (unpack should be used first)
+ //@@RM second, it is slower than sub(max,min), as noted in freescale's documentation
+ //@@RM absv = (vector unsigned char)vec_abs((vector signed char)vec_sub(pix1v, pix2v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix2v));
+ absv1_2 = (vector unsigned char)vec_sub(vec_max(pix1v, pix2v), vec_min(pix1v, pix2v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix2v));
+ sumv0 = (vec_s32_t) vec_sum4s( absv1_2, (vec_u32_t) sumv0);
+
+ absv1_3 = (vector unsigned char)vec_sub(vec_max(pix1v, pix3v), vec_min(pix1v, pix3v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix3v));
+ sumv1 = (vec_s32_t) vec_sum4s( absv1_3, (vec_u32_t) sumv1);
+
+ absv1_4 = (vector unsigned char)vec_sub(vec_max(pix1v, pix4v), vec_min(pix1v, pix4v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix3v));
+ sumv2 = (vec_s32_t) vec_sum4s( absv1_4, (vec_u32_t) sumv2);
+
+ absv1_5 = (vector unsigned char)vec_sub(vec_max(pix1v, pix5v), vec_min(pix1v, pix5v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix3v));
+ sumv3 = (vec_s32_t) vec_sum4s( absv1_5, (vec_u32_t) sumv3);
+
+ pix1 += FENC_STRIDE;
+ pix2 += frefstride;
+ pix3 += frefstride;
+ pix4 += frefstride;
+ pix5 += frefstride;
+ }
+
+ sum_columns_altivec<lx, ly>(sumv0, res+0);
+ sum_columns_altivec<lx, ly>(sumv1, res+1);
+ sum_columns_altivec<lx, ly>(sumv2, res+2);
+ sum_columns_altivec<lx, ly>(sumv3, res+3);
+ //printf("<%d %d>%d %d %d %d\n", lx, ly, res[0], res[1], res[2], res[3]);
+}
+
+template<int lx, int ly>
+void inline sad_x4_altivec(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res){}
+
+
+template<>
+void inline sad_x4_altivec<24, 32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[4];
+ sad16_x4_altivec<16, 32>(pix1, pix2, pix3, pix4, pix5, frefstride, sum);
+ sad16_x4_altivec<8, 32>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, res);
+ res[0] += sum[0];
+ res[1] += sum[1];
+ res[2] += sum[2];
+ res[3] += sum[3];
+ //printf("<24 32>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+}
+
+template<>
+void inline sad_x4_altivec<32, 8>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[4];
+ sad16_x4_altivec<16, 8>(pix1, pix2, pix3, pix4, pix5, frefstride, sum);
+ sad16_x4_altivec<16, 8>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, res);
+ res[0] += sum[0];
+ res[1] += sum[1];
+ res[2] += sum[2];
+ res[3] += sum[3];
+ //printf("<32 8>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+}
+
+template<>
+void sad_x4_altivec<32,16>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+
+ const int lx = 32 ;
+ const int ly = 16 ;
+
+ vector unsigned int v_zeros = {0, 0, 0, 0} ;
+
+ vector signed short v_results_0 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+ vector signed short v_results_1 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+ vector signed short v_results_2 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+ vector signed short v_results_3 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+
+
+ vector signed int v_results_int_0 ;
+ vector signed int v_results_int_1 ;
+ vector signed int v_results_int_2 ;
+ vector signed int v_results_int_3 ;
+
+ vector unsigned char v_pix1 ;
+ vector unsigned char v_pix2 ;
+ vector unsigned char v_pix3 ;
+ vector unsigned char v_pix4 ;
+ vector unsigned char v_pix5 ;
+
+ vector unsigned char v_abs_diff_0 ;
+ vector unsigned char v_abs_diff_1 ;
+ vector unsigned char v_abs_diff_2 ;
+ vector unsigned char v_abs_diff_3 ;
+
+ vector signed short v_unpack_mask = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ;
+
+ vector signed short v_short_0_0 , v_short_0_1 ;
+ vector signed short v_short_1_0 , v_short_1_1 ;
+ vector signed short v_short_2_0 , v_short_2_1 ;
+ vector signed short v_short_3_0 , v_short_3_1 ;
+
+ vector signed short v_sum_0 ;
+ vector signed short v_sum_1 ;
+ vector signed short v_sum_2 ;
+ vector signed short v_sum_3 ;
+
+
+ res[0] = 0;
+ res[1] = 0;
+ res[2] = 0;
+ res[3] = 0;
+ for (int y = 0; y < ly; y++)
+ {
+ for (int x = 0; x < lx; x+=16)
+ {
+ v_pix1 = vec_xl(x, pix1) ;
+
+ // for(int ii=0; ii<16; ii++) { res[0] += abs(pix1[x + ii] - pix2[x + ii]); }
+ v_pix2 = vec_xl(x, pix2) ;
+ v_abs_diff_0 = vec_sub(vec_max(v_pix1, v_pix2), vec_min(v_pix1, v_pix2)) ;
+ v_short_0_0 = vec_unpackh((vector signed char)v_abs_diff_0) ;
+ v_short_0_0 = vec_and(v_short_0_0, v_unpack_mask) ;
+ v_short_0_1 = vec_unpackl((vector signed char)v_abs_diff_0) ;
+ v_short_0_1 = vec_and(v_short_0_1, v_unpack_mask) ;
+ v_sum_0 = vec_add(v_short_0_0, v_short_0_1) ;
+ v_results_0 = vec_add(v_results_0, v_sum_0) ;
+
+ // for(int ii=0; ii<16; ii++) { res[1] += abs(pix1[x + ii] - pix3[x + ii]); }
+ v_pix3 = vec_xl(x, pix3) ;
+ v_abs_diff_1 = vec_sub(vec_max(v_pix1, v_pix3), vec_min(v_pix1, v_pix3)) ;
+ v_short_1_0 = vec_unpackh((vector signed char)v_abs_diff_1) ;
+ v_short_1_0 = vec_and(v_short_1_0, v_unpack_mask) ;
+ v_short_1_1 = vec_unpackl((vector signed char)v_abs_diff_1) ;
+ v_short_1_1 = vec_and(v_short_1_1, v_unpack_mask) ;
+ v_sum_1 = vec_add(v_short_1_0, v_short_1_1) ;
+ v_results_1 = vec_add(v_results_1, v_sum_1) ;
+
+
+ // for(int ii=0; ii<16; ii++) { res[2] += abs(pix1[x + ii] - pix4[x + ii]); }
+ v_pix4 = vec_xl(x, pix4) ;
+ v_abs_diff_2 = vec_sub(vec_max(v_pix1, v_pix4), vec_min(v_pix1, v_pix4)) ;
+ v_short_2_0 = vec_unpackh((vector signed char)v_abs_diff_2) ;
+ v_short_2_0 = vec_and(v_short_2_0, v_unpack_mask) ;
+ v_short_2_1 = vec_unpackl((vector signed char)v_abs_diff_2) ;
+ v_short_2_1 = vec_and(v_short_2_1, v_unpack_mask) ;
+ v_sum_2 = vec_add(v_short_2_0, v_short_2_1) ;
+ v_results_2 = vec_add(v_results_2, v_sum_2) ;
+
+
+ // for(int ii=0; ii<16; ii++) { res[3] += abs(pix1[x + ii] - pix5[x + ii]); }
+ v_pix5 = vec_xl(x, pix5) ;
+ v_abs_diff_3 = vec_sub(vec_max(v_pix1, v_pix5), vec_min(v_pix1, v_pix5)) ;
+ v_short_3_0 = vec_unpackh((vector signed char)v_abs_diff_3) ;
+ v_short_3_0 = vec_and(v_short_3_0, v_unpack_mask) ;
+ v_short_3_1 = vec_unpackl((vector signed char)v_abs_diff_3) ;
+ v_short_3_1 = vec_and(v_short_3_1, v_unpack_mask) ;
+ v_sum_3 = vec_add(v_short_3_0, v_short_3_1) ;
+ v_results_3 = vec_add(v_results_3, v_sum_3) ;
+ }
+
+ pix1 += FENC_STRIDE;
+ pix2 += frefstride;
+ pix3 += frefstride;
+ pix4 += frefstride;
+ pix5 += frefstride;
+ }
+
+
+ v_results_int_0 = vec_sum4s((vector signed short)v_results_0, (vector signed int)v_zeros) ;
+ v_results_int_0 = vec_sums(v_results_int_0, (vector signed int)v_zeros) ;
+ res[0] = v_results_int_0[3] ;
+
+
+ v_results_int_1 = vec_sum4s((vector signed short)v_results_1, (vector signed int)v_zeros) ;
+ v_results_int_1 = vec_sums(v_results_int_1, (vector signed int)v_zeros) ;
+ res[1] = v_results_int_1[3] ;
+
+
+ v_results_int_2 = vec_sum4s((vector signed short)v_results_2, (vector signed int)v_zeros) ;
+ v_results_int_2 = vec_sums(v_results_int_2, (vector signed int)v_zeros) ;
+ res[2] = v_results_int_2[3] ;
+
+
+ v_results_int_3 = vec_sum4s((vector signed short)v_results_3, (vector signed int)v_zeros) ;
+ v_results_int_3 = vec_sums(v_results_int_3, (vector signed int)v_zeros) ;
+ res[3] = v_results_int_3[3] ;
+ //printf("<32 16>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+} // end sad_x4_altivec
+
+template<>
+void inline sad_x4_altivec<32, 24>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[4];
+ sad16_x4_altivec<16, 24>(pix1, pix2, pix3, pix4, pix5, frefstride, sum);
+ sad16_x4_altivec<16, 24>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, res);
+ res[0] += sum[0];
+ res[1] += sum[1];
+ res[2] += sum[2];
+ res[3] += sum[3];
+ //printf("<32 24>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+}
+
+template<>
+void sad_x4_altivec<32,32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+
+ const int lx = 32 ;
+ const int ly = 32 ;
+
+ vector unsigned int v_zeros = {0, 0, 0, 0} ;
+
+ vector signed short v_results_0 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+ vector signed short v_results_1 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+ vector signed short v_results_2 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+ vector signed short v_results_3 = {0, 0, 0, 0, 0, 0, 0, 0} ;
+
+
+ vector signed int v_results_int_0 ;
+ vector signed int v_results_int_1 ;
+ vector signed int v_results_int_2 ;
+ vector signed int v_results_int_3 ;
+
+ vector unsigned char v_pix1 ;
+ vector unsigned char v_pix2 ;
+ vector unsigned char v_pix3 ;
+ vector unsigned char v_pix4 ;
+ vector unsigned char v_pix5 ;
+
+ vector unsigned char v_abs_diff_0 ;
+ vector unsigned char v_abs_diff_1 ;
+ vector unsigned char v_abs_diff_2 ;
+ vector unsigned char v_abs_diff_3 ;
+
+ vector signed short v_unpack_mask = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ;
+
+ vector signed short v_short_0_0 , v_short_0_1 ;
+ vector signed short v_short_1_0 , v_short_1_1 ;
+ vector signed short v_short_2_0 , v_short_2_1 ;
+ vector signed short v_short_3_0 , v_short_3_1 ;
+
+ vector signed short v_sum_0 ;
+ vector signed short v_sum_1 ;
+ vector signed short v_sum_2 ;
+ vector signed short v_sum_3 ;
+
+
+ res[0] = 0;
+ res[1] = 0;
+ res[2] = 0;
+ res[3] = 0;
+ for (int y = 0; y < ly; y++)
+ {
+ for (int x = 0; x < lx; x+=16)
+ {
+ v_pix1 = vec_xl(x, pix1) ;
+
+ // for(int ii=0; ii<16; ii++) { res[0] += abs(pix1[x + ii] - pix2[x + ii]); }
+ v_pix2 = vec_xl(x, pix2) ;
+ v_abs_diff_0 = vec_sub(vec_max(v_pix1, v_pix2), vec_min(v_pix1, v_pix2)) ;
+ v_short_0_0 = vec_unpackh((vector signed char)v_abs_diff_0) ;
+ v_short_0_0 = vec_and(v_short_0_0, v_unpack_mask) ;
+ v_short_0_1 = vec_unpackl((vector signed char)v_abs_diff_0) ;
+ v_short_0_1 = vec_and(v_short_0_1, v_unpack_mask) ;
+ v_sum_0 = vec_add(v_short_0_0, v_short_0_1) ;
+ v_results_0 = vec_add(v_results_0, v_sum_0) ;
+
+ // for(int ii=0; ii<16; ii++) { res[1] += abs(pix1[x + ii] - pix3[x + ii]); }
+ v_pix3 = vec_xl(x, pix3) ;
+ v_abs_diff_1 = vec_sub(vec_max(v_pix1, v_pix3), vec_min(v_pix1, v_pix3)) ;
+ v_short_1_0 = vec_unpackh((vector signed char)v_abs_diff_1) ;
+ v_short_1_0 = vec_and(v_short_1_0, v_unpack_mask) ;
+ v_short_1_1 = vec_unpackl((vector signed char)v_abs_diff_1) ;
+ v_short_1_1 = vec_and(v_short_1_1, v_unpack_mask) ;
+ v_sum_1 = vec_add(v_short_1_0, v_short_1_1) ;
+ v_results_1 = vec_add(v_results_1, v_sum_1) ;
+
+
+ // for(int ii=0; ii<16; ii++) { res[2] += abs(pix1[x + ii] - pix4[x + ii]); }
+ v_pix4 = vec_xl(x, pix4) ;
+ v_abs_diff_2 = vec_sub(vec_max(v_pix1, v_pix4), vec_min(v_pix1, v_pix4)) ;
+ v_short_2_0 = vec_unpackh((vector signed char)v_abs_diff_2) ;
+ v_short_2_0 = vec_and(v_short_2_0, v_unpack_mask) ;
+ v_short_2_1 = vec_unpackl((vector signed char)v_abs_diff_2) ;
+ v_short_2_1 = vec_and(v_short_2_1, v_unpack_mask) ;
+ v_sum_2 = vec_add(v_short_2_0, v_short_2_1) ;
+ v_results_2 = vec_add(v_results_2, v_sum_2) ;
+
+
+ // for(int ii=0; ii<16; ii++) { res[3] += abs(pix1[x + ii] - pix5[x + ii]); }
+ v_pix5 = vec_xl(x, pix5) ;
+ v_abs_diff_3 = vec_sub(vec_max(v_pix1, v_pix5), vec_min(v_pix1, v_pix5)) ;
+ v_short_3_0 = vec_unpackh((vector signed char)v_abs_diff_3) ;
+ v_short_3_0 = vec_and(v_short_3_0, v_unpack_mask) ;
+ v_short_3_1 = vec_unpackl((vector signed char)v_abs_diff_3) ;
+ v_short_3_1 = vec_and(v_short_3_1, v_unpack_mask) ;
+ v_sum_3 = vec_add(v_short_3_0, v_short_3_1) ;
+ v_results_3 = vec_add(v_results_3, v_sum_3) ;
+ }
+
+ pix1 += FENC_STRIDE;
+ pix2 += frefstride;
+ pix3 += frefstride;
+ pix4 += frefstride;
+ pix5 += frefstride;
+ }
+
+
+ v_results_int_0 = vec_sum4s((vector signed short)v_results_0, (vector signed int)v_zeros) ;
+ v_results_int_0 = vec_sums(v_results_int_0, (vector signed int)v_zeros) ;
+ res[0] = v_results_int_0[3] ;
+
+
+ v_results_int_1 = vec_sum4s((vector signed short)v_results_1, (vector signed int)v_zeros) ;
+ v_results_int_1 = vec_sums(v_results_int_1, (vector signed int)v_zeros) ;
+ res[1] = v_results_int_1[3] ;
+
+
+ v_results_int_2 = vec_sum4s((vector signed short)v_results_2, (vector signed int)v_zeros) ;
+ v_results_int_2 = vec_sums(v_results_int_2, (vector signed int)v_zeros) ;
+ res[2] = v_results_int_2[3] ;
+
+
+ v_results_int_3 = vec_sum4s((vector signed short)v_results_3, (vector signed int)v_zeros) ;
+ v_results_int_3 = vec_sums(v_results_int_3, (vector signed int)v_zeros) ;
+ res[3] = v_results_int_3[3] ;
+
+ //printf("<32 32>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+} // end sad_x4_altivec
+
+template<>
+void inline sad_x4_altivec<32, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[4];
+ sad16_x4_altivec<16, 64>(pix1, pix2, pix3, pix4, pix5, frefstride, sum);
+ sad16_x4_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, res);
+ res[0] += sum[0];
+ res[1] += sum[1];
+ res[2] += sum[2];
+ res[3] += sum[3];
+ //printf("<32 64>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+}
+
+template<>
+void inline sad_x4_altivec<48, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[8];
+ sad16_x4_altivec<16, 64>(pix1, pix2, pix3, pix4, pix5, frefstride, sum);
+ sad16_x4_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, sum+4);
+ sad16_x4_altivec<16, 64>(pix1+32, pix2+32, pix3+32, pix4+32, pix5+32, frefstride, res);
+ res[0] = sum[0]+sum[4]+res[0];
+ res[1] = sum[1]+sum[5]+res[1];
+ res[2] = sum[2]+sum[6]+res[2];
+ res[3] = sum[3]+sum[7]+res[3];
+ //printf("<48 64>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+}
+
+template<>
+void inline sad_x4_altivec<64, 16>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[12];
+ sad16_x4_altivec<16, 16>(pix1, pix2, pix3, pix4, pix5, frefstride, sum);
+ sad16_x4_altivec<16, 16>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, sum+4);
+ sad16_x4_altivec<16, 16>(pix1+32, pix2+32, pix3+32, pix4+32, pix5+32, frefstride, sum+8);
+ sad16_x4_altivec<16, 16>(pix1+48, pix2+48, pix3+48, pix4+48, pix5+48, frefstride, res);
+ res[0] = sum[0]+sum[4]+sum[8]+res[0];
+ res[1] = sum[1]+sum[5]+sum[9]+res[1];
+ res[2] = sum[2]+sum[6]+sum[10]+res[2];
+ res[3] = sum[3]+sum[7]+sum[11]+res[3];
+ //printf("<64 16>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+}
+
+template<>
+void inline sad_x4_altivec<64, 32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[12];
+ sad16_x4_altivec<16, 32>(pix1, pix2, pix3, pix4, pix5, frefstride, sum);
+ sad16_x4_altivec<16, 32>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, sum+4);
+ sad16_x4_altivec<16, 32>(pix1+32, pix2+32, pix3+32, pix4+32, pix5+32, frefstride, sum+8);
+ sad16_x4_altivec<16, 32>(pix1+48, pix2+48, pix3+48, pix4+48, pix5+48, frefstride, res);
+ res[0] = sum[0]+sum[4]+sum[8]+res[0];
+ res[1] = sum[1]+sum[5]+sum[9]+res[1];
+ res[2] = sum[2]+sum[6]+sum[10]+res[2];
+ res[3] = sum[3]+sum[7]+sum[11]+res[3];
+ //printf("<64 32>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+}
+
+template<>
+void inline sad_x4_altivec<64, 48>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[12];
+ sad16_x4_altivec<16, 48>(pix1, pix2, pix3, pix4, pix5, frefstride, sum);
+ sad16_x4_altivec<16, 48>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, sum+4);
+ sad16_x4_altivec<16, 48>(pix1+32, pix2+32, pix3+32, pix4+32, pix5+32, frefstride, sum+8);
+ sad16_x4_altivec<16, 48>(pix1+48, pix2+48, pix3+48, pix4+48, pix5+48, frefstride, res);
+ res[0] = sum[0]+sum[4]+sum[8]+res[0];
+ res[1] = sum[1]+sum[5]+sum[9]+res[1];
+ res[2] = sum[2]+sum[6]+sum[10]+res[2];
+ res[3] = sum[3]+sum[7]+sum[11]+res[3];
+ //printf("<64 48>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+}
+
+template<>
+void inline sad_x4_altivec<64, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res)
+{
+ int32_t sum[12];
+ sad16_x4_altivec<16, 64>(pix1, pix2, pix3, pix4, pix5, frefstride, sum);
+ sad16_x4_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, sum+4);
+ sad16_x4_altivec<16, 64>(pix1+32, pix2+32, pix3+32, pix4+32, pix5+32, frefstride, sum+8);
+ sad16_x4_altivec<16, 64>(pix1+48, pix2+48, pix3+48, pix4+48, pix5+48, frefstride, res);
+ res[0] = sum[0]+sum[4]+sum[8]+res[0];
+ res[1] = sum[1]+sum[5]+sum[9]+res[1];
+ res[2] = sum[2]+sum[6]+sum[10]+res[2];
+ res[3] = sum[3]+sum[7]+sum[11]+res[3];
+ //printf("<64 64>%d %d %d %d\n", res[0], res[1], res[2], res[3]);
+}
+
+
+/***********************************************************************
+ * SATD routines - altivec implementation
+ **********************************************************************/
+#define HADAMARD4_VEC(s0, s1, s2, s3, d0, d1, d2, d3) \
+{\
+ vec_s16_t t0, t1, t2, t3;\
+ t0 = vec_add(s0, s1);\
+ t1 = vec_sub(s0, s1);\
+ t2 = vec_add(s2, s3);\
+ t3 = vec_sub(s2, s3);\
+ d0 = vec_add(t0, t2);\
+ d2 = vec_sub(t0, t2);\
+ d1 = vec_add(t1, t3);\
+ d3 = vec_sub(t1, t3);\
+}
+
+#define VEC_TRANSPOSE_4(a0,a1,a2,a3,b0,b1,b2,b3) \
+ b0 = vec_mergeh( a0, a0 ); \
+ b1 = vec_mergeh( a1, a0 ); \
+ b2 = vec_mergeh( a2, a0 ); \
+ b3 = vec_mergeh( a3, a0 ); \
+ a0 = vec_mergeh( b0, b2 ); \
+ a1 = vec_mergel( b0, b2 ); \
+ a2 = vec_mergeh( b1, b3 ); \
+ a3 = vec_mergel( b1, b3 ); \
+ b0 = vec_mergeh( a0, a2 ); \
+ b1 = vec_mergel( a0, a2 ); \
+ b2 = vec_mergeh( a1, a3 ); \
+ b3 = vec_mergel( a1, a3 )
+
+#define VEC_TRANSPOSE_8(a0,a1,a2,a3,a4,a5,a6,a7,b0,b1,b2,b3,b4,b5,b6,b7) \
+ b0 = vec_mergeh( a0, a4 ); \
+ b1 = vec_mergel( a0, a4 ); \
+ b2 = vec_mergeh( a1, a5 ); \
+ b3 = vec_mergel( a1, a5 ); \
+ b4 = vec_mergeh( a2, a6 ); \
+ b5 = vec_mergel( a2, a6 ); \
+ b6 = vec_mergeh( a3, a7 ); \
+ b7 = vec_mergel( a3, a7 ); \
+ a0 = vec_mergeh( b0, b4 ); \
+ a1 = vec_mergel( b0, b4 ); \
+ a2 = vec_mergeh( b1, b5 ); \
+ a3 = vec_mergel( b1, b5 ); \
+ a4 = vec_mergeh( b2, b6 ); \
+ a5 = vec_mergel( b2, b6 ); \
+ a6 = vec_mergeh( b3, b7 ); \
+ a7 = vec_mergel( b3, b7 ); \
+ b0 = vec_mergeh( a0, a4 ); \
+ b1 = vec_mergel( a0, a4 ); \
+ b2 = vec_mergeh( a1, a5 ); \
+ b3 = vec_mergel( a1, a5 ); \
+ b4 = vec_mergeh( a2, a6 ); \
+ b5 = vec_mergel( a2, a6 ); \
+ b6 = vec_mergeh( a3, a7 ); \
+ b7 = vec_mergel( a3, a7 )
+
+int satd_4x4_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16( int, sum );
+
+ LOAD_ZERO;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diff0v, diff1v, diff2v, diff3v;
+ vec_s16_t temp0v, temp1v, temp2v, temp3v;
+ vec_s32_t satdv, satdv1, satdv2, satdv3;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff0v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff1v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff2v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff3v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ /* Hadamar H */
+ HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v);
+ VEC_TRANSPOSE_4( temp0v, temp1v, temp2v, temp3v, diff0v, diff1v, diff2v, diff3v );
+ /* Hadamar V */
+ HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v);
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1 = vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2 = vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3 = vec_sum4s( temp3v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv += satdv2;
+
+ satdv = vec_sum2s( satdv, zero_s32v );
+ //satdv = vec_splat( satdv, 1 );
+ //vec_ste( satdv, 0, &sum );
+ sum = vec_extract(satdv, 1);
+ //print(sum);
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ satdv = vec_sum2s( satdv, zero_s32v );
+ //satdv = vec_splat( satdv, 1 );
+ //vec_ste( satdv, 0, &sum );
+ sum = vec_extract(satdv, 1);
+ //print(sum);
+#endif
+ return sum >> 1;
+}
+
+#define HADAMARD4_x2vec(v_out0, v_out1, v_in0, v_in1, v_perm_l0_0, v_perm_l0_1) \
+{ \
+ \
+ vector unsigned int v_l0_input_0, v_l0_input_1 ; \
+ v_l0_input_0 = vec_perm((vector unsigned int)v_in0, (vector unsigned int)v_in1, v_perm_l0_0) ; \
+ v_l0_input_1 = vec_perm((vector unsigned int)v_in0, (vector unsigned int)v_in1, v_perm_l0_1) ; \
+ \
+ vector unsigned int v_l0_add_result, v_l0_sub_result ; \
+ v_l0_add_result = vec_add(v_l0_input_0, v_l0_input_1) ; \
+ v_l0_sub_result = vec_sub(v_l0_input_0, v_l0_input_1) ; \
+ \
+ vector unsigned char v_perm_l1_0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17} ; \
+ vector unsigned char v_perm_l1_1 = {0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0xF, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F} ; \
+ \
+ vector unsigned int v_l1_input_0, v_l1_input_1 ; \
+ v_l1_input_0 = vec_perm(v_l0_add_result, v_l0_sub_result, v_perm_l1_0) ; \
+ v_l1_input_1 = vec_perm(v_l0_add_result, v_l0_sub_result, v_perm_l1_1) ; \
+ \
+ vector unsigned int v_l1_add_result, v_l1_sub_result ; \
+ v_l1_add_result = vec_add(v_l1_input_0, v_l1_input_1) ; \
+ v_l1_sub_result = vec_sub(v_l1_input_0, v_l1_input_1) ; \
+ \
+ \
+ v_out0 = v_l1_add_result ; \
+ v_out1 = v_l1_sub_result ; \
+\
+\
+}
+
+int satd_4x8_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16( int, sum );
+
+ LOAD_ZERO;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diff0v, diff1v, diff2v, diff3v;
+ vec_s16_t temp0v, temp1v, temp2v, temp3v;
+ vec_s32_t satdv, satdv1, satdv2, satdv3;;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff0v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff1v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff2v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff3v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ /* Hadamar H */
+ HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v);
+ VEC_TRANSPOSE_4( temp0v, temp1v, temp2v, temp3v, diff0v, diff1v, diff2v, diff3v );
+ /* Hadamar V */
+ HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v);
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv += satdv2;
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+#endif
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff0v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff1v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff2v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff3v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ /* Hadamar H */
+ HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v);
+ VEC_TRANSPOSE_4( temp0v, temp1v, temp2v, temp3v, diff0v, diff1v, diff2v, diff3v );
+ /* Hadamar V */
+ HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v);
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv += vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1 = vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2 = vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3 = vec_sum4s( temp3v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv += satdv2;
+
+ satdv = vec_sum2s( satdv, zero_s32v );
+ sum = vec_extract(satdv, 1);
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, satdv);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ satdv = vec_sum2s( satdv, zero_s32v );
+ satdv = vec_splat( satdv, 1 );
+ vec_ste( satdv, 0, &sum );
+#endif
+ return sum >> 1;
+}
+
+#if 1
+static int satd_8x4_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ const vector signed short v_unsigned_short_mask = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ;
+ vector unsigned char v_pix1_ub, v_pix2_ub ;
+ vector signed short v_pix1_ss, v_pix2_ss ;
+ vector signed short v_sub ;
+ vector signed int v_sub_sw_0, v_sub_sw_1 ;
+ vector signed int v_packed_sub_0, v_packed_sub_1 ;
+ vector unsigned int v_hadamard_result_0, v_hadamard_result_1, v_hadamard_result_2, v_hadamard_result_3 ;
+
+ // for (int i = 0; i < 4; i+=2, pix1 += 2*stride_pix1, pix2 += 2*stride_pix2)
+ // {
+ //a0 = (pix1[0] - pix2[0]) + ((sum2_t)(pix1[4] - pix2[4]) << BITS_PER_SUM);
+ //a1 = (pix1[1] - pix2[1]) + ((sum2_t)(pix1[5] - pix2[5]) << BITS_PER_SUM);
+ //a2 = (pix1[2] - pix2[2]) + ((sum2_t)(pix1[6] - pix2[6]) << BITS_PER_SUM);
+ //a3 = (pix1[3] - pix2[3]) + ((sum2_t)(pix1[7] - pix2[7]) << BITS_PER_SUM);
+
+ // Load 16 elements from each pix array
+ v_pix1_ub = vec_xl(0, pix1) ;
+ v_pix2_ub = vec_xl(0, pix2) ;
+
+ // We only care about the top 8, and in short format
+ v_pix1_ss = vec_unpackh((vector signed char)v_pix1_ub) ;
+ v_pix2_ss = vec_unpackh((vector signed char)v_pix2_ub) ;
+
+ // Undo the sign extend of the unpacks
+ v_pix1_ss = vec_and(v_pix1_ss, v_unsigned_short_mask) ;
+ v_pix2_ss = vec_and(v_pix2_ss, v_unsigned_short_mask) ;
+
+ // Peform the subtraction
+ v_sub = vec_sub(v_pix1_ss, v_pix2_ss) ;
+
+ // Unpack the sub results into ints
+ v_sub_sw_0 = vec_unpackh(v_sub) ;
+ v_sub_sw_1 = vec_unpackl(v_sub) ;
+ v_sub_sw_1 = vec_sl(v_sub_sw_1, (vector unsigned int){16,16,16,16}) ;
+
+ // Add the int sub results (compatibility with the original code)
+ v_packed_sub_0 = vec_add(v_sub_sw_0, v_sub_sw_1) ;
+
+ //a0 = (pix1[0] - pix2[0]) + ((sum2_t)(pix1[4] - pix2[4]) << BITS_PER_SUM);
+ //a1 = (pix1[1] - pix2[1]) + ((sum2_t)(pix1[5] - pix2[5]) << BITS_PER_SUM);
+ //a2 = (pix1[2] - pix2[2]) + ((sum2_t)(pix1[6] - pix2[6]) << BITS_PER_SUM);
+ //a3 = (pix1[3] - pix2[3]) + ((sum2_t)(pix1[7] - pix2[7]) << BITS_PER_SUM);
+
+ // Load 16 elements from each pix array
+ v_pix1_ub = vec_xl(stride_pix1, pix1) ;
+ v_pix2_ub = vec_xl(stride_pix2, pix2) ;
+
+ // We only care about the top 8, and in short format
+ v_pix1_ss = vec_unpackh((vector signed char)v_pix1_ub) ;
+ v_pix2_ss = vec_unpackh((vector signed char)v_pix2_ub) ;
+
+ // Undo the sign extend of the unpacks
+ v_pix1_ss = vec_and(v_pix1_ss, v_unsigned_short_mask) ;
+ v_pix2_ss = vec_and(v_pix2_ss, v_unsigned_short_mask) ;
+
+ // Peform the subtraction
+ v_sub = vec_sub(v_pix1_ss, v_pix2_ss) ;
+
+ // Unpack the sub results into ints
+ v_sub_sw_0 = vec_unpackh(v_sub) ;
+ v_sub_sw_1 = vec_unpackl(v_sub) ;
+ v_sub_sw_1 = vec_sl(v_sub_sw_1, (vector unsigned int){16,16,16,16}) ;
+
+ // Add the int sub results (compatibility with the original code)
+ v_packed_sub_1 = vec_add(v_sub_sw_0, v_sub_sw_1) ;
+
+ // original: HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, a3);
+ // modified while vectorizing: HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], v_packed_sub_0[0], v_packed_sub_0[1], v_packed_sub_0[2], v_packed_sub_0[3]);
+
+ // original: HADAMARD4(tmp[i+1][0], tmp[i+1][1], tmp[i+1][2], tmp[i+1][3], a0, a1, a2, a3);
+ // modified while vectorizing: HADAMARD4(tmp[i+1][0], tmp[i+1][1], tmp[i+1][2], tmp[i+1][3], v_packed_sub_1[0], v_packed_sub_1[1], v_packed_sub_1[2], v_packed_sub_1[3]);
+
+ // Go after two hadamard4(int) at once, fully utilizing the vector width
+ // Note that the hadamard4(int) provided by x264/x265 is actually two hadamard4(short) simultaneously
+ const vector unsigned char v_perm_l0_0 = {0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x08, 0x09, 0x0A, 0x0B, 0x18, 0x19, 0x1A, 0x1B} ;
+ const vector unsigned char v_perm_l0_1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x0C, 0x0D, 0x0E, 0x0F, 0x1C, 0x1D, 0x1E, 0x1F} ;
+ HADAMARD4_x2vec(v_hadamard_result_0, v_hadamard_result_1, v_packed_sub_0, v_packed_sub_1, v_perm_l0_0, v_perm_l0_1) ;
+
+ //##
+ // tmp[0][0] = v_hadamard_result_0[0] ;
+ // tmp[0][1] = v_hadamard_result_0[2] ;
+ // tmp[0][2] = v_hadamard_result_1[0] ;
+ // tmp[0][3] = v_hadamard_result_1[2] ;
+
+ // tmp[1][0] = v_hadamard_result_0[1] ;
+ // tmp[1][1] = v_hadamard_result_0[3] ;
+ // tmp[1][2] = v_hadamard_result_1[1] ;
+ // tmp[1][3] = v_hadamard_result_1[3] ;
+ //##
+
+ //a0 = (pix1[0] - pix2[0]) + ((sum2_t)(pix1[4] - pix2[4]) << BITS_PER_SUM);
+ //a1 = (pix1[1] - pix2[1]) + ((sum2_t)(pix1[5] - pix2[5]) << BITS_PER_SUM);
+ //a2 = (pix1[2] - pix2[2]) + ((sum2_t)(pix1[6] - pix2[6]) << BITS_PER_SUM);
+ //a3 = (pix1[3] - pix2[3]) + ((sum2_t)(pix1[7] - pix2[7]) << BITS_PER_SUM);
+
+ // Load 16 elements from each pix array
+ v_pix1_ub = vec_xl(2*stride_pix1, pix1) ;
+ v_pix2_ub = vec_xl(2*stride_pix1, pix2) ;
+
+ // We only care about the top 8, and in short format
+ v_pix1_ss = vec_unpackh((vector signed char)v_pix1_ub) ;
+ v_pix2_ss = vec_unpackh((vector signed char)v_pix2_ub) ;
+
+ // Undo the sign extend of the unpacks
+ v_pix1_ss = vec_and(v_pix1_ss, v_unsigned_short_mask) ;
+ v_pix2_ss = vec_and(v_pix2_ss, v_unsigned_short_mask) ;
+
+ // Peform the subtraction
+ v_sub = vec_sub(v_pix1_ss, v_pix2_ss) ;
+
+ // Unpack the sub results into ints
+ v_sub_sw_0 = vec_unpackh(v_sub) ;
+ v_sub_sw_1 = vec_unpackl(v_sub) ;
+ v_sub_sw_1 = vec_sl(v_sub_sw_1, (vector unsigned int){16,16,16,16}) ;
+
+ // Add the int sub results (compatibility with the original code)
+ v_packed_sub_0 = vec_add(v_sub_sw_0, v_sub_sw_1) ;
+
+ //a0 = (pix1[0] - pix2[0]) + ((sum2_t)(pix1[4] - pix2[4]) << BITS_PER_SUM);
+ //a1 = (pix1[1] - pix2[1]) + ((sum2_t)(pix1[5] - pix2[5]) << BITS_PER_SUM);
+ //a2 = (pix1[2] - pix2[2]) + ((sum2_t)(pix1[6] - pix2[6]) << BITS_PER_SUM);
+ //a3 = (pix1[3] - pix2[3]) + ((sum2_t)(pix1[7] - pix2[7]) << BITS_PER_SUM);
+
+ // Load 16 elements from each pix array
+ v_pix1_ub = vec_xl(3*stride_pix1, pix1) ;
+ v_pix2_ub = vec_xl(3*stride_pix2, pix2) ;
+
+ // We only care about the top 8, and in short format
+ v_pix1_ss = vec_unpackh((vector signed char)v_pix1_ub) ;
+ v_pix2_ss = vec_unpackh((vector signed char)v_pix2_ub) ;
+
+ // Undo the sign extend of the unpacks
+ v_pix1_ss = vec_and(v_pix1_ss, v_unsigned_short_mask) ;
+ v_pix2_ss = vec_and(v_pix2_ss, v_unsigned_short_mask) ;
+
+ // Peform the subtraction
+ v_sub = vec_sub(v_pix1_ss, v_pix2_ss) ;
+
+ // Unpack the sub results into ints
+ v_sub_sw_0 = vec_unpackh(v_sub) ;
+ v_sub_sw_1 = vec_unpackl(v_sub) ;
+ v_sub_sw_1 = vec_sl(v_sub_sw_1, (vector unsigned int){16,16,16,16}) ;
+
+ // Add the int sub results (compatibility with the original code)
+ v_packed_sub_1 = vec_add(v_sub_sw_0, v_sub_sw_1) ;
+
+
+ // original: HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, a3);
+ // modified while vectorizing: HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], v_packed_sub_0[0], v_packed_sub_0[1], v_packed_sub_0[2], v_packed_sub_0[3]);
+
+ // original: HADAMARD4(tmp[i+1][0], tmp[i+1][1], tmp[i+1][2], tmp[i+1][3], a0, a1, a2, a3);
+ // modified while vectorizing: HADAMARD4(tmp[i+1][0], tmp[i+1][1], tmp[i+1][2], tmp[i+1][3], v_packed_sub_1[0], v_packed_sub_1[1], v_packed_sub_1[2], v_packed_sub_1[3]);
+
+ // Go after two hadamard4(int) at once, fully utilizing the vector width
+ // Note that the hadamard4(int) provided by x264/x265 is actually two hadamard4(short) simultaneously
+ HADAMARD4_x2vec(v_hadamard_result_2, v_hadamard_result_3, v_packed_sub_0, v_packed_sub_1, v_perm_l0_0, v_perm_l0_1) ;
+
+ //##
+ //## tmp[2][0] = v_hadamard_result_2[0] ;
+ //## tmp[2][1] = v_hadamard_result_2[2] ;
+ //## tmp[2][2] = v_hadamard_result_3[0] ;
+ //## tmp[2][3] = v_hadamard_result_3[2] ;
+ //##
+ //## tmp[3][0] = v_hadamard_result_2[1] ;
+ //## tmp[3][1] = v_hadamard_result_2[3] ;
+ //## tmp[3][2] = v_hadamard_result_3[1] ;
+ //## tmp[3][3] = v_hadamard_result_3[3] ;
+ //##
+ // }
+ // for (int i = 0; i < 4; i++)
+ // {
+ // HADAMARD4(a0, a1, a2, a3, tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]);
+ // sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
+
+ // HADAMARD4(a0, a1, a2, a3, tmp[0][1], tmp[1][1], tmp[2][1], tmp[3][1]);
+ // sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
+ const vector unsigned char v_lowerloop_perm_l0_0 = {0x00, 0x01, 0x02, 0x03, 0x08, 0x09, 0x0A, 0x0B, 0x10, 0x11, 0x12, 0x13, 0x18, 0x19, 0x1A, 0x1B} ;
+ const vector unsigned char v_lowerloop_perm_l0_1 = {0x04, 0x05, 0x06, 0x07, 0x0C, 0x0D, 0x0E, 0x0F, 0x14, 0x15, 0x16, 0x17, 0x1C, 0x1D, 0x1E, 0x1F} ;
+ HADAMARD4_x2vec(v_hadamard_result_0, v_hadamard_result_2, v_hadamard_result_0, v_hadamard_result_2, v_lowerloop_perm_l0_0, v_lowerloop_perm_l0_1) ;
+
+ const vector unsigned int v_15 = {15, 15, 15, 15} ;
+ const vector unsigned int v_0x10001 = (vector unsigned int){ 0x10001, 0x10001, 0x10001, 0x10001 };
+ const vector unsigned int v_0xffff = (vector unsigned int){ 0xffff, 0xffff, 0xffff, 0xffff };
+
+
+ vector unsigned int v_hadamard_result_s_0 ;
+ v_hadamard_result_s_0 = vec_sra(v_hadamard_result_0, v_15) ;
+ v_hadamard_result_s_0 = vec_and(v_hadamard_result_s_0, v_0x10001) ;
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_hadamard_result_s_0)
+ : "v" (v_hadamard_result_s_0) , "v" (v_0xffff)
+ ) ;
+ v_hadamard_result_0 = vec_add(v_hadamard_result_0, v_hadamard_result_s_0) ;
+ v_hadamard_result_0 = vec_xor(v_hadamard_result_0, v_hadamard_result_s_0) ;
+
+ vector unsigned int v_hadamard_result_s_2 ;
+ v_hadamard_result_s_2 = vec_sra(v_hadamard_result_2, v_15) ;
+ v_hadamard_result_s_2 = vec_and(v_hadamard_result_s_2, v_0x10001) ;
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_hadamard_result_s_2)
+ : "v" (v_hadamard_result_s_2) , "v" (v_0xffff)
+ ) ;
+ v_hadamard_result_2 = vec_add(v_hadamard_result_2, v_hadamard_result_s_2) ;
+ v_hadamard_result_2 = vec_xor(v_hadamard_result_2, v_hadamard_result_s_2) ;
+
+ // HADAMARD4(a0, a1, a2, a3, tmp[0][2], tmp[1][2], tmp[2][2], tmp[3][2]);
+ // sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
+
+ // HADAMARD4(a0, a1, a2, a3, tmp[0][3], tmp[1][3], tmp[2][3], tmp[3][3]);
+ // sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
+
+ HADAMARD4_x2vec(v_hadamard_result_1, v_hadamard_result_3, v_hadamard_result_1, v_hadamard_result_3, v_lowerloop_perm_l0_0, v_lowerloop_perm_l0_1) ;
+
+ vector unsigned int v_hadamard_result_s_1 ;
+ v_hadamard_result_s_1 = vec_sra(v_hadamard_result_1, v_15) ;
+ v_hadamard_result_s_1 = vec_and(v_hadamard_result_s_1, v_0x10001) ;
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_hadamard_result_s_1)
+ : "v" (v_hadamard_result_s_1) , "v" (v_0xffff)
+ ) ;
+ v_hadamard_result_1 = vec_add(v_hadamard_result_1, v_hadamard_result_s_1) ;
+ v_hadamard_result_1 = vec_xor(v_hadamard_result_1, v_hadamard_result_s_1) ;
+
+ vector unsigned int v_hadamard_result_s_3 ;
+ v_hadamard_result_s_3 = vec_sra(v_hadamard_result_3, v_15) ;
+ v_hadamard_result_s_3 = vec_and(v_hadamard_result_s_3, v_0x10001) ;
+ asm ("vmuluwm %0,%1,%2"
+ : "=v" (v_hadamard_result_s_3)
+ : "v" (v_hadamard_result_s_3) , "v" (v_0xffff)
+ ) ;
+ v_hadamard_result_3 = vec_add(v_hadamard_result_3, v_hadamard_result_s_3) ;
+ v_hadamard_result_3 = vec_xor(v_hadamard_result_3, v_hadamard_result_s_3) ;
+
+// }
+
+
+ vector unsigned int v_sum_0, v_sum_1 ;
+ vector signed int v_sum ;
+
+ v_sum_0 = vec_add(v_hadamard_result_0, v_hadamard_result_2) ;
+ v_sum_1 = vec_add(v_hadamard_result_1, v_hadamard_result_3) ;
+
+ v_sum_0 = vec_add(v_sum_0, v_sum_1) ;
+
+ vector signed int v_zero = {0, 0, 0, 0} ;
+ v_sum = vec_sums((vector signed int)v_sum_0, v_zero) ;
+
+ // return (((sum_t)sum) + (sum >> BITS_PER_SUM)) >> 1;
+ return (((sum_t)v_sum[3]) + (v_sum[3] >> BITS_PER_SUM)) >> 1;
+}
+#else
+int satd_8x4_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16( int, sum );
+ LOAD_ZERO;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diff0v, diff1v, diff2v, diff3v, diff4v, diff5v, diff6v, diff7v;
+ vec_s16_t temp0v, temp1v, temp2v, temp3v, temp4v, temp5v, temp6v, temp7v;
+ vec_s32_t satdv;
+
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff0v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff1v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff2v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff3v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff4v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff5v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff6v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff7v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v );
+ //HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diff0v, diff1v, diff2v, diff3v,
+ diff4v, diff5v, diff6v, diff7v );
+
+ HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v );
+
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, satdv);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+
+ satdv = vec_sums( satdv, zero_s32v );
+ satdv = vec_splat( satdv, 3 );
+ vec_ste( satdv, 0, &sum );
+
+ //print(sum);
+ return sum>>1;
+}
+#endif
+
+int satd_8x8_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16( int, sum );
+ LOAD_ZERO;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diff0v, diff1v, diff2v, diff3v, diff4v, diff5v, diff6v, diff7v;
+ vec_s16_t temp0v, temp1v, temp2v, temp3v, temp4v, temp5v, temp6v, temp7v;
+ vec_s32_t satdv, satdv1, satdv2, satdv3, satdv4, satdv5, satdv6, satdv7;
+ //vec_s32_t satdv=(vec_s32_t){0,0,0,0};
+
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff0v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff1v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff2v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff3v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff4v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff5v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff6v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff7v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diff0v, diff1v, diff2v, diff3v,
+ diff4v, diff5v, diff6v, diff7v );
+
+ HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v );
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv4 = vec_sum4s( temp4v, zero_s32v);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv5= vec_sum4s( temp5v, zero_s32v );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv6= vec_sum4s( temp6v, zero_s32v );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv7= vec_sum4s( temp7v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv4 += satdv5;
+ satdv6 += satdv7;
+
+ satdv += satdv2;
+ satdv4 += satdv6;
+ satdv += satdv4;
+
+ satdv = vec_sums( satdv, zero_s32v );
+ sum = vec_extract(satdv, 3);
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, satdv);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+
+ satdv = vec_sums( satdv, zero_s32v );
+ satdv = vec_splat( satdv, 3 );
+ vec_ste( satdv, 0, &sum );
+#endif
+ return sum>>1;
+}
+
+int satd_8x16_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16( int, sum );
+
+ LOAD_ZERO;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diff0v, diff1v, diff2v, diff3v, diff4v, diff5v, diff6v, diff7v;
+ vec_s16_t temp0v, temp1v, temp2v, temp3v, temp4v, temp5v, temp6v, temp7v;
+ //vec_s32_t satdv=(vec_s32_t){0,0,0,0};
+ vec_s32_t satdv, satdv1, satdv2, satdv3, satdv4, satdv5, satdv6, satdv7;
+
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff0v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff1v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff2v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff3v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff4v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff5v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff6v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff7v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diff0v, diff1v, diff2v, diff3v,
+ diff4v, diff5v, diff6v, diff7v );
+
+ HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v );
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv4 = vec_sum4s( temp4v, zero_s32v);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv5= vec_sum4s( temp5v, zero_s32v );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv6= vec_sum4s( temp6v, zero_s32v );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv7= vec_sum4s( temp7v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv4 += satdv5;
+ satdv6 += satdv7;
+
+ satdv += satdv2;
+ satdv4 += satdv6;
+ satdv += satdv4;
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, satdv);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+#endif
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff0v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff1v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff2v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff3v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff4v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff5v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff6v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16(vec_xl(0, pix2) );
+ diff7v = vec_sub( pix1v, pix2v );
+ pix1 += stride_pix1;
+ pix2 += stride_pix2;
+
+ HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diff0v, diff1v, diff2v, diff3v,
+ diff4v, diff5v, diff6v, diff7v );
+
+ HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v );
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv += vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv4 = vec_sum4s( temp4v, zero_s32v);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv5= vec_sum4s( temp5v, zero_s32v );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv6= vec_sum4s( temp6v, zero_s32v );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv7= vec_sum4s( temp7v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv4 += satdv5;
+ satdv6 += satdv7;
+
+ satdv += satdv2;
+ satdv4 += satdv6;
+ satdv += satdv4;
+
+ satdv = vec_sums( satdv, zero_s32v );
+ sum = vec_extract(satdv, 3);
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, satdv);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+
+ satdv = vec_sums( satdv, zero_s32v );
+ satdv = vec_splat( satdv, 3 );
+ vec_ste( satdv, 0, &sum );
+#endif
+ return sum >> 1;
+}
+
+#define VEC_DIFF_S16(p1,i1,p2,i2,dh,dl)\
+{\
+ pix1v = (vec_s16_t)vec_xl(0, p1);\
+ temp0v = vec_u8_to_s16_h( pix1v );\
+ temp1v = vec_u8_to_s16_l( pix1v );\
+ pix2v = (vec_s16_t)vec_xl(0, p2);\
+ temp2v = vec_u8_to_s16_h( pix2v );\
+ temp3v = vec_u8_to_s16_l( pix2v );\
+ dh = vec_sub( temp0v, temp2v );\
+ dl = vec_sub( temp1v, temp3v );\
+ p1 += i1;\
+ p2 += i2;\
+}
+
+
+int satd_16x4_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16( int, sum );
+ LOAD_ZERO;
+ //vec_s32_t satdv=(vec_s32_t){0,0,0,0};
+ vec_s32_t satdv, satdv1, satdv2, satdv3, satdv4, satdv5, satdv6, satdv7;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diffh0v, diffh1v, diffh2v, diffh3v;
+ vec_s16_t diffl0v, diffl1v, diffl2v, diffl3v;
+ vec_s16_t temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v;
+
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh0v,diffl0v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh1v, diffl1v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh2v, diffl2v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh3v, diffl3v);
+
+
+ HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diffh0v, diffh1v, diffh2v, diffh3v,
+ diffl0v, diffl1v, diffl2v, diffl3v);
+
+ HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp4v, temp5v, temp6v, temp7v );
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv4 = vec_sum4s( temp4v, zero_s32v);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv5= vec_sum4s( temp5v, zero_s32v );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv6= vec_sum4s( temp6v, zero_s32v );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv7= vec_sum4s( temp7v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv4 += satdv5;
+ satdv6 += satdv7;
+
+ satdv += satdv2;
+ satdv4 += satdv6;
+ satdv += satdv4;
+
+ satdv = vec_sums( satdv, zero_s32v );
+ sum = vec_extract(satdv, 3);
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+
+ satdv = vec_sums( satdv, zero_s32v );
+ satdv = vec_splat( satdv, 3 );
+ vec_ste( satdv, 0, &sum );
+#endif
+ return sum >> 1;
+}
+
+int satd_16x8_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16( int, sum );
+ LOAD_ZERO;
+ //vec_s32_t satdv=(vec_s32_t){0,0,0,0};
+ vec_s32_t satdv, satdv1, satdv2, satdv3, satdv4, satdv5, satdv6, satdv7;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diffh0v, diffh1v, diffh2v, diffh3v,
+ diffh4v, diffh5v, diffh6v, diffh7v;
+ vec_s16_t diffl0v, diffl1v, diffl2v, diffl3v,
+ diffl4v, diffl5v, diffl6v, diffl7v;
+ vec_s16_t temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v;
+
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh0v,diffl0v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh1v, diffl1v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh2v, diffl2v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh3v, diffl3v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh4v, diffl4v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh5v, diffl5v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh6v, diffl6v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh7v, diffl7v);
+
+ HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diffh0v, diffh1v, diffh2v, diffh3v,
+ diffh4v, diffh5v, diffh6v, diffh7v );
+
+ HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v );
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv4 = vec_sum4s( temp4v, zero_s32v);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv5= vec_sum4s( temp5v, zero_s32v );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv6= vec_sum4s( temp6v, zero_s32v );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv7= vec_sum4s( temp7v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv4 += satdv5;
+ satdv6 += satdv7;
+
+ satdv += satdv2;
+ satdv4 += satdv6;
+ satdv += satdv4;
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+#endif
+
+ HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diffl0v, diffl1v, diffl2v, diffl3v,
+ diffl4v, diffl5v, diffl6v, diffl7v );
+
+ HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v );
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv += vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv4 = vec_sum4s( temp4v, zero_s32v);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv5= vec_sum4s( temp5v, zero_s32v );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv6= vec_sum4s( temp6v, zero_s32v );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv7= vec_sum4s( temp7v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv4 += satdv5;
+ satdv6 += satdv7;
+
+ satdv += satdv2;
+ satdv4 += satdv6;
+ satdv += satdv4;
+
+ satdv = vec_sums( satdv, zero_s32v );
+ sum = vec_extract(satdv, 3);
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, satdv);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+
+ satdv = vec_sums( satdv, zero_s32v );
+ satdv = vec_splat( satdv, 3 );
+ vec_ste( satdv, 0, &sum );
+#endif
+ return sum >> 1;
+}
+
+int satd_16x16_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ ALIGN_VAR_16( int, sum );
+ LOAD_ZERO;
+ //vec_s32_t satdv=(vec_s32_t){0,0,0,0};
+ vec_s32_t satdv, satdv1, satdv2, satdv3, satdv4, satdv5, satdv6, satdv7;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diffh0v, diffh1v, diffh2v, diffh3v,
+ diffh4v, diffh5v, diffh6v, diffh7v;
+ vec_s16_t diffl0v, diffl1v, diffl2v, diffl3v,
+ diffl4v, diffl5v, diffl6v, diffl7v;
+ vec_s16_t temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v;
+
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh0v,diffl0v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh1v, diffl1v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh2v, diffl2v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh3v, diffl3v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh4v, diffl4v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh5v, diffl5v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh6v, diffl6v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh7v, diffl7v);
+
+ HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diffh0v, diffh1v, diffh2v, diffh3v,
+ diffh4v, diffh5v, diffh6v, diffh7v );
+
+ HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v );
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv4 = vec_sum4s( temp4v, zero_s32v);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv5= vec_sum4s( temp5v, zero_s32v );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv6= vec_sum4s( temp6v, zero_s32v );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv7= vec_sum4s( temp7v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv4 += satdv5;
+ satdv6 += satdv7;
+
+ satdv += satdv2;
+ satdv4 += satdv6;
+ satdv += satdv4;
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+#endif
+
+ HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diffl0v, diffl1v, diffl2v, diffl3v,
+ diffl4v, diffl5v, diffl6v, diffl7v );
+
+ HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v );
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv += vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv4 = vec_sum4s( temp4v, zero_s32v);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv5= vec_sum4s( temp5v, zero_s32v );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv6= vec_sum4s( temp6v, zero_s32v );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv7= vec_sum4s( temp7v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv4 += satdv5;
+ satdv6 += satdv7;
+
+ satdv += satdv2;
+ satdv4 += satdv6;
+ satdv += satdv4;
+
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, satdv);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+#endif
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh0v,diffl0v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh1v, diffl1v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh2v, diffl2v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh3v, diffl3v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh4v, diffl4v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh5v, diffl5v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh6v, diffl6v);
+ VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh7v, diffl7v);
+
+ HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diffh0v, diffh1v, diffh2v, diffh3v,
+ diffh4v, diffh5v, diffh6v, diffh7v );
+
+ HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v );
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv += vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv4 = vec_sum4s( temp4v, zero_s32v);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv5= vec_sum4s( temp5v, zero_s32v );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv6= vec_sum4s( temp6v, zero_s32v );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv7= vec_sum4s( temp7v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv4 += satdv5;
+ satdv6 += satdv7;
+
+ satdv += satdv2;
+ satdv4 += satdv6;
+ satdv += satdv4;
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, satdv);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+#endif
+ HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v );
+
+ VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v,
+ temp4v, temp5v, temp6v, temp7v,
+ diffl0v, diffl1v, diffl2v, diffl3v,
+ diffl4v, diffl5v, diffl6v, diffl7v );
+
+ HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v );
+ HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v );
+
+#if 1
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv += vec_sum4s( temp0v, zero_s32v);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv1= vec_sum4s( temp1v, zero_s32v );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv2= vec_sum4s( temp2v, zero_s32v );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv3= vec_sum4s( temp3v, zero_s32v );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv4 = vec_sum4s( temp4v, zero_s32v);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv5= vec_sum4s( temp5v, zero_s32v );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv6= vec_sum4s( temp6v, zero_s32v );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv7= vec_sum4s( temp7v, zero_s32v );
+
+ satdv += satdv1;
+ satdv2 += satdv3;
+ satdv4 += satdv5;
+ satdv6 += satdv7;
+
+ satdv += satdv2;
+ satdv4 += satdv6;
+ satdv += satdv4;
+
+ satdv = vec_sums( satdv, zero_s32v );
+ sum = vec_extract(satdv, 3);
+#else
+ temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) );
+ satdv = vec_sum4s( temp0v, satdv);
+
+ temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) );
+ satdv= vec_sum4s( temp1v, satdv );
+
+ temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) );
+ satdv= vec_sum4s( temp2v, satdv );
+
+ temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) );
+ satdv= vec_sum4s( temp3v, satdv );
+
+ temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) );
+ satdv = vec_sum4s( temp4v, satdv);
+
+ temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) );
+ satdv= vec_sum4s( temp5v, satdv );
+
+ temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) );
+ satdv= vec_sum4s( temp6v, satdv );
+
+ temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) );
+ satdv= vec_sum4s( temp7v, satdv );
+
+ satdv = vec_sums( satdv, zero_s32v );
+ satdv = vec_splat( satdv, 3 );
+ vec_ste( satdv, 0, &sum );
+#endif
+ return sum >> 1;
+}
+
+
+template<int w, int h>
+int satd_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+
+template<>
+int satd_altivec<4, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ return satd_4x4_altivec(pix1, stride_pix1, pix2, stride_pix2);
+}
+
+template<>
+int satd_altivec<4, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ return satd_4x8_altivec(pix1, stride_pix1, pix2, stride_pix2);
+}
+
+template<>
+int satd_altivec<4, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_4x4_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+4*stride_pix1, stride_pix1, pix2+4*stride_pix2, stride_pix2);
+
+ return satd;
+}
+
+template<>
+int satd_altivec<4, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_4x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+8*stride_pix1, stride_pix1, pix2+8*stride_pix2, stride_pix2);
+
+ return satd;
+}
+
+template<>
+int satd_altivec<4, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_4x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+8*stride_pix1, stride_pix1, pix2+8*stride_pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2);
+
+ return satd;
+}
+
+template<>
+int satd_altivec<4, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_4x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+8*stride_pix1, stride_pix1, pix2+8*stride_pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+24*stride_pix1, stride_pix1, pix2+24*stride_pix2, stride_pix2);
+
+ return satd;
+}
+
+template<>
+int satd_altivec<4, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_altivec<4, 32>(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_altivec<4, 32>(pix1+32*stride_pix1, stride_pix1, pix2+32*stride_pix2, stride_pix2);
+
+ return satd;
+}
+
+template<>
+int satd_altivec<8, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ return satd_8x4_altivec(pix1, stride_pix1, pix2, stride_pix2);
+}
+
+template<>
+int satd_altivec<8, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ return satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2);
+}
+
+template<>
+int satd_altivec<8, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_8x4_altivec(pix1+8*stride_pix1, stride_pix1, pix2+8*stride_pix2, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<8,16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ return satd_8x16_altivec(pix1, stride_pix1, pix2, stride_pix2);
+}
+
+template<>
+int satd_altivec<8,24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_8x16_altivec(pix1+8*stride_pix1, stride_pix1, pix2+8*stride_pix2, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<8,32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_8x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_8x16_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<8,64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_8x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_8x16_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2)
+ + satd_8x16_altivec(pix1+32*stride_pix1, stride_pix1, pix2+32*stride_pix2, stride_pix2)
+ + satd_8x16_altivec(pix1+48*stride_pix1, stride_pix1, pix2+48*stride_pix2, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<12, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_8x4_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_4x4_altivec(pix1+8, stride_pix1, pix2+8, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<12, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+8, stride_pix1, pix2+8, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<12, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 8*stride_pix1;
+ const pixel *pix4 = pix2 + 8*stride_pix2;
+ satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+8, stride_pix1, pix2+8, stride_pix2);
+ + satd_8x4_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_4x4_altivec(pix3+8, stride_pix1, pix4+8, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<12, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 8*stride_pix1;
+ const pixel *pix4 = pix2 + 8*stride_pix2;
+ satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+8, stride_pix1, pix2+8, stride_pix2)
+ + satd_8x8_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_4x8_altivec(pix3+8, stride_pix1, pix4+8, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<12, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 8*stride_pix1;
+ const pixel *pix4 = pix2 + 8*stride_pix2;
+ satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_4x8_altivec(pix1+8, stride_pix1, pix2+8, stride_pix2)
+ + satd_8x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_altivec<4, 16>(pix3+8, stride_pix1, pix4+8, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<12, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ satd = satd_8x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_altivec<4, 16>(pix1+8, stride_pix1, pix2+8, stride_pix2)
+ + satd_8x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_altivec<4, 16>(pix3+8, stride_pix1, pix4+8, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<12, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ const pixel *pix5 = pix1 + 32*stride_pix1;
+ const pixel *pix6 = pix2 + 32*stride_pix2;
+ const pixel *pix7 = pix1 + 48*stride_pix1;
+ const pixel *pix8 = pix2 + 48*stride_pix2;
+ satd = satd_8x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_altivec<4, 16>(pix1+8, stride_pix1, pix2+8, stride_pix2)
+ + satd_8x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_altivec<4, 16>(pix3+8, stride_pix1, pix4+8, stride_pix2)
+ + satd_8x16_altivec(pix5, stride_pix1, pix6, stride_pix2)
+ + satd_altivec<4, 16>(pix5+8, stride_pix1, pix6+8, stride_pix2)
+ + satd_8x16_altivec(pix7, stride_pix1, pix8, stride_pix2)
+ + satd_altivec<4, 16>(pix7+8, stride_pix1, pix8+8, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<16, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ return satd_16x4_altivec(pix1, stride_pix1, pix2, stride_pix2);
+}
+
+template<>
+int satd_altivec<16, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ return satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2);
+}
+
+template<>
+int satd_altivec<16, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x4_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x8_altivec(pix1+4*stride_pix1, stride_pix1, pix2+4*stride_pix2, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<16, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ return satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2);
+}
+
+template<>
+int satd_altivec<16, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x8_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<16, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<16, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+32*stride_pix1, stride_pix1, pix2+32*stride_pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+48*stride_pix1, stride_pix1, pix2+48*stride_pix2, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<24, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x4_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_8x4_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<24, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_8x8_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<24, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 8*stride_pix1;
+ const pixel *pix4 = pix2 + 8*stride_pix2;
+ satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_8x8_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x4_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_8x4_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<24, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_8x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<24, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_altivec<24, 16>(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_altivec<24, 8>(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<24, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_8x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_8x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<24, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ const pixel *pix5 = pix1 + 32*stride_pix1;
+ const pixel *pix6 = pix2 + 32*stride_pix2;
+ const pixel *pix7 = pix1 + 48*stride_pix1;
+ const pixel *pix8 = pix2 + 48*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_8x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_8x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2)
+ + satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2)
+ + satd_8x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2)
+ + satd_16x16_altivec(pix7, stride_pix1, pix8, stride_pix2)
+ + satd_8x16_altivec(pix7+16, stride_pix1, pix8+16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<32, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x4_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x4_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<32, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x8_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<32, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 8*stride_pix1;
+ const pixel *pix4 = pix2 + 8*stride_pix2;
+ satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x8_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2)
+ + satd_16x4_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x4_altivec(pix3 + 16, stride_pix1, pix4 + 16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<32, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<32, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2)
+ + satd_16x8_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x8_altivec(pix3 + 16, stride_pix1, pix4 + 16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<32, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2)
+ + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x16_altivec(pix3 + 16, stride_pix1, pix4 + 16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<32, 48>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ const pixel *pix5 = pix1 + 32*stride_pix1;
+ const pixel *pix6 = pix2 + 32*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2)
+ + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x16_altivec(pix3 + 16, stride_pix1, pix4 + 16, stride_pix2)
+ + satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2)
+ + satd_16x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<32, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ const pixel *pix5 = pix1 + 32*stride_pix1;
+ const pixel *pix6 = pix2 + 32*stride_pix2;
+ const pixel *pix7 = pix1 + 48*stride_pix1;
+ const pixel *pix8 = pix2 + 48*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2)
+ + satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2)
+ + satd_16x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2)
+ + satd_16x16_altivec(pix7, stride_pix1, pix8, stride_pix2)
+ + satd_16x16_altivec(pix7+16, stride_pix1, pix8+16, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<48, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x4_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x4_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x4_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<48, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x8_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x8_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<48, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 8*stride_pix1;
+ const pixel *pix4 = pix2 + 8*stride_pix2;
+ satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x8_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x8_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ +satd_16x4_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x4_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2)
+ + satd_16x4_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<48, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<48, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 8*stride_pix1;
+ const pixel *pix4 = pix2 + 8*stride_pix2;
+ satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x8_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x8_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ +satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2)
+ + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<48, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ +satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2)
+ + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<48, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ const pixel *pix5 = pix1 + 32*stride_pix1;
+ const pixel *pix6 = pix2 + 32*stride_pix2;
+ const pixel *pix7 = pix1 + 48*stride_pix1;
+ const pixel *pix8 = pix2 + 48*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ +satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2)
+ + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2)
+ +satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2)
+ + satd_16x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2)
+ + satd_16x16_altivec(pix5+32, stride_pix1, pix6+32, stride_pix2)
+ +satd_16x16_altivec(pix7, stride_pix1, pix8, stride_pix2)
+ + satd_16x16_altivec(pix7+16, stride_pix1,pix8+16, stride_pix2)
+ + satd_16x16_altivec(pix7+32, stride_pix1, pix8+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<64, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_altivec<32, 4>(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_altivec<32, 4>(pix1+32, stride_pix1, pix2+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<64, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_altivec<32, 8>(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_altivec<32, 8>(pix1+32, stride_pix1, pix2+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<64, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_altivec<32, 12>(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_altivec<32, 12>(pix1+32, stride_pix1, pix2+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<64, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ + satd_16x16_altivec(pix1+48, stride_pix1, pix2+48, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<64, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ satd = satd_altivec<32, 24>(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_altivec<32, 24>(pix1+32, stride_pix1, pix2+32, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<64, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ + satd_16x16_altivec(pix1+48, stride_pix1, pix2+48, stride_pix2)
+ + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2)
+ + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2)
+ + satd_16x16_altivec(pix3+48, stride_pix1, pix4+48, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<64, 48>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ const pixel *pix5 = pix1 + 32*stride_pix1;
+ const pixel *pix6 = pix2 + 32*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ + satd_16x16_altivec(pix1+48, stride_pix1, pix2+48, stride_pix2)
+ + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2)
+ + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2)
+ + satd_16x16_altivec(pix3+48, stride_pix1, pix4+48, stride_pix2)
+ + satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2)
+ + satd_16x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2)
+ + satd_16x16_altivec(pix5+32, stride_pix1, pix6+32, stride_pix2)
+ + satd_16x16_altivec(pix5+48, stride_pix1, pix6+48, stride_pix2);
+ return satd;
+}
+
+template<>
+int satd_altivec<64, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+{
+ int satd = 0;
+ const pixel *pix3 = pix1 + 16*stride_pix1;
+ const pixel *pix4 = pix2 + 16*stride_pix2;
+ const pixel *pix5 = pix1 + 32*stride_pix1;
+ const pixel *pix6 = pix2 + 32*stride_pix2;
+ const pixel *pix7 = pix1 + 48*stride_pix1;
+ const pixel *pix8 = pix2 + 48*stride_pix2;
+ satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2)
+ + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2)
+ + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2)
+ + satd_16x16_altivec(pix1+48, stride_pix1, pix2+48, stride_pix2)
+ + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2)
+ + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2)
+ + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2)
+ + satd_16x16_altivec(pix3+48, stride_pix1, pix4+48, stride_pix2)
+ + satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2)
+ + satd_16x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2)
+ + satd_16x16_altivec(pix5+32, stride_pix1, pix6+32, stride_pix2)
+ + satd_16x16_altivec(pix5+48, stride_pix1, pix6+48, stride_pix2)
+ + satd_16x16_altivec(pix7, stride_pix1, pix8, stride_pix2)
+ + satd_16x16_altivec(pix7+16, stride_pix1, pix8+16, stride_pix2)
+ + satd_16x16_altivec(pix7+32, stride_pix1, pix8+32, stride_pix2)
+ + satd_16x16_altivec(pix7+48, stride_pix1, pix8+48, stride_pix2);
+ return satd;
+}
+
+
+/***********************************************************************
+ * SA8D routines - altivec implementation
+ **********************************************************************/
+#define SA8D_1D_ALTIVEC( sa8d0v, sa8d1v, sa8d2v, sa8d3v, \
+ sa8d4v, sa8d5v, sa8d6v, sa8d7v ) \
+{ \
+ /* int a0 = SRC(0) + SRC(4) */ \
+ vec_s16_t a0v = vec_add(sa8d0v, sa8d4v); \
+ /* int a4 = SRC(0) - SRC(4) */ \
+ vec_s16_t a4v = vec_sub(sa8d0v, sa8d4v); \
+ /* int a1 = SRC(1) + SRC(5) */ \
+ vec_s16_t a1v = vec_add(sa8d1v, sa8d5v); \
+ /* int a5 = SRC(1) - SRC(5) */ \
+ vec_s16_t a5v = vec_sub(sa8d1v, sa8d5v); \
+ /* int a2 = SRC(2) + SRC(6) */ \
+ vec_s16_t a2v = vec_add(sa8d2v, sa8d6v); \
+ /* int a6 = SRC(2) - SRC(6) */ \
+ vec_s16_t a6v = vec_sub(sa8d2v, sa8d6v); \
+ /* int a3 = SRC(3) + SRC(7) */ \
+ vec_s16_t a3v = vec_add(sa8d3v, sa8d7v); \
+ /* int a7 = SRC(3) - SRC(7) */ \
+ vec_s16_t a7v = vec_sub(sa8d3v, sa8d7v); \
+ \
+ /* int b0 = a0 + a2 */ \
+ vec_s16_t b0v = vec_add(a0v, a2v); \
+ /* int b2 = a0 - a2; */ \
+ vec_s16_t b2v = vec_sub(a0v, a2v); \
+ /* int b1 = a1 + a3; */ \
+ vec_s16_t b1v = vec_add(a1v, a3v); \
+ /* int b3 = a1 - a3; */ \
+ vec_s16_t b3v = vec_sub(a1v, a3v); \
+ /* int b4 = a4 + a6; */ \
+ vec_s16_t b4v = vec_add(a4v, a6v); \
+ /* int b6 = a4 - a6; */ \
+ vec_s16_t b6v = vec_sub(a4v, a6v); \
+ /* int b5 = a5 + a7; */ \
+ vec_s16_t b5v = vec_add(a5v, a7v); \
+ /* int b7 = a5 - a7; */ \
+ vec_s16_t b7v = vec_sub(a5v, a7v); \
+ \
+ /* DST(0, b0 + b1) */ \
+ sa8d0v = vec_add(b0v, b1v); \
+ /* DST(1, b0 - b1) */ \
+ sa8d1v = vec_sub(b0v, b1v); \
+ /* DST(2, b2 + b3) */ \
+ sa8d2v = vec_add(b2v, b3v); \
+ /* DST(3, b2 - b3) */ \
+ sa8d3v = vec_sub(b2v, b3v); \
+ /* DST(4, b4 + b5) */ \
+ sa8d4v = vec_add(b4v, b5v); \
+ /* DST(5, b4 - b5) */ \
+ sa8d5v = vec_sub(b4v, b5v); \
+ /* DST(6, b6 + b7) */ \
+ sa8d6v = vec_add(b6v, b7v); \
+ /* DST(7, b6 - b7) */ \
+ sa8d7v = vec_sub(b6v, b7v); \
+}
+
+inline int sa8d_8x8_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2)
+{
+ ALIGN_VAR_16(int, sum);
+
+ LOAD_ZERO;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diff0v, diff1v, diff2v, diff3v, diff4v, diff5v, diff6v, diff7v;
+ vec_s16_t sa8d0v, sa8d1v, sa8d2v, sa8d3v, sa8d4v, sa8d5v, sa8d6v, sa8d7v;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff0v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff1v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff2v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff3v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff4v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff5v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff6v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff7v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+
+ SA8D_1D_ALTIVEC(diff0v, diff1v, diff2v, diff3v,
+ diff4v, diff5v, diff6v, diff7v);
+ VEC_TRANSPOSE_8(diff0v, diff1v, diff2v, diff3v,
+ diff4v, diff5v, diff6v, diff7v,
+ sa8d0v, sa8d1v, sa8d2v, sa8d3v,
+ sa8d4v, sa8d5v, sa8d6v, sa8d7v );
+ SA8D_1D_ALTIVEC(sa8d0v, sa8d1v, sa8d2v, sa8d3v,
+ sa8d4v, sa8d5v, sa8d6v, sa8d7v );
+
+ /* accumulation of the absolute value of all elements of the resulting bloc */
+ vec_s16_t abs0v = vec_max( sa8d0v, vec_sub( zero_s16v, sa8d0v ) );
+ vec_s16_t abs1v = vec_max( sa8d1v, vec_sub( zero_s16v, sa8d1v ) );
+ vec_s16_t sum01v = vec_add(abs0v, abs1v);
+
+ vec_s16_t abs2v = vec_max( sa8d2v, vec_sub( zero_s16v, sa8d2v ) );
+ vec_s16_t abs3v = vec_max( sa8d3v, vec_sub( zero_s16v, sa8d3v ) );
+ vec_s16_t sum23v = vec_add(abs2v, abs3v);
+
+ vec_s16_t abs4v = vec_max( sa8d4v, vec_sub( zero_s16v, sa8d4v ) );
+ vec_s16_t abs5v = vec_max( sa8d5v, vec_sub( zero_s16v, sa8d5v ) );
+ vec_s16_t sum45v = vec_add(abs4v, abs5v);
+
+ vec_s16_t abs6v = vec_max( sa8d6v, vec_sub( zero_s16v, sa8d6v ) );
+ vec_s16_t abs7v = vec_max( sa8d7v, vec_sub( zero_s16v, sa8d7v ) );
+ vec_s16_t sum67v = vec_add(abs6v, abs7v);
+
+ vec_s16_t sum0123v = vec_add(sum01v, sum23v);
+ vec_s16_t sum4567v = vec_add(sum45v, sum67v);
+
+ vec_s32_t sumblocv;
+
+ sumblocv = vec_sum4s(sum0123v, (vec_s32_t)zerov );
+ //print_vec_s("sum0123v", &sum0123v);
+ //print_vec_i("sumblocv = vec_sum4s(sum0123v, 0 )", &sumblocv);
+ sumblocv = vec_sum4s(sum4567v, sumblocv );
+ //print_vec_s("sum4567v", &sum4567v);
+ //print_vec_i("sumblocv = vec_sum4s(sum4567v, sumblocv )", &sumblocv);
+ sumblocv = vec_sums(sumblocv, (vec_s32_t)zerov );
+ //print_vec_i("sumblocv=vec_sums(sumblocv,0 )", &sumblocv);
+ sumblocv = vec_splat(sumblocv, 3);
+ //print_vec_i("sumblocv = vec_splat(sumblocv, 3)", &sumblocv);
+ vec_ste(sumblocv, 0, &sum);
+
+ return (sum + 2) >> 2;
+}
+
+
+int sa8d_8x8_altivec(const int16_t* pix1, intptr_t i_pix1)
+{
+ int sum = 0;
+ return ((sum+2)>>2);
+}
+
+inline int sa8d_8x16_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2)
+{
+ ALIGN_VAR_16(int, sum);
+ ALIGN_VAR_16(int, sum1);
+
+ LOAD_ZERO;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diff0v, diff1v, diff2v, diff3v, diff4v, diff5v, diff6v, diff7v;
+ vec_s16_t sa8d0v, sa8d1v, sa8d2v, sa8d3v, sa8d4v, sa8d5v, sa8d6v, sa8d7v;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff0v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff1v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff2v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff3v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff4v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff5v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff6v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff7v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+
+ SA8D_1D_ALTIVEC(diff0v, diff1v, diff2v, diff3v,
+ diff4v, diff5v, diff6v, diff7v);
+ VEC_TRANSPOSE_8(diff0v, diff1v, diff2v, diff3v,
+ diff4v, diff5v, diff6v, diff7v,
+ sa8d0v, sa8d1v, sa8d2v, sa8d3v,
+ sa8d4v, sa8d5v, sa8d6v, sa8d7v );
+ SA8D_1D_ALTIVEC(sa8d0v, sa8d1v, sa8d2v, sa8d3v,
+ sa8d4v, sa8d5v, sa8d6v, sa8d7v );
+
+ /* accumulation of the absolute value of all elements of the resulting bloc */
+ vec_s16_t abs0v = vec_max( sa8d0v, vec_sub( zero_s16v, sa8d0v ) );
+ vec_s16_t abs1v = vec_max( sa8d1v, vec_sub( zero_s16v, sa8d1v ) );
+ vec_s16_t sum01v = vec_add(abs0v, abs1v);
+
+ vec_s16_t abs2v = vec_max( sa8d2v, vec_sub( zero_s16v, sa8d2v ) );
+ vec_s16_t abs3v = vec_max( sa8d3v, vec_sub( zero_s16v, sa8d3v ) );
+ vec_s16_t sum23v = vec_add(abs2v, abs3v);
+
+ vec_s16_t abs4v = vec_max( sa8d4v, vec_sub( zero_s16v, sa8d4v ) );
+ vec_s16_t abs5v = vec_max( sa8d5v, vec_sub( zero_s16v, sa8d5v ) );
+ vec_s16_t sum45v = vec_add(abs4v, abs5v);
+
+ vec_s16_t abs6v = vec_max( sa8d6v, vec_sub( zero_s16v, sa8d6v ) );
+ vec_s16_t abs7v = vec_max( sa8d7v, vec_sub( zero_s16v, sa8d7v ) );
+ vec_s16_t sum67v = vec_add(abs6v, abs7v);
+
+ vec_s16_t sum0123v = vec_add(sum01v, sum23v);
+ vec_s16_t sum4567v = vec_add(sum45v, sum67v);
+
+ vec_s32_t sumblocv, sumblocv1;
+
+ sumblocv = vec_sum4s(sum0123v, (vec_s32_t)zerov );
+ sumblocv = vec_sum4s(sum4567v, sumblocv );
+ sumblocv = vec_sums(sumblocv, (vec_s32_t)zerov );
+ sumblocv = vec_splat(sumblocv, 3);
+ vec_ste(sumblocv, 0, &sum);
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff0v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff1v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff2v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff3v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff4v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff5v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff6v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+ pix1v = vec_u8_to_s16(vec_xl(0, pix1));
+ pix2v = vec_u8_to_s16( vec_xl(0, pix2) );
+ diff7v = vec_sub( pix1v, pix2v );
+ pix1 += i_pix1;
+ pix2 += i_pix2;
+
+
+ SA8D_1D_ALTIVEC(diff0v, diff1v, diff2v, diff3v,
+ diff4v, diff5v, diff6v, diff7v);
+ VEC_TRANSPOSE_8(diff0v, diff1v, diff2v, diff3v,
+ diff4v, diff5v, diff6v, diff7v,
+ sa8d0v, sa8d1v, sa8d2v, sa8d3v,
+ sa8d4v, sa8d5v, sa8d6v, sa8d7v );
+ SA8D_1D_ALTIVEC(sa8d0v, sa8d1v, sa8d2v, sa8d3v,
+ sa8d4v, sa8d5v, sa8d6v, sa8d7v );
+
+ /* accumulation of the absolute value of all elements of the resulting bloc */
+ abs0v = vec_max( sa8d0v, vec_sub( zero_s16v, sa8d0v ) );
+ abs1v = vec_max( sa8d1v, vec_sub( zero_s16v, sa8d1v ) );
+ sum01v = vec_add(abs0v, abs1v);
+
+ abs2v = vec_max( sa8d2v, vec_sub( zero_s16v, sa8d2v ) );
+ abs3v = vec_max( sa8d3v, vec_sub( zero_s16v, sa8d3v ) );
+ sum23v = vec_add(abs2v, abs3v);
+
+ abs4v = vec_max( sa8d4v, vec_sub( zero_s16v, sa8d4v ) );
+ abs5v = vec_max( sa8d5v, vec_sub( zero_s16v, sa8d5v ) );
+ sum45v = vec_add(abs4v, abs5v);
+
+ abs6v = vec_max( sa8d6v, vec_sub( zero_s16v, sa8d6v ) );
+ abs7v = vec_max( sa8d7v, vec_sub( zero_s16v, sa8d7v ) );
+ sum67v = vec_add(abs6v, abs7v);
+
+ sum0123v = vec_add(sum01v, sum23v);
+ sum4567v = vec_add(sum45v, sum67v);
+
+ sumblocv1 = vec_sum4s(sum0123v, (vec_s32_t)zerov );
+ sumblocv1 = vec_sum4s(sum4567v, sumblocv1 );
+ sumblocv1 = vec_sums(sumblocv1, (vec_s32_t)zerov );
+ sumblocv1 = vec_splat(sumblocv1, 3);
+ vec_ste(sumblocv1, 0, &sum1);
+
+ sum = (sum + 2) >> 2;
+ sum1 = (sum1 + 2) >> 2;
+ sum += sum1;
+ return (sum);
+}
+
+inline int sa8d_16x8_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2)
+{
+ ALIGN_VAR_16(int, sumh);
+ ALIGN_VAR_16(int, suml);
+
+ LOAD_ZERO;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diffh0v, diffh1v, diffh2v, diffh3v,
+ diffh4v, diffh5v, diffh6v, diffh7v;
+ vec_s16_t diffl0v, diffl1v, diffl2v, diffl3v,
+ diffl4v, diffl5v, diffl6v, diffl7v;
+ vec_s16_t sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v;
+ vec_s16_t sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v;
+ vec_s16_t temp0v, temp1v, temp2v, temp3v;
+
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh0v,diffl0v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh1v, diffl1v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh2v, diffl2v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh3v, diffl3v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh4v, diffl4v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh5v, diffl5v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh6v, diffl6v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh7v, diffl7v);
+
+ SA8D_1D_ALTIVEC(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v);
+ VEC_TRANSPOSE_8(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v,
+ sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v );
+ SA8D_1D_ALTIVEC(sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v);
+
+ SA8D_1D_ALTIVEC(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v);
+ VEC_TRANSPOSE_8(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v,
+ sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v );
+ SA8D_1D_ALTIVEC(sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v);
+
+ /* accumulation of the absolute value of all elements of the resulting bloc */
+ sa8dh0v = vec_max( sa8dh0v, vec_sub( zero_s16v, sa8dh0v ) );
+ sa8dh1v = vec_max( sa8dh1v, vec_sub( zero_s16v, sa8dh1v ) );
+ vec_s16_t sumh01v = vec_add(sa8dh0v, sa8dh1v);
+
+ sa8dh2v = vec_max( sa8dh2v, vec_sub( zero_s16v, sa8dh2v ) );
+ sa8dh3v = vec_max( sa8dh3v, vec_sub( zero_s16v, sa8dh3v ) );
+ vec_s16_t sumh23v = vec_add(sa8dh2v, sa8dh3v);
+
+ sa8dh4v = vec_max( sa8dh4v, vec_sub( zero_s16v, sa8dh4v ) );
+ sa8dh5v = vec_max( sa8dh5v, vec_sub( zero_s16v, sa8dh5v ) );
+ vec_s16_t sumh45v = vec_add(sa8dh4v, sa8dh5v);
+
+ sa8dh6v = vec_max( sa8dh6v, vec_sub( zero_s16v, sa8dh6v ) );
+ sa8dh7v = vec_max( sa8dh7v, vec_sub( zero_s16v, sa8dh7v ) );
+ vec_s16_t sumh67v = vec_add(sa8dh6v, sa8dh7v);
+
+ vec_s16_t sumh0123v = vec_add(sumh01v, sumh23v);
+ vec_s16_t sumh4567v = vec_add(sumh45v, sumh67v);
+
+ vec_s32_t sumblocv_h;
+
+ sumblocv_h = vec_sum4s(sumh0123v, (vec_s32_t)zerov );
+ //print_vec_s("sum0123v", &sum0123v);
+ //print_vec_i("sumblocv = vec_sum4s(sum0123v, 0 )", &sumblocv);
+ sumblocv_h = vec_sum4s(sumh4567v, sumblocv_h );
+ //print_vec_s("sum4567v", &sum4567v);
+ //print_vec_i("sumblocv = vec_sum4s(sum4567v, sumblocv )", &sumblocv);
+ sumblocv_h = vec_sums(sumblocv_h, (vec_s32_t)zerov );
+ //print_vec_i("sumblocv=vec_sums(sumblocv,0 )", &sumblocv);
+ sumblocv_h = vec_splat(sumblocv_h, 3);
+ //print_vec_i("sumblocv = vec_splat(sumblocv, 3)", &sumblocv);
+ vec_ste(sumblocv_h, 0, &sumh);
+
+ sa8dl0v = vec_max( sa8dl0v, vec_sub( zero_s16v, sa8dl0v ) );
+ sa8dl1v = vec_max( sa8dl1v, vec_sub( zero_s16v, sa8dl1v ) );
+ vec_s16_t suml01v = vec_add(sa8dl0v, sa8dl1v);
+
+ sa8dl2v = vec_max( sa8dl2v, vec_sub( zero_s16v, sa8dl2v ) );
+ sa8dl3v = vec_max( sa8dl3v, vec_sub( zero_s16v, sa8dl3v ) );
+ vec_s16_t suml23v = vec_add(sa8dl2v, sa8dl3v);
+
+ sa8dl4v = vec_max( sa8dl4v, vec_sub( zero_s16v, sa8dl4v ) );
+ sa8dl5v = vec_max( sa8dl5v, vec_sub( zero_s16v, sa8dl5v ) );
+ vec_s16_t suml45v = vec_add(sa8dl4v, sa8dl5v);
+
+ sa8dl6v = vec_max( sa8dl6v, vec_sub( zero_s16v, sa8dl6v ) );
+ sa8dl7v = vec_max( sa8dl7v, vec_sub( zero_s16v, sa8dl7v ) );
+ vec_s16_t suml67v = vec_add(sa8dl6v, sa8dl7v);
+
+ vec_s16_t suml0123v = vec_add(suml01v, suml23v);
+ vec_s16_t suml4567v = vec_add(suml45v, suml67v);
+
+ vec_s32_t sumblocv_l;
+
+ sumblocv_l = vec_sum4s(suml0123v, (vec_s32_t)zerov );
+ //print_vec_s("sum0123v", &sum0123v);
+ //print_vec_i("sumblocv = vec_sum4s(sum0123v, 0 )", &sumblocv);
+ sumblocv_l = vec_sum4s(suml4567v, sumblocv_l );
+ //print_vec_s("sum4567v", &sum4567v);
+ //print_vec_i("sumblocv = vec_sum4s(sum4567v, sumblocv )", &sumblocv);
+ sumblocv_l = vec_sums(sumblocv_l, (vec_s32_t)zerov );
+ //print_vec_i("sumblocv=vec_sums(sumblocv,0 )", &sumblocv);
+ sumblocv_l = vec_splat(sumblocv_l, 3);
+ //print_vec_i("sumblocv = vec_splat(sumblocv, 3)", &sumblocv);
+ vec_ste(sumblocv_l, 0, &suml);
+
+ sumh = (sumh + 2) >> 2;
+ suml= (suml + 2) >> 2;
+ return (sumh+suml);
+}
+
+inline int sa8d_16x16_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2)
+{
+ ALIGN_VAR_16(int, sumh0);
+ ALIGN_VAR_16(int, suml0);
+
+ ALIGN_VAR_16(int, sumh1);
+ ALIGN_VAR_16(int, suml1);
+
+ ALIGN_VAR_16(int, sum);
+
+ LOAD_ZERO;
+ vec_s16_t pix1v, pix2v;
+ vec_s16_t diffh0v, diffh1v, diffh2v, diffh3v,
+ diffh4v, diffh5v, diffh6v, diffh7v;
+ vec_s16_t diffl0v, diffl1v, diffl2v, diffl3v,
+ diffl4v, diffl5v, diffl6v, diffl7v;
+ vec_s16_t sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v;
+ vec_s16_t sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v;
+ vec_s16_t temp0v, temp1v, temp2v, temp3v;
+
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh0v,diffl0v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh1v, diffl1v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh2v, diffl2v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh3v, diffl3v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh4v, diffl4v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh5v, diffl5v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh6v, diffl6v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh7v, diffl7v);
+
+ SA8D_1D_ALTIVEC(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v);
+ VEC_TRANSPOSE_8(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v,
+ sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v );
+ SA8D_1D_ALTIVEC(sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v);
+
+ SA8D_1D_ALTIVEC(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v);
+ VEC_TRANSPOSE_8(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v,
+ sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v );
+ SA8D_1D_ALTIVEC(sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v);
+
+ /* accumulation of the absolute value of all elements of the resulting bloc */
+ sa8dh0v = vec_max( sa8dh0v, vec_sub( zero_s16v, sa8dh0v ) );
+ sa8dh1v = vec_max( sa8dh1v, vec_sub( zero_s16v, sa8dh1v ) );
+ vec_s16_t sumh01v = vec_add(sa8dh0v, sa8dh1v);
+
+ sa8dh2v = vec_max( sa8dh2v, vec_sub( zero_s16v, sa8dh2v ) );
+ sa8dh3v = vec_max( sa8dh3v, vec_sub( zero_s16v, sa8dh3v ) );
+ vec_s16_t sumh23v = vec_add(sa8dh2v, sa8dh3v);
+
+ sa8dh4v = vec_max( sa8dh4v, vec_sub( zero_s16v, sa8dh4v ) );
+ sa8dh5v = vec_max( sa8dh5v, vec_sub( zero_s16v, sa8dh5v ) );
+ vec_s16_t sumh45v = vec_add(sa8dh4v, sa8dh5v);
+
+ sa8dh6v = vec_max( sa8dh6v, vec_sub( zero_s16v, sa8dh6v ) );
+ sa8dh7v = vec_max( sa8dh7v, vec_sub( zero_s16v, sa8dh7v ) );
+ vec_s16_t sumh67v = vec_add(sa8dh6v, sa8dh7v);
+
+ vec_s16_t sumh0123v = vec_add(sumh01v, sumh23v);
+ vec_s16_t sumh4567v = vec_add(sumh45v, sumh67v);
+
+ vec_s32_t sumblocv_h0;
+
+ sumblocv_h0 = vec_sum4s(sumh0123v, (vec_s32_t)zerov );
+ sumblocv_h0 = vec_sum4s(sumh4567v, sumblocv_h0 );
+ sumblocv_h0 = vec_sums(sumblocv_h0, (vec_s32_t)zerov );
+ sumblocv_h0 = vec_splat(sumblocv_h0, 3);
+ vec_ste(sumblocv_h0, 0, &sumh0);
+
+ sa8dl0v = vec_max( sa8dl0v, vec_sub( zero_s16v, sa8dl0v ) );
+ sa8dl1v = vec_max( sa8dl1v, vec_sub( zero_s16v, sa8dl1v ) );
+ vec_s16_t suml01v = vec_add(sa8dl0v, sa8dl1v);
+
+ sa8dl2v = vec_max( sa8dl2v, vec_sub( zero_s16v, sa8dl2v ) );
+ sa8dl3v = vec_max( sa8dl3v, vec_sub( zero_s16v, sa8dl3v ) );
+ vec_s16_t suml23v = vec_add(sa8dl2v, sa8dl3v);
+
+ sa8dl4v = vec_max( sa8dl4v, vec_sub( zero_s16v, sa8dl4v ) );
+ sa8dl5v = vec_max( sa8dl5v, vec_sub( zero_s16v, sa8dl5v ) );
+ vec_s16_t suml45v = vec_add(sa8dl4v, sa8dl5v);
+
+ sa8dl6v = vec_max( sa8dl6v, vec_sub( zero_s16v, sa8dl6v ) );
+ sa8dl7v = vec_max( sa8dl7v, vec_sub( zero_s16v, sa8dl7v ) );
+ vec_s16_t suml67v = vec_add(sa8dl6v, sa8dl7v);
+
+ vec_s16_t suml0123v = vec_add(suml01v, suml23v);
+ vec_s16_t suml4567v = vec_add(suml45v, suml67v);
+
+ vec_s32_t sumblocv_l0;
+
+ sumblocv_l0 = vec_sum4s(suml0123v, (vec_s32_t)zerov );
+ sumblocv_l0 = vec_sum4s(suml4567v, sumblocv_l0 );
+ sumblocv_l0 = vec_sums(sumblocv_l0, (vec_s32_t)zerov );
+ sumblocv_l0 = vec_splat(sumblocv_l0, 3);
+ vec_ste(sumblocv_l0, 0, &suml0);
+
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh0v,diffl0v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh1v, diffl1v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh2v, diffl2v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh3v, diffl3v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh4v, diffl4v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh5v, diffl5v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh6v, diffl6v);
+ VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh7v, diffl7v);
+
+ SA8D_1D_ALTIVEC(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v);
+ VEC_TRANSPOSE_8(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v,
+ sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v );
+ SA8D_1D_ALTIVEC(sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v);
+
+ SA8D_1D_ALTIVEC(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v);
+ VEC_TRANSPOSE_8(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v,
+ sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v );
+ SA8D_1D_ALTIVEC(sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v);
+
+ /* accumulation of the absolute value of all elements of the resulting bloc */
+ sa8dh0v = vec_max( sa8dh0v, vec_sub( zero_s16v, sa8dh0v ) );
+ sa8dh1v = vec_max( sa8dh1v, vec_sub( zero_s16v, sa8dh1v ) );
+ sumh01v = vec_add(sa8dh0v, sa8dh1v);
+
+ sa8dh2v = vec_max( sa8dh2v, vec_sub( zero_s16v, sa8dh2v ) );
+ sa8dh3v = vec_max( sa8dh3v, vec_sub( zero_s16v, sa8dh3v ) );
+ sumh23v = vec_add(sa8dh2v, sa8dh3v);
+
+ sa8dh4v = vec_max( sa8dh4v, vec_sub( zero_s16v, sa8dh4v ) );
+ sa8dh5v = vec_max( sa8dh5v, vec_sub( zero_s16v, sa8dh5v ) );
+ sumh45v = vec_add(sa8dh4v, sa8dh5v);
+
+ sa8dh6v = vec_max( sa8dh6v, vec_sub( zero_s16v, sa8dh6v ) );
+ sa8dh7v = vec_max( sa8dh7v, vec_sub( zero_s16v, sa8dh7v ) );
+ sumh67v = vec_add(sa8dh6v, sa8dh7v);
+
+ sumh0123v = vec_add(sumh01v, sumh23v);
+ sumh4567v = vec_add(sumh45v, sumh67v);
+
+ vec_s32_t sumblocv_h1;
+
+ sumblocv_h1 = vec_sum4s(sumh0123v, (vec_s32_t)zerov );
+ sumblocv_h1 = vec_sum4s(sumh4567v, sumblocv_h1 );
+ sumblocv_h1 = vec_sums(sumblocv_h1, (vec_s32_t)zerov );
+ sumblocv_h1 = vec_splat(sumblocv_h1, 3);
+ vec_ste(sumblocv_h1, 0, &sumh1);
+
+ sa8dl0v = vec_max( sa8dl0v, vec_sub( zero_s16v, sa8dl0v ) );
+ sa8dl1v = vec_max( sa8dl1v, vec_sub( zero_s16v, sa8dl1v ) );
+ suml01v = vec_add(sa8dl0v, sa8dl1v);
+
+ sa8dl2v = vec_max( sa8dl2v, vec_sub( zero_s16v, sa8dl2v ) );
+ sa8dl3v = vec_max( sa8dl3v, vec_sub( zero_s16v, sa8dl3v ) );
+ suml23v = vec_add(sa8dl2v, sa8dl3v);
+
+ sa8dl4v = vec_max( sa8dl4v, vec_sub( zero_s16v, sa8dl4v ) );
+ sa8dl5v = vec_max( sa8dl5v, vec_sub( zero_s16v, sa8dl5v ) );
+ suml45v = vec_add(sa8dl4v, sa8dl5v);
+
+ sa8dl6v = vec_max( sa8dl6v, vec_sub( zero_s16v, sa8dl6v ) );
+ sa8dl7v = vec_max( sa8dl7v, vec_sub( zero_s16v, sa8dl7v ) );
+ suml67v = vec_add(sa8dl6v, sa8dl7v);
+
+ suml0123v = vec_add(suml01v, suml23v);
+ suml4567v = vec_add(suml45v, suml67v);
+
+ vec_s32_t sumblocv_l1;
+
+ sumblocv_l1 = vec_sum4s(suml0123v, (vec_s32_t)zerov );
+ sumblocv_l1 = vec_sum4s(suml4567v, sumblocv_l1 );
+ sumblocv_l1 = vec_sums(sumblocv_l1, (vec_s32_t)zerov );
+ sumblocv_l1 = vec_splat(sumblocv_l1, 3);
+ vec_ste(sumblocv_l1, 0, &suml1);
+
+ sum = (sumh0+suml0+sumh1+suml1 + 2) >>2;
+ return (sum );
+}
+
+int sa8d_16x32_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2)
+{
+ ALIGN_VAR_16(int, sum);
+ sum = sa8d_16x16_altivec(pix1, i_pix1, pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16*i_pix1, i_pix1, pix2+16*i_pix2, i_pix2);
+ return sum;
+}
+
+int sa8d_32x32_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2)
+{
+ ALIGN_VAR_16(int, sum);
+ int offset1, offset2;
+ offset1 = 16*i_pix1;
+ offset2 = 16*i_pix2;
+ sum = sa8d_16x16_altivec(pix1, i_pix1, pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16, i_pix1, pix2+16, i_pix2)
+ + sa8d_16x16_altivec(pix1+offset1, i_pix1, pix2+offset2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16+offset1, i_pix1, pix2+16+offset2, i_pix2);
+ return sum;
+}
+
+int sa8d_32x64_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2)
+{
+ ALIGN_VAR_16(int, sum);
+ int offset1, offset2;
+ offset1 = 16*i_pix1;
+ offset2 = 16*i_pix2;
+ sum = sa8d_16x16_altivec(pix1, i_pix1, pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16, i_pix1, pix2+16, i_pix2)
+ + sa8d_16x16_altivec(pix1+offset1, i_pix1, pix2+offset2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16+offset1, i_pix1, pix2+16+offset2, i_pix2)
+ + sa8d_16x16_altivec(pix1+32*i_pix1, i_pix1, pix2+32*i_pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16+32*i_pix1, i_pix1, pix2+16+32*i_pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+48*i_pix1, i_pix1, pix2+48*i_pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16+48*i_pix1, i_pix1, pix2+16+48*i_pix2, i_pix2);
+ return sum;
+}
+
+int sa8d_64x64_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2)
+{
+ ALIGN_VAR_16(int, sum);
+ int offset1, offset2;
+ offset1 = 16*i_pix1;
+ offset2 = 16*i_pix2;
+ sum = sa8d_16x16_altivec(pix1, i_pix1, pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16, i_pix1, pix2+16, i_pix2)
+ + sa8d_16x16_altivec(pix1+32, i_pix1, pix2+32, i_pix2)
+ + sa8d_16x16_altivec(pix1+48, i_pix1, pix2+48, i_pix2)
+ + sa8d_16x16_altivec(pix1+offset1, i_pix1, pix2+offset2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16+offset1, i_pix1, pix2+16+offset2, i_pix2)
+ + sa8d_16x16_altivec(pix1+32+offset1, i_pix1, pix2+32+offset2, i_pix2)
+ + sa8d_16x16_altivec(pix1+48+offset1, i_pix1, pix2+48+offset2, i_pix2)
+ + sa8d_16x16_altivec(pix1+32*i_pix1, i_pix1, pix2+32*i_pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16+32*i_pix1, i_pix1, pix2+16+32*i_pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+32+32*i_pix1, i_pix1, pix2+32+32*i_pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+48+32*i_pix1, i_pix1, pix2+48+32*i_pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+48*i_pix1, i_pix1, pix2+48*i_pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+16+48*i_pix1, i_pix1, pix2+16+48*i_pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+32+48*i_pix1, i_pix1, pix2+32+48*i_pix2, i_pix2)
+ + sa8d_16x16_altivec(pix1+48+48*i_pix1, i_pix1, pix2+48+48*i_pix2, i_pix2);
+ return sum;
+}
+
+/* Initialize entries for pixel functions defined in this file */
+void setupPixelPrimitives_altivec(EncoderPrimitives &p)
+{
+#define LUMA_PU(W, H) \
+ if (W<=16) { \
+ p.pu[LUMA_ ## W ## x ## H].sad = sad16_altivec<W, H>; \
+ p.pu[LUMA_ ## W ## x ## H].sad_x3 = sad16_x3_altivec<W, H>; \
+ p.pu[LUMA_ ## W ## x ## H].sad_x4 = sad16_x4_altivec<W, H>; \
+ } \
+ else { \
+ p.pu[LUMA_ ## W ## x ## H].sad = sad_altivec<W, H>; \
+ p.pu[LUMA_ ## W ## x ## H].sad_x3 = sad_x3_altivec<W, H>; \
+ p.pu[LUMA_ ## W ## x ## H].sad_x4 = sad_x4_altivec<W, H>; \
+ }
+
+ LUMA_PU(4, 4);
+ LUMA_PU(8, 8);
+ LUMA_PU(16, 16);
+ LUMA_PU(32, 32);
+ LUMA_PU(64, 64);
+ LUMA_PU(4, 8);
+ LUMA_PU(8, 4);
+ LUMA_PU(16, 8);
+ LUMA_PU(8, 16);
+ LUMA_PU(16, 12);
+ LUMA_PU(12, 16);
+ LUMA_PU(16, 4);
+ LUMA_PU(4, 16);
+ LUMA_PU(32, 16);
+ LUMA_PU(16, 32);
+ LUMA_PU(32, 24);
+ LUMA_PU(24, 32);
+ LUMA_PU(32, 8);
+ LUMA_PU(8, 32);
+ LUMA_PU(64, 32);
+ LUMA_PU(32, 64);
+ LUMA_PU(64, 48);
+ LUMA_PU(48, 64);
+ LUMA_PU(64, 16);
+ LUMA_PU(16, 64);
+
+ p.pu[LUMA_4x4].satd = satd_4x4_altivec;//satd_4x4;
+ p.pu[LUMA_8x8].satd = satd_8x8_altivec;//satd8<8, 8>;
+ p.pu[LUMA_8x4].satd = satd_8x4_altivec;//satd_8x4;
+ p.pu[LUMA_4x8].satd = satd_4x8_altivec;//satd4<4, 8>;
+ p.pu[LUMA_16x16].satd = satd_16x16_altivec;//satd8<16, 16>;
+ p.pu[LUMA_16x8].satd = satd_16x8_altivec;//satd8<16, 8>;
+ p.pu[LUMA_8x16].satd = satd_8x16_altivec;//satd8<8, 16>;
+ p.pu[LUMA_16x12].satd = satd_altivec<16, 12>;//satd8<16, 12>;
+ p.pu[LUMA_12x16].satd = satd_altivec<12, 16>;//satd4<12, 16>;
+ p.pu[LUMA_16x4].satd = satd_altivec<16, 4>;//satd8<16, 4>;
+ p.pu[LUMA_4x16].satd = satd_altivec<4, 16>;//satd4<4, 16>;
+ p.pu[LUMA_32x32].satd = satd_altivec<32, 32>;//satd8<32, 32>;
+ p.pu[LUMA_32x16].satd = satd_altivec<32, 16>;//satd8<32, 16>;
+ p.pu[LUMA_16x32].satd = satd_altivec<16, 32>;//satd8<16, 32>;
+ p.pu[LUMA_32x24].satd = satd_altivec<32, 24>;//satd8<32, 24>;
+ p.pu[LUMA_24x32].satd = satd_altivec<24, 32>;//satd8<24, 32>;
+ p.pu[LUMA_32x8].satd = satd_altivec<32, 8>;//satd8<32, 8>;
+ p.pu[LUMA_8x32].satd = satd_altivec<8,32>;//satd8<8, 32>;
+ p.pu[LUMA_64x64].satd = satd_altivec<64, 64>;//satd8<64, 64>;
+ p.pu[LUMA_64x32].satd = satd_altivec<64, 32>;//satd8<64, 32>;
+ p.pu[LUMA_32x64].satd = satd_altivec<32, 64>;//satd8<32, 64>;
+ p.pu[LUMA_64x48].satd = satd_altivec<64, 48>;//satd8<64, 48>;
+ p.pu[LUMA_48x64].satd = satd_altivec<48, 64>;//satd8<48, 64>;
+ p.pu[LUMA_64x16].satd = satd_altivec<64, 16>;//satd8<64, 16>;
+ p.pu[LUMA_16x64].satd = satd_altivec<16, 64>;//satd8<16, 64>;
+
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].satd = satd_4x4_altivec;//satd_4x4;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].satd = satd_8x8_altivec;//satd8<8, 8>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].satd = satd_16x16_altivec;//satd8<16, 16>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].satd = satd_altivec<32, 32>;//satd8<32, 32>;
+
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].satd = satd_8x4_altivec;//satd_8x4;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].satd = satd_4x8_altivec;//satd4<4, 8>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].satd = satd_16x8_altivec;//satd8<16, 8>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].satd = satd_8x16_altivec;//satd8<8, 16>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].satd = satd_altivec<32, 16>;//satd8<32, 16>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].satd = satd_altivec<16, 32>;//satd8<16, 32>;
+
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].satd = satd_altivec<16, 12>;//satd4<16, 12>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].satd = satd_altivec<12, 16>;//satd4<12, 16>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].satd = satd_altivec<16, 4>;//satd4<16, 4>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].satd = satd_altivec<4, 16>;//satd4<4, 16>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].satd = satd_altivec<32, 24>;//satd8<32, 24>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].satd = satd_altivec<24, 32>;//satd8<24, 32>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = satd_altivec<32, 8>;//satd8<32, 8>;
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].satd = satd_altivec<8,32>;//satd8<8, 32>;
+
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].satd = satd_4x8_altivec;//satd4<4, 8>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].satd = satd_8x16_altivec;//satd8<8, 16>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].satd = satd_altivec<16, 32>;//satd8<16, 32>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = satd_altivec<32, 64>;//satd8<32, 64>;
+
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].satd = satd_4x4_altivec;//satd_4x4;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].satd = satd_8x8_altivec;//satd8<8, 8>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].satd = satd_altivec<4, 16>;//satd4<4, 16>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = satd_16x16_altivec;//satd8<16, 16>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].satd = satd_altivec<8,32>;//satd8<8, 32>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = satd_altivec<32, 32>;//satd8<32, 32>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].satd = satd_altivec<16, 64>;//satd8<16, 64>;
+
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = satd_altivec<8, 12>;//satd4<8, 12>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].satd = satd_8x4_altivec;//satd4<8, 4>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].satd = satd_altivec<16, 24>;//satd8<16, 24>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = satd_altivec<12, 32>;//satd4<12, 32>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = satd_16x8_altivec;//satd8<16, 8>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = satd_altivec<4, 32>;//satd4<4, 32>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = satd_altivec<32, 48>;//satd8<32, 48>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].satd = satd_altivec<24, 64>;//satd8<24, 64>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = satd_altivec<32, 16>;//satd8<32, 16>;
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].satd = satd_altivec<8,64>;//satd8<8, 64>;
+
+ p.cu[BLOCK_4x4].sa8d = satd_4x4_altivec;//satd_4x4;
+ p.cu[BLOCK_8x8].sa8d = sa8d_8x8_altivec;//sa8d_8x8;
+ p.cu[BLOCK_16x16].sa8d = sa8d_16x16_altivec;//sa8d_16x16;
+ p.cu[BLOCK_32x32].sa8d = sa8d_32x32_altivec;//sa8d16<32, 32>;
+ p.cu[BLOCK_64x64].sa8d = sa8d_64x64_altivec;//sa8d16<64, 64>;
+
+ p.chroma[X265_CSP_I420].cu[BLOCK_16x16].sa8d = sa8d_8x8_altivec;//sa8d8<8, 8>;
+ p.chroma[X265_CSP_I420].cu[BLOCK_32x32].sa8d = sa8d_16x16_altivec;//sa8d16<16, 16>;
+ p.chroma[X265_CSP_I420].cu[BLOCK_64x64].sa8d = sa8d_32x32_altivec;//sa8d16<32, 32>;
+
+ p.chroma[X265_CSP_I422].cu[BLOCK_16x16].sa8d = sa8d_8x16_altivec;//sa8d8<8, 16>;
+ p.chroma[X265_CSP_I422].cu[BLOCK_32x32].sa8d = sa8d_16x32_altivec;//sa8d16<16, 32>;
+ p.chroma[X265_CSP_I422].cu[BLOCK_64x64].sa8d = sa8d_32x64_altivec;//sa8d16<32, 64>;
+
+}
+}
diff --git a/source/common/ppc/ppccommon.h b/source/common/ppc/ppccommon.h
new file mode 100644
index 0000000..9822c4f
--- /dev/null
+++ b/source/common/ppc/ppccommon.h
@@ -0,0 +1,91 @@
+/*****************************************************************************
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Min Chen <min.chen at multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#ifndef X265_PPCCOMMON_H
+#define X265_PPCCOMMON_H
+
+
+#if HAVE_ALTIVEC
+#include <altivec.h>
+
+#define vec_u8_t vector unsigned char
+#define vec_s8_t vector signed char
+#define vec_u16_t vector unsigned short
+#define vec_s16_t vector signed short
+#define vec_u32_t vector unsigned int
+#define vec_s32_t vector signed int
+
+//copy from x264
+#define LOAD_ZERO const vec_u8_t zerov = vec_splat_u8( 0 )
+
+#define zero_u8v (vec_u8_t) zerov
+#define zero_s8v (vec_s8_t) zerov
+#define zero_u16v (vec_u16_t) zerov
+#define zero_s16v (vec_s16_t) zerov
+#define zero_u32v (vec_u32_t) zerov
+#define zero_s32v (vec_s32_t) zerov
+
+/***********************************************************************
+ * 8 <-> 16 bits conversions
+ **********************************************************************/
+#ifdef WORDS_BIGENDIAN
+#define vec_u8_to_u16_h(v) (vec_u16_t) vec_mergeh( zero_u8v, (vec_u8_t) v )
+#define vec_u8_to_u16_l(v) (vec_u16_t) vec_mergel( zero_u8v, (vec_u8_t) v )
+#define vec_u8_to_s16_h(v) (vec_s16_t) vec_mergeh( zero_u8v, (vec_u8_t) v )
+#define vec_u8_to_s16_l(v) (vec_s16_t) vec_mergel( zero_u8v, (vec_u8_t) v )
+#else
+#define vec_u8_to_u16_h(v) (vec_u16_t) vec_mergeh( (vec_u8_t) v, zero_u8v )
+#define vec_u8_to_u16_l(v) (vec_u16_t) vec_mergel( (vec_u8_t) v, zero_u8v )
+#define vec_u8_to_s16_h(v) (vec_s16_t) vec_mergeh( (vec_u8_t) v, zero_u8v )
+#define vec_u8_to_s16_l(v) (vec_s16_t) vec_mergel( (vec_u8_t) v, zero_u8v )
+#endif
+
+#define vec_u8_to_u16(v) vec_u8_to_u16_h(v)
+#define vec_u8_to_s16(v) vec_u8_to_s16_h(v)
+
+#ifdef WORDS_BIGENDIAN
+#define vec_u16_to_u32_h(v) (vec_u32_t) vec_mergeh( zero_u16v, (vec_u16_t) v )
+#define vec_u16_to_u32_l(v) (vec_u32_t) vec_mergel( zero_u16v, (vec_u16_t) v )
+#define vec_u16_to_s32_h(v) (vec_s32_t) vec_mergeh( zero_u16v, (vec_u16_t) v )
+#define vec_u16_to_s32_l(v) (vec_s32_t) vec_mergel( zero_u16v, (vec_u16_t) v )
+#else
+#define vec_u16_to_u32_h(v) (vec_u32_t) vec_mergeh( (vec_u16_t) v, zero_u16v )
+#define vec_u16_to_u32_l(v) (vec_u32_t) vec_mergel( (vec_u16_t) v, zero_u16v )
+#define vec_u16_to_s32_h(v) (vec_s32_t) vec_mergeh( (vec_u16_t) v, zero_u16v )
+#define vec_u16_to_s32_l(v) (vec_s32_t) vec_mergel( (vec_u16_t) v, zero_u16v )
+#endif
+
+#define vec_u16_to_u32(v) vec_u16_to_u32_h(v)
+#define vec_u16_to_s32(v) vec_u16_to_s32_h(v)
+
+#define vec_u32_to_u16(v) vec_pack( v, zero_u32v )
+#define vec_s32_to_u16(v) vec_packsu( v, zero_s32v )
+
+#define BITS_PER_SUM (8 * sizeof(sum_t))
+
+#endif /* HAVE_ALTIVEC */
+
+#endif /* X265_PPCCOMMON_H */
+
+
+
diff --git a/source/common/primitives.cpp b/source/common/primitives.cpp
index ddbde38..18bd7e1 100644
--- a/source/common/primitives.cpp
+++ b/source/common/primitives.cpp
@@ -243,6 +243,15 @@ void x265_setup_primitives(x265_param *param)
#endif
setupAssemblyPrimitives(primitives, param->cpuid);
#endif
+#if HAVE_ALTIVEC
+ if (param->cpuid & X265_CPU_ALTIVEC)
+ {
+ setupPixelPrimitives_altivec(primitives); // pixel_altivec.cpp, overwrite the initialization for altivec optimizated functions
+ setupDCTPrimitives_altivec(primitives); // dct_altivec.cpp, overwrite the initialization for altivec optimizated functions
+ setupFilterPrimitives_altivec(primitives); // ipfilter.cpp, overwrite the initialization for altivec optimizated functions
+ setupIntraPrimitives_altivec(primitives); // intrapred_altivec.cpp, overwrite the initialization for altivec optimizated functions
+ }
+#endif
setupAliasPrimitives(primitives);
}
diff --git a/source/common/primitives.h b/source/common/primitives.h
index ad632f0..d038f3d 100644
--- a/source/common/primitives.h
+++ b/source/common/primitives.h
@@ -115,6 +115,7 @@ typedef int (*pixelcmp_ss_t)(const int16_t* fenc, intptr_t fencstride, const in
typedef sse_t (*pixel_sse_t)(const pixel* fenc, intptr_t fencstride, const pixel* fref, intptr_t frefstride); // fenc is aligned
typedef sse_t (*pixel_sse_ss_t)(const int16_t* fenc, intptr_t fencstride, const int16_t* fref, intptr_t frefstride);
typedef sse_t (*pixel_ssd_s_t)(const int16_t* fenc, intptr_t fencstride);
+typedef int(*pixelcmp_ads_t)(int encDC[], uint32_t *sums, int delta, uint16_t *costMvX, int16_t *mvs, int width, int thresh);
typedef void (*pixelcmp_x4_t)(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
typedef void (*pixelcmp_x3_t)(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
typedef void (*blockfill_s_t)(int16_t* dst, intptr_t dstride, int16_t val);
@@ -217,6 +218,7 @@ struct EncoderPrimitives
pixelcmp_t sad; // Sum of Absolute Differences
pixelcmp_x3_t sad_x3; // Sum of Absolute Differences, 3 mv offsets at once
pixelcmp_x4_t sad_x4; // Sum of Absolute Differences, 4 mv offsets at once
+ pixelcmp_ads_t ads; // Absolute Differences sum
pixelcmp_t satd; // Sum of Absolute Transformed Differences (4x4 Hadamard)
filter_pp_t luma_hpp; // 8-tap luma motion compensation interpolation filters
@@ -402,6 +404,22 @@ inline int partitionFromSizes(int width, int height)
return part;
}
+/* Computes the size of the LumaPU for a given LumaPU enum */
+inline void sizesFromPartition(int part, int *width, int *height)
+{
+ X265_CHECK(part >= 0 && part <= 24, "Invalid part %d \n", part);
+ extern const uint8_t lumaPartitionMapTable[];
+ int index = 0;
+ for (int i = 0; i < 256;i++)
+ if (part == lumaPartitionMapTable[i])
+ {
+ index = i;
+ break;
+ }
+ *width = 4 * ((index >> 4) + 1);
+ *height = 4 * ((index % 16) + 1);
+}
+
inline int partitionFromLog2Size(int log2Size)
{
X265_CHECK(2 <= log2Size && log2Size <= 6, "Invalid block size\n");
@@ -412,6 +430,12 @@ void setupCPrimitives(EncoderPrimitives &p);
void setupInstrinsicPrimitives(EncoderPrimitives &p, int cpuMask);
void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask);
void setupAliasPrimitives(EncoderPrimitives &p);
+#if HAVE_ALTIVEC
+void setupPixelPrimitives_altivec(EncoderPrimitives &p);
+void setupDCTPrimitives_altivec(EncoderPrimitives &p);
+void setupFilterPrimitives_altivec(EncoderPrimitives &p);
+void setupIntraPrimitives_altivec(EncoderPrimitives &p);
+#endif
}
#if !EXPORT_C_API
diff --git a/source/common/scalinglist.cpp b/source/common/scalinglist.cpp
index aff25da..602cf98 100644
--- a/source/common/scalinglist.cpp
+++ b/source/common/scalinglist.cpp
@@ -29,64 +29,6 @@ namespace {
// file-anonymous namespace
/* Strings for scaling list file parsing */
-const char MatrixType[4][6][20] =
-{
- {
- "INTRA4X4_LUMA",
- "INTRA4X4_CHROMAU",
- "INTRA4X4_CHROMAV",
- "INTER4X4_LUMA",
- "INTER4X4_CHROMAU",
- "INTER4X4_CHROMAV"
- },
- {
- "INTRA8X8_LUMA",
- "INTRA8X8_CHROMAU",
- "INTRA8X8_CHROMAV",
- "INTER8X8_LUMA",
- "INTER8X8_CHROMAU",
- "INTER8X8_CHROMAV"
- },
- {
- "INTRA16X16_LUMA",
- "INTRA16X16_CHROMAU",
- "INTRA16X16_CHROMAV",
- "INTER16X16_LUMA",
- "INTER16X16_CHROMAU",
- "INTER16X16_CHROMAV"
- },
- {
- "INTRA32X32_LUMA",
- "",
- "",
- "INTER32X32_LUMA",
- "",
- "",
- },
-};
-const char MatrixType_DC[4][12][22] =
-{
- {
- },
- {
- },
- {
- "INTRA16X16_LUMA_DC",
- "INTRA16X16_CHROMAU_DC",
- "INTRA16X16_CHROMAV_DC",
- "INTER16X16_LUMA_DC",
- "INTER16X16_CHROMAU_DC",
- "INTER16X16_CHROMAV_DC"
- },
- {
- "INTRA32X32_LUMA_DC",
- "",
- "",
- "INTER32X32_LUMA_DC",
- "",
- "",
- },
-};
static int quantTSDefault4x4[16] =
{
@@ -124,6 +66,64 @@ static int quantInterDefault8x8[64] =
namespace X265_NS {
// private namespace
+ const char ScalingList::MatrixType[4][6][20] =
+ {
+ {
+ "INTRA4X4_LUMA",
+ "INTRA4X4_CHROMAU",
+ "INTRA4X4_CHROMAV",
+ "INTER4X4_LUMA",
+ "INTER4X4_CHROMAU",
+ "INTER4X4_CHROMAV"
+ },
+ {
+ "INTRA8X8_LUMA",
+ "INTRA8X8_CHROMAU",
+ "INTRA8X8_CHROMAV",
+ "INTER8X8_LUMA",
+ "INTER8X8_CHROMAU",
+ "INTER8X8_CHROMAV"
+ },
+ {
+ "INTRA16X16_LUMA",
+ "INTRA16X16_CHROMAU",
+ "INTRA16X16_CHROMAV",
+ "INTER16X16_LUMA",
+ "INTER16X16_CHROMAU",
+ "INTER16X16_CHROMAV"
+ },
+ {
+ "INTRA32X32_LUMA",
+ "",
+ "",
+ "INTER32X32_LUMA",
+ "",
+ "",
+ },
+ };
+ const char ScalingList::MatrixType_DC[4][12][22] =
+ {
+ {
+ },
+ {
+ },
+ {
+ "INTRA16X16_LUMA_DC",
+ "INTRA16X16_CHROMAU_DC",
+ "INTRA16X16_CHROMAV_DC",
+ "INTER16X16_LUMA_DC",
+ "INTER16X16_CHROMAU_DC",
+ "INTER16X16_CHROMAV_DC"
+ },
+ {
+ "INTRA32X32_LUMA_DC",
+ "",
+ "",
+ "INTER32X32_LUMA_DC",
+ "",
+ "",
+ },
+ };
const int ScalingList::s_numCoefPerSize[NUM_SIZES] = { 16, 64, 256, 1024 };
const int32_t ScalingList::s_quantScales[NUM_REM] = { 26214, 23302, 20560, 18396, 16384, 14564 };
@@ -312,6 +312,22 @@ bool ScalingList::parseScalingList(const char* filename)
m_scalingListDC[sizeIdc][listIdc] = data;
}
}
+ if (sizeIdc == 3)
+ {
+ for (int listIdc = 1; listIdc < NUM_LISTS; listIdc++)
+ {
+ if (listIdc % 3 != 0)
+ {
+ src = m_scalingListCoef[sizeIdc][listIdc];
+ const int *srcNextSmallerSize = m_scalingListCoef[sizeIdc - 1][listIdc];
+ for (int i = 0; i < size; i++)
+ {
+ src[i] = srcNextSmallerSize[i];
+ }
+ m_scalingListDC[sizeIdc][listIdc] = m_scalingListDC[sizeIdc - 1][listIdc];
+ }
+ }
+ }
}
fclose(fp);
diff --git a/source/common/scalinglist.h b/source/common/scalinglist.h
index 467f10f..08893b9 100644
--- a/source/common/scalinglist.h
+++ b/source/common/scalinglist.h
@@ -42,6 +42,8 @@ public:
static const int s_numCoefPerSize[NUM_SIZES];
static const int32_t s_invQuantScales[NUM_REM];
static const int32_t s_quantScales[NUM_REM];
+ static const char MatrixType[4][6][20];
+ static const char MatrixType_DC[4][12][22];
int32_t m_scalingListDC[NUM_SIZES][NUM_LISTS]; // the DC value of the matrix coefficient for 16x16
int32_t* m_scalingListCoef[NUM_SIZES][NUM_LISTS]; // quantization matrix
diff --git a/source/common/slice.h b/source/common/slice.h
index eeefed5..5b9478e 100644
--- a/source/common/slice.h
+++ b/source/common/slice.h
@@ -239,11 +239,16 @@ struct SPS
uint32_t maxLatencyIncrease;
int numReorderPics;
+ RPS spsrps[MAX_NUM_SHORT_TERM_RPS];
+ int spsrpsNum;
+ int numGOPBegin;
+
bool bUseSAO; // use param
bool bUseAMP; // use param
bool bUseStrongIntraSmoothing; // use param
bool bTemporalMVPEnabled;
- bool bDiscardOptionalVUI;
+ bool bEmitVUITimingInfo;
+ bool bEmitVUIHRDInfo;
Window conformanceWindow;
VUI vuiParameters;
@@ -282,6 +287,8 @@ struct PPS
bool bDeblockingFilterControlPresent;
bool bPicDisableDeblockingFilter;
+
+ int numRefIdxDefault[2];
};
struct WeightParam
@@ -334,6 +341,7 @@ public:
int m_sliceQp;
int m_poc;
int m_lastIDR;
+ int m_rpsIdx;
uint32_t m_colRefIdx; // never modified
@@ -347,6 +355,10 @@ public:
bool m_sLFaseFlag; // loop filter boundary flag
bool m_colFromL0Flag; // collocated picture from List0 or List1 flag
+ int m_iPPSQpMinus26;
+ int numRefIdxDefault[2];
+ int m_iNumRPSInSPS;
+
Slice()
{
m_lastIDR = 0;
@@ -356,6 +368,10 @@ public:
memset(m_refReconPicList, 0, sizeof(m_refReconPicList));
memset(m_refPOCList, 0, sizeof(m_refPOCList));
disableWeights();
+ m_iPPSQpMinus26 = 0;
+ numRefIdxDefault[0] = 1;
+ numRefIdxDefault[1] = 1;
+ m_rpsIdx = -1;
}
void disableWeights();
diff --git a/source/common/version.cpp b/source/common/version.cpp
index 062c91a..e4d7554 100644
--- a/source/common/version.cpp
+++ b/source/common/version.cpp
@@ -77,7 +77,7 @@
#define BITS "[32 bit]"
#endif
-#if defined(ENABLE_ASSEMBLY)
+#if defined(ENABLE_ASSEMBLY) || HAVE_ALTIVEC
#define ASM ""
#else
#define ASM "[noasm]"
diff --git a/source/common/yuv.cpp b/source/common/yuv.cpp
index 33f0ed0..7eebc96 100644
--- a/source/common/yuv.cpp
+++ b/source/common/yuv.cpp
@@ -47,6 +47,11 @@ bool Yuv::create(uint32_t size, int csp)
m_size = size;
m_part = partitionFromSizes(size, size);
+ for (int i = 0; i < 2; i++)
+ for (int j = 0; j < MAX_NUM_REF; j++)
+ for (int k = 0; k < INTEGRAL_PLANE_NUM; k++)
+ m_integral[i][j][k] = NULL;
+
if (csp == X265_CSP_I400)
{
CHECKED_MALLOC(m_buf[0], pixel, size * size + 8);
diff --git a/source/common/yuv.h b/source/common/yuv.h
index cb60b2d..3fb48e2 100644
--- a/source/common/yuv.h
+++ b/source/common/yuv.h
@@ -48,6 +48,7 @@ public:
int m_csp;
int m_hChromaShift;
int m_vChromaShift;
+ uint32_t *m_integral[2][MAX_NUM_REF][INTEGRAL_PLANE_NUM];
Yuv();
diff --git a/source/encoder/analysis.cpp b/source/encoder/analysis.cpp
index a54d19e..bbbac43 100644
--- a/source/encoder/analysis.cpp
+++ b/source/encoder/analysis.cpp
@@ -203,6 +203,57 @@ Mode& Analysis::compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, con
return *m_modeDepth[0].bestMode;
}
+int32_t Analysis::loadTUDepth(CUGeom cuGeom, CUData parentCTU)
+{
+ float predDepth = 0;
+ CUData* neighbourCU;
+ uint8_t count = 0;
+ int32_t maxTUDepth = -1;
+ neighbourCU = m_slice->m_refFrameList[0][0]->m_encData->m_picCTU;
+ predDepth += neighbourCU->m_refTuDepth[cuGeom.geomRecurId];
+ count++;
+ if (m_slice->isInterB())
+ {
+ neighbourCU = m_slice->m_refFrameList[1][0]->m_encData->m_picCTU;
+ predDepth += neighbourCU->m_refTuDepth[cuGeom.geomRecurId];
+ count++;
+ }
+ if (parentCTU.m_cuAbove)
+ {
+ predDepth += parentCTU.m_cuAbove->m_refTuDepth[cuGeom.geomRecurId];
+ count++;
+ if (parentCTU.m_cuAboveLeft)
+ {
+ predDepth += parentCTU.m_cuAboveLeft->m_refTuDepth[cuGeom.geomRecurId];
+ count++;
+ }
+ if (parentCTU.m_cuAboveRight)
+ {
+ predDepth += parentCTU.m_cuAboveRight->m_refTuDepth[cuGeom.geomRecurId];
+ count++;
+ }
+ }
+ if (parentCTU.m_cuLeft)
+ {
+ predDepth += parentCTU.m_cuLeft->m_refTuDepth[cuGeom.geomRecurId];
+ count++;
+ }
+ predDepth /= count;
+
+ if (predDepth == 0)
+ maxTUDepth = 0;
+ else if (predDepth < 1)
+ maxTUDepth = 1;
+ else if (predDepth >= 1 && predDepth <= 1.5)
+ maxTUDepth = 2;
+ else if (predDepth > 1.5 && predDepth <= 2.5)
+ maxTUDepth = 3;
+ else
+ maxTUDepth = -1;
+
+ return maxTUDepth;
+}
+
void Analysis::tryLossless(const CUGeom& cuGeom)
{
ModeDepth& md = m_modeDepth[cuGeom.depth];
@@ -394,6 +445,16 @@ void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, in
cacheCost[cuIdx] = md.bestMode->rdCost;
}
+ /* Save Intra CUs TU depth only when analysis mode is OFF */
+ if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4 && !m_param->analysisMode)
+ {
+ CUData* ctu = md.bestMode->cu.m_encData->getPicCTU(parentCTU.m_cuAddr);
+ int8_t maxTUDepth = -1;
+ for (uint32_t i = 0; i < cuGeom.numPartitions; i++)
+ maxTUDepth = X265_MAX(maxTUDepth, md.pred[PRED_INTRA].cu.m_tuDepth[i]);
+ ctu->m_refTuDepth[cuGeom.geomRecurId] = maxTUDepth;
+ }
+
/* Copy best data to encData CTU and recon */
md.bestMode->cu.copyToPic(depth);
if (md.bestMode != &md.pred[PRED_SPLIT])
@@ -883,6 +944,16 @@ SplitData Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom&
ModeDepth& md = m_modeDepth[depth];
md.bestMode = NULL;
+ if (m_param->searchMethod == X265_SEA)
+ {
+ int numPredDir = m_slice->isInterP() ? 1 : 2;
+ int offset = (int)(m_frame->m_reconPic->m_cuOffsetY[parentCTU.m_cuAddr] + m_frame->m_reconPic->m_buOffsetY[cuGeom.absPartIdx]);
+ for (int list = 0; list < numPredDir; list++)
+ for (int i = 0; i < m_frame->m_encData->m_slice->m_numRefIdx[list]; i++)
+ for (int planes = 0; planes < INTEGRAL_PLANE_NUM; planes++)
+ m_modeDepth[depth].fencYuv.m_integral[list][i][planes] = m_frame->m_encData->m_slice->m_refFrameList[list][i]->m_encData->m_meIntegral[planes] + offset;
+ }
+
PicYuv& reconPic = *m_frame->m_reconPic;
bool mightSplit = !(cuGeom.flags & CUGeom::LEAF);
@@ -894,6 +965,9 @@ SplitData Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom&
bool skipRectAmp = false;
bool chooseMerge = false;
+ if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4)
+ m_maxTUDepth = loadTUDepth(cuGeom, parentCTU);
+
SplitData splitData[4];
splitData[0].initSplitCUData();
splitData[1].initSplitCUData();
@@ -1400,6 +1474,18 @@ SplitData Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom&
if (m_param->rdLevel)
md.bestMode->reconYuv.copyToPicYuv(reconPic, cuAddr, cuGeom.absPartIdx);
+ if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4)
+ {
+ if (mightNotSplit)
+ {
+ CUData* ctu = md.bestMode->cu.m_encData->getPicCTU(parentCTU.m_cuAddr);
+ int8_t maxTUDepth = -1;
+ for (uint32_t i = 0; i < cuGeom.numPartitions; i++)
+ maxTUDepth = X265_MAX(maxTUDepth, md.bestMode->cu.m_tuDepth[i]);
+ ctu->m_refTuDepth[cuGeom.geomRecurId] = maxTUDepth;
+ }
+ }
+
return splitCUData;
}
@@ -1409,6 +1495,16 @@ SplitData Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom&
ModeDepth& md = m_modeDepth[depth];
md.bestMode = NULL;
+ if (m_param->searchMethod == X265_SEA)
+ {
+ int numPredDir = m_slice->isInterP() ? 1 : 2;
+ int offset = (int)(m_frame->m_reconPic->m_cuOffsetY[parentCTU.m_cuAddr] + m_frame->m_reconPic->m_buOffsetY[cuGeom.absPartIdx]);
+ for (int list = 0; list < numPredDir; list++)
+ for (int i = 0; i < m_frame->m_encData->m_slice->m_numRefIdx[list]; i++)
+ for (int planes = 0; planes < INTEGRAL_PLANE_NUM; planes++)
+ m_modeDepth[depth].fencYuv.m_integral[list][i][planes] = m_frame->m_encData->m_slice->m_refFrameList[list][i]->m_encData->m_meIntegral[planes] + offset;
+ }
+
bool mightSplit = !(cuGeom.flags & CUGeom::LEAF);
bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY);
bool skipRecursion = false;
@@ -1424,6 +1520,9 @@ SplitData Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom&
md.pred[PRED_2Nx2N].rdCost = 0;
}
+ if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4)
+ m_maxTUDepth = loadTUDepth(cuGeom, parentCTU);
+
SplitData splitData[4];
splitData[0].initSplitCUData();
splitData[1].initSplitCUData();
@@ -1751,6 +1850,18 @@ SplitData Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom&
addSplitFlagCost(*md.bestMode, cuGeom.depth);
}
+ if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4)
+ {
+ if (mightNotSplit)
+ {
+ CUData* ctu = md.bestMode->cu.m_encData->getPicCTU(parentCTU.m_cuAddr);
+ int8_t maxTUDepth = -1;
+ for (uint32_t i = 0; i < cuGeom.numPartitions; i++)
+ maxTUDepth = X265_MAX(maxTUDepth, md.bestMode->cu.m_tuDepth[i]);
+ ctu->m_refTuDepth[cuGeom.geomRecurId] = maxTUDepth;
+ }
+ }
+
/* compare split RD cost against best cost */
if (mightSplit && !skipRecursion)
checkBestMode(md.pred[PRED_SPLIT], depth);
@@ -1942,12 +2053,12 @@ void Analysis::checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGe
if (m_param->maxSlices > 1)
{
// NOTE: First row in slice can't negative
- if ((candMvField[i][0].mv.y < m_sliceMinY) | (candMvField[i][1].mv.y < m_sliceMinY))
+ if (X265_MIN(candMvField[i][0].mv.y, candMvField[i][1].mv.y) < m_sliceMinY)
continue;
// Last row in slice can't reference beyond bound since it is another slice area
// TODO: we may beyond bound in future since these area have a chance to finish because we use parallel slices. Necessary prepare research on load balance
- if ((candMvField[i][0].mv.y > m_sliceMaxY) | (candMvField[i][1].mv.y > m_sliceMaxY))
+ if (X265_MAX(candMvField[i][0].mv.y, candMvField[i][1].mv.y) > m_sliceMaxY)
continue;
}
@@ -2072,12 +2183,12 @@ void Analysis::checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGe
if (m_param->maxSlices > 1)
{
// NOTE: First row in slice can't negative
- if ((candMvField[i][0].mv.y < m_sliceMinY) | (candMvField[i][1].mv.y < m_sliceMinY))
+ if (X265_MIN(candMvField[i][0].mv.y, candMvField[i][1].mv.y) < m_sliceMinY)
continue;
// Last row in slice can't reference beyond bound since it is another slice area
// TODO: we may beyond bound in future since these area have a chance to finish because we use parallel slices. Necessary prepare research on load balance
- if ((candMvField[i][0].mv.y > m_sliceMaxY) | (candMvField[i][1].mv.y > m_sliceMaxY))
+ if (X265_MAX(candMvField[i][0].mv.y, candMvField[i][1].mv.y) > m_sliceMaxY)
continue;
}
diff --git a/source/encoder/analysis.h b/source/encoder/analysis.h
index aedcc2e..79a0aa7 100644
--- a/source/encoder/analysis.h
+++ b/source/encoder/analysis.h
@@ -116,6 +116,7 @@ public:
void destroy();
Mode& compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, const Entropy& initialContext);
+ int32_t loadTUDepth(CUGeom cuGeom, CUData parentCTU);
protected:
/* Analysis data for save/load mode, writes/reads data based on absPartIdx */
diff --git a/source/encoder/api.cpp b/source/encoder/api.cpp
index 49cae06..4743ea3 100644
--- a/source/encoder/api.cpp
+++ b/source/encoder/api.cpp
@@ -141,6 +141,11 @@ int x265_encoder_headers(x265_encoder *enc, x265_nal **pp_nal, uint32_t *pi_nal)
Encoder *encoder = static_cast<Encoder*>(enc);
Entropy sbacCoder;
Bitstream bs;
+ if (encoder->m_param->rc.bStatRead && encoder->m_param->bMultiPassOptRPS)
+ {
+ if (!encoder->computeSPSRPSIndex())
+ return -1;
+ }
encoder->getStreamHeaders(encoder->m_nalList, sbacCoder, bs);
*pp_nal = &encoder->m_nalList.m_nal[0];
if (pi_nal) *pi_nal = encoder->m_nalList.m_numNal;
diff --git a/source/encoder/bitcost.cpp b/source/encoder/bitcost.cpp
index f1d20d9..3506abe 100644
--- a/source/encoder/bitcost.cpp
+++ b/source/encoder/bitcost.cpp
@@ -54,16 +54,40 @@ void BitCost::setQP(unsigned int qp)
s_costs[qp][i] = s_costs[qp][-i] = (uint16_t)X265_MIN(s_bitsizes[i] * lambda + 0.5f, (1 << 15) - 1);
}
}
-
+ for (int j = 0; j < 4; j++)
+ {
+ if (!s_fpelMvCosts[qp][j])
+ {
+ ScopedLock s(s_costCalcLock);
+ if (!s_fpelMvCosts[qp][j])
+ {
+ s_fpelMvCosts[qp][j] = X265_MALLOC(uint16_t, BC_MAX_MV + 1) + (BC_MAX_MV >> 1);
+ if (!s_fpelMvCosts[qp][j])
+ {
+ x265_log(NULL, X265_LOG_ERROR, "BitCost s_fpelMvCosts buffer allocation failure\n");
+ return;
+ }
+ for (int i = -(BC_MAX_MV >> 1); i < (BC_MAX_MV >> 1); i++)
+ {
+ s_fpelMvCosts[qp][j][i] = s_costs[qp][i * 4 + j];
+ }
+ }
+ }
+ }
m_cost = s_costs[qp];
+ for (int j = 0; j < 4; j++)
+ {
+ m_fpelMvCosts[j] = s_fpelMvCosts[qp][j];
+ }
}
-
/***
* Class static data and methods
*/
uint16_t *BitCost::s_costs[BC_MAX_QP];
+uint16_t* BitCost::s_fpelMvCosts[BC_MAX_QP][4];
+
float *BitCost::s_bitsizes;
Lock BitCost::s_costCalcLock;
@@ -96,6 +120,17 @@ void BitCost::destroy()
s_costs[i] = NULL;
}
}
+ for (int i = 0; i < BC_MAX_QP; i++)
+ {
+ for (int j = 0; j < 4; j++)
+ {
+ if (s_fpelMvCosts[i][j])
+ {
+ X265_FREE(s_fpelMvCosts[i][j] - (BC_MAX_MV >> 1));
+ s_fpelMvCosts[i][j] = NULL;
+ }
+ }
+ }
if (s_bitsizes)
{
diff --git a/source/encoder/bitcost.h b/source/encoder/bitcost.h
index 68ae41e..257d9c6 100644
--- a/source/encoder/bitcost.h
+++ b/source/encoder/bitcost.h
@@ -67,6 +67,8 @@ protected:
uint16_t *m_cost;
+ uint16_t *m_fpelMvCosts[4];
+
MV m_mvp;
BitCost& operator =(const BitCost&);
@@ -84,6 +86,8 @@ private:
static uint16_t *s_costs[BC_MAX_QP];
+ static uint16_t *s_fpelMvCosts[BC_MAX_QP][4];
+
static Lock s_costCalcLock;
static void CalculateLogs();
diff --git a/source/encoder/dpb.cpp b/source/encoder/dpb.cpp
index de79fa5..985545f 100644
--- a/source/encoder/dpb.cpp
+++ b/source/encoder/dpb.cpp
@@ -92,6 +92,19 @@ void DPB::recycleUnreferenced()
m_freeList.pushBack(*curFrame);
curFrame->m_encData->m_freeListNext = m_frameDataFreeList;
m_frameDataFreeList = curFrame->m_encData;
+
+ if (curFrame->m_encData->m_meBuffer)
+ {
+ for (int i = 0; i < INTEGRAL_PLANE_NUM; i++)
+ {
+ if (curFrame->m_encData->m_meBuffer[i] != NULL)
+ {
+ X265_FREE(curFrame->m_encData->m_meBuffer[i]);
+ curFrame->m_encData->m_meBuffer[i] = NULL;
+ }
+ }
+ }
+
curFrame->m_encData = NULL;
curFrame->m_reconPic = NULL;
}
diff --git a/source/encoder/encoder.cpp b/source/encoder/encoder.cpp
index 1a8402b..6021e27 100644
--- a/source/encoder/encoder.cpp
+++ b/source/encoder/encoder.cpp
@@ -74,6 +74,10 @@ Encoder::Encoder()
m_threadPool = NULL;
m_analysisFile = NULL;
m_offsetEmergency = NULL;
+ m_iFrameNum = 0;
+ m_iPPSQpMinus26 = 0;
+ m_iLastSliceQp = 0;
+ m_rpsInSpsCount = 0;
for (int i = 0; i < X265_MAX_FRAME_THREADS; i++)
m_frameEncoder[i] = NULL;
@@ -145,12 +149,6 @@ void Encoder::create()
p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = p->lookaheadSlices = 0;
}
- if (!p->bEnableWavefront && p->rc.vbvBufferSize)
- {
- x265_log(p, X265_LOG_ERROR, "VBV requires wavefront parallelism\n");
- m_aborted = true;
- }
-
x265_log(p, X265_LOG_INFO, "Slices : %d\n", p->maxSlices);
char buf[128];
@@ -318,6 +316,8 @@ void Encoder::create()
if (!m_lookahead->create())
m_aborted = true;
+ initRefIdx();
+
if (m_param->analysisMode)
{
const char* name = m_param->analysisFileName;
@@ -869,6 +869,58 @@ int Encoder::encode(const x265_picture* pic_in, x265_picture* pic_out)
slice->m_endCUAddr = slice->realEndAddress(m_sps.numCUsInFrame * NUM_4x4_PARTITIONS);
}
+ if (m_param->searchMethod == X265_SEA && frameEnc->m_lowres.sliceType != X265_TYPE_B)
+ {
+ int padX = g_maxCUSize + 32;
+ int padY = g_maxCUSize + 16;
+ uint32_t numCuInHeight = (frameEnc->m_encData->m_reconPic->m_picHeight + g_maxCUSize - 1) / g_maxCUSize;
+ int maxHeight = numCuInHeight * g_maxCUSize;
+ for (int i = 0; i < INTEGRAL_PLANE_NUM; i++)
+ {
+ frameEnc->m_encData->m_meBuffer[i] = X265_MALLOC(uint32_t, frameEnc->m_reconPic->m_stride * (maxHeight + (2 * padY)));
+ if (frameEnc->m_encData->m_meBuffer[i])
+ {
+ memset(frameEnc->m_encData->m_meBuffer[i], 0, sizeof(uint32_t)* frameEnc->m_reconPic->m_stride * (maxHeight + (2 * padY)));
+ frameEnc->m_encData->m_meIntegral[i] = frameEnc->m_encData->m_meBuffer[i] + frameEnc->m_encData->m_reconPic->m_stride * padY + padX;
+ }
+ else
+ x265_log(m_param, X265_LOG_ERROR, "SEA motion search: POC %d Integral buffer[%d] unallocated\n", frameEnc->m_poc, i);
+ }
+ }
+
+ if (m_param->bOptQpPPS && frameEnc->m_lowres.bKeyframe && m_param->bRepeatHeaders)
+ {
+ ScopedLock qpLock(m_sliceQpLock);
+ if (m_iFrameNum > 0)
+ {
+ //Search the least cost
+ int64_t iLeastCost = m_iBitsCostSum[0];
+ int iLeastId = 0;
+ for (int i = 1; i < QP_MAX_MAX + 1; i++)
+ {
+ if (iLeastCost > m_iBitsCostSum[i])
+ {
+ iLeastId = i;
+ iLeastCost = m_iBitsCostSum[i];
+ }
+ }
+
+ /* If last slice Qp is close to (26 + m_iPPSQpMinus26) or outputs is all I-frame video,
+ we don't need to change m_iPPSQpMinus26. */
+ if ((abs(m_iLastSliceQp - (26 + m_iPPSQpMinus26)) > 1) && (m_iFrameNum > 1))
+ m_iPPSQpMinus26 = (iLeastId + 1) - 26;
+ m_iFrameNum = 0;
+ }
+
+ for (int i = 0; i < QP_MAX_MAX + 1; i++)
+ m_iBitsCostSum[i] = 0;
+ }
+
+ frameEnc->m_encData->m_slice->m_iPPSQpMinus26 = m_iPPSQpMinus26;
+ frameEnc->m_encData->m_slice->numRefIdxDefault[0] = m_pps.numRefIdxDefault[0];
+ frameEnc->m_encData->m_slice->numRefIdxDefault[1] = m_pps.numRefIdxDefault[1];
+ frameEnc->m_encData->m_slice->m_iNumRPSInSPS = m_sps.spsrpsNum;
+
curEncoder->m_rce.encodeOrder = frameEnc->m_encodeOrder = m_encodedFrameNum++;
if (m_bframeDelay)
{
@@ -1031,6 +1083,13 @@ void Encoder::printSummary()
x265_log(m_param, X265_LOG_INFO, "lossless compression ratio %.2f::1\n", uncompressed / m_analyzeAll.m_accBits);
}
+ if (m_param->bMultiPassOptRPS && m_param->rc.bStatRead)
+ {
+ x265_log(m_param, X265_LOG_INFO, "RPS in SPS: %d frames (%.2f%%), RPS not in SPS: %d frames (%.2f%%)\n",
+ m_rpsInSpsCount, (float)100.0 * m_rpsInSpsCount / m_rateControl->m_numEntries,
+ m_rateControl->m_numEntries - m_rpsInSpsCount,
+ (float)100.0 * (m_rateControl->m_numEntries - m_rpsInSpsCount) / m_rateControl->m_numEntries);
+ }
if (m_analyzeAll.m_numPics)
{
@@ -1353,6 +1412,7 @@ void Encoder::finishFrameStats(Frame* curFrame, FrameEncoder *curEncoder, x265_f
frameStats->qp = curEncData.m_avgQpAq;
frameStats->bits = bits;
frameStats->bScenecut = curFrame->m_lowres.bScenecut;
+ frameStats->bufferFill = m_rateControl->m_bufferFillActual;
frameStats->frameLatency = inPoc - poc;
if (m_param->rc.rateControlMode == X265_RC_CRF)
frameStats->rateFactor = curEncData.m_rateFactor;
@@ -1413,6 +1473,66 @@ void Encoder::finishFrameStats(Frame* curFrame, FrameEncoder *curEncoder, x265_f
#pragma warning(disable: 4127) // conditional expression is constant
#endif
+void Encoder::initRefIdx()
+{
+ int j = 0;
+
+ for (j = 0; j < MAX_NUM_REF_IDX; j++)
+ {
+ m_refIdxLastGOP.numRefIdxl0[j] = 0;
+ m_refIdxLastGOP.numRefIdxl1[j] = 0;
+ }
+
+ return;
+}
+
+void Encoder::analyseRefIdx(int *numRefIdx)
+{
+ int i_l0 = 0;
+ int i_l1 = 0;
+
+ i_l0 = numRefIdx[0];
+ i_l1 = numRefIdx[1];
+
+ if ((0 < i_l0) && (MAX_NUM_REF_IDX > i_l0))
+ m_refIdxLastGOP.numRefIdxl0[i_l0]++;
+ if ((0 < i_l1) && (MAX_NUM_REF_IDX > i_l1))
+ m_refIdxLastGOP.numRefIdxl1[i_l1]++;
+
+ return;
+}
+
+void Encoder::updateRefIdx()
+{
+ int i_max_l0 = 0;
+ int i_max_l1 = 0;
+ int j = 0;
+
+ i_max_l0 = 0;
+ i_max_l1 = 0;
+ m_refIdxLastGOP.numRefIdxDefault[0] = 1;
+ m_refIdxLastGOP.numRefIdxDefault[1] = 1;
+ for (j = 0; j < MAX_NUM_REF_IDX; j++)
+ {
+ if (i_max_l0 < m_refIdxLastGOP.numRefIdxl0[j])
+ {
+ i_max_l0 = m_refIdxLastGOP.numRefIdxl0[j];
+ m_refIdxLastGOP.numRefIdxDefault[0] = j;
+ }
+ if (i_max_l1 < m_refIdxLastGOP.numRefIdxl1[j])
+ {
+ i_max_l1 = m_refIdxLastGOP.numRefIdxl1[j];
+ m_refIdxLastGOP.numRefIdxDefault[1] = j;
+ }
+ }
+
+ m_pps.numRefIdxDefault[0] = m_refIdxLastGOP.numRefIdxDefault[0];
+ m_pps.numRefIdxDefault[1] = m_refIdxLastGOP.numRefIdxDefault[1];
+ initRefIdx();
+
+ return;
+}
+
void Encoder::getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs)
{
sbacCoder.setBitstream(&bs);
@@ -1429,7 +1549,7 @@ void Encoder::getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs)
list.serialize(NAL_UNIT_SPS, bs);
bs.resetBits();
- sbacCoder.codePPS(m_pps, (m_param->maxSlices <= 1));
+ sbacCoder.codePPS( m_pps, (m_param->maxSlices <= 1), m_iPPSQpMinus26);
bs.writeByteAlignment();
list.serialize(NAL_UNIT_PPS, bs);
@@ -1458,9 +1578,9 @@ void Encoder::getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs)
list.serialize(NAL_UNIT_PREFIX_SEI, bs);
}
- if (!m_param->bDiscardSEI && m_param->bEmitInfoSEI)
+ if (m_param->bEmitInfoSEI)
{
- char *opts = x265_param2string(m_param);
+ char *opts = x265_param2string(m_param, m_sps.conformanceWindow.rightOffset, m_sps.conformanceWindow.bottomOffset);
if (opts)
{
char *buffer = X265_MALLOC(char, strlen(opts) + strlen(PFX(version_str)) +
@@ -1468,7 +1588,7 @@ void Encoder::getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs)
if (buffer)
{
sprintf(buffer, "x265 (build %d) - %s:%s - H.265/HEVC codec - "
- "Copyright 2013-2015 (c) Multicoreware Inc - "
+ "Copyright 2013-2016 (c) Multicoreware Inc - "
"http://x265.org - options: %s",
X265_BUILD, PFX(version_str), PFX(build_info_str), opts);
@@ -1488,7 +1608,7 @@ void Encoder::getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs)
}
}
- if (!m_param->bDiscardSEI && (m_param->bEmitHRDSEI || !!m_param->interlaceMode))
+ if ((m_param->bEmitHRDSEI || !!m_param->interlaceMode))
{
/* Picture Timing and Buffering Period SEI require the SPS to be "activated" */
SEIActiveParameterSets sei;
@@ -1543,7 +1663,8 @@ void Encoder::initSPS(SPS *sps)
sps->bUseStrongIntraSmoothing = m_param->bEnableStrongIntraSmoothing;
sps->bTemporalMVPEnabled = m_param->bEnableTemporalMvp;
- sps->bDiscardOptionalVUI = m_param->bDiscardOptionalVUI;
+ sps->bEmitVUITimingInfo = m_param->bEmitVUITimingInfo;
+ sps->bEmitVUIHRDInfo = m_param->bEmitVUIHRDInfo;
sps->log2MaxPocLsb = m_param->log2MaxPocLsb;
int maxDeltaPOC = (m_param->bframes + 2) * (!!m_param->bBPyramid + 1) * 2;
while ((1 << sps->log2MaxPocLsb) <= maxDeltaPOC * 2)
@@ -1621,6 +1742,9 @@ void Encoder::initPPS(PPS *pps)
pps->deblockingFilterTcOffsetDiv2 = m_param->deblockingFilterTCOffset;
pps->bEntropyCodingSyncEnabled = m_param->bEnableWavefront;
+
+ pps->numRefIdxDefault[0] = 1;
+ pps->numRefIdxDefault[1] = 1;
}
void Encoder::configure(x265_param *p)
@@ -1819,6 +1943,7 @@ void Encoder::configure(x265_param *p)
m_bframeDelay = p->bframes ? (p->bBPyramid ? 2 : 1) : 0;
p->bFrameBias = X265_MIN(X265_MAX(-90, p->bFrameBias), 100);
+ p->scenecutBias = (double)(p->scenecutBias / 100);
if (p->logLevel < X265_LOG_INFO)
{
@@ -1849,6 +1974,12 @@ void Encoder::configure(x265_param *p)
if (s)
x265_log(p, X265_LOG_WARNING, "--tune %s should be used if attempting to benchmark %s!\n", s, s);
}
+ if (p->searchMethod == X265_SEA && (p->bDistributeMotionEstimation || p->bDistributeModeAnalysis))
+ {
+ x265_log(p, X265_LOG_WARNING, "Disabling pme and pmode: --pme and --pmode cannot be used with SEA motion search!\n");
+ p->bDistributeMotionEstimation = 0;
+ p->bDistributeModeAnalysis = 0;
+ }
/* some options make no sense if others are disabled */
p->bSaoNonDeblocked &= p->bEnableSAO;
@@ -1878,6 +2009,11 @@ void Encoder::configure(x265_param *p)
x265_log(p, X265_LOG_WARNING, "--rd-refine disabled, requires RD level > 4 and adaptive quant\n");
}
+ if (p->limitTU && p->tuQTMaxInterDepth < 2)
+ {
+ p->limitTU = 0;
+ x265_log(p, X265_LOG_WARNING, "limit-tu disabled, requires tu-inter-depth > 1\n");
+ }
bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
if (!m_param->bLossless && (m_param->rc.aqMode || bIsVbv))
{
@@ -2013,6 +2149,19 @@ void Encoder::configure(x265_param *p)
p->log2MaxPocLsb = 4;
}
+ if (p->maxSlices < 1)
+ {
+ x265_log(p, X265_LOG_WARNING, "maxSlices can not be less than 1, force set to 1\n");
+ p->maxSlices = 1;
+ }
+
+ const uint32_t numRows = (p->sourceHeight + p->maxCUSize - 1) / p->maxCUSize;
+ const uint32_t slicesLimit = X265_MIN(numRows, NALList::MAX_NAL_UNITS - 1);
+ if (p->maxSlices > numRows)
+ {
+ x265_log(p, X265_LOG_WARNING, "maxSlices can not be more than min(rows, MAX_NAL_UNITS-1), force set to %d\n", slicesLimit);
+ p->maxSlices = slicesLimit;
+ }
}
void Encoder::allocAnalysis(x265_analysis_data* analysis)
@@ -2309,10 +2458,10 @@ void Encoder::printReconfigureParams()
x265_param* oldParam = m_param;
x265_param* newParam = m_latestParam;
- x265_log(newParam, X265_LOG_INFO, "Reconfigured param options, input Frame: %d\n", m_pocLast + 1);
+ x265_log(newParam, X265_LOG_DEBUG, "Reconfigured param options, input Frame: %d\n", m_pocLast + 1);
char tmp[40];
-#define TOOLCMP(COND1, COND2, STR) if (COND1 != COND2) { sprintf(tmp, STR, COND1, COND2); x265_log(newParam, X265_LOG_INFO, tmp); }
+#define TOOLCMP(COND1, COND2, STR) if (COND1 != COND2) { sprintf(tmp, STR, COND1, COND2); x265_log(newParam, X265_LOG_DEBUG, tmp); }
TOOLCMP(oldParam->maxNumReferences, newParam->maxNumReferences, "ref=%d to %d\n");
TOOLCMP(oldParam->bEnableFastIntra, newParam->bEnableFastIntra, "fast-intra=%d to %d\n");
TOOLCMP(oldParam->bEnableEarlySkip, newParam->bEnableEarlySkip, "early-skip=%d to %d\n");
@@ -2326,3 +2475,208 @@ void Encoder::printReconfigureParams()
TOOLCMP(oldParam->maxNumMergeCand, newParam->maxNumMergeCand, "max-merge=%d to %d\n");
TOOLCMP(oldParam->bIntraInBFrames, newParam->bIntraInBFrames, "b-intra=%d to %d\n");
}
+
+bool Encoder::computeSPSRPSIndex()
+{
+ RPS* rpsInSPS = m_sps.spsrps;
+ int* rpsNumInPSP = &m_sps.spsrpsNum;
+ int beginNum = m_sps.numGOPBegin;
+ int endNum;
+ RPS* rpsInRec;
+ RPS* rpsInIdxList;
+ RPS* thisRpsInSPS;
+ RPS* thisRpsInList;
+ RPSListNode* headRpsIdxList = NULL;
+ RPSListNode* tailRpsIdxList = NULL;
+ RPSListNode* rpsIdxListIter = NULL;
+ RateControlEntry *rce2Pass = m_rateControl->m_rce2Pass;
+ int numEntries = m_rateControl->m_numEntries;
+ RateControlEntry *rce;
+ int idx = 0;
+ int pos = 0;
+ int resultIdx[64];
+ memset(rpsInSPS, 0, sizeof(RPS) * MAX_NUM_SHORT_TERM_RPS);
+
+ // find out all RPS date in current GOP
+ beginNum++;
+ endNum = beginNum;
+ if (!m_param->bRepeatHeaders)
+ {
+ endNum = numEntries;
+ }
+ else
+ {
+ while (endNum < numEntries)
+ {
+ rce = &rce2Pass[endNum];
+ if (rce->sliceType == I_SLICE)
+ {
+ if (m_param->keyframeMin && (endNum - beginNum + 1 < m_param->keyframeMin))
+ {
+ endNum++;
+ continue;
+ }
+ break;
+ }
+ endNum++;
+ }
+ }
+ m_sps.numGOPBegin = endNum;
+
+ // find out all kinds of RPS
+ for (int i = beginNum; i < endNum; i++)
+ {
+ rce = &rce2Pass[i];
+ rpsInRec = &rce->rpsData;
+ rpsIdxListIter = headRpsIdxList;
+ // i frame don't recode RPS info
+ if (rce->sliceType != I_SLICE)
+ {
+ while (rpsIdxListIter)
+ {
+ rpsInIdxList = rpsIdxListIter->rps;
+ if (rpsInRec->numberOfPictures == rpsInIdxList->numberOfPictures
+ && rpsInRec->numberOfNegativePictures == rpsInIdxList->numberOfNegativePictures
+ && rpsInRec->numberOfPositivePictures == rpsInIdxList->numberOfPositivePictures)
+ {
+ for (pos = 0; pos < rpsInRec->numberOfPictures; pos++)
+ {
+ if (rpsInRec->deltaPOC[pos] != rpsInIdxList->deltaPOC[pos]
+ || rpsInRec->bUsed[pos] != rpsInIdxList->bUsed[pos])
+ break;
+ }
+ if (pos == rpsInRec->numberOfPictures) // if this type of RPS has exist
+ {
+ rce->rpsIdx = rpsIdxListIter->idx;
+ rpsIdxListIter->count++;
+ // sort RPS type link after reset RPS type count.
+ RPSListNode* next = rpsIdxListIter->next;
+ RPSListNode* prior = rpsIdxListIter->prior;
+ RPSListNode* iter = prior;
+ if (iter)
+ {
+ while (iter)
+ {
+ if (iter->count > rpsIdxListIter->count)
+ break;
+ iter = iter->prior;
+ }
+ if (iter)
+ {
+ prior->next = next;
+ if (next)
+ next->prior = prior;
+ else
+ tailRpsIdxList = prior;
+ rpsIdxListIter->next = iter->next;
+ rpsIdxListIter->prior = iter;
+ iter->next->prior = rpsIdxListIter;
+ iter->next = rpsIdxListIter;
+ }
+ else
+ {
+ prior->next = next;
+ if (next)
+ next->prior = prior;
+ else
+ tailRpsIdxList = prior;
+ headRpsIdxList->prior = rpsIdxListIter;
+ rpsIdxListIter->next = headRpsIdxList;
+ rpsIdxListIter->prior = NULL;
+ headRpsIdxList = rpsIdxListIter;
+ }
+ }
+ break;
+ }
+ }
+ rpsIdxListIter = rpsIdxListIter->next;
+ }
+ if (!rpsIdxListIter) // add new type of RPS
+ {
+ RPSListNode* newIdxNode = new RPSListNode();
+ if (newIdxNode == NULL)
+ goto fail;
+ newIdxNode->rps = rpsInRec;
+ newIdxNode->idx = idx++;
+ newIdxNode->count = 1;
+ newIdxNode->next = NULL;
+ newIdxNode->prior = NULL;
+ if (!tailRpsIdxList)
+ tailRpsIdxList = headRpsIdxList = newIdxNode;
+ else
+ {
+ tailRpsIdxList->next = newIdxNode;
+ newIdxNode->prior = tailRpsIdxList;
+ tailRpsIdxList = newIdxNode;
+ }
+ rce->rpsIdx = newIdxNode->idx;
+ }
+ }
+ else
+ {
+ rce->rpsIdx = -1;
+ }
+ }
+
+ // get commonly RPS set
+ memset(resultIdx, 0, sizeof(resultIdx));
+ if (idx > MAX_NUM_SHORT_TERM_RPS)
+ idx = MAX_NUM_SHORT_TERM_RPS;
+
+ *rpsNumInPSP = idx;
+ rpsIdxListIter = headRpsIdxList;
+ for (int i = 0; i < idx; i++)
+ {
+ resultIdx[i] = rpsIdxListIter->idx;
+ m_rpsInSpsCount += rpsIdxListIter->count;
+ thisRpsInSPS = rpsInSPS + i;
+ thisRpsInList = rpsIdxListIter->rps;
+ thisRpsInSPS->numberOfPictures = thisRpsInList->numberOfPictures;
+ thisRpsInSPS->numberOfNegativePictures = thisRpsInList->numberOfNegativePictures;
+ thisRpsInSPS->numberOfPositivePictures = thisRpsInList->numberOfPositivePictures;
+ for (pos = 0; pos < thisRpsInList->numberOfPictures; pos++)
+ {
+ thisRpsInSPS->deltaPOC[pos] = thisRpsInList->deltaPOC[pos];
+ thisRpsInSPS->bUsed[pos] = thisRpsInList->bUsed[pos];
+ }
+ rpsIdxListIter = rpsIdxListIter->next;
+ }
+
+ //reset every frame's RPS index
+ for (int i = beginNum; i < endNum; i++)
+ {
+ int j;
+ rce = &rce2Pass[i];
+ for (j = 0; j < idx; j++)
+ {
+ if (rce->rpsIdx == resultIdx[j])
+ {
+ rce->rpsIdx = j;
+ break;
+ }
+ }
+
+ if (j == idx)
+ rce->rpsIdx = -1;
+ }
+
+ rpsIdxListIter = headRpsIdxList;
+ while (rpsIdxListIter)
+ {
+ RPSListNode* freeIndex = rpsIdxListIter;
+ rpsIdxListIter = rpsIdxListIter->next;
+ delete freeIndex;
+ }
+ return true;
+
+fail:
+ rpsIdxListIter = headRpsIdxList;
+ while (rpsIdxListIter)
+ {
+ RPSListNode* freeIndex = rpsIdxListIter;
+ rpsIdxListIter = rpsIdxListIter->next;
+ delete freeIndex;
+ }
+ return false;
+}
+
diff --git a/source/encoder/encoder.h b/source/encoder/encoder.h
index 4d5559f..ddcc86c 100644
--- a/source/encoder/encoder.h
+++ b/source/encoder/encoder.h
@@ -26,6 +26,7 @@
#include "common.h"
#include "slice.h"
+#include "threading.h"
#include "scalinglist.h"
#include "x265.h"
#include "nal.h"
@@ -69,6 +70,24 @@ struct EncStats
void addSsim(double ssim);
};
+#define MAX_NUM_REF_IDX 64
+
+struct RefIdxLastGOP
+{
+ int numRefIdxDefault[2];
+ int numRefIdxl0[MAX_NUM_REF_IDX];
+ int numRefIdxl1[MAX_NUM_REF_IDX];
+};
+
+struct RPSListNode
+{
+ int idx;
+ int count;
+ RPS* rps;
+ RPSListNode* next;
+ RPSListNode* prior;
+};
+
class FrameEncoder;
class DPB;
class Lookahead;
@@ -136,6 +155,19 @@ public:
* one is done. Requires bIntraRefresh to be set.*/
int m_bQueuedIntraRefresh;
+ /* For optimising slice QP */
+ Lock m_sliceQpLock;
+ int m_iFrameNum;
+ int m_iPPSQpMinus26;
+ int m_iLastSliceQp;
+ int64_t m_iBitsCostSum[QP_MAX_MAX + 1];
+
+ Lock m_sliceRefIdxLock;
+ RefIdxLastGOP m_refIdxLastGOP;
+
+ Lock m_rpsInSpsLock;
+ int m_rpsInSpsCount;
+
Encoder();
~Encoder() {}
@@ -173,6 +205,11 @@ public:
void calcRefreshInterval(Frame* frameEnc);
+ void initRefIdx();
+ void analyseRefIdx(int *numRefIdx);
+ void updateRefIdx();
+ bool computeSPSRPSIndex();
+
protected:
void initVPS(VPS *vps);
diff --git a/source/encoder/entropy.cpp b/source/encoder/entropy.cpp
index 9ed62fe..044d6e2 100644
--- a/source/encoder/entropy.cpp
+++ b/source/encoder/entropy.cpp
@@ -312,19 +312,21 @@ void Entropy::codeSPS(const SPS& sps, const ScalingList& scalingList, const Prof
WRITE_FLAG(sps.bUseSAO, "sample_adaptive_offset_enabled_flag");
WRITE_FLAG(0, "pcm_enabled_flag");
- WRITE_UVLC(0, "num_short_term_ref_pic_sets");
+ WRITE_UVLC(sps.spsrpsNum, "num_short_term_ref_pic_sets");
+ for (int i = 0; i < sps.spsrpsNum; i++)
+ codeShortTermRefPicSet(sps.spsrps[i], i);
WRITE_FLAG(0, "long_term_ref_pics_present_flag");
WRITE_FLAG(sps.bTemporalMVPEnabled, "sps_temporal_mvp_enable_flag");
WRITE_FLAG(sps.bUseStrongIntraSmoothing, "sps_strong_intra_smoothing_enable_flag");
WRITE_FLAG(1, "vui_parameters_present_flag");
- codeVUI(sps.vuiParameters, sps.maxTempSubLayers, sps.bDiscardOptionalVUI);
+ codeVUI(sps.vuiParameters, sps.maxTempSubLayers, sps.bEmitVUITimingInfo, sps.bEmitVUIHRDInfo);
WRITE_FLAG(0, "sps_extension_flag");
}
-void Entropy::codePPS(const PPS& pps, bool filerAcross)
+void Entropy::codePPS( const PPS& pps, bool filerAcross, int iPPSInitQpMinus26 )
{
WRITE_UVLC(0, "pps_pic_parameter_set_id");
WRITE_UVLC(0, "pps_seq_parameter_set_id");
@@ -333,10 +335,10 @@ void Entropy::codePPS(const PPS& pps, bool filerAcross)
WRITE_CODE(0, 3, "num_extra_slice_header_bits");
WRITE_FLAG(pps.bSignHideEnabled, "sign_data_hiding_flag");
WRITE_FLAG(0, "cabac_init_present_flag");
- WRITE_UVLC(0, "num_ref_idx_l0_default_active_minus1");
- WRITE_UVLC(0, "num_ref_idx_l1_default_active_minus1");
+ WRITE_UVLC(pps.numRefIdxDefault[0] - 1, "num_ref_idx_l0_default_active_minus1");
+ WRITE_UVLC(pps.numRefIdxDefault[1] - 1, "num_ref_idx_l1_default_active_minus1");
- WRITE_SVLC(0, "init_qp_minus26");
+ WRITE_SVLC(iPPSInitQpMinus26, "init_qp_minus26");
WRITE_FLAG(pps.bConstrainedIntraPred, "constrained_intra_pred_flag");
WRITE_FLAG(pps.bTransformSkipEnabled, "transform_skip_enabled_flag");
@@ -422,7 +424,7 @@ void Entropy::codeProfileTier(const ProfileTierLevel& ptl, int maxTempSubLayers)
}
}
-void Entropy::codeVUI(const VUI& vui, int maxSubTLayers, bool bDiscardOptionalVUI)
+void Entropy::codeVUI(const VUI& vui, int maxSubTLayers, bool bEmitVUITimingInfo, bool bEmitVUIHRDInfo)
{
WRITE_FLAG(vui.aspectRatioInfoPresentFlag, "aspect_ratio_info_present_flag");
if (vui.aspectRatioInfoPresentFlag)
@@ -473,7 +475,7 @@ void Entropy::codeVUI(const VUI& vui, int maxSubTLayers, bool bDiscardOptionalVU
WRITE_UVLC(vui.defaultDisplayWindow.bottomOffset, "def_disp_win_bottom_offset");
}
- if (bDiscardOptionalVUI)
+ if (!bEmitVUITimingInfo)
WRITE_FLAG(0, "vui_timing_info_present_flag");
else
{
@@ -483,7 +485,7 @@ void Entropy::codeVUI(const VUI& vui, int maxSubTLayers, bool bDiscardOptionalVU
WRITE_FLAG(0, "vui_poc_proportional_to_timing_flag");
}
- if (bDiscardOptionalVUI)
+ if (!bEmitVUIHRDInfo)
WRITE_FLAG(0, "vui_hrd_parameters_present_flag");
else
{
@@ -614,8 +616,21 @@ void Entropy::codeSliceHeader(const Slice& slice, FrameData& encData, uint32_t s
}
#endif
- WRITE_FLAG(0, "short_term_ref_pic_set_sps_flag");
- codeShortTermRefPicSet(slice.m_rps);
+ if (slice.m_rpsIdx < 0)
+ {
+ WRITE_FLAG(0, "short_term_ref_pic_set_sps_flag");
+ codeShortTermRefPicSet(slice.m_rps, slice.m_sps->spsrpsNum);
+ }
+ else
+ {
+ WRITE_FLAG(1, "short_term_ref_pic_set_sps_flag");
+ int numBits = 0;
+ while ((1 << numBits) < slice.m_iNumRPSInSPS)
+ numBits++;
+
+ if (numBits > 0)
+ WRITE_CODE(slice.m_rpsIdx, numBits, "short_term_ref_pic_set_idx");
+ }
if (slice.m_sps->bTemporalMVPEnabled)
WRITE_FLAG(1, "slice_temporal_mvp_enable_flag");
@@ -633,7 +648,7 @@ void Entropy::codeSliceHeader(const Slice& slice, FrameData& encData, uint32_t s
if (!slice.isIntra())
{
- bool overrideFlag = (slice.m_numRefIdx[0] != 1 || (slice.isInterB() && slice.m_numRefIdx[1] != 1));
+ bool overrideFlag = (slice.m_numRefIdx[0] != slice.numRefIdxDefault[0] || (slice.isInterB() && slice.m_numRefIdx[1] != slice.numRefIdxDefault[1]));
WRITE_FLAG(overrideFlag, "num_ref_idx_active_override_flag");
if (overrideFlag)
{
@@ -673,7 +688,7 @@ void Entropy::codeSliceHeader(const Slice& slice, FrameData& encData, uint32_t s
if (!slice.isIntra())
WRITE_UVLC(MRG_MAX_NUM_CANDS - slice.m_maxNumMergeCand, "five_minus_max_num_merge_cand");
- int code = sliceQp - 26;
+ int code = sliceQp - (slice.m_iPPSQpMinus26 + 26);
WRITE_SVLC(code, "slice_qp_delta");
// TODO: Enable when pps_loop_filter_across_slices_enabled_flag==1
@@ -707,8 +722,11 @@ void Entropy::codeSliceHeaderWPPEntryPoints(const uint32_t *substreamSizes, uint
WRITE_CODE(substreamSizes[i] - 1, offsetLen, "entry_point_offset_minus1");
}
-void Entropy::codeShortTermRefPicSet(const RPS& rps)
+void Entropy::codeShortTermRefPicSet(const RPS& rps, int idx)
{
+ if (idx > 0)
+ WRITE_FLAG(0, "inter_ref_pic_set_prediction_flag");
+
WRITE_UVLC(rps.numberOfNegativePictures, "num_negative_pics");
WRITE_UVLC(rps.numberOfPositivePictures, "num_positive_pics");
int prev = 0;
diff --git a/source/encoder/entropy.h b/source/encoder/entropy.h
index da09eaf..a157e1b 100644
--- a/source/encoder/entropy.h
+++ b/source/encoder/entropy.h
@@ -142,14 +142,14 @@ public:
void codeVPS(const VPS& vps);
void codeSPS(const SPS& sps, const ScalingList& scalingList, const ProfileTierLevel& ptl);
- void codePPS(const PPS& pps, bool filerAcross);
- void codeVUI(const VUI& vui, int maxSubTLayers, bool discardOptionalVUI);
+ void codePPS( const PPS& pps, bool filerAcross, int iPPSInitQpMinus26 );
+ void codeVUI(const VUI& vui, int maxSubTLayers, bool bEmitVUITimingInfo, bool bEmitVUIHRDInfo);
void codeAUD(const Slice& slice);
void codeHrdParameters(const HRDInfo& hrd, int maxSubTLayers);
void codeSliceHeader(const Slice& slice, FrameData& encData, uint32_t slice_addr, uint32_t slice_addr_bits, int sliceQp);
void codeSliceHeaderWPPEntryPoints(const uint32_t *substreamSizes, uint32_t numSubStreams, uint32_t maxOffset);
- void codeShortTermRefPicSet(const RPS& rps);
+ void codeShortTermRefPicSet(const RPS& rps, int idx);
void finishSlice() { encodeBinTrm(1); finish(); dynamic_cast<Bitstream*>(m_bitIf)->writeByteAlignment(); }
void encodeCTU(const CUData& cu, const CUGeom& cuGeom);
diff --git a/source/encoder/frameencoder.cpp b/source/encoder/frameencoder.cpp
index 65370ba..016ab8a 100644
--- a/source/encoder/frameencoder.cpp
+++ b/source/encoder/frameencoder.cpp
@@ -50,6 +50,7 @@ FrameEncoder::FrameEncoder()
m_bAllRowsStop = false;
m_vbvResetTriggerRow = -1;
m_outStreams = NULL;
+ m_backupStreams = NULL;
m_substreamSizes = NULL;
m_nr = NULL;
m_tld = NULL;
@@ -85,6 +86,7 @@ void FrameEncoder::destroy()
delete[] m_rows;
delete[] m_outStreams;
+ delete[] m_backupStreams;
X265_FREE(m_sliceBaseRow);
X265_FREE(m_cuGeoms);
X265_FREE(m_ctuGeomMap);
@@ -121,7 +123,7 @@ bool FrameEncoder::init(Encoder *top, int numRows, int numCols)
int range = m_param->searchRange; /* fpel search */
range += !!(m_param->searchMethod < 2); /* diamond/hex range check lag */
range += NTAPS_LUMA / 2; /* subpel filter half-length */
- range += 2 + MotionEstimate::hpelIterationCount(m_param->subpelRefine) / 2; /* subpel refine steps */
+ range += 2 + (MotionEstimate::hpelIterationCount(m_param->subpelRefine) + 1) / 2; /* subpel refine steps */
m_refLagRows = /*(m_param->maxSlices > 1 ? 1 : 0) +*/ 1 + ((range + g_maxCUSize - 1) / g_maxCUSize);
// NOTE: 2 times of numRows because both Encoder and Filter in same queue
@@ -152,7 +154,7 @@ bool FrameEncoder::init(Encoder *top, int numRows, int numCols)
// 7.4.7.1 - Ceil( Log2( PicSizeInCtbsY ) ) bits
{
unsigned long tmp;
- CLZ(tmp, (numRows * numCols));
+ CLZ(tmp, (numRows * numCols - 1));
m_sliceAddrBits = (uint16_t)(tmp + 1);
}
@@ -305,6 +307,19 @@ void FrameEncoder::WeightAnalysis::processTasks(int /* workerThreadId */)
weightAnalyse(*frame->m_encData->m_slice, *frame, *master.m_param);
}
+
+uint32_t getBsLength( int32_t code )
+{
+ uint32_t ucode = (code <= 0) ? -code << 1 : (code << 1) - 1;
+
+ ++ucode;
+ unsigned long idx;
+ CLZ( idx, ucode );
+ uint32_t length = (uint32_t)idx * 2 + 1;
+
+ return length;
+}
+
void FrameEncoder::compressFrame()
{
ProfileScopeEvent(frameThread);
@@ -340,7 +355,28 @@ void FrameEncoder::compressFrame()
m_nalList.serialize(NAL_UNIT_ACCESS_UNIT_DELIMITER, m_bs);
}
if (m_frame->m_lowres.bKeyframe && m_param->bRepeatHeaders)
- m_top->getStreamHeaders(m_nalList, m_entropyCoder, m_bs);
+ {
+ if (m_param->bOptRefListLengthPPS)
+ {
+ ScopedLock refIdxLock(m_top->m_sliceRefIdxLock);
+ m_top->updateRefIdx();
+ }
+ if (m_top->m_param->rc.bStatRead && m_top->m_param->bMultiPassOptRPS)
+ {
+ ScopedLock refIdxLock(m_top->m_rpsInSpsLock);
+ if (!m_top->computeSPSRPSIndex())
+ {
+ x265_log(m_param, X265_LOG_ERROR, "compute commonly RPS failed!\n");
+ m_top->m_aborted = true;
+ }
+ m_top->getStreamHeaders(m_nalList, m_entropyCoder, m_bs);
+ }
+ else
+ m_top->getStreamHeaders(m_nalList, m_entropyCoder, m_bs);
+ }
+
+ if (m_top->m_param->rc.bStatRead && m_top->m_param->bMultiPassOptRPS)
+ m_frame->m_encData->m_slice->m_rpsIdx = (m_top->m_rateControl->m_rce2Pass + m_frame->m_encodeOrder)->rpsIdx;
// Weighted Prediction parameters estimation.
bool bUseWeightP = slice->m_sliceType == P_SLICE && slice->m_pps->bUseWeightPred;
@@ -448,6 +484,19 @@ void FrameEncoder::compressFrame()
/* Clip slice QP to 0-51 spec range before encoding */
slice->m_sliceQp = x265_clip3(-QP_BD_OFFSET, QP_MAX_SPEC, qp);
+ if (m_param->bOptQpPPS && m_param->bRepeatHeaders)
+ {
+ ScopedLock qpLock(m_top->m_sliceQpLock);
+ for (int i = 0; i < (QP_MAX_MAX + 1); i++)
+ {
+ int delta = slice->m_sliceQp - (i + 1);
+ int codeLength = getBsLength( delta );
+ m_top->m_iBitsCostSum[i] += codeLength;
+ }
+ m_top->m_iFrameNum++;
+ m_top->m_iLastSliceQp = slice->m_sliceQp;
+ }
+
m_initSliceContext.resetEntropy(*slice);
m_frameFilter.start(m_frame, m_initSliceContext);
@@ -485,6 +534,8 @@ void FrameEncoder::compressFrame()
if (!m_outStreams)
{
m_outStreams = new Bitstream[numSubstreams];
+ if (!m_param->bEnableWavefront)
+ m_backupStreams = new Bitstream[numSubstreams];
m_substreamSizes = X265_MALLOC(uint32_t, numSubstreams);
if (!m_param->bEnableSAO)
for (uint32_t i = 0; i < numSubstreams; i++)
@@ -498,7 +549,7 @@ void FrameEncoder::compressFrame()
if (m_frame->m_lowres.bKeyframe)
{
- if (!m_param->bDiscardSEI && m_param->bEmitHRDSEI)
+ if (m_param->bEmitHRDSEI)
{
SEIBufferingPeriod* bpSei = &m_top->m_rateControl->m_bufPeriodSEI;
@@ -520,7 +571,7 @@ void FrameEncoder::compressFrame()
}
}
- if (!m_param->bDiscardSEI && (m_param->bEmitHRDSEI || !!m_param->interlaceMode))
+ if ((m_param->bEmitHRDSEI || !!m_param->interlaceMode))
{
SEIPictureTiming *sei = m_rce.picTimingSEI;
const VUI *vui = &slice->m_sps->vuiParameters;
@@ -556,22 +607,19 @@ void FrameEncoder::compressFrame()
}
/* Write user SEI */
- if (!m_param->bDiscardSEI)
+ for (int i = 0; i < m_frame->m_userSEI.numPayloads; i++)
{
- for (int i = 0; i < m_frame->m_userSEI.numPayloads; i++)
- {
- x265_sei_payload *payload = &m_frame->m_userSEI.payloads[i];
- SEIuserDataUnregistered sei;
+ x265_sei_payload *payload = &m_frame->m_userSEI.payloads[i];
+ SEIuserDataUnregistered sei;
- sei.m_payloadType = payload->payloadType;
- sei.m_userDataLength = payload->payloadSize;
- sei.m_userData = payload->payload;
+ sei.m_payloadType = payload->payloadType;
+ sei.m_userDataLength = payload->payloadSize;
+ sei.m_userData = payload->payload;
- m_bs.resetBits();
- sei.write(m_bs, *slice->m_sps);
- m_bs.writeByteAlignment();
- m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs);
- }
+ m_bs.resetBits();
+ sei.write(m_bs, *slice->m_sps);
+ m_bs.writeByteAlignment();
+ m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs);
}
/* CQP and CRF (without capped VBV) doesn't use mid-frame statistics to
@@ -606,8 +654,7 @@ void FrameEncoder::compressFrame()
const uint32_t sliceEndRow = m_sliceBaseRow[sliceId + 1] - 1;
const uint32_t row = sliceStartRow + rowInSlice;
- if (row >= m_numRows)
- break;
+ X265_CHECK(row < m_numRows, "slices row fault was detected");
if (row > sliceEndRow)
continue;
@@ -626,7 +673,7 @@ void FrameEncoder::compressFrame()
refpic->m_reconRowFlag[rowIdx].waitForChange(0);
if ((bUseWeightP || bUseWeightB) && m_mref[l][ref].isWeighted)
- m_mref[l][ref].applyWeight(row + m_refLagRows, m_numRows, sliceEndRow + 1, sliceId);
+ m_mref[l][ref].applyWeight(rowIdx, m_numRows, sliceEndRow, sliceId);
}
}
@@ -666,7 +713,7 @@ void FrameEncoder::compressFrame()
refpic->m_reconRowFlag[rowIdx].waitForChange(0);
if ((bUseWeightP || bUseWeightB) && m_mref[l][ref].isWeighted)
- m_mref[list][ref].applyWeight(i + m_refLagRows, m_numRows, m_numRows, 0);
+ m_mref[list][ref].applyWeight(rowIdx, m_numRows, m_numRows, 0);
}
}
@@ -830,6 +877,11 @@ void FrameEncoder::compressFrame()
const uint32_t sliceAddr = nextSliceRow * m_numCols;
//CUData* ctu = m_frame->m_encData->getPicCTU(sliceAddr);
//const int sliceQp = ctu->m_qp[0];
+ if (m_param->bOptRefListLengthPPS)
+ {
+ ScopedLock refIdxLock(m_top->m_sliceRefIdxLock);
+ m_top->analyseRefIdx(slice->m_numRefIdx);
+ }
m_entropyCoder.codeSliceHeader(*slice, *m_frame->m_encData, sliceAddr, m_sliceAddrBits, slice->m_sliceQp);
// Find rows of current slice
@@ -853,6 +905,11 @@ void FrameEncoder::compressFrame()
}
else
{
+ if (m_param->bOptRefListLengthPPS)
+ {
+ ScopedLock refIdxLock(m_top->m_sliceRefIdxLock);
+ m_top->analyseRefIdx(slice->m_numRefIdx);
+ }
m_entropyCoder.codeSliceHeader(*slice, *m_frame->m_encData, 0, 0, slice->m_sliceQp);
// serialize each row, record final lengths in slice header
@@ -868,7 +925,7 @@ void FrameEncoder::compressFrame()
}
- if (!m_param->bDiscardSEI && m_param->decodedPictureHashSEI)
+ if (m_param->decodedPictureHashSEI)
{
int planes = (m_frame->m_param->internalCsp != X265_CSP_I400) ? 3 : 1;
if (m_param->decodedPictureHashSEI == 1)
@@ -1129,8 +1186,8 @@ void FrameEncoder::processRowEncoder(int intRow, ThreadLocalData& tld)
// TODO: specially case handle on first and last row
// Initialize restrict on MV range in slices
- tld.analysis.m_sliceMinY = -(int16_t)(rowInSlice * g_maxCUSize * 4) + 2 * 4;
- tld.analysis.m_sliceMaxY = (int16_t)((endRowInSlicePlus1 - 1 - row) * (g_maxCUSize * 4) - 3 * 4);
+ tld.analysis.m_sliceMinY = -(int16_t)(rowInSlice * g_maxCUSize * 4) + 3 * 4;
+ tld.analysis.m_sliceMaxY = (int16_t)((endRowInSlicePlus1 - 1 - row) * (g_maxCUSize * 4) - 4 * 4);
// Handle single row slice
if (tld.analysis.m_sliceMaxY < tld.analysis.m_sliceMinY)
@@ -1149,17 +1206,25 @@ void FrameEncoder::processRowEncoder(int intRow, ThreadLocalData& tld)
if (bIsVbv)
{
- if (!row)
+ if (col == 0 && !m_param->bEnableWavefront)
{
- curEncData.m_rowStat[row].diagQp = curEncData.m_avgQpRc;
- curEncData.m_rowStat[row].diagQpScale = x265_qp2qScale(curEncData.m_avgQpRc);
+ m_backupStreams[0].copyBits(&m_outStreams[0]);
+ curRow.bufferedEntropy.copyState(rowCoder);
+ curRow.bufferedEntropy.loadContexts(rowCoder);
+ }
+ if (!row && m_vbvResetTriggerRow != intRow)
+ {
+ curEncData.m_rowStat[row].rowQp = curEncData.m_avgQpRc;
+ curEncData.m_rowStat[row].rowQpScale = x265_qp2qScale(curEncData.m_avgQpRc);
}
FrameData::RCStatCU& cuStat = curEncData.m_cuStat[cuAddr];
- if (row >= col && row && m_vbvResetTriggerRow != intRow)
+ if (m_param->bEnableWavefront && row >= col && row && m_vbvResetTriggerRow != intRow)
cuStat.baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp;
+ else if (!m_param->bEnableWavefront && row && m_vbvResetTriggerRow != intRow)
+ cuStat.baseQp = curEncData.m_rowStat[row - 1].rowQp;
else
- cuStat.baseQp = curEncData.m_rowStat[row].diagQp;
+ cuStat.baseQp = curEncData.m_rowStat[row].rowQp;
/* TODO: use defines from slicetype.h for lowres block size */
uint32_t block_y = (ctu->m_cuPelY >> g_maxLog2CUSize) * noOfBlocks;
@@ -1310,21 +1375,52 @@ void FrameEncoder::processRowEncoder(int intRow, ThreadLocalData& tld)
if (bIsVbv)
{
// Update encoded bits, satdCost, baseQP for each CU
- curEncData.m_rowStat[row].diagSatd += curEncData.m_cuStat[cuAddr].vbvCost;
- curEncData.m_rowStat[row].diagIntraSatd += curEncData.m_cuStat[cuAddr].intraVbvCost;
+ curEncData.m_rowStat[row].rowSatd += curEncData.m_cuStat[cuAddr].vbvCost;
+ curEncData.m_rowStat[row].rowIntraSatd += curEncData.m_cuStat[cuAddr].intraVbvCost;
curEncData.m_rowStat[row].encodedBits += curEncData.m_cuStat[cuAddr].totalBits;
curEncData.m_rowStat[row].sumQpRc += curEncData.m_cuStat[cuAddr].baseQp;
curEncData.m_rowStat[row].numEncodedCUs = cuAddr;
+ // If current block is at row end checkpoint, call vbv ratecontrol.
+
+ if (!m_param->bEnableWavefront && col == numCols - 1)
+ {
+ double qpBase = curEncData.m_cuStat[cuAddr].baseQp;
+ int reEncode = m_top->m_rateControl->rowVbvRateControl(m_frame, row, &m_rce, qpBase);
+ qpBase = x265_clip3((double)m_param->rc.qpMin, (double)m_param->rc.qpMax, qpBase);
+ curEncData.m_rowStat[row].rowQp = qpBase;
+ curEncData.m_rowStat[row].rowQpScale = x265_qp2qScale(qpBase);
+ if (reEncode < 0)
+ {
+ x265_log(m_param, X265_LOG_DEBUG, "POC %d row %d - encode restart required for VBV, to %.2f from %.2f\n",
+ m_frame->m_poc, row, qpBase, curEncData.m_cuStat[cuAddr].baseQp);
+
+ m_vbvResetTriggerRow = row;
+ m_outStreams[0].copyBits(&m_backupStreams[0]);
+
+ rowCoder.copyState(curRow.bufferedEntropy);
+ rowCoder.loadContexts(curRow.bufferedEntropy);
+
+ curRow.completed = 0;
+ memset(&curRow.rowStats, 0, sizeof(curRow.rowStats));
+ curEncData.m_rowStat[row].numEncodedCUs = 0;
+ curEncData.m_rowStat[row].encodedBits = 0;
+ curEncData.m_rowStat[row].rowSatd = 0;
+ curEncData.m_rowStat[row].rowIntraSatd = 0;
+ curEncData.m_rowStat[row].sumQpRc = 0;
+ curEncData.m_rowStat[row].sumQpAq = 0;
+ }
+ }
+
// If current block is at row diagonal checkpoint, call vbv ratecontrol.
- if (row == col && row)
+ else if (m_param->bEnableWavefront && row == col && row)
{
double qpBase = curEncData.m_cuStat[cuAddr].baseQp;
- int reEncode = m_top->m_rateControl->rowDiagonalVbvRateControl(m_frame, row, &m_rce, qpBase);
+ int reEncode = m_top->m_rateControl->rowVbvRateControl(m_frame, row, &m_rce, qpBase);
qpBase = x265_clip3((double)m_param->rc.qpMin, (double)m_param->rc.qpMax, qpBase);
- curEncData.m_rowStat[row].diagQp = qpBase;
- curEncData.m_rowStat[row].diagQpScale = x265_qp2qScale(qpBase);
+ curEncData.m_rowStat[row].rowQp = qpBase;
+ curEncData.m_rowStat[row].rowQpScale = x265_qp2qScale(qpBase);
if (reEncode < 0)
{
@@ -1377,8 +1473,8 @@ void FrameEncoder::processRowEncoder(int intRow, ThreadLocalData& tld)
memset(&stopRow.rowStats, 0, sizeof(stopRow.rowStats));
curEncData.m_rowStat[r].numEncodedCUs = 0;
curEncData.m_rowStat[r].encodedBits = 0;
- curEncData.m_rowStat[r].diagSatd = 0;
- curEncData.m_rowStat[r].diagIntraSatd = 0;
+ curEncData.m_rowStat[r].rowSatd = 0;
+ curEncData.m_rowStat[r].rowIntraSatd = 0;
curEncData.m_rowStat[r].sumQpRc = 0;
curEncData.m_rowStat[r].sumQpAq = 0;
}
@@ -1405,7 +1501,7 @@ void FrameEncoder::processRowEncoder(int intRow, ThreadLocalData& tld)
ScopedLock self(curRow.lock);
if ((m_bAllRowsStop && intRow > m_vbvResetTriggerRow) ||
- (!bFirstRowInSlice && ((curRow.completed < numCols - 1) || (m_rows[row - 1].completed < numCols)) && m_rows[row - 1].completed < m_rows[row].completed + 2))
+ (!bFirstRowInSlice && ((curRow.completed < numCols - 1) || (m_rows[row - 1].completed < numCols)) && m_rows[row - 1].completed < curRow.completed + 2))
{
curRow.active = false;
curRow.busy = false;
diff --git a/source/encoder/frameencoder.h b/source/encoder/frameencoder.h
index 8bfecad..e4bb99d 100644
--- a/source/encoder/frameencoder.h
+++ b/source/encoder/frameencoder.h
@@ -184,6 +184,7 @@ public:
NoiseReduction* m_nr;
ThreadLocalData* m_tld; /* for --no-wpp */
Bitstream* m_outStreams;
+ Bitstream* m_backupStreams;
uint32_t* m_substreamSizes;
CUGeom* m_cuGeoms;
diff --git a/source/encoder/framefilter.cpp b/source/encoder/framefilter.cpp
index b9f4256..c102925 100644
--- a/source/encoder/framefilter.cpp
+++ b/source/encoder/framefilter.cpp
@@ -35,6 +35,109 @@ using namespace X265_NS;
static uint64_t computeSSD(pixel *fenc, pixel *rec, intptr_t stride, uint32_t width, uint32_t height);
static float calculateSSIM(pixel *pix1, intptr_t stride1, pixel *pix2, intptr_t stride2, uint32_t width, uint32_t height, void *buf, uint32_t& cnt);
+static void integral_init4h(uint32_t *sum, pixel *pix, intptr_t stride)
+{
+ int32_t v = pix[0] + pix[1] + pix[2] + pix[3];
+ for (int16_t x = 0; x < stride - 4; x++)
+ {
+ sum[x] = v + sum[x - stride];
+ v += pix[x + 4] - pix[x];
+ }
+}
+
+static void integral_init8h(uint32_t *sum, pixel *pix, intptr_t stride)
+{
+ int32_t v = pix[0] + pix[1] + pix[2] + pix[3] + pix[4] + pix[5] + pix[6] + pix[7];
+ for (int16_t x = 0; x < stride - 8; x++)
+ {
+ sum[x] = v + sum[x - stride];
+ v += pix[x + 8] - pix[x];
+ }
+}
+
+static void integral_init12h(uint32_t *sum, pixel *pix, intptr_t stride)
+{
+ int32_t v = pix[0] + pix[1] + pix[2] + pix[3] + pix[4] + pix[5] + pix[6] + pix[7] +
+ pix[8] + pix[9] + pix[10] + pix[11];
+ for (int16_t x = 0; x < stride - 12; x++)
+ {
+ sum[x] = v + sum[x - stride];
+ v += pix[x + 12] - pix[x];
+ }
+}
+
+static void integral_init16h(uint32_t *sum, pixel *pix, intptr_t stride)
+{
+ int32_t v = pix[0] + pix[1] + pix[2] + pix[3] + pix[4] + pix[5] + pix[6] + pix[7] +
+ pix[8] + pix[9] + pix[10] + pix[11] + pix[12] + pix[13] + pix[14] + pix[15];
+ for (int16_t x = 0; x < stride - 16; x++)
+ {
+ sum[x] = v + sum[x - stride];
+ v += pix[x + 16] - pix[x];
+ }
+}
+
+static void integral_init24h(uint32_t *sum, pixel *pix, intptr_t stride)
+{
+ int32_t v = pix[0] + pix[1] + pix[2] + pix[3] + pix[4] + pix[5] + pix[6] + pix[7] +
+ pix[8] + pix[9] + pix[10] + pix[11] + pix[12] + pix[13] + pix[14] + pix[15] +
+ pix[16] + pix[17] + pix[18] + pix[19] + pix[20] + pix[21] + pix[22] + pix[23];
+ for (int16_t x = 0; x < stride - 24; x++)
+ {
+ sum[x] = v + sum[x - stride];
+ v += pix[x + 24] - pix[x];
+ }
+}
+
+static void integral_init32h(uint32_t *sum, pixel *pix, intptr_t stride)
+{
+ int32_t v = pix[0] + pix[1] + pix[2] + pix[3] + pix[4] + pix[5] + pix[6] + pix[7] +
+ pix[8] + pix[9] + pix[10] + pix[11] + pix[12] + pix[13] + pix[14] + pix[15] +
+ pix[16] + pix[17] + pix[18] + pix[19] + pix[20] + pix[21] + pix[22] + pix[23] +
+ pix[24] + pix[25] + pix[26] + pix[27] + pix[28] + pix[29] + pix[30] + pix[31];
+ for (int16_t x = 0; x < stride - 32; x++)
+ {
+ sum[x] = v + sum[x - stride];
+ v += pix[x + 32] - pix[x];
+ }
+}
+
+static void integral_init4v(uint32_t *sum4, intptr_t stride)
+{
+ for (int x = 0; x < stride; x++)
+ sum4[x] = sum4[x + 4 * stride] - sum4[x];
+}
+
+static void integral_init8v(uint32_t *sum8, intptr_t stride)
+{
+ for (int x = 0; x < stride; x++)
+ sum8[x] = sum8[x + 8 * stride] - sum8[x];
+}
+
+static void integral_init12v(uint32_t *sum12, intptr_t stride)
+{
+ for (int x = 0; x < stride; x++)
+ sum12[x] = sum12[x + 12 * stride] - sum12[x];
+}
+
+static void integral_init16v(uint32_t *sum16, intptr_t stride)
+{
+ for (int x = 0; x < stride; x++)
+ sum16[x] = sum16[x + 16 * stride] - sum16[x];
+}
+
+static void integral_init24v(uint32_t *sum24, intptr_t stride)
+{
+ for (int x = 0; x < stride; x++)
+ sum24[x] = sum24[x + 24 * stride] - sum24[x];
+}
+
+static void integral_init32v(uint32_t *sum32, intptr_t stride)
+{
+ for (int x = 0; x < stride; x++)
+ sum32[x] = sum32[x + 32 * stride] - sum32[x];
+}
+
void FrameFilter::destroy()
{
X265_FREE(m_ssimBuf);
@@ -65,6 +168,7 @@ void FrameFilter::init(Encoder *top, FrameEncoder *frame, int numRows, uint32_t
m_saoRowDelay = m_param->bEnableLoopFilter ? 1 : 0;
m_lastHeight = (m_param->sourceHeight % g_maxCUSize) ? (m_param->sourceHeight % g_maxCUSize) : g_maxCUSize;
m_lastWidth = (m_param->sourceWidth % g_maxCUSize) ? (m_param->sourceWidth % g_maxCUSize) : g_maxCUSize;
+ integralCompleted.set(0);
if (m_param->bEnableSsim)
m_ssimBuf = X265_MALLOC(int, 8 * (m_param->sourceWidth / 4 + 3));
@@ -499,14 +603,19 @@ void FrameFilter::processRow(int row)
if (!ctu->m_bFirstRowInSlice)
processPostRow(row - 1);
- if (ctu->m_bLastRowInSlice)
- processPostRow(row);
-
// NOTE: slices parallelism will be execute out-of-order
- int numRowFinished;
- for(numRowFinished = 0; numRowFinished < m_numRows; numRowFinished++)
- if (!m_frame->m_reconRowFlag[numRowFinished].get())
- break;
+ int numRowFinished = 0;
+ if (m_frame->m_reconRowFlag)
+ {
+ for (numRowFinished = 0; numRowFinished < m_numRows; numRowFinished++)
+ {
+ if (!m_frame->m_reconRowFlag[numRowFinished].get())
+ break;
+
+ if (numRowFinished == row)
+ continue;
+ }
+ }
if (numRowFinished == m_numRows)
{
@@ -522,6 +631,9 @@ void FrameFilter::processRow(int row)
m_parallelFilter[0].m_sao.rdoSaoUnitRowEnd(saoParam, encData.m_slice->m_sps->numCUsInFrame);
}
}
+
+ if (ctu->m_bLastRowInSlice)
+ processPostRow(row);
}
void FrameFilter::processPostRow(int row)
@@ -656,6 +768,107 @@ void FrameFilter::processPostRow(int row)
}
} // end of (m_param->maxSlices == 1)
+ int lastRow = row == (int)m_frame->m_encData->m_slice->m_sps->numCuInHeight - 1;
+
+ /* generate integral planes for SEA motion search */
+ if (m_param->searchMethod == X265_SEA && m_frame->m_encData->m_meIntegral && m_frame->m_lowres.sliceType != X265_TYPE_B)
+ {
+ /* If WPP, other than first row, integral calculation for current row needs to wait till the
+ * integral for the previous row is computed */
+ if (m_param->bEnableWavefront && row)
+ {
+ while (m_parallelFilter[row - 1].m_frameFilter->integralCompleted.get() == 0)
+ {
+ m_parallelFilter[row - 1].m_frameFilter->integralCompleted.waitForChange(0);
+ }
+ }
+
+ int stride = (int)m_frame->m_reconPic->m_stride;
+ int padX = g_maxCUSize + 32;
+ int padY = g_maxCUSize + 16;
+ int numCuInHeight = m_frame->m_encData->m_slice->m_sps->numCuInHeight;
+ int maxHeight = numCuInHeight * g_maxCUSize;
+ int startRow = 0;
+
+ if (m_param->interlaceMode)
+ startRow = (row * g_maxCUSize >> 1);
+ else
+ startRow = row * g_maxCUSize;
+
+ int height = lastRow ? (maxHeight + g_maxCUSize * m_param->interlaceMode) : (((row + m_param->interlaceMode) * g_maxCUSize) + g_maxCUSize);
+
+ if (!row)
+ {
+ for (int i = 0; i < INTEGRAL_PLANE_NUM; i++)
+ memset(m_frame->m_encData->m_meIntegral[i] - padY * stride - padX, 0, stride * sizeof(uint32_t));
+ startRow = -padY;
+ }
+
+ if (lastRow)
+ height += padY - 1;
+
+ for (int y = startRow; y < height; y++)
+ {
+ pixel *pix = m_frame->m_reconPic->m_picOrg[0] + y * stride - padX;
+ uint32_t *sum32x32 = m_frame->m_encData->m_meIntegral[0] + (y + 1) * stride - padX;
+ uint32_t *sum32x24 = m_frame->m_encData->m_meIntegral[1] + (y + 1) * stride - padX;
+ uint32_t *sum32x8 = m_frame->m_encData->m_meIntegral[2] + (y + 1) * stride - padX;
+ uint32_t *sum24x32 = m_frame->m_encData->m_meIntegral[3] + (y + 1) * stride - padX;
+ uint32_t *sum16x16 = m_frame->m_encData->m_meIntegral[4] + (y + 1) * stride - padX;
+ uint32_t *sum16x12 = m_frame->m_encData->m_meIntegral[5] + (y + 1) * stride - padX;
+ uint32_t *sum16x4 = m_frame->m_encData->m_meIntegral[6] + (y + 1) * stride - padX;
+ uint32_t *sum12x16 = m_frame->m_encData->m_meIntegral[7] + (y + 1) * stride - padX;
+ uint32_t *sum8x32 = m_frame->m_encData->m_meIntegral[8] + (y + 1) * stride - padX;
+ uint32_t *sum8x8 = m_frame->m_encData->m_meIntegral[9] + (y + 1) * stride - padX;
+ uint32_t *sum4x16 = m_frame->m_encData->m_meIntegral[10] + (y + 1) * stride - padX;
+ uint32_t *sum4x4 = m_frame->m_encData->m_meIntegral[11] + (y + 1) * stride - padX;
+
+ /*For width = 32 */
+ integral_init32h(sum32x32, pix, stride);
+ if (y >= 32 - padY)
+ integral_init32v(sum32x32 - 32 * stride, stride);
+ integral_init32h(sum32x24, pix, stride);
+ if (y >= 24 - padY)
+ integral_init24v(sum32x24 - 24 * stride, stride);
+ integral_init32h(sum32x8, pix, stride);
+ if (y >= 8 - padY)
+ integral_init8v(sum32x8 - 8 * stride, stride);
+ /*For width = 24 */
+ integral_init24h(sum24x32, pix, stride);
+ if (y >= 32 - padY)
+ integral_init32v(sum24x32 - 32 * stride, stride);
+ /*For width = 16 */
+ integral_init16h(sum16x16, pix, stride);
+ if (y >= 16 - padY)
+ integral_init16v(sum16x16 - 16 * stride, stride);
+ integral_init16h(sum16x12, pix, stride);
+ if (y >= 12 - padY)
+ integral_init12v(sum16x12 - 12 * stride, stride);
+ integral_init16h(sum16x4, pix, stride);
+ if (y >= 4 - padY)
+ integral_init4v(sum16x4 - 4 * stride, stride);
+ /*For width = 12 */
+ integral_init12h(sum12x16, pix, stride);
+ if (y >= 16 - padY)
+ integral_init16v(sum12x16 - 16 * stride, stride);
+ /*For width = 8 */
+ integral_init8h(sum8x32, pix, stride);
+ if (y >= 32 - padY)
+ integral_init32v(sum8x32 - 32 * stride, stride);
+ integral_init8h(sum8x8, pix, stride);
+ if (y >= 8 - padY)
+ integral_init8v(sum8x8 - 8 * stride, stride);
+ /*For width = 4 */
+ integral_init4h(sum4x16, pix, stride);
+ if (y >= 16 - padY)
+ integral_init16v(sum4x16 - 16 * stride, stride);
+ integral_init4h(sum4x4, pix, stride);
+ if (y >= 4 - padY)
+ integral_init4v(sum4x4 - 4 * stride, stride);
+ }
+ m_parallelFilter[row].m_frameFilter->integralCompleted.set(1);
+ }
+
if (ATOMIC_INC(&m_frameEncoder->m_completionCount) == 2 * (int)m_frameEncoder->m_numRows)
{
m_frameEncoder->m_completionEvent.trigger();
diff --git a/source/encoder/framefilter.h b/source/encoder/framefilter.h
index 5c9d12b..b492811 100644
--- a/source/encoder/framefilter.h
+++ b/source/encoder/framefilter.h
@@ -57,6 +57,8 @@ public:
int m_lastHeight;
int m_lastWidth;
+ ThreadSafeInteger integralCompleted; /* check if integral calculation is completed in this row */
+
void* m_ssimBuf; /* Temp storage for ssim computation */
#define MAX_PFILTER_CUS (4) /* maximum CUs for every thread */
diff --git a/source/encoder/motion.cpp b/source/encoder/motion.cpp
index 2edb52a..4e36ec1 100644
--- a/source/encoder/motion.cpp
+++ b/source/encoder/motion.cpp
@@ -109,6 +109,8 @@ MotionEstimate::MotionEstimate()
blockOffset = 0;
bChromaSATD = false;
chromaSatd = NULL;
+ for (int i = 0; i < INTEGRAL_PLANE_NUM; i++)
+ integral[i] = NULL;
}
void MotionEstimate::init(int csp)
@@ -165,10 +167,12 @@ void MotionEstimate::setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset,
partEnum = partitionFromSizes(pwidth, pheight);
X265_CHECK(LUMA_4x4 != partEnum, "4x4 inter partition detected!\n");
sad = primitives.pu[partEnum].sad;
+ ads = primitives.pu[partEnum].ads;
satd = primitives.pu[partEnum].satd;
sad_x3 = primitives.pu[partEnum].sad_x3;
sad_x4 = primitives.pu[partEnum].sad_x4;
+
blockwidth = pwidth;
blockOffset = offset;
absPartIdx = ctuAddr = -1;
@@ -188,6 +192,7 @@ void MotionEstimate::setSourcePU(const Yuv& srcFencYuv, int _ctuAddr, int cuPart
partEnum = partitionFromSizes(pwidth, pheight);
X265_CHECK(LUMA_4x4 != partEnum, "4x4 inter partition detected!\n");
sad = primitives.pu[partEnum].sad;
+ ads = primitives.pu[partEnum].ads;
satd = primitives.pu[partEnum].satd;
sad_x3 = primitives.pu[partEnum].sad_x3;
sad_x4 = primitives.pu[partEnum].sad_x4;
@@ -278,12 +283,31 @@ void MotionEstimate::setSourcePU(const Yuv& srcFencYuv, int _ctuAddr, int cuPart
costs[1] += mvcost((omv + MV(m1x, m1y)) << 2); \
costs[2] += mvcost((omv + MV(m2x, m2y)) << 2); \
costs[3] += mvcost((omv + MV(m3x, m3y)) << 2); \
- COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \
- COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \
- COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \
- COPY2_IF_LT(bcost, costs[3], bmv, omv + MV(m3x, m3y)); \
+ if ((omv.y + m0y >= mvmin.y) & (omv.y + m0y <= mvmax.y)) \
+ COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \
+ if ((omv.y + m1y >= mvmin.y) & (omv.y + m1y <= mvmax.y)) \
+ COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \
+ if ((omv.y + m2y >= mvmin.y) & (omv.y + m2y <= mvmax.y)) \
+ COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \
+ if ((omv.y + m3y >= mvmin.y) & (omv.y + m3y <= mvmax.y)) \
+ COPY2_IF_LT(bcost, costs[3], bmv, omv + MV(m3x, m3y)); \
}
+#define COST_MV_X3_ABS( m0x, m0y, m1x, m1y, m2x, m2y )\
+{\
+ sad_x3(fenc, \
+ fref + (m0x) + (m0y) * stride, \
+ fref + (m1x) + (m1y) * stride, \
+ fref + (m2x) + (m2y) * stride, \
+ stride, costs); \
+ costs[0] += p_cost_mvx[(m0x) << 2]; /* no cost_mvy */\
+ costs[1] += p_cost_mvx[(m1x) << 2]; \
+ costs[2] += p_cost_mvx[(m2x) << 2]; \
+ COPY3_IF_LT(bcost, costs[0], bmv.x, m0x, bmv.y, m0y); \
+ COPY3_IF_LT(bcost, costs[1], bmv.x, m1x, bmv.y, m1y); \
+ COPY3_IF_LT(bcost, costs[2], bmv.x, m2x, bmv.y, m2y); \
+}
+
#define COST_MV_X4_DIR(m0x, m0y, m1x, m1y, m2x, m2y, m3x, m3y, costs) \
{ \
pixel *pix_base = fref + bmv.x + bmv.y * stride; \
@@ -627,6 +651,7 @@ int MotionEstimate::motionEstimate(ReferencePlanes *ref,
{
bcost = cost;
bmv = 0;
+ bmv.y = X265_MAX(X265_MIN(0, mvmax.y), mvmin.y);
}
}
@@ -659,8 +684,10 @@ int MotionEstimate::motionEstimate(ReferencePlanes *ref,
do
{
COST_MV_X4_DIR(0, -1, 0, 1, -1, 0, 1, 0, costs);
- COPY1_IF_LT(bcost, (costs[0] << 4) + 1);
- COPY1_IF_LT(bcost, (costs[1] << 4) + 3);
+ if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y))
+ COPY1_IF_LT(bcost, (costs[0] << 4) + 1);
+ if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y))
+ COPY1_IF_LT(bcost, (costs[1] << 4) + 3);
COPY1_IF_LT(bcost, (costs[2] << 4) + 4);
COPY1_IF_LT(bcost, (costs[3] << 4) + 12);
if (!(bcost & 15))
@@ -698,36 +725,57 @@ me_hex2:
/* equivalent to the above, but eliminates duplicate candidates */
COST_MV_X3_DIR(-2, 0, -1, 2, 1, 2, costs);
bcost <<= 3;
- COPY1_IF_LT(bcost, (costs[0] << 3) + 2);
- COPY1_IF_LT(bcost, (costs[1] << 3) + 3);
- COPY1_IF_LT(bcost, (costs[2] << 3) + 4);
+ if ((bmv.y >= mvmin.y) & (bmv.y <= mvmax.y))
+ COPY1_IF_LT(bcost, (costs[0] << 3) + 2);
+ if ((bmv.y + 2 >= mvmin.y) & (bmv.y + 2 <= mvmax.y))
+ {
+ COPY1_IF_LT(bcost, (costs[1] << 3) + 3);
+ COPY1_IF_LT(bcost, (costs[2] << 3) + 4);
+ }
+
COST_MV_X3_DIR(2, 0, 1, -2, -1, -2, costs);
- COPY1_IF_LT(bcost, (costs[0] << 3) + 5);
- COPY1_IF_LT(bcost, (costs[1] << 3) + 6);
- COPY1_IF_LT(bcost, (costs[2] << 3) + 7);
+ if ((bmv.y >= mvmin.y) & (bmv.y <= mvmax.y))
+ COPY1_IF_LT(bcost, (costs[0] << 3) + 5);
+ if ((bmv.y - 2 >= mvmin.y) & (bmv.y - 2 <= mvmax.y))
+ {
+ COPY1_IF_LT(bcost, (costs[1] << 3) + 6);
+ COPY1_IF_LT(bcost, (costs[2] << 3) + 7);
+ }
if (bcost & 7)
{
int dir = (bcost & 7) - 2;
- bmv += hex2[dir + 1];
- /* half hexagon, not overlapping the previous iteration */
- for (int i = (merange >> 1) - 1; i > 0 && bmv.checkRange(mvmin, mvmax); i--)
+ if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + hex2[dir + 1].y <= mvmax.y))
{
- COST_MV_X3_DIR(hex2[dir + 0].x, hex2[dir + 0].y,
- hex2[dir + 1].x, hex2[dir + 1].y,
- hex2[dir + 2].x, hex2[dir + 2].y,
- costs);
- bcost &= ~7;
- COPY1_IF_LT(bcost, (costs[0] << 3) + 1);
- COPY1_IF_LT(bcost, (costs[1] << 3) + 2);
- COPY1_IF_LT(bcost, (costs[2] << 3) + 3);
- if (!(bcost & 7))
- break;
- dir += (bcost & 7) - 2;
- dir = mod6m1[dir + 1];
bmv += hex2[dir + 1];
- }
+
+ /* half hexagon, not overlapping the previous iteration */
+ for (int i = (merange >> 1) - 1; i > 0 && bmv.checkRange(mvmin, mvmax); i--)
+ {
+ COST_MV_X3_DIR(hex2[dir + 0].x, hex2[dir + 0].y,
+ hex2[dir + 1].x, hex2[dir + 1].y,
+ hex2[dir + 2].x, hex2[dir + 2].y,
+ costs);
+ bcost &= ~7;
+
+ if ((bmv.y + hex2[dir + 0].y >= mvmin.y) & (bmv.y + hex2[dir + 0].y <= mvmax.y))
+ COPY1_IF_LT(bcost, (costs[0] << 3) + 1);
+
+ if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + hex2[dir + 1].y <= mvmax.y))
+ COPY1_IF_LT(bcost, (costs[1] << 3) + 2);
+
+ if ((bmv.y + hex2[dir + 2].y >= mvmin.y) & (bmv.y + hex2[dir + 2].y <= mvmax.y))
+ COPY1_IF_LT(bcost, (costs[2] << 3) + 3);
+
+ if (!(bcost & 7))
+ break;
+
+ dir += (bcost & 7) - 2;
+ dir = mod6m1[dir + 1];
+ bmv += hex2[dir + 1];
+ }
+ } // if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + hex2[dir + 1].y <= mvmax.y))
}
bcost >>= 3;
#endif // if 0
@@ -735,15 +783,21 @@ me_hex2:
/* square refine */
int dir = 0;
COST_MV_X4_DIR(0, -1, 0, 1, -1, 0, 1, 0, costs);
- COPY2_IF_LT(bcost, costs[0], dir, 1);
- COPY2_IF_LT(bcost, costs[1], dir, 2);
+ if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y))
+ COPY2_IF_LT(bcost, costs[0], dir, 1);
+ if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y))
+ COPY2_IF_LT(bcost, costs[1], dir, 2);
COPY2_IF_LT(bcost, costs[2], dir, 3);
COPY2_IF_LT(bcost, costs[3], dir, 4);
COST_MV_X4_DIR(-1, -1, -1, 1, 1, -1, 1, 1, costs);
- COPY2_IF_LT(bcost, costs[0], dir, 5);
- COPY2_IF_LT(bcost, costs[1], dir, 6);
- COPY2_IF_LT(bcost, costs[2], dir, 7);
- COPY2_IF_LT(bcost, costs[3], dir, 8);
+ if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y))
+ COPY2_IF_LT(bcost, costs[0], dir, 5);
+ if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y))
+ COPY2_IF_LT(bcost, costs[1], dir, 6);
+ if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y))
+ COPY2_IF_LT(bcost, costs[2], dir, 7);
+ if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y))
+ COPY2_IF_LT(bcost, costs[3], dir, 8);
bmv += square1[dir];
break;
}
@@ -756,6 +810,7 @@ me_hex2:
/* refine predictors */
omv = bmv;
ucost1 = bcost;
+ X265_CHECK(((pmv.y >= mvmin.y) & (pmv.y <= mvmax.y)), "pmv outside of search range!");
DIA1_ITER(pmv.x, pmv.y);
if (pmv.notZero())
DIA1_ITER(0, 0);
@@ -879,7 +934,7 @@ me_hex2:
stride, costs + 4 * k); \
fref_base += 2 * dy;
#define ADD_MVCOST(k, x, y) costs[k] += p_cost_omvx[x * 4 * i] + p_cost_omvy[y * 4 * i]
-#define MIN_MV(k, x, y) COPY2_IF_LT(bcost, costs[k], dir, x * 16 + (y & 15))
+#define MIN_MV(k, dx, dy) if ((omv.y + (dy) >= mvmin.y) & (omv.y + (dy) <= mvmax.y)) { COPY2_IF_LT(bcost, costs[k], dir, dx * 16 + (dy & 15)) }
SADS(0, +0, -4, +0, +4, -2, -3, +2, -3);
SADS(1, -4, -2, +4, -2, -4, -1, +4, -1);
@@ -1043,6 +1098,161 @@ me_hex2:
break;
}
+ case X265_SEA:
+ {
+ // Successive Elimination Algorithm
+ const int16_t minX = X265_MAX(omv.x - (int16_t)merange, mvmin.x);
+ const int16_t minY = X265_MAX(omv.y - (int16_t)merange, mvmin.y);
+ const int16_t maxX = X265_MIN(omv.x + (int16_t)merange, mvmax.x);
+ const int16_t maxY = X265_MIN(omv.y + (int16_t)merange, mvmax.y);
+ const uint16_t *p_cost_mvx = m_cost_mvx - qmvp.x;
+ const uint16_t *p_cost_mvy = m_cost_mvy - qmvp.y;
+ int16_t* meScratchBuffer = NULL;
+ int scratchSize = merange * 2 + 4;
+ if (scratchSize)
+ {
+ meScratchBuffer = X265_MALLOC(int16_t, scratchSize);
+ memset(meScratchBuffer, 0, sizeof(int16_t)* scratchSize);
+ }
+
+ /* SEA is fastest in multiples of 4 */
+ int meRangeWidth = (maxX - minX + 3) & ~3;
+ int w = 0, h = 0; // Width and height of the PU
+ ALIGN_VAR_32(pixel, zero[64 * FENC_STRIDE]) = { 0 };
+ ALIGN_VAR_32(int, encDC[4]);
+ uint16_t *fpelCostMvX = m_fpelMvCosts[-qmvp.x & 3] + (-qmvp.x >> 2);
+ sizesFromPartition(partEnum, &w, &h);
+ int deltaX = (w <= 8) ? (w) : (w >> 1);
+ int deltaY = (h <= 8) ? (h) : (h >> 1);
+
+ /* Check if very small rectangular blocks which cannot be sub-divided anymore */
+ bool smallRectPartition = partEnum == LUMA_4x4 || partEnum == LUMA_16x12 ||
+ partEnum == LUMA_12x16 || partEnum == LUMA_16x4 || partEnum == LUMA_4x16;
+ /* Check if vertical partition */
+ bool verticalRect = partEnum == LUMA_32x64 || partEnum == LUMA_16x32 || partEnum == LUMA_8x16 ||
+ partEnum == LUMA_4x8;
+ /* Check if horizontal partition */
+ bool horizontalRect = partEnum == LUMA_64x32 || partEnum == LUMA_32x16 || partEnum == LUMA_16x8 ||
+ partEnum == LUMA_8x4;
+ /* Check if assymetric vertical partition */
+ bool assymetricVertical = partEnum == LUMA_12x16 || partEnum == LUMA_4x16 || partEnum == LUMA_24x32 ||
+ partEnum == LUMA_8x32 || partEnum == LUMA_48x64 || partEnum == LUMA_16x64;
+ /* Check if assymetric horizontal partition */
+ bool assymetricHorizontal = partEnum == LUMA_16x12 || partEnum == LUMA_16x4 || partEnum == LUMA_32x24 ||
+ partEnum == LUMA_32x8 || partEnum == LUMA_64x48 || partEnum == LUMA_64x16;
+
+ int tempPartEnum = 0;
+
+ /* If a vertical rectangular partition, it is horizontally split into two, for ads_x2() */
+ if (verticalRect)
+ tempPartEnum = partitionFromSizes(w, h >> 1);
+ /* If a horizontal rectangular partition, it is vertically split into two, for ads_x2() */
+ else if (horizontalRect)
+ tempPartEnum = partitionFromSizes(w >> 1, h);
+ /* We have integral planes introduced to account for assymetric partitions.
+ * Hence all assymetric partitions except those which cannot be split into legal sizes,
+ * are split into four for ads_x4() */
+ else if (assymetricVertical || assymetricHorizontal)
+ tempPartEnum = smallRectPartition ? partEnum : partitionFromSizes(w >> 1, h >> 1);
+ /* General case: Square partitions. All partitions with width > 8 are split into four
+ * for ads_x4(), for 4x4 and 8x8 we do ads_x1() */
+ else
+ tempPartEnum = (w <= 8) ? partEnum : partitionFromSizes(w >> 1, h >> 1);
+
+ /* Successive elimination by comparing DC before a full SAD,
+ * because sum(abs(diff)) >= abs(diff(sum)). */
+ primitives.pu[tempPartEnum].sad_x4(zero,
+ fenc,
+ fenc + deltaX,
+ fenc + deltaY * FENC_STRIDE,
+ fenc + deltaX + deltaY * FENC_STRIDE,
+ FENC_STRIDE,
+ encDC);
+
+ /* Assigning appropriate integral plane */
+ uint32_t *sumsBase = NULL;
+ switch (deltaX)
+ {
+ case 32: if (deltaY % 24 == 0)
+ sumsBase = integral[1];
+ else if (deltaY == 8)
+ sumsBase = integral[2];
+ else
+ sumsBase = integral[0];
+ break;
+ case 24: sumsBase = integral[3];
+ break;
+ case 16: if (deltaY % 12 == 0)
+ sumsBase = integral[5];
+ else if (deltaY == 4)
+ sumsBase = integral[6];
+ else
+ sumsBase = integral[4];
+ break;
+ case 12: sumsBase = integral[7];
+ break;
+ case 8: if (deltaY == 32)
+ sumsBase = integral[8];
+ else
+ sumsBase = integral[9];
+ break;
+ case 4: if (deltaY == 16)
+ sumsBase = integral[10];
+ else
+ sumsBase = integral[11];
+ break;
+ default: sumsBase = integral[11];
+ break;
+ }
+
+ if (partEnum == LUMA_64x64 || partEnum == LUMA_32x32 || partEnum == LUMA_16x16 ||
+ partEnum == LUMA_32x64 || partEnum == LUMA_16x32 || partEnum == LUMA_8x16 ||
+ partEnum == LUMA_4x8 || partEnum == LUMA_12x16 || partEnum == LUMA_4x16 ||
+ partEnum == LUMA_24x32 || partEnum == LUMA_8x32 || partEnum == LUMA_48x64 ||
+ partEnum == LUMA_16x64)
+ deltaY *= (int)stride;
+
+ if (verticalRect)
+ encDC[1] = encDC[2];
+
+ if (horizontalRect)
+ deltaY = deltaX;
+
+ /* ADS and SAD */
+ MV tmv;
+ for (tmv.y = minY; tmv.y <= maxY; tmv.y++)
+ {
+ int i, xn;
+ int ycost = p_cost_mvy[tmv.y] << 2;
+ if (bcost <= ycost)
+ continue;
+ bcost -= ycost;
+
+ /* ADS_4 for 16x16, 32x32, 64x64, 24x32, 32x24, 48x64, 64x48, 32x8, 8x32, 64x16, 16x64 partitions
+ * ADS_1 for 4x4, 8x8, 16x4, 4x16, 16x12, 12x16 partitions
+ * ADS_2 for all other rectangular partitions */
+ xn = ads(encDC,
+ sumsBase + minX + tmv.y * stride,
+ deltaY,
+ fpelCostMvX + minX,
+ meScratchBuffer,
+ meRangeWidth,
+ bcost);
+
+ for (i = 0; i < xn - 2; i += 3)
+ COST_MV_X3_ABS(minX + meScratchBuffer[i], tmv.y,
+ minX + meScratchBuffer[i + 1], tmv.y,
+ minX + meScratchBuffer[i + 2], tmv.y);
+
+ bcost += ycost;
+ for (; i < xn; i++)
+ COST_MV(minX + meScratchBuffer[i], tmv.y);
+ }
+ if (meScratchBuffer)
+ x265_free(meScratchBuffer);
+ break;
+ }
+
case X265_FULL_SEARCH:
{
// dead slow exhaustive search, but at least it uses sad_x4()
@@ -1099,6 +1309,7 @@ me_hex2:
if ((g_maxSlices > 1) & ((bmv.y < qmvmin.y) | (bmv.y > qmvmax.y)))
{
bmv.y = x265_min(x265_max(bmv.y, qmvmin.y), qmvmax.y);
+ bcost = subpelCompare(ref, bmv, satd) + mvcost(bmv);
}
if (!bcost)
@@ -1113,6 +1324,11 @@ me_hex2:
for (int i = 1; i <= wl.hpel_dirs; i++)
{
MV qmv = bmv + square1[i] * 2;
+
+ /* skip invalid range */
+ if ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y))
+ continue;
+
int cost = ref->lowresQPelCost(fenc, blockOffset, qmv, sad) + mvcost(qmv);
COPY2_IF_LT(bcost, cost, bdir, i);
}
@@ -1124,6 +1340,11 @@ me_hex2:
for (int i = 1; i <= wl.qpel_dirs; i++)
{
MV qmv = bmv + square1[i];
+
+ /* skip invalid range */
+ if ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y))
+ continue;
+
int cost = ref->lowresQPelCost(fenc, blockOffset, qmv, satd) + mvcost(qmv);
COPY2_IF_LT(bcost, cost, bdir, i);
}
@@ -1150,7 +1371,7 @@ me_hex2:
MV qmv = bmv + square1[i] * 2;
// check mv range for slice bound
- if ((g_maxSlices > 1) & ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y)))
+ if ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y))
continue;
int cost = subpelCompare(ref, qmv, hpelcomp) + mvcost(qmv);
@@ -1175,7 +1396,7 @@ me_hex2:
MV qmv = bmv + square1[i];
// check mv range for slice bound
- if ((g_maxSlices > 1) & ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y)))
+ if ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y))
continue;
int cost = subpelCompare(ref, qmv, satd) + mvcost(qmv);
@@ -1189,6 +1410,9 @@ me_hex2:
}
}
+ // check mv range for slice bound
+ X265_CHECK(((bmv.y >= qmvmin.y) & (bmv.y <= qmvmax.y)), "mv beyond range!");
+
x265_emms();
outQMv = bmv;
return bcost;
diff --git a/source/encoder/motion.h b/source/encoder/motion.h
index a47c2be..9b602ec 100644
--- a/source/encoder/motion.h
+++ b/source/encoder/motion.h
@@ -52,6 +52,7 @@ protected:
pixelcmp_t sad;
pixelcmp_x3_t sad_x3;
pixelcmp_x4_t sad_x4;
+ pixelcmp_ads_t ads;
pixelcmp_t satd;
pixelcmp_t chromaSatd;
@@ -61,6 +62,7 @@ public:
static const int COST_MAX = 1 << 28;
+ uint32_t* integral[INTEGRAL_PLANE_NUM];
Yuv fencPUYuv;
int partEnum;
bool bChromaSATD;
diff --git a/source/encoder/nal.h b/source/encoder/nal.h
index 15e542d..35f6961 100644
--- a/source/encoder/nal.h
+++ b/source/encoder/nal.h
@@ -34,6 +34,7 @@ class Bitstream;
class NALList
{
+public:
static const int MAX_NAL_UNITS = 16;
public:
diff --git a/source/encoder/ratecontrol.cpp b/source/encoder/ratecontrol.cpp
index 0649e12..d71cfeb 100644
--- a/source/encoder/ratecontrol.cpp
+++ b/source/encoder/ratecontrol.cpp
@@ -341,6 +341,8 @@ bool RateControl::init(const SPS& sps)
m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, m_param->rc.vbvBufferInit / m_param->rc.vbvBufferSize);
m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, X265_MAX(m_param->rc.vbvBufferInit, m_bufferRate / m_bufferSize));
m_bufferFillFinal = m_bufferSize * m_param->rc.vbvBufferInit;
+ m_bufferFillActual = m_bufferFillFinal;
+ m_bufferExcess = 0;
}
m_totalBits = 0;
@@ -431,7 +433,7 @@ bool RateControl::init(const SPS& sps)
}
*statsIn = '\0';
statsIn++;
- if (sscanf(opts, "#options: %dx%d", &i, &j) != 2)
+ if ((p = strstr(opts, " input-res=")) == 0 || sscanf(p, " input-res=%dx%d", &i, &j) != 2)
{
x265_log(m_param, X265_LOG_ERROR, "Resolution specified in stats file not valid\n");
return false;
@@ -457,9 +459,15 @@ bool RateControl::init(const SPS& sps)
CMP_OPT_FIRST_PASS("bframes", m_param->bframes);
CMP_OPT_FIRST_PASS("b-pyramid", m_param->bBPyramid);
CMP_OPT_FIRST_PASS("open-gop", m_param->bOpenGOP);
- CMP_OPT_FIRST_PASS("keyint", m_param->keyframeMax);
+ CMP_OPT_FIRST_PASS(" keyint", m_param->keyframeMax);
CMP_OPT_FIRST_PASS("scenecut", m_param->scenecutThreshold);
CMP_OPT_FIRST_PASS("intra-refresh", m_param->bIntraRefresh);
+ if (m_param->bMultiPassOptRPS)
+ {
+ CMP_OPT_FIRST_PASS("multi-pass-opt-rps", m_param->bMultiPassOptRPS);
+ CMP_OPT_FIRST_PASS("repeat-headers", m_param->bRepeatHeaders);
+ CMP_OPT_FIRST_PASS("min-keyint", m_param->keyframeMin);
+ }
if ((p = strstr(opts, "b-adapt=")) != 0 && sscanf(p, "b-adapt=%d", &i) && i >= X265_B_ADAPT_NONE && i <= X265_B_ADAPT_TRELLIS)
{
@@ -542,10 +550,27 @@ bool RateControl::init(const SPS& sps)
}
rce = &m_rce2Pass[encodeOrder];
m_encOrder[frameNumber] = encodeOrder;
- e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf",
- &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
- &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
- &rce->skipCuCount);
+ if (!m_param->bMultiPassOptRPS)
+ {
+ e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf",
+ &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
+ &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
+ &rce->skipCuCount);
+ }
+ else
+ {
+ char deltaPOC[128];
+ char bUsed[40];
+ memset(deltaPOC, 0, sizeof(deltaPOC));
+ memset(bUsed, 0, sizeof(bUsed));
+ e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf nump:%d numnegp:%d numposp:%d deltapoc:%s bused:%s",
+ &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
+ &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
+ &rce->skipCuCount, &rce->rpsData.numberOfPictures, &rce->rpsData.numberOfNegativePictures, &rce->rpsData.numberOfPositivePictures, deltaPOC, bUsed);
+ splitdeltaPOC(deltaPOC, rce);
+ splitbUsed(bUsed, rce);
+ rce->rpsIdx = -1;
+ }
rce->keptAsRef = true;
rce->isIdr = false;
if (picType == 'b' || picType == 'p')
@@ -598,7 +623,7 @@ bool RateControl::init(const SPS& sps)
x265_log_file(m_param, X265_LOG_ERROR, "can't open stats file %s.temp\n", fileName);
return false;
}
- p = x265_param2string(m_param);
+ p = x265_param2string(m_param, sps.conformanceWindow.rightOffset, sps.conformanceWindow.bottomOffset);
if (p)
fprintf(m_statFileOut, "#options: %s\n", p);
X265_FREE(p);
@@ -1649,15 +1674,18 @@ double RateControl::rateEstimateQscale(Frame* curFrame, RateControlEntry *rce)
if (m_pred[m_predType].count == 1)
qScale = x265_clip3(lmin, lmax, qScale);
m_lastQScaleFor[m_sliceType] = qScale;
- rce->frameSizePlanned = predictSize(&m_pred[m_predType], qScale, (double)m_currentSatd);
}
- else
- rce->frameSizePlanned = qScale2bits(rce, qScale);
+ }
- /* Limit planned size by MinCR */
+ if (m_2pass)
+ rce->frameSizePlanned = qScale2bits(rce, qScale);
+ else
+ rce->frameSizePlanned = predictSize(&m_pred[m_predType], qScale, (double)m_currentSatd);
+
+ /* Limit planned size by MinCR */
+ if (m_isVbv)
rce->frameSizePlanned = X265_MIN(rce->frameSizePlanned, rce->frameSizeMaximum);
- rce->frameSizeEstimated = rce->frameSizePlanned;
- }
+ rce->frameSizeEstimated = rce->frameSizePlanned;
rce->newQScale = qScale;
if(rce->bLastMiniGopBFrame)
@@ -1875,7 +1903,7 @@ double RateControl::rateEstimateQscale(Frame* curFrame, RateControlEntry *rce)
if ((m_curSlice->m_poc == 0 || m_lastQScaleFor[P_SLICE] < q) && !(m_2pass && !m_isVbv))
m_lastQScaleFor[P_SLICE] = q * fabs(m_param->rc.ipFactor);
- if (m_2pass && m_isVbv)
+ if (m_2pass)
rce->frameSizePlanned = qScale2bits(rce, q);
else
rce->frameSizePlanned = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
@@ -2161,7 +2189,7 @@ double RateControl::predictRowsSizeSum(Frame* curFrame, RateControlEntry* rce, d
for (uint32_t row = 0; row < maxRows; row++)
{
encodedBitsSoFar += curEncData.m_rowStat[row].encodedBits;
- rowSatdCostSoFar = curEncData.m_rowStat[row].diagSatd;
+ rowSatdCostSoFar = curEncData.m_rowStat[row].rowSatd;
uint32_t satdCostForPendingCus = curEncData.m_rowStat[row].satdForVbv - rowSatdCostSoFar;
satdCostForPendingCus >>= X265_DEPTH - 8;
if (satdCostForPendingCus > 0)
@@ -2190,7 +2218,7 @@ double RateControl::predictRowsSizeSum(Frame* curFrame, RateControlEntry* rce, d
}
refRowSatdCost >>= X265_DEPTH - 8;
- refQScale = refEncData.m_rowStat[row].diagQpScale;
+ refQScale = refEncData.m_rowStat[row].rowQpScale;
}
if (picType == I_SLICE || qScale >= refQScale)
@@ -2212,7 +2240,7 @@ double RateControl::predictRowsSizeSum(Frame* curFrame, RateControlEntry* rce, d
}
else if (picType == P_SLICE)
{
- intraCostForPendingCus = curEncData.m_rowStat[row].intraSatdForVbv - curEncData.m_rowStat[row].diagIntraSatd;
+ intraCostForPendingCus = curEncData.m_rowStat[row].intraSatdForVbv - curEncData.m_rowStat[row].rowIntraSatd;
intraCostForPendingCus >>= X265_DEPTH - 8;
/* Our QP is lower than the reference! */
double pred_intra = predictSize(rce->rowPred[1], qScale, intraCostForPendingCus);
@@ -2227,16 +2255,16 @@ double RateControl::predictRowsSizeSum(Frame* curFrame, RateControlEntry* rce, d
return totalSatdBits + encodedBitsSoFar;
}
-int RateControl::rowDiagonalVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv)
+int RateControl::rowVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv)
{
FrameData& curEncData = *curFrame->m_encData;
double qScaleVbv = x265_qp2qScale(qpVbv);
- uint64_t rowSatdCost = curEncData.m_rowStat[row].diagSatd;
+ uint64_t rowSatdCost = curEncData.m_rowStat[row].rowSatd;
double encodedBits = curEncData.m_rowStat[row].encodedBits;
- if (row == 1)
+ if (m_param->bEnableWavefront && row == 1)
{
- rowSatdCost += curEncData.m_rowStat[0].diagSatd;
+ rowSatdCost += curEncData.m_rowStat[0].rowSatd;
encodedBits += curEncData.m_rowStat[0].encodedBits;
}
rowSatdCost >>= X265_DEPTH - 8;
@@ -2244,11 +2272,11 @@ int RateControl::rowDiagonalVbvRateControl(Frame* curFrame, uint32_t row, RateCo
if (curEncData.m_slice->m_sliceType != I_SLICE)
{
Frame* refFrame = curEncData.m_slice->m_refFrameList[0][0];
- if (qpVbv < refFrame->m_encData->m_rowStat[row].diagQp)
+ if (qpVbv < refFrame->m_encData->m_rowStat[row].rowQp)
{
- uint64_t intraRowSatdCost = curEncData.m_rowStat[row].diagIntraSatd;
- if (row == 1)
- intraRowSatdCost += curEncData.m_rowStat[0].diagIntraSatd;
+ uint64_t intraRowSatdCost = curEncData.m_rowStat[row].rowIntraSatd;
+ if (m_param->bEnableWavefront && row == 1)
+ intraRowSatdCost += curEncData.m_rowStat[0].rowIntraSatd;
intraRowSatdCost >>= X265_DEPTH - 8;
updatePredictor(rce->rowPred[1], qScaleVbv, (double)intraRowSatdCost, encodedBits);
}
@@ -2309,7 +2337,7 @@ int RateControl::rowDiagonalVbvRateControl(Frame* curFrame, uint32_t row, RateCo
}
while (qpVbv > qpMin
- && (qpVbv > curEncData.m_rowStat[0].diagQp || m_singleFrameVbv)
+ && (qpVbv > curEncData.m_rowStat[0].rowQp || m_singleFrameVbv)
&& (((accFrameBits < rce->frameSizePlanned * 0.8f && qpVbv <= prevRowQp)
|| accFrameBits < (rce->bufferFill - m_bufferSize + m_bufferRate) * 1.1)
&& (!m_param->rc.bStrictCbr ? 1 : abrOvershoot < 0)))
@@ -2329,7 +2357,7 @@ int RateControl::rowDiagonalVbvRateControl(Frame* curFrame, uint32_t row, RateCo
accFrameBits = predictRowsSizeSum(curFrame, rce, qpVbv, encodedBitsSoFar);
abrOvershoot = (accFrameBits + m_totalBits - m_wantedBitsWindow) / totalBitsNeeded;
}
- if (qpVbv > curEncData.m_rowStat[0].diagQp &&
+ if (qpVbv > curEncData.m_rowStat[0].rowQp &&
abrOvershoot < -0.1 && timeDone > 0.5 && accFrameBits < rce->frameSizePlanned - rcTol)
{
qpVbv -= stepSize;
@@ -2446,6 +2474,10 @@ void RateControl::updateVbv(int64_t bits, RateControlEntry* rce)
m_bufferFillFinal = X265_MAX(m_bufferFillFinal, 0);
m_bufferFillFinal += m_bufferRate;
m_bufferFillFinal = X265_MIN(m_bufferFillFinal, m_bufferSize);
+ double bufferBits = X265_MIN(bits + m_bufferExcess, m_bufferRate);
+ m_bufferExcess = X265_MAX(m_bufferExcess - bufferBits + bits, 0);
+ m_bufferFillActual += bufferBits - bits;
+ m_bufferFillActual = X265_MIN(m_bufferFillActual, m_bufferSize);
}
/* After encoding one frame, update rate control state */
@@ -2626,18 +2658,55 @@ int RateControl::writeRateControlFrameStats(Frame* curFrame, RateControlEntry* r
char cType = rce->sliceType == I_SLICE ? (curFrame->m_lowres.sliceType == X265_TYPE_IDR ? 'I' : 'i')
: rce->sliceType == P_SLICE ? 'P'
: IS_REFERENCED(curFrame) ? 'B' : 'b';
- if (fprintf(m_statFileOut,
- "in:%d out:%d type:%c q:%.2f q-aq:%.2f q-noVbv:%.2f q-Rceq:%.2f tex:%d mv:%d misc:%d icu:%.2f pcu:%.2f scu:%.2f ;\n",
- rce->poc, rce->encodeOrder,
- cType, curEncData.m_avgQpRc, curEncData.m_avgQpAq,
- rce->qpNoVbv, rce->qRceq,
- curFrame->m_encData->m_frameStats.coeffBits,
- curFrame->m_encData->m_frameStats.mvBits,
- curFrame->m_encData->m_frameStats.miscBits,
- curFrame->m_encData->m_frameStats.percent8x8Intra * m_ncu,
- curFrame->m_encData->m_frameStats.percent8x8Inter * m_ncu,
- curFrame->m_encData->m_frameStats.percent8x8Skip * m_ncu) < 0)
- goto writeFailure;
+
+ if (!curEncData.m_param->bMultiPassOptRPS)
+ {
+ if (fprintf(m_statFileOut,
+ "in:%d out:%d type:%c q:%.2f q-aq:%.2f q-noVbv:%.2f q-Rceq:%.2f tex:%d mv:%d misc:%d icu:%.2f pcu:%.2f scu:%.2f ;\n",
+ rce->poc, rce->encodeOrder,
+ cType, curEncData.m_avgQpRc, curEncData.m_avgQpAq,
+ rce->qpNoVbv, rce->qRceq,
+ curFrame->m_encData->m_frameStats.coeffBits,
+ curFrame->m_encData->m_frameStats.mvBits,
+ curFrame->m_encData->m_frameStats.miscBits,
+ curFrame->m_encData->m_frameStats.percent8x8Intra * m_ncu,
+ curFrame->m_encData->m_frameStats.percent8x8Inter * m_ncu,
+ curFrame->m_encData->m_frameStats.percent8x8Skip * m_ncu) < 0)
+ goto writeFailure;
+ }
+ else{
+ RPS* rpsWriter = &curFrame->m_encData->m_slice->m_rps;
+ int i, num = rpsWriter->numberOfPictures;
+ char deltaPOC[128];
+ char bUsed[40];
+ memset(deltaPOC, 0, sizeof(deltaPOC));
+ memset(bUsed, 0, sizeof(bUsed));
+ sprintf(deltaPOC, "deltapoc:~");
+ sprintf(bUsed, "bused:~");
+
+ for (i = 0; i < num; i++)
+ {
+ sprintf(deltaPOC, "%s%d~", deltaPOC, rpsWriter->deltaPOC[i]);
+ sprintf(bUsed, "%s%d~", bUsed, rpsWriter->bUsed[i]);
+ }
+
+ if (fprintf(m_statFileOut,
+ "in:%d out:%d type:%c q:%.2f q-aq:%.2f q-noVbv:%.2f q-Rceq:%.2f tex:%d mv:%d misc:%d icu:%.2f pcu:%.2f scu:%.2f nump:%d numnegp:%d numposp:%d %s %s ;\n",
+ rce->poc, rce->encodeOrder,
+ cType, curEncData.m_avgQpRc, curEncData.m_avgQpAq,
+ rce->qpNoVbv, rce->qRceq,
+ curFrame->m_encData->m_frameStats.coeffBits,
+ curFrame->m_encData->m_frameStats.mvBits,
+ curFrame->m_encData->m_frameStats.miscBits,
+ curFrame->m_encData->m_frameStats.percent8x8Intra * m_ncu,
+ curFrame->m_encData->m_frameStats.percent8x8Inter * m_ncu,
+ curFrame->m_encData->m_frameStats.percent8x8Skip * m_ncu,
+ rpsWriter->numberOfPictures,
+ rpsWriter->numberOfNegativePictures,
+ rpsWriter->numberOfPositivePictures,
+ deltaPOC, bUsed) < 0)
+ goto writeFailure;
+ }
/* Don't re-write the data in multi-pass mode. */
if (m_param->rc.cuTree && IS_REFERENCED(curFrame) && !m_param->rc.bStatRead)
{
@@ -2730,3 +2799,48 @@ void RateControl::destroy()
X265_FREE(m_param->rc.zones);
}
+void RateControl::splitdeltaPOC(char deltapoc[], RateControlEntry *rce)
+{
+ int idx = 0, length = 0;
+ char tmpStr[128];
+ char* src = deltapoc;
+ char* buf = strstr(src, "~");
+ while (buf)
+ {
+ memset(tmpStr, 0, sizeof(tmpStr));
+ length = (int)(buf - src);
+ if (length != 0)
+ {
+ strncpy(tmpStr, src, length);
+ rce->rpsData.deltaPOC[idx] = atoi(tmpStr);
+ idx++;
+ if (idx == rce->rpsData.numberOfPictures)
+ break;
+ }
+ src += (length + 1);
+ buf = strstr(src, "~");
+ }
+}
+
+void RateControl::splitbUsed(char bused[], RateControlEntry *rce)
+{
+ int idx = 0, length = 0;
+ char tmpStr[128];
+ char* src = bused;
+ char* buf = strstr(src, "~");
+ while (buf)
+ {
+ memset(tmpStr, 0, sizeof(tmpStr));
+ length = (int)(buf - src);
+ if (length != 0)
+ {
+ strncpy(tmpStr, src, length);
+ rce->rpsData.bUsed[idx] = atoi(tmpStr) > 0;
+ idx++;
+ if (idx == rce->rpsData.numberOfPictures)
+ break;
+ }
+ src += (length + 1);
+ buf = strstr(src, "~");
+ }
+}
diff --git a/source/encoder/ratecontrol.h b/source/encoder/ratecontrol.h
index 8808a4c..c46423b 100644
--- a/source/encoder/ratecontrol.h
+++ b/source/encoder/ratecontrol.h
@@ -111,6 +111,8 @@ struct RateControlEntry
bool isIdr;
SEIPictureTiming *picTimingSEI;
HRDTiming *hrdTiming;
+ int rpsIdx;
+ RPS rpsData;
};
class RateControl
@@ -144,6 +146,8 @@ public:
double m_rateFactorMaxIncrement; /* Don't allow RF above (CRF + this value). */
double m_rateFactorMaxDecrement; /* don't allow RF below (this value). */
double m_avgPFrameQp;
+ double m_bufferFillActual;
+ double m_bufferExcess;
bool m_isFirstMiniGop;
Predictor m_pred[4]; /* Slice predictors to preidct bits for each Slice type - I,P,Bref and B */
int64_t m_leadingNoBSatd;
@@ -239,7 +243,7 @@ public:
int rateControlStart(Frame* curFrame, RateControlEntry* rce, Encoder* enc);
void rateControlUpdateStats(RateControlEntry* rce);
int rateControlEnd(Frame* curFrame, int64_t bits, RateControlEntry* rce);
- int rowDiagonalVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv);
+ int rowVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv);
int rateControlSliceType(int frameNum);
bool cuTreeReadFor2Pass(Frame* curFrame);
void hrdFullness(SEIBufferingPeriod* sei);
@@ -280,6 +284,8 @@ protected:
bool findUnderflow(double *fills, int *t0, int *t1, int over, int framesCount);
bool fixUnderflow(int t0, int t1, double adjustment, double qscaleMin, double qscaleMax);
double tuneQScaleForGrain(double rcOverflow);
+ void splitdeltaPOC(char deltapoc[], RateControlEntry *rce);
+ void splitbUsed(char deltapoc[], RateControlEntry *rce);
};
}
#endif // ifndef X265_RATECONTROL_H
diff --git a/source/encoder/reference.cpp b/source/encoder/reference.cpp
index 864ca63..9b79ca1 100644
--- a/source/encoder/reference.cpp
+++ b/source/encoder/reference.cpp
@@ -128,11 +128,12 @@ void MotionReference::applyWeight(uint32_t finishedRows, uint32_t maxNumRows, ui
intptr_t stride = reconPic->m_stride;
int width = reconPic->m_picWidth;
int height = (finishedRows - numWeightedRows) * g_maxCUSize;
- if ((finishedRows == maxNumRows) && (reconPic->m_picHeight % g_maxCUSize))
+ /* the last row may be partial height */
+ if (finishedRows == maxNumRows - 1)
{
- /* the last row may be partial height */
- height -= g_maxCUSize;
- height += reconPic->m_picHeight % g_maxCUSize;
+ const int leftRows = (reconPic->m_picHeight & (g_maxCUSize - 1));
+
+ height += leftRows ? leftRows : g_maxCUSize;
}
int cuHeight = g_maxCUSize;
@@ -172,7 +173,7 @@ void MotionReference::applyWeight(uint32_t finishedRows, uint32_t maxNumRows, ui
}
// Extending Bottom
- if (finishedRows == maxNumRows)
+ if (finishedRows == maxNumRows - 1)
{
int picHeight = reconPic->m_picHeight;
if (c) picHeight >>= reconPic->m_vChromaShift;
diff --git a/source/encoder/sao.cpp b/source/encoder/sao.cpp
index 0ea68aa..6d6b401 100644
--- a/source/encoder/sao.cpp
+++ b/source/encoder/sao.cpp
@@ -1208,10 +1208,15 @@ void SAO::rdoSaoUnitRowEnd(const SAOParam* saoParam, int numctus)
if (!saoParam->bSaoFlag[0])
m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + m_refDepth] = 1.0;
else
+ {
+ X265_CHECK(m_numNoSao[0] <= numctus, "m_numNoSao check failure!");
m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + m_refDepth] = m_numNoSao[0] / ((double)numctus);
+ }
if (!saoParam->bSaoFlag[1])
+ {
m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + m_refDepth] = 1.0;
+ }
else
m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + m_refDepth] = m_numNoSao[1] / ((double)numctus);
}
diff --git a/source/encoder/search.cpp b/source/encoder/search.cpp
index 7ad0784..7d9081c 100644
--- a/source/encoder/search.cpp
+++ b/source/encoder/search.cpp
@@ -67,6 +67,7 @@ Search::Search()
m_param = NULL;
m_slice = NULL;
m_frame = NULL;
+ m_maxTUDepth = -1;
}
bool Search::initSearch(const x265_param& param, ScalingList& scalingList)
@@ -93,6 +94,19 @@ bool Search::initSearch(const x265_param& param, ScalingList& scalingList)
uint32_t sizeC = sizeL >> (m_hChromaShift + m_vChromaShift);
uint32_t numPartitions = 1 << (maxLog2CUSize - LOG2_UNIT_SIZE) * 2;
+ m_limitTU = 0;
+ if (m_param->limitTU)
+ {
+ if (m_param->limitTU == 1)
+ m_limitTU = X265_TU_LIMIT_BFS;
+ else if (m_param->limitTU == 2)
+ m_limitTU = X265_TU_LIMIT_DFS;
+ else if (m_param->limitTU == 3)
+ m_limitTU = X265_TU_LIMIT_NEIGH;
+ else if (m_param->limitTU == 4)
+ m_limitTU = X265_TU_LIMIT_DFS + X265_TU_LIMIT_NEIGH;
+ }
+
/* these are indexed by qtLayer (log2size - 2) so nominally 0=4x4, 1=8x8, 2=16x16, 3=32x32
* the coeffRQT and reconQtYuv are allocated to the max CU size at every depth. The parts
* which are reconstructed at each depth are valid. At the end, the transform depth table
@@ -2131,6 +2145,13 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma
int mvpIdx = selectMVP(cu, pu, amvp, list, ref);
MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx];
+ if (m_param->searchMethod == X265_SEA)
+ {
+ int puX = puIdx & 1;
+ int puY = puIdx >> 1;
+ for (int planes = 0; planes < INTEGRAL_PLANE_NUM; planes++)
+ m_me.integral[planes] = interMode.fencYuv->m_integral[list][ref][planes] + puX * pu.width + puY * pu.height * m_slice->m_refFrameList[list][ref]->m_reconPic->m_stride;
+ }
setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax);
int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv,
m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0) : 0);
@@ -2229,7 +2250,13 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma
if (lmv.notZero())
mvc[numMvc++] = lmv;
}
-
+ if (m_param->searchMethod == X265_SEA)
+ {
+ int puX = puIdx & 1;
+ int puY = puIdx >> 1;
+ for (int planes = 0; planes < INTEGRAL_PLANE_NUM; planes++)
+ m_me.integral[planes] = interMode.fencYuv->m_integral[list][ref][planes] + puX * pu.width + puY * pu.height * m_slice->m_refFrameList[list][ref]->m_reconPic->m_stride;
+ }
setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax);
int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv,
m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0) : 0);
@@ -2544,6 +2571,9 @@ void Search::setSearchRange(const CUData& cu, const MV& mvp, int merange, MV& mv
/* conditional clipping for frame parallelism */
mvmin.y = X265_MIN(mvmin.y, (int16_t)m_refLagPixels);
mvmax.y = X265_MIN(mvmax.y, (int16_t)m_refLagPixels);
+
+ /* conditional clipping for negative mv range */
+ mvmax.y = X265_MAX(mvmax.y, mvmin.y);
}
/* Note: this function overwrites the RD cost variables of interMode, but leaves the sa8d cost unharmed */
@@ -2617,8 +2647,29 @@ void Search::encodeResAndCalcRdInterCU(Mode& interMode, const CUGeom& cuGeom)
m_entropyCoder.load(m_rqt[depth].cur);
+ if ((m_limitTU & X265_TU_LIMIT_DFS) && !(m_limitTU & X265_TU_LIMIT_NEIGH))
+ m_maxTUDepth = -1;
+ else if (m_limitTU & X265_TU_LIMIT_BFS)
+ memset(&m_cacheTU, 0, sizeof(TUInfoCache));
+
Cost costs;
- estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange);
+ if (m_limitTU & X265_TU_LIMIT_NEIGH)
+ {
+ /* Save and reload maxTUDepth to avoid changing of maxTUDepth between modes */
+ int32_t tempDepth = m_maxTUDepth;
+ if (m_maxTUDepth != -1)
+ {
+ uint32_t splitFlag = interMode.cu.m_partSize[0] != SIZE_2Nx2N;
+ uint32_t minSize = tuDepthRange[0];
+ uint32_t maxSize = tuDepthRange[1];
+ maxSize = X265_MIN(maxSize, cuGeom.log2CUSize - splitFlag);
+ m_maxTUDepth = x265_clip3(cuGeom.log2CUSize - maxSize, cuGeom.log2CUSize - minSize, (uint32_t)m_maxTUDepth);
+ }
+ estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange);
+ m_maxTUDepth = tempDepth;
+ }
+ else
+ estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange);
uint32_t tqBypass = cu.m_tqBypass[0];
if (!tqBypass)
@@ -2867,7 +2918,57 @@ uint64_t Search::estimateNullCbfCost(sse_t dist, uint32_t psyEnergy, uint32_t tu
return m_rdCost.calcRdCost(dist, nullBits);
}
-void Search::estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, ShortYuv& resiYuv, Cost& outCosts, const uint32_t depthRange[2])
+bool Search::splitTU(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, ShortYuv& resiYuv, Cost& splitCost, const uint32_t depthRange[2], int32_t splitMore)
+{
+ CUData& cu = mode.cu;
+ uint32_t depth = cuGeom.depth + tuDepth;
+ uint32_t log2TrSize = cuGeom.log2CUSize - tuDepth;
+
+ uint32_t qNumParts = 1 << (log2TrSize - 1 - LOG2_UNIT_SIZE) * 2;
+ uint32_t ycbf = 0, ucbf = 0, vcbf = 0;
+ for (uint32_t qIdx = 0, qPartIdx = absPartIdx; qIdx < 4; ++qIdx, qPartIdx += qNumParts)
+ {
+ if ((m_limitTU & X265_TU_LIMIT_DFS) && tuDepth == 0 && qIdx == 1)
+ {
+ m_maxTUDepth = cu.m_tuDepth[0];
+ // Fetch maximum TU depth of first sub partition to limit recursion of others
+ for (uint32_t i = 1; i < cuGeom.numPartitions / 4; i++)
+ m_maxTUDepth = X265_MAX(m_maxTUDepth, cu.m_tuDepth[i]);
+ }
+ estimateResidualQT(mode, cuGeom, qPartIdx, tuDepth + 1, resiYuv, splitCost, depthRange, splitMore);
+ ycbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1);
+ if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)
+ {
+ ucbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1);
+ vcbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1);
+ }
+ }
+ cu.m_cbf[0][absPartIdx] |= ycbf << tuDepth;
+ if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)
+ {
+ cu.m_cbf[1][absPartIdx] |= ucbf << tuDepth;
+ cu.m_cbf[2][absPartIdx] |= vcbf << tuDepth;
+ }
+
+ // Here we were encoding cbfs and coefficients for splitted blocks. Since I have collected coefficient bits
+ // for each individual blocks, only encoding cbf values. As I mentioned encoding chroma cbfs is different then luma.
+ // But have one doubt that if coefficients are encoded in context at depth 2 (for example) and cbfs are encoded in context
+ // at depth 0 (for example).
+ m_entropyCoder.load(m_rqt[depth].rqtRoot);
+ m_entropyCoder.resetBits();
+ codeInterSubdivCbfQT(cu, absPartIdx, tuDepth, depthRange);
+ uint32_t splitCbfBits = m_entropyCoder.getNumberOfWrittenBits();
+ splitCost.bits += splitCbfBits;
+
+ if (m_rdCost.m_psyRd)
+ splitCost.rdcost = m_rdCost.calcPsyRdCost(splitCost.distortion, splitCost.bits, splitCost.energy);
+ else
+ splitCost.rdcost = m_rdCost.calcRdCost(splitCost.distortion, splitCost.bits);
+
+ return ycbf || ucbf || vcbf;
+}
+
+void Search::estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, ShortYuv& resiYuv, Cost& outCosts, const uint32_t depthRange[2], int32_t splitMore)
{
CUData& cu = mode.cu;
uint32_t depth = cuGeom.depth + tuDepth;
@@ -2876,6 +2977,37 @@ void Search::estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPa
bool bCheckSplit = log2TrSize > depthRange[0];
bool bCheckFull = log2TrSize <= depthRange[1];
+ bool bSaveTUData = false, bLoadTUData = false;
+ uint32_t idx = 0;
+
+ if ((m_limitTU & X265_TU_LIMIT_BFS) && splitMore >= 0)
+ {
+ if (bCheckSplit && bCheckFull && tuDepth)
+ {
+ uint32_t qNumParts = 1 << (log2TrSize - LOG2_UNIT_SIZE) * 2;
+ uint32_t qIdx = (absPartIdx / qNumParts) % 4;
+ idx = (depth - 1) * 4 + qIdx;
+ if (splitMore)
+ {
+ bLoadTUData = true;
+ bCheckFull = false;
+ }
+ else
+ {
+ bSaveTUData = true;
+ bCheckSplit = false;
+ }
+ }
+ }
+ else if (m_limitTU & X265_TU_LIMIT_DFS || m_limitTU & X265_TU_LIMIT_NEIGH)
+ {
+ if (bCheckSplit && m_maxTUDepth >= 0)
+ {
+ uint32_t log2MaxTrSize = cuGeom.log2CUSize - m_maxTUDepth;
+ bCheckSplit = log2TrSize > log2MaxTrSize;
+ }
+ }
+
bool bSplitPresentFlag = bCheckSplit && bCheckFull;
if (cu.m_partSize[0] != SIZE_2Nx2N && !tuDepth && bCheckSplit)
@@ -3194,6 +3326,8 @@ void Search::estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPa
singlePsyEnergy[TEXT_LUMA][0] = nonZeroPsyEnergyY;
cbfFlag[TEXT_LUMA][0] = !!numSigTSkipY;
bestTransformMode[TEXT_LUMA][0] = 1;
+ if (m_param->limitTU)
+ numSig[TEXT_LUMA][0] = numSigTSkipY;
uint32_t numCoeffY = 1 << (log2TrSize << 1);
memcpy(coeffCurY, m_tsCoeff, sizeof(coeff_t) * numCoeffY);
primitives.cu[partSize].copy_ss(curResiY, strideResiY, m_tsResidual, trSize);
@@ -3331,6 +3465,50 @@ void Search::estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPa
fullCost.rdcost = m_rdCost.calcPsyRdCost(fullCost.distortion, fullCost.bits, fullCost.energy);
else
fullCost.rdcost = m_rdCost.calcRdCost(fullCost.distortion, fullCost.bits);
+
+ if (m_param->limitTU && bCheckSplit)
+ {
+ // Stop recursion if the TU's energy level is minimal
+ uint32_t numCoeff = trSize * trSize;
+ if (cbfFlag[TEXT_LUMA][0] == 0)
+ bCheckSplit = false;
+ else if (numSig[TEXT_LUMA][0] < (numCoeff / 64))
+ {
+ uint32_t energy = 0;
+ for (uint32_t i = 0; i < numCoeff; i++)
+ energy += abs(coeffCurY[i]);
+ if (energy == numSig[TEXT_LUMA][0])
+ bCheckSplit = false;
+ }
+ }
+
+ if (bSaveTUData)
+ {
+ for (int plane = 0; plane < MAX_NUM_COMPONENT; plane++)
+ {
+ for(int part = 0; part < (m_csp == X265_CSP_I422) + 1; part++)
+ {
+ m_cacheTU.bestTransformMode[idx][plane][part] = bestTransformMode[plane][part];
+ m_cacheTU.cbfFlag[idx][plane][part] = cbfFlag[plane][part];
+ }
+ }
+ m_cacheTU.cost[idx] = fullCost;
+ m_entropyCoder.store(m_cacheTU.rqtStore[idx]);
+ }
+ }
+ if (bLoadTUData)
+ {
+ for (int plane = 0; plane < MAX_NUM_COMPONENT; plane++)
+ {
+ for(int part = 0; part < (m_csp == X265_CSP_I422) + 1; part++)
+ {
+ bestTransformMode[plane][part] = m_cacheTU.bestTransformMode[idx][plane][part];
+ cbfFlag[plane][part] = m_cacheTU.cbfFlag[idx][plane][part];
+ }
+ }
+ fullCost = m_cacheTU.cost[idx];
+ m_entropyCoder.load(m_cacheTU.rqtStore[idx]);
+ bCheckFull = true;
}
// code sub-blocks
@@ -3351,45 +3529,29 @@ void Search::estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPa
splitCost.bits = m_entropyCoder.getNumberOfWrittenBits();
}
- uint32_t qNumParts = 1 << (log2TrSize - 1 - LOG2_UNIT_SIZE) * 2;
- uint32_t ycbf = 0, ucbf = 0, vcbf = 0;
- for (uint32_t qIdx = 0, qPartIdx = absPartIdx; qIdx < 4; ++qIdx, qPartIdx += qNumParts)
- {
- estimateResidualQT(mode, cuGeom, qPartIdx, tuDepth + 1, resiYuv, splitCost, depthRange);
- ycbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1);
- if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)
- {
- ucbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1);
- vcbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1);
- }
- }
- cu.m_cbf[0][absPartIdx] |= ycbf << tuDepth;
- if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)
- {
- cu.m_cbf[1][absPartIdx] |= ucbf << tuDepth;
- cu.m_cbf[2][absPartIdx] |= vcbf << tuDepth;
- }
-
- // Here we were encoding cbfs and coefficients for splitted blocks. Since I have collected coefficient bits
- // for each individual blocks, only encoding cbf values. As I mentioned encoding chroma cbfs is different then luma.
- // But have one doubt that if coefficients are encoded in context at depth 2 (for example) and cbfs are encoded in context
- // at depth 0 (for example).
- m_entropyCoder.load(m_rqt[depth].rqtRoot);
- m_entropyCoder.resetBits();
-
- codeInterSubdivCbfQT(cu, absPartIdx, tuDepth, depthRange);
- uint32_t splitCbfBits = m_entropyCoder.getNumberOfWrittenBits();
- splitCost.bits += splitCbfBits;
-
- if (m_rdCost.m_psyRd)
- splitCost.rdcost = m_rdCost.calcPsyRdCost(splitCost.distortion, splitCost.bits, splitCost.energy);
- else
- splitCost.rdcost = m_rdCost.calcRdCost(splitCost.distortion, splitCost.bits);
-
- if (ycbf || ucbf || vcbf || !bCheckFull)
+ bool yCbCrCbf = splitTU(mode, cuGeom, absPartIdx, tuDepth, resiYuv, splitCost, depthRange, 0);
+ if (yCbCrCbf || !bCheckFull)
{
if (splitCost.rdcost < fullCost.rdcost)
{
+ if (m_limitTU & X265_TU_LIMIT_BFS)
+ {
+ uint32_t nextlog2TrSize = cuGeom.log2CUSize - (tuDepth + 1);
+ bool nextSplit = nextlog2TrSize > depthRange[0];
+ if (nextSplit)
+ {
+ m_entropyCoder.load(m_rqt[depth].rqtRoot);
+ splitCost.bits = splitCost.distortion = splitCost.rdcost = splitCost.energy = 0;
+ if (bSplitPresentFlag && (log2TrSize <= depthRange[1] && log2TrSize > depthRange[0]))
+ {
+ // Subdiv flag can be encoded at the start of analysis of split blocks.
+ m_entropyCoder.resetBits();
+ m_entropyCoder.codeTransformSubdivFlag(1, 5 - log2TrSize);
+ splitCost.bits = m_entropyCoder.getNumberOfWrittenBits();
+ }
+ splitTU(mode, cuGeom, absPartIdx, tuDepth, resiYuv, splitCost, depthRange, 1);
+ }
+ }
outCosts.distortion += splitCost.distortion;
outCosts.rdcost += splitCost.rdcost;
outCosts.bits += splitCost.bits;
diff --git a/source/encoder/search.h b/source/encoder/search.h
index 4c86f14..cbb5872 100644
--- a/source/encoder/search.h
+++ b/source/encoder/search.h
@@ -49,6 +49,8 @@
#define ProfileCounter(cu, count)
#endif
+#define NUM_SUBPART MAX_TS_SIZE * 4 // 4 sub partitions * 4 depth
+
namespace X265_NS {
// private namespace
@@ -275,6 +277,9 @@ public:
uint32_t m_numLayers;
uint32_t m_refLagPixels;
+ int32_t m_maxTUDepth;
+ uint16_t m_limitTU;
+
int16_t m_sliceMaxY;
int16_t m_sliceMinY;
@@ -377,8 +382,17 @@ protected:
Cost() { rdcost = 0; bits = 0; distortion = 0; energy = 0; }
};
+ struct TUInfoCache
+ {
+ Cost cost[NUM_SUBPART];
+ uint32_t bestTransformMode[NUM_SUBPART][MAX_NUM_COMPONENT][2];
+ uint8_t cbfFlag[NUM_SUBPART][MAX_NUM_COMPONENT][2];
+ Entropy rqtStore[NUM_SUBPART];
+ } m_cacheTU;
+
uint64_t estimateNullCbfCost(sse_t dist, uint32_t psyEnergy, uint32_t tuDepth, TextType compId);
- void estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t depth, ShortYuv& resiYuv, Cost& costs, const uint32_t depthRange[2]);
+ bool splitTU(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, ShortYuv& resiYuv, Cost& splitCost, const uint32_t depthRange[2], int32_t splitMore);
+ void estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t depth, ShortYuv& resiYuv, Cost& costs, const uint32_t depthRange[2], int32_t splitMore = -1);
// generate prediction, generate residual and recon. if bAllowSplit, find optimal RQT splits
void codeIntraLumaQT(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t absPartIdx, bool bAllowSplit, Cost& costs, const uint32_t depthRange[2]);
diff --git a/source/encoder/slicetype.cpp b/source/encoder/slicetype.cpp
index c761ab8..c7973fb 100644
--- a/source/encoder/slicetype.cpp
+++ b/source/encoder/slicetype.cpp
@@ -1617,7 +1617,7 @@ bool Lookahead::scenecutInternal(Lowres **frames, int p0, int p1, bool bRealScen
/* magic numbers pulled out of thin air */
float threshMin = (float)(threshMax * 0.25);
- double bias = 0.05;
+ double bias = m_param->scenecutBias;
if (bRealScenecut)
{
if (m_param->keyframeMin == m_param->keyframeMax)
diff --git a/source/input/y4m.cpp b/source/input/y4m.cpp
index a9adb44..105deac 100644
--- a/source/input/y4m.cpp
+++ b/source/input/y4m.cpp
@@ -280,7 +280,7 @@ bool Y4MInput::parseHeader()
{
c = ifs->get();
- if (c <= '9' && c >= '0')
+ if (c <= 'o' && c >= '0')
csp = csp * 10 + (c - '0');
else if (c == 'p')
{
@@ -300,9 +300,23 @@ bool Y4MInput::parseHeader()
break;
}
- if (d >= 8 && d <= 16)
- depth = d;
- colorSpace = (csp == 444) ? X265_CSP_I444 : (csp == 422) ? X265_CSP_I422 : X265_CSP_I420;
+ switch (csp)
+ {
+ case ('m'-'0')*100000 + ('o'-'0')*10000 + ('n'-'0')*1000 + ('o'-'0')*100 + 16:
+ colorSpace = X265_CSP_I400;
+ depth = 16;
+ break;
+
+ case ('m'-'0')*1000 + ('o'-'0')*100 + ('n'-'0')*10 + ('o'-'0'):
+ colorSpace = X265_CSP_I400;
+ depth = 8;
+ break;
+
+ default:
+ if (d >= 8 && d <= 16)
+ depth = d;
+ colorSpace = (csp == 444) ? X265_CSP_I444 : (csp == 422) ? X265_CSP_I422 : X265_CSP_I420;
+ }
break;
default:
@@ -324,7 +338,7 @@ bool Y4MInput::parseHeader()
if (width < MIN_FRAME_WIDTH || width > MAX_FRAME_WIDTH ||
height < MIN_FRAME_HEIGHT || height > MAX_FRAME_HEIGHT ||
(rateNum / rateDenom) < 1 || (rateNum / rateDenom) > MAX_FRAME_RATE ||
- colorSpace <= X265_CSP_I400 || colorSpace >= X265_CSP_COUNT)
+ colorSpace < X265_CSP_I400 || colorSpace >= X265_CSP_COUNT)
return false;
return true;
diff --git a/source/test/rate-control-tests.txt b/source/test/rate-control-tests.txt
index e12bc86..7d5a75d 100644
--- a/source/test/rate-control-tests.txt
+++ b/source/test/rate-control-tests.txt
@@ -21,6 +21,9 @@ NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 8000 --vbv-
big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud --hrd --tune fast-decode
sita_1920x1080_30.yuv,--preset superfast --crf 25 --vbv-bufsize 3000 --vbv-maxrate 4000 --vbv-bufsize 5000 --hrd --crf-max 30
sita_1920x1080_30.yuv,--preset superfast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --aud --strict-cbr
+BasketballDrive_1920x1080_50.y4m,--preset ultrafast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --no-wpp
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --no-wpp --aud --hrd --tune fast-decode
+sita_1920x1080_30.yuv,--preset superfast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --aud --strict-cbr --no-wpp
@@ -38,4 +41,5 @@ SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv, --tune grain --preset ultrafas
RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 40 --pass 1, --preset faster --bitrate 200 --pass 2 -F4
CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --bitrate 2500 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 2500 --pass 2 -F4
RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --vbv-maxrate 1000 --vbv-bufsize 1000 --pass 1,--preset fast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 700 --pass 3 -F4,--preset slow --bitrate 500 --vbv-maxrate 500 --vbv-bufsize 700 --pass 2 -F4
-
+sita_1920x1080_30.yuv, --preset ultrafast --crf 20 --no-cutree --keyint 50 --min-keyint 50 --no-open-gop --pass 1 --vbv-bufsize 7000 --vbv-maxrate 5000, --preset ultrafast --crf 20 --no-cutree --keyint 50 --min-keyint 50 --no-open-gop --pass 2 --vbv-bufsize 7000 --vbv-maxrate 5000 --repeat-headers
+sita_1920x1080_30.yuv, --preset medium --crf 20 --no-cutree --keyint 50 --min-keyint 50 --no-open-gop --pass 1 --vbv-bufsize 7000 --vbv-maxrate 5000 --repeat-headers --multi-pass-opt-rps, --preset medium --crf 20 --no-cutree --keyint 50 --min-keyint 50 --no-open-gop --pass 2 --vbv-bufsize 7000 --vbv-maxrate 5000 --repeat-headers --multi-pass-opt-rps
diff --git a/source/test/regression-tests.txt b/source/test/regression-tests.txt
index d7a457e..2b43e05 100644
--- a/source/test/regression-tests.txt
+++ b/source/test/regression-tests.txt
@@ -14,20 +14,21 @@
BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709
BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp --limit-modes
BasketballDrive_1920x1080_50.y4m,--preset veryfast --tune zerolatency --no-temporal-mvp
-BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190
-BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 16 --cu-lossless
+BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 --slices 3
+BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 16 --cu-lossless --tu-inter-depth 3 --limit-tu 1
BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao
BasketballDrive_1920x1080_50.y4m,--preset medium --no-cutree --analysis-mode=save --bitrate 7000 --limit-modes,--preset medium --no-cutree --analysis-mode=load --bitrate 7000 --limit-modes
BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16 --limit-refs 1
-BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0
+BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 --limit-tu 4
BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-mode=save --bitrate 7000,--preset slower --no-cutree --analysis-mode=load --bitrate 7000
-BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3
-BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-mode=save --bitrate 7000 --tskip-fast,--preset veryslow --no-cutree --analysis-mode=load --bitrate 7000 --tskip-fast
+BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 --limit-tu 3
+BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-mode=save --bitrate 7000 --tskip-fast --limit-tu 4,--preset veryslow --no-cutree --analysis-mode=load --bitrate 7000 --tskip-fast --limit-tu 4
BasketballDrive_1920x1080_50.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
Coastguard-4k.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop
+Coastguard-4k.y4m,--preset superfast --tune grain --pme --aq-strength 2 --merange 190
Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-mode=save --bitrate 15000,--preset veryfast --no-cutree --analysis-mode=load --bitrate 15000
-Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh
+Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh --slices 2
Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1
CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16
CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao
@@ -41,13 +42,14 @@ CrowdRun_1920x1080_50_10bit_444.yuv,--preset ultrafast --weightp --no-wpp --no-o
CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd
CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers --limit-refs 2
CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1 --limit-modes
-CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut --limit-tu 1
DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16
DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16 --limit-modes
DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless
DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4
DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq
-DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 --limit-refs 3
+DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 --limit-refs 3 --tu-inter-depth 4 --limit-tu 3
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset fast --no-cutree --analysis-mode=save --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1,--preset fast --no-cutree --analysis-mode=load --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1
FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2
FourPeople_1280x720_60.y4m,--preset veryfast --aq-mode 2 --aq-strength 1.5 --qg-size 8
FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd
@@ -61,24 +63,27 @@ Kimono1_1920x1080_24_10bit_444.yuv,--preset medium --min-cu-size 32
KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing
KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16 --limit-refs 1
KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16
-KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 --limit-modes
+KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 --limit-modes --limit-tu 1
NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr
NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain --limit-refs 2
NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-mode=save --bitrate 9000,--preset slow --no-cutree --analysis-mode=load --bitrate 9000
News-4k.y4m,--preset ultrafast --no-cutree --analysis-mode=save --bitrate 15000,--preset ultrafast --no-cutree --analysis-mode=load --bitrate 15000
News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0
+News-4k.y4m,--preset superfast --slices 4 --aq-mode 0
News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 16
News-4k.y4m,--preset veryslow --no-rskip
+News-4k.y4m,--preset veryslow --pme --crf 40
OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp
OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp
OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode
ParkScene_1920x1080_24_10bit_444.yuv,--preset superfast --weightp --lookahead-slices 4
ParkScene_1920x1080_24.y4m,--preset medium --qp 40 --rdpenalty 2 --tu-intra-depth 3
+ParkScene_1920x1080_24.y4m,--preset medium --pme --tskip-fast --tskip --min-keyint 48 --weightb --limit-refs 3
ParkScene_1920x1080_24.y4m,--preset slower --no-weightp
RaceHorses_416x240_30.y4m,--preset superfast --no-cutree
RaceHorses_416x240_30.y4m,--preset medium --tskip-fast --tskip
-RaceHorses_416x240_30.y4m,--preset slower --keyint -1 --rdoq-level 0
-RaceHorses_416x240_30.y4m,--preset veryslow --tskip-fast --tskip --limit-refs 3
+RaceHorses_416x240_30.y4m,--preset slower --keyint -1 --rdoq-level 0 --limit-tu 2
+RaceHorses_416x240_30.y4m,--preset veryslow --tskip-fast --tskip --limit-refs 3 --limit-tu 3
RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --tune psnr --limit-refs 1
RaceHorses_416x240_30_10bit.yuv,--preset veryfast --weightb
RaceHorses_416x240_30_10bit.yuv,--preset faster --rdoq-level 0 --dither
@@ -108,7 +113,7 @@ ducks_take_off_420_720p50.y4m,--preset slower --no-wpp
ducks_take_off_420_720p50.y4m,--preset veryslow --constrained-intra --bframes 2
mobile_calendar_422_ntsc.y4m,--preset superfast --weightp
mobile_calendar_422_ntsc.y4m,--preset medium --bitrate 500 -F4
-mobile_calendar_422_ntsc.y4m,--preset slower --tskip --tskip-fast
+mobile_calendar_422_ntsc.y4m,--preset slower --tskip --tskip-fast --limit-tu 4
mobile_calendar_422_ntsc.y4m,--preset veryslow --tskip --limit-refs 2
old_town_cross_444_720p50.y4m,--preset ultrafast --weightp --min-cu 32
old_town_cross_444_720p50.y4m,--preset superfast --weightp --min-cu 16 --limit-modes
@@ -118,6 +123,7 @@ old_town_cross_444_720p50.y4m,--preset fast --no-cutree --analysis-mode=save --b
old_town_cross_444_720p50.y4m,--preset medium --keyint -1 --no-weightp --ref 6
old_town_cross_444_720p50.y4m,--preset slow --rdoq-level 1 --early-skip --ref 7 --no-b-pyramid
old_town_cross_444_720p50.y4m,--preset slower --crf 4 --cu-lossless
+old_town_cross_444_720p50.y4m,--preset veryslow --max-tu-size 4 --min-cu-size 32 --limit-tu 4
parkrun_ter_720p50.y4m,--preset medium --no-open-gop --sao-non-deblock --crf 4 --cu-lossless
parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain
silent_cif_420.y4m,--preset superfast --weightp --rect
@@ -133,6 +139,11 @@ washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4 --limit-refs 1
vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode --qg-size 16
washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32 --limit-refs 1
washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless --limit-refs 3 --limit-modes
+washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless --limit-refs 3 --limit-modes --slices 2
+Kimono1_1920x1080_24_400.yuv,--preset ultrafast --slices 1 --weightp --tu-intra-depth 4
+Kimono1_1920x1080_24_400.yuv,--preset medium --rdoq-level 0 --limit-refs 3 --slices 2
+Kimono1_1920x1080_24_400.yuv,--preset veryslow --crf 4 --cu-lossless --slices 2 --limit-refs 3 --limit-modes
+Kimono1_1920x1080_24_400.yuv,--preset placebo --ctu 32 --max-tu-size 8 --limit-tu 2
# Main12 intraCost overflow bug test
720p50_parkrun_ter.y4m,--preset medium
@@ -141,4 +152,7 @@ washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless --limit-refs 3 --lim
CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --interlace tff
CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --interlace bff
+#SEA Implementation Test
+silent_cif_420.y4m,--preset veryslow --me 4
+big_buck_bunny_360p24.y4m,--preset superfast --me 4
# vim: tw=200
diff --git a/source/test/smoke-tests.txt b/source/test/smoke-tests.txt
index 409f1e7..fa92abb 100644
--- a/source/test/smoke-tests.txt
+++ b/source/test/smoke-tests.txt
@@ -3,10 +3,9 @@
# consider VBV tests a failure if new bitrate is more than 5% different
# from the old bitrate
# vbv-tolerance = 0.05
-
big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers
big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default
-big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --pme --qg-size 16
+big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --qg-size 16
washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1 --qg-size 16
washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4
washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0
@@ -16,9 +15,10 @@ old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode --qg-size
RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --max-tu-size 8
RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1
CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10
-CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16 --tu-inter-depth 2 --limit-tu 3
DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryfast --min-cu 16
DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=fast --weightb --interlace bff
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryslow --limit-ref 1 --limit-mode --tskip --limit-tu 1
# Main12 intraCost overflow bug test
720p50_parkrun_ter.y4m,--preset medium
diff --git a/source/x265-extras.cpp b/source/x265-extras.cpp
index dac4c1e..653d72e 100644
--- a/source/x265-extras.cpp
+++ b/source/x265-extras.cpp
@@ -64,6 +64,8 @@ FILE* x265_csvlog_open(const x265_api& api, const x265_param& param, const char*
fprintf(csvfp, "Encode Order, Type, POC, QP, Bits, Scenecut, ");
if (param.rc.rateControlMode == X265_RC_CRF)
fprintf(csvfp, "RateFactor, ");
+ if (param.rc.vbvBufferSize)
+ fprintf(csvfp, "BufferFill, ");
if (param.bEnablePsnr)
fprintf(csvfp, "Y PSNR, U PSNR, V PSNR, YUV PSNR, ");
if (param.bEnableSsim)
@@ -132,6 +134,8 @@ void x265_csvlog_frame(FILE* csvfp, const x265_param& param, const x265_picture&
fprintf(csvfp, "%d, %c-SLICE, %4d, %2.2lf, %10d, %d,", frameStats->encoderOrder, frameStats->sliceType, frameStats->poc, frameStats->qp, (int)frameStats->bits, frameStats->bScenecut);
if (param.rc.rateControlMode == X265_RC_CRF)
fprintf(csvfp, "%.3lf,", frameStats->rateFactor);
+ if (param.rc.vbvBufferSize)
+ fprintf(csvfp, "%.3lf,", frameStats->bufferFill);
if (param.bEnablePsnr)
fprintf(csvfp, "%.3lf, %.3lf, %.3lf, %.3lf,", frameStats->psnrY, frameStats->psnrU, frameStats->psnrV, frameStats->psnr);
if (param.bEnableSsim)
@@ -187,7 +191,7 @@ void x265_csvlog_frame(FILE* csvfp, const x265_param& param, const x265_picture&
fflush(stderr);
}
-void x265_csvlog_encode(FILE* csvfp, const x265_api& api, const x265_param& param, const x265_stats& stats, int level, int argc, char** argv)
+void x265_csvlog_encode(FILE* csvfp, const char* version, const x265_param& param, const x265_stats& stats, int level, int argc, char** argv)
{
if (!csvfp)
return;
@@ -277,7 +281,7 @@ void x265_csvlog_encode(FILE* csvfp, const x265_api& api, const x265_param& para
else
fprintf(csvfp, " -, -, -, -, -, -, -,");
- fprintf(csvfp, " %-6u, %-6u, %s\n", stats.maxCLL, stats.maxFALL, api.version_str);
+ fprintf(csvfp, " %-6u, %-6u, %s\n", stats.maxCLL, stats.maxFALL, version);
}
/* The dithering algorithm is based on Sierra-2-4A error diffusion.
diff --git a/source/x265-extras.h b/source/x265-extras.h
index 8b90ca4..a63e178 100644
--- a/source/x265-extras.h
+++ b/source/x265-extras.h
@@ -53,7 +53,7 @@ LIBAPI void x265_csvlog_frame(FILE* csvfp, const x265_param& param, const x265_p
/* Log final encode statistics to the CSV file handle. 'argc' and 'argv' are
* intended to be command line arguments passed to the encoder. Encode
* statistics should be queried from the encoder just prior to closing it. */
-LIBAPI void x265_csvlog_encode(FILE* csvfp, const x265_api& api, const x265_param& param, const x265_stats& stats, int level, int argc, char** argv);
+LIBAPI void x265_csvlog_encode(FILE* csvfp, const char* version, const x265_param& param, const x265_stats& stats, int level, int argc, char** argv);
/* In-place downshift from a bit-depth greater than 8 to a bit-depth of 8, using
* the residual bits to dither each row. */
diff --git a/source/x265.cpp b/source/x265.cpp
index 26afb73..e408def 100644
--- a/source/x265.cpp
+++ b/source/x265.cpp
@@ -746,7 +746,7 @@ fail:
api->encoder_get_stats(encoder, &stats, sizeof(stats));
if (cliopt.csvfpt && !b_ctrl_c)
- x265_csvlog_encode(cliopt.csvfpt, *api, *param, stats, cliopt.csvLogLevel, argc, argv);
+ x265_csvlog_encode(cliopt.csvfpt, api->version_str, *param, stats, cliopt.csvLogLevel, argc, argv);
api->encoder_close(encoder);
int64_t second_largest_pts = 0;
diff --git a/source/x265.h b/source/x265.h
index 2e2e38c..e620ba0 100644
--- a/source/x265.h
+++ b/source/x265.h
@@ -137,6 +137,7 @@ typedef struct x265_frame_stats
double avgPsyEnergy;
double avgResEnergy;
double avgLumaLevel;
+ double bufferFill;
uint64_t bits;
int encoderOrder;
int poc;
@@ -289,6 +290,7 @@ typedef enum
X265_HEX_SEARCH,
X265_UMH_SEARCH,
X265_STAR_SEARCH,
+ X265_SEA,
X265_FULL_SEARCH
} X265_ME_METHODS;
@@ -334,6 +336,9 @@ typedef enum
#define X265_CPU_NEON 0x0000002 /* ARM NEON */
#define X265_CPU_FAST_NEON_MRC 0x0000004 /* Transfer from NEON to ARM register is fast (Cortex-A9) */
+/* IBM Power8 */
+#define X265_CPU_ALTIVEC 0x0000001
+
#define X265_MAX_SUBPEL_LEVEL 7
/* Log level */
@@ -351,6 +356,10 @@ typedef enum
#define X265_REF_LIMIT_DEPTH 1
#define X265_REF_LIMIT_CU 2
+#define X265_TU_LIMIT_BFS 1
+#define X265_TU_LIMIT_DFS 2
+#define X265_TU_LIMIT_NEIGH 4
+
#define X265_BFRAME_MAX 16
#define X265_MAX_FRAME_THREADS 16
@@ -456,7 +465,7 @@ typedef struct x265_stats
} x265_stats;
/* String values accepted by x265_param_parse() (and CLI) for various parameters */
-static const char * const x265_motion_est_names[] = { "dia", "hex", "umh", "star", "full", 0 };
+static const char * const x265_motion_est_names[] = { "dia", "hex", "umh", "star", "sea", "full", 0 };
static const char * const x265_source_csp_names[] = { "i400", "i420", "i422", "i444", "nv12", "nv16", 0 };
static const char * const x265_video_format_names[] = { "component", "pal", "ntsc", "secam", "mac", "undef", 0 };
static const char * const x265_fullrange_names[] = { "limited", "full", 0 };
@@ -823,6 +832,10 @@ typedef struct x265_param
* compressed by the DCT transforms, at the expense of much more compute */
uint32_t tuQTMaxIntraDepth;
+ /* Enable early exit decisions for inter coded blocks to avoid recursing to
+ * higher TU depths. Default: 0 */
+ uint32_t limitTU;
+
/* Set the amount of rate-distortion analysis to use within quant. 0 implies
* no rate-distortion optimization. At level 1 rate-distortion cost is used to
* find optimal rounding values for each level (and allows psy-rdoq to be
@@ -898,9 +911,9 @@ typedef struct x265_param
/* Limit modes analyzed for each CU using cost metrics from the 4 sub-CUs */
uint32_t limitModes;
- /* ME search method (DIA, HEX, UMH, STAR, FULL). The search patterns
+ /* ME search method (DIA, HEX, UMH, STAR, SEA, FULL). The search patterns
* (methods) are sorted in increasing complexity, with diamond being the
- * simplest and fastest and full being the slowest. DIA, HEX, and UMH were
+ * simplest and fastest and full being the slowest. DIA, HEX, UMH and SEA were
* adapted from x264 directly. STAR is an adaption of the HEVC reference
* encoder's three step search, while full is a naive exhaustive search. The
* default is the star search, it has a good balance of performance and
@@ -1300,15 +1313,28 @@ typedef struct x265_param
/* Maximum of the picture order count */
int log2MaxPocLsb;
- /* Dicard SEI messages when printing */
- int bDiscardSEI;
-
- /* Control removing optional vui information (timing, HRD info) to get low bitrate */
- int bDiscardOptionalVUI;
+ /* Emit VUI Timing info, an optional VUI field */
+ int bEmitVUITimingInfo;
+
+ /* Emit HRD Timing info */
+ int bEmitVUIHRDInfo;
/* Maximum count of Slices of picture, the value range is [1, maximum rows] */
unsigned int maxSlices;
+ /* Optimize QP in PPS based on statistics from prevvious GOP*/
+ int bOptQpPPS;
+
+ /* Opitmize ref list length in PPS based on stats from previous GOP*/
+ int bOptRefListLengthPPS;
+
+ /* Enable storing commonly RPS in SPS in multi pass mode */
+ int bMultiPassOptRPS;
+
+ /* This value represents the percentage difference between the inter cost and
+ * intra cost of a frame used in scenecut detection. Default 5. */
+ double scenecutBias;
+
} x265_param;
/* x265_param_alloc:
diff --git a/source/x265cli.h b/source/x265cli.h
index bec9cd2..7f933c2 100644
--- a/source/x265cli.h
+++ b/source/x265cli.h
@@ -85,6 +85,7 @@ static const struct option long_options[] =
{ "max-tu-size", required_argument, NULL, 0 },
{ "tu-intra-depth", required_argument, NULL, 0 },
{ "tu-inter-depth", required_argument, NULL, 0 },
+ { "limit-tu", required_argument, NULL, 0 },
{ "me", required_argument, NULL, 0 },
{ "subme", required_argument, NULL, 'm' },
{ "merange", required_argument, NULL, 0 },
@@ -120,6 +121,7 @@ static const struct option long_options[] =
{ "min-keyint", required_argument, NULL, 'i' },
{ "scenecut", required_argument, NULL, 0 },
{ "no-scenecut", no_argument, NULL, 0 },
+ { "scenecut-bias", required_argument, NULL, 0 },
{ "intra-refresh", no_argument, NULL, 0 },
{ "rc-lookahead", required_argument, NULL, 0 },
{ "lookahead-slices", required_argument, NULL, 0 },
@@ -208,8 +210,14 @@ static const struct option long_options[] =
{ "min-luma", required_argument, NULL, 0 },
{ "max-luma", required_argument, NULL, 0 },
{ "log2-max-poc-lsb", required_argument, NULL, 8 },
- { "discard-sei", no_argument, NULL, 0 },
- { "discard-vui", no_argument, NULL, 0 },
+ { "vui-timing-info", no_argument, NULL, 0 },
+ { "no-vui-timing-info", no_argument, NULL, 0 },
+ { "vui-hrd-info", no_argument, NULL, 0 },
+ { "no-vui-hrd-info", no_argument, NULL, 0 },
+ { "opt-qp-pps", no_argument, NULL, 0 },
+ { "no-opt-qp-pps", no_argument, NULL, 0 },
+ { "opt-ref-list-length-pps", no_argument, NULL, 0 },
+ { "no-opt-ref-list-length-pps", no_argument, NULL, 0 },
{ "no-dither", no_argument, NULL, 0 },
{ "dither", no_argument, NULL, 0 },
{ "no-repeat-headers", no_argument, NULL, 0 },
@@ -229,6 +237,8 @@ static const struct option long_options[] =
{ "pass", required_argument, NULL, 0 },
{ "slow-firstpass", no_argument, NULL, 0 },
{ "no-slow-firstpass", no_argument, NULL, 0 },
+ { "multi-pass-opt-rps", no_argument, NULL, 0 },
+ { "no-multi-pass-opt-rps", no_argument, NULL, 0 },
{ "analysis-mode", required_argument, NULL, 0 },
{ "analysis-file", required_argument, NULL, 0 },
{ "strict-cbr", no_argument, NULL, 0 },
@@ -317,6 +327,7 @@ static void showHelp(x265_param *param)
H0(" --max-tu-size <32|16|8|4> Maximum TU size (WxH). Default %d\n", param->maxTUSize);
H0(" --tu-intra-depth <integer> Max TU recursive depth for intra CUs. Default %d\n", param->tuQTMaxIntraDepth);
H0(" --tu-inter-depth <integer> Max TU recursive depth for inter CUs. Default %d\n", param->tuQTMaxInterDepth);
+ H0(" --limit-tu <0..4> Enable early exit from TU recursion for inter coded blocks. Default %d\n", param->limitTU);
H0("\nAnalysis:\n");
H0(" --rd <1..6> Level of RDO in mode decision 1:least....6:full RDO. Default %d\n", param->rdLevel);
H0(" --[no-]psy-rd <0..5.0> Strength of psycho-visual rate distortion optimization, 0 to disable. Default %.1f\n", param->psyRd);
@@ -357,6 +368,7 @@ static void showHelp(x265_param *param)
H0("-i/--min-keyint <integer> Scenecuts closer together than this are coded as I, not IDR. Default: auto\n");
H0(" --no-scenecut Disable adaptive I-frame decision\n");
H0(" --scenecut <integer> How aggressively to insert extra I-frames. Default %d\n", param->scenecutThreshold);
+ H1(" --scenecut-bias <0..100.0> Bias for scenecut detection. Default %.2f\n", param->scenecutBias);
H0(" --intra-refresh Use Periodic Intra Refresh instead of IDR frames\n");
H0(" --rc-lookahead <integer> Number of frames for frame-type lookahead (determines encoder latency) Default %d\n", param->lookaheadDepth);
H1(" --lookahead-slices <0..16> Number of slices to use per lookahead cost estimate. Default %d\n", param->lookaheadSlices);
@@ -448,8 +460,11 @@ static void showHelp(x265_param *param)
H0(" --[no-]aud Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters));
H1(" --hash <integer> Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI);
H0(" --log2-max-poc-lsb <integer> Maximum of the picture order count\n");
- H0(" --discard-sei Discard SEI packets in bitstream. Default %s\n", OPT(param->bDiscardSEI));
- H0(" --discard-vui Discard optional VUI information from the bistream. Default %s\n", OPT(param->bDiscardOptionalVUI));
+ H0(" --[no-]vui-timing-info Emit VUI timing information in the bistream. Default %s\n", OPT(param->bEmitVUITimingInfo));
+ H0(" --[no-]vui-hrd-info Emit VUI HRD information in the bistream. Default %s\n", OPT(param->bEmitVUIHRDInfo));
+ H0(" --[no-]opt-qp-pps Dynamically optimize QP in PPS (instead of default 26) based on QPs in previous GOP. Default %s\n", OPT(param->bOptQpPPS));
+ H0(" --[no-]opt-ref-list-length-pps Dynamically set L0 and L1 ref list length in PPS (instead of default 0) based on values in last GOP. Default %s\n", OPT(param->bOptRefListLengthPPS));
+ H0(" --[no-]multi-pass-opt-rps Enable storing commonly used RPS in SPS in multi pass mode. Default %s\n", OPT(param->bMultiPassOptRPS));
H1("\nReconstructed video options (debugging):\n");
H1("-r/--recon <filename> Reconstructed raw image YUV or Y4M output file name\n");
H1(" --recon-depth <integer> Bit-depth of reconstructed raw image file. Defaults to input bit depth, or 8 if Y4M\n");
--
x265 packaging
More information about the pkg-multimedia-commits
mailing list