c++ - SSE / Optimisation - duplicating array into larger array -



c++ - SSE / Optimisation - duplicating array into larger array -

i'm trying optimize next function: (basically takes line of 32bit ints, , duplicates each int larger destination arrray, , duplicates each line

for(int = 0; < numlines; i++) { pstartofline = pdest; for(int j = 0; j < intsperlinesrc; j++) { *pdest = *psrc; // re-create pixel fullsizebuffer pdest++; // move dest ptr next pixel *pdest = *psrc; // re-create pixel fullsizebuffer 1 time again pdest++; // move src , dst pointrs next pixels psrc++; } memcpy(pdest, pstartofline, (8*intsperlinesrc) ); // duplicate line written pdest, next line of pdest. pdest = pdest + (2*intsperlinesrc); // move pdest start of next line }

effectively scaling image 2 * it's original size in both dimensions. strikes me should benefit massively simd, cannot seem find right set of intrinsic instructions assist me in specific case.

anyone care help me out? or memory limited in such simple operation re-factoring in simd waste?

yes section of code ends running in multiple threads, heavily multi-threaded, think simd optimization may more helpful.

cheers, help / advice,

james

your current operation memory bandwidth bound.

if can find way not process entire image instead process blocks (e.g. 16x16 pixel blocks 32x32 pixel blocks) , other computations each block may able create operations less memory bandwidth bound.

but if have process entire image there few things should consider accomplish maximum memory bandwidth:

for memory bandwidth bound operations don't scale number of cores scale number of sockets. if have dual socket scheme memory bandwidth twice single socket scheme (assumption both sockets utilize same processor do). however, achieving twice bandwidth can tricky. the memcpy function typically not optimized copying big sizes. 1 of main reasons many implementations of don't utilize non-temporal stores. rule of thumb non-temporal stores utilize them when size larger twice size of slowest cache. let's assume processor has 12 mb l3 cache. if size of destination image larger 6mb should consider using non-temporal stores. case since you're code writing 32 mb.

here illustration of how can utilize both sse2 , non-temporal stores

int main() { int n = 16; int *src = (int*)_mm_malloc(n*sizeof(int), 16); //16 byte aligned int *dst = (int*)_mm_malloc(2*n*sizeof(int), 16); //16 byte aligned for(int i=0; i<n; i++) src[i] = rand(); for(int i=0; i<n; i+=4) { __m128i x = _mm_load_si128((__m128i*)&src[i]); __m128i lo = _mm_shuffle_epi32(x, 0x50); // 0x50 = 1100 in base of operations 4 __m128i hi = _mm_shuffle_epi32(x, 0xfa); // 0xfa = 3322 in base of operations 4 _mm_stream_si128((__m128i*)&dst[2*i+0], lo); //non-temporal store _mm_stream_si128((__m128i*)&dst[2*i+4], hi); //non-temporal store //_mm_store_si128((__m128i*)&dst[2*i+0], lo); //_mm_store_si128((__m128i*)&dst[2*i+4], hi); } //for(int i=0; i<n; i++) printf("%x ", src[i]); printf("\n"); //for(int i=0; i<(2*n); i++) printf("%x ", dst[i]); printf("\n"); }

in case replace n number of pixels. if n not multiple of 4 have little clean did not here. temporal stores have 16 byte aligned in order why aligned dst. however, src not have 16 byte aligned utilize _mm_loadu_si128 , not align src.

once accomplish maximum bandwidth single thread , assuming have multi-socket scheme should seek , accomplish maximum bandwidth both sockets. don't have plenty experience help think can achieved using numactl. see why-doesnt-this-code-scale-linearly example.

c++ arrays image-processing optimization sse

Comments

Popular posts from this blog

formatting - SAS SQL Datepart function returning odd values -

c++ - Apple Mach-O Linker Error(Duplicate Symbols For Architecture armv7) -

php - Yii 2: Unable to find a class into the extension 'yii2-admin' -