python - numpy peformance differences between Linux and Windows -
python - numpy peformance differences between Linux and Windows -
i trying run sklearn.decomposition.truncatedsvd()
on 2 different computers , understand performance differences.
computer 1 (windows 7, physical computer)
os name microsoft windows 7 professional scheme type x64-based pc processor intel(r) core(tm) i7-3770 cpu @ 3.40ghz, 3401 mhz, 4 core(s), 8 logical installed physical memory (ram) 8.00 gb total physical memory 7.89 gb
computer 2 (debian, on amazon cloud)
architecture: x86_64 cpu op-mode(s): 32-bit, 64-bit byte order: little endian cpu(s): 8 width: 64 bits capabilities: ldt16 vsyscall32 *-core description: motherboard physical id: 0 *-memory description: scheme memory physical id: 0 size: 29gib *-cpu product: intel(r) xeon(r) cpu e5-2670 0 @ 2.60ghz vendor: intel corp. physical id: 1 bus info: cpu@0 width: 64 bits
computer 3 (windows 2008r2, on amazon cloud)
os name microsoft windows server 2008 r2 datacenter version 6.1.7601 service pack 1 build 7601 scheme type x64-based pc processor intel(r) xeon(r) cpu e5-2670 v2 @ 2.50ghz, 2500 mhz, 4 core(s), 8 logical processor(s) installed physical memory (ram) 30.0 gb
both computers running python 3.2 , identical sklearn, numpy, scipy versions
i ran cprofile
follows:
print(vectors.shape) >>> (7500, 2042) _decomp = truncatedsvd(n_components=680, random_state=1) global _o _o = _decomp cprofile.runctx('_o.fit_transform(vectors)', globals(), locals(), sort=1)
computer 1 output
>>> 833 function calls in 1.710 seconds ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.767 0.767 0.782 0.782 decomp_svd.py:15(svd) 1 0.249 0.249 0.249 0.249 {method 'enable' of '_lsprof.profiler' objects} 1 0.183 0.183 0.183 0.183 {method 'normal' of 'mtrand.randomstate' objects} 6 0.174 0.029 0.174 0.029 {built-in method csr_matvecs} 6 0.123 0.021 0.123 0.021 {built-in method csc_matvecs} 2 0.110 0.055 0.110 0.055 decomp_qr.py:14(safecall) 1 0.035 0.035 0.035 0.035 {built-in method dot} 1 0.020 0.020 0.589 0.589 extmath.py:185(randomized_range_finder) 2 0.018 0.009 0.019 0.010 function_base.py:532(asarray_chkfinite) 24 0.014 0.001 0.014 0.001 {method 'ravel' of 'numpy.ndarray' objects} 1 0.007 0.007 0.009 0.009 twodim_base.py:427(triu) 1 0.004 0.004 1.710 1.710 extmath.py:232(randomized_svd)
computer 2 output
>>> 858 function calls in 40.145 seconds ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 2 32.116 16.058 32.116 16.058 {built-in method dot} 1 6.148 6.148 6.156 6.156 decomp_svd.py:15(svd) 2 0.561 0.281 0.561 0.281 decomp_qr.py:14(safecall) 6 0.561 0.093 0.561 0.093 {built-in method csr_matvecs} 1 0.337 0.337 0.337 0.337 {method 'normal' of 'mtrand.randomstate' objects} 6 0.202 0.034 0.202 0.034 {built-in method csc_matvecs} 1 0.052 0.052 1.633 1.633 extmath.py:183(randomized_range_finder) 1 0.045 0.045 0.054 0.054 _methods.py:73(_var) 1 0.023 0.023 0.023 0.023 {method 'argmax' of 'numpy.ndarray' objects} 1 0.023 0.023 0.046 0.046 extmath.py:531(svd_flip) 1 0.016 0.016 40.145 40.145 <string>:1(<module>) 24 0.011 0.000 0.011 0.000 {method 'ravel' of 'numpy.ndarray' objects} 6 0.009 0.002 0.009 0.002 {method 'reduce' of 'numpy.ufunc' objects} 2 0.008 0.004 0.009 0.004 function_base.py:532(asarray_chkfinite)
computer 3 output
>>> 858 function calls in 2.223 seconds ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.956 0.956 0.972 0.972 decomp_svd.py:15(svd) 2 0.306 0.153 0.306 0.153 {built-in method dot} 1 0.274 0.274 0.274 0.274 {method 'normal' of 'mtrand.randomstate' objects} 6 0.205 0.034 0.205 0.034 {built-in method csr_matvecs} 6 0.151 0.025 0.151 0.025 {built-in method csc_matvecs} 2 0.133 0.067 0.133 0.067 decomp_qr.py:14(safecall) 1 0.032 0.032 0.043 0.043 _methods.py:73(_var) 1 0.030 0.030 0.030 0.030 {method 'argmax' of 'numpy.ndarray' objects} 24 0.026 0.001 0.026 0.001 {method 'ravel' of 'numpy.ndarray' objects} 2 0.019 0.010 0.020 0.010 function_base.py:532(asarray_chkfinite) 1 0.019 0.019 0.773 0.773 extmath.py:183(randomized_range_finder) 1 0.019 0.019 0.049 0.049 extmath.py:531(svd_flip)
notice {built-in method dot} difference 0.035s/call 16.058s/call, 450 times slower!!
------+---------+---------+---------+---------+--------------------------------------- ncalls| tottime | percall | cumtime | percall | filename:lineno(function) hardware ------+---------+---------+---------+---------+--------------------------------------- 1 | 0.035 | 0.035 | 0.035 | 0.035 | {built-in method dot} computer 1 2 | 32.116 | 16.058 | 32.116 | 16.058 | {built-in method dot} computer 2 2 | 0.306 | 0.153 | 0.306 | 0.153 | {built-in method dot} computer 3
i understand there should performance differences, should high?
is there way can farther debug performance issue?
edit
i tested new computer, computer 3 hw similar computer 2 , different os
the results 0.153s/call {built-in method dot} still 100 times faster linux!!
edit 2
computer 1 numpy config
>>> np.__config__.show() lapack_opt_info: libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd', 'mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd'] library_dirs = ['c:/program files (x86)/intel/composer xe/mkl/lib/intel64'] define_macros = [('scipy_mkl_h', none)] include_dirs = ['c:/program files (x86)/intel/composer xe/mkl/include'] blas_opt_info: libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd'] library_dirs = ['c:/program files (x86)/intel/composer xe/mkl/lib/intel64'] define_macros = [('scipy_mkl_h', none)] include_dirs = ['c:/program files (x86)/intel/composer xe/mkl/include'] openblas_info: not available lapack_mkl_info: libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd', 'mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd'] library_dirs = ['c:/program files (x86)/intel/composer xe/mkl/lib/intel64'] define_macros = [('scipy_mkl_h', none)] include_dirs = ['c:/program files (x86)/intel/composer xe/mkl/include'] blas_mkl_info: libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd'] library_dirs = ['c:/program files (x86)/intel/composer xe/mkl/lib/intel64'] define_macros = [('scipy_mkl_h', none)] include_dirs = ['c:/program files (x86)/intel/composer xe/mkl/include'] mkl_info: libraries = ['mkl_lapack95_lp64', 'mkl_blas95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'libiomp5md', 'libifportmd'] library_dirs = ['c:/program files (x86)/intel/composer xe/mkl/lib/intel64'] define_macros = [('scipy_mkl_h', none)] include_dirs = ['c:/program files (x86)/intel/composer xe/mkl/include']
computer 2 numpy config
>>> np.__config__.show() lapack_info: not available lapack_opt_info: not available blas_info: libraries = ['blas'] library_dirs = ['/usr/lib'] language = f77 atlas_threads_info: not available atlas_blas_info: not available lapack_src_info: not available openblas_info: not available atlas_blas_threads_info: not available blas_mkl_info: not available blas_opt_info: libraries = ['blas'] library_dirs = ['/usr/lib'] language = f77 define_macros = [('no_atlas_info', 1)] atlas_info: not available lapack_mkl_info: not available mkl_info: not available
{built-in method dot}
np.dot
function, numpy wrapper around cblas routines matrix-matrix, matrix-vector , vector-vector multiplication. windows machines uses heavily tuned intel mkl version of cblas. linux machine using slow old reference implementation.
if install atlas or openblas (both available through linux bundle managers) or, in fact, intel mkl, you're see massive speedups. seek sudo apt-get install libatlas-dev
, check numpy config 1 time again see if picked atlas, , measure again.
once you've decided on right cblas library, may want recompile scikit-learn. of uses numpy linear algebra needs, algorithms (notably k-means) utilize cblas directly.
the os has little this.
python performance numpy scikit-learn
Comments
Post a Comment