normalization - Effect of feature scaling on accuracy -



normalization - Effect of feature scaling on accuracy -

i working on image classification using gaussian mixture models. have around 34,000 features, belonging 3 classes, lying in 23 dimensional space. performed feature scaling on both training , testing info using different methods, , observed accuracy reduces after performing scaling. performed feature scaling because there difference of many orders between many features. curious know why happening, thought feature scaling increment accuracy, given big differences in features.

i thought feature scaling increment accuracy, given big differences in features.

welcome real world buddy.

in general, quite true want features in same "scale" don't have features "dominating" other features. if machine learning algorithm inherently "geometrical" in nature. "geometrical", mean treats samples points in space, , relies on distances (usually euclidean/l2 case) between points in making predictions, i.e., spatial relationships of points matter. gmm , svm algorithms of nature.

however, feature scaling can screw things up, if features categorical/ordinal in nature, , didn't preprocess them when appended them rest of features. furthermore, depending on feature scaling method, presence of outliers particular feature can screw feature scaling feature. e.g., "min/max" or "unit variance" scaling going sensitive outliers (e.g., if 1 of feature encodes yearly income or cash balance , there few mi/billionaires in dataset).

also, when experience problem such this, cause may not obvious. not mean perform feature scaling, result goes bad, feature scaling @ fault. method screwed begin with, , result after feature scaling happens more screwed up.

so other cause(s) of problem?

my guess cause have high-dimensional info , not plenty training samples. because gmm going estimating covariance matrices using info 34000 in dimension. unless have lot of data, chances 1 or more of covariance matrices (one each gaussian) going near singular or singular. means predictions gmm nonsense begin because gaussians "blew" up, and/or em algorithm gave after predefined number of iterations. poor testing methodology. did not have info divided proper training/validation/test sets, , did not perform testing properly. "good" performance have in origin not credible. common, natural tendency test using training info model fitted on , not on validation or test set.

so can do?

don't utilize gmm image categorization. utilize proper supervised learning algorithm, if have known image categories labels. in particular, avoid feature scaling altogether, utilize random forest or variants (e.g., extremely randomized trees). get more training data. unless classifying "simple" (i.e., "toy"/synthetic images) or classifying them few image classes (e.g., <= 5. note random little number pulled out of air.), have deal of images per class. starting point @ to the lowest degree couple of hundreds per class, or utilize more sophisticated algorithm exploit construction within info arrive @ improve performance.

basically, point don't (just) treat machine learning field/algorithms black boxes , bunch of tricks memorize , seek @ random. seek understanding algorithm/math under hood. way, you'll improve able diagnose problem(s) encounter.

edit (in response request clarification @zee)

for papers, 1 can recall off top of head a practical guide back upwards vector classification authors of libsvm. examples therein shows importance of feature scaling svm on various datasets. e.g., consider rbf/gaussian kernel. kernel uses square l2 norm. if features of different scale, impact value.

also, how represent features matter. e.g., changing variable represents height metres cm or inches impact algorithms such pca (because variance along direction feature has changed.) note different "typical" scaling (e.g., min/max, z-score etc.) in matter of representation. person still same height regardless of unit. whereas typical feature scaling "transform" data, changes "height" of person. prof. david mackay, on amazon page of book, information theory machine learning, has comment in vein when asked why did not include pca in book.

for ordinal , categorical variables, mentioned briefly in bayesian reasoning machine learning, the elements of statistical learning. mention ways encode them features, e.g., replacing variable can represent 3 categories 3 binary variables, 1 set "1" indicate sample has category. of import methods such linear regression (or linear classifiers). note encoding categorical variables/features, not scaling per se, part of feature preprocessing set up, , hence useful know. more can found in hal duame iii's book below.

the book a course of study in machine learning hal duame iii. search "scaling". 1 of earliest illustration in book how if affects knn (which uses l2 distance, gmm, svm etc. uses if utilize rbf/gaussian kernel). more details given in chapter 4, "machine learning in practice". unfortunately images/plots not showing in pdf. book has 1 of nicest treatments on feature encoding , scaling, if work on natural language processing (nlp). e.g., see explanation of applying logarithm feature (i.e., log transform). way, sums of logs become log of product of features, , "effects"/"contributions" of these features tapered logarithm.

note aforementioned textbooks freely downloadable above links.

normalization mixture-model

Comments

Popular posts from this blog

formatting - SAS SQL Datepart function returning odd values -

c++ - Apple Mach-O Linker Error(Duplicate Symbols For Architecture armv7) -

php - Yii 2: Unable to find a class into the extension 'yii2-admin' -