M U LT I VA R I AT E A N A LY S I SK I T T I S A K B U D D H A C H A T, P H . D
WHAT IS DIFFERENCE BETWEEN UNIVARIATE ANDMULTIVARIATE DATA?Univariate & Multivariate Fertiliser B ForestationFertiliser AAim: Which fertiliser do tree Aim: What are the significantaccelerate its growth? factors for different growth?Independent variable: type of Independent variable: intensityfertilisers. of lights, pH of soil,Dependent variable: height temperature and etc. Dependent variable: variation of height
W H AT I S M U LT I VA R I AT E A N A LY S I S ( M A ) ?Definition:The statistical process of simultaneously analysing multipleindependent (predictor) variables with multiple dependent(outcome or criterion) variables using matrix algebra.For example:A study in behaviour humans (like laziness and diligence)which are very complex.Observed data is more than avariable such as sex, age, weightand so on.What are dependent variables?
W H AT I S M U LT I VA R I AT E A N A LY S I S ?Matrix algebra: Most multivariate analyses are correlational! Correlation can tell the importance of which indecent variables influence the dependent study.
TYPES OF DATA Categorical/Discrete variableContinuous variable • a variable whose attribute are• a variable has an infinite separated from one another number of possible values • also known as qualitative• also known as quantitative variable e.g. marital status, variable e.g. income, age gender, nationality• scale: interval&ratio • scale: nominal&ordinal Binary system gender 01
WHAT CAN MA TELL US?Ty p e s o f re s e a rc h q u e s t i o n s 1. Measures significant differences between group means such as multivariate analysis of variance (MANOVA). 2. Degree of relationship between the variables such as multiple linear regression, logistic regression. 3. Explaining underlying structure such as principle component analysis (PCA). 4. Predicting membership in two or more group from one or more variables such as linear discriminant analysis (LDA).
TYPE OF MA Type of MASupervised Unsupervised Multiple regression DATA reduction ex. PCA Logistic regression Linear discriminant analysis Clustering • Hierarchical • Partitional
DATA REDUCTIONPrinciple component analysis (PCA) is a dimension reductionapproaches applied for simplifying the data and for visualising the mostimportant information in the data setMain goal of PCA • to identify hidden pattern in a data set. • to reduce the dimensionality of the data (noise and redundancy). • to identify correlated variables.
DATA REDUCTIONPrinciple component analysis (PCA) in exploratory data common factor observed factor unique factor
DATA REDUCTIONStep for PCA PREPARE THE DATA scaled —> by SD variance (normalised) centre —> by mean (standardised) CALCULATE THE COVARIANCE / C O R R E L AT I O NVariance x Covariance —> cov XY Cov xy Correlation —> cov x*cov y CALCULATE EIGENVECTOR/ EIGENVALUES CHOOSE HOW TO ROTATE orthogonal (free) —> varimax (popular) oblique (related)
DATA FOR MACase study I: Iris data set, a multivariate data setintroduced by Sir Ronald Fisher (1936)Iris setosa Iris versicolorIris virginiga
DATA FOR MACase study I: Iris data set, a multivariate data setintroduced by Sir Ronald Fisher (1936)
CLUSTERING K-means clusteringclustering is an exploratorydata analysis technique usedfor identifying groups in thedata set of interest. Eachgroup contains observationswith similar profile accordingto a specific criteria. Hierarchical clustering
CLUSTERING Non-Hierarchical or Partitional clusteringHierarchical clusteringGenerate nested cluster Partition of the data space.hierarchy by distance/ Finds all cluster simultaneously.dissimilarity, formingdendrogram
CLUSTERINGHierarchical clustering an alternative approach for identifying groups in the dataset (not required for specific group), resulting in a tree-based representation called, dendrogram using dissimilarity (pairwise distance). In R software—> Euclidean distanceTwo algorithms: • Agglomerative clustering; a bottom-up manner • Divisive hierarchical clustering; a top-down manner
CLUSTERINGAgglomerative clustering • Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters. • Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters. • Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters. • Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2. • Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.
CLUSTERINGComplete link clustering
CLUSTERINGSingle link clustering
CLUSTERINGGroup average clustering
MULTIPLE REGRESSIONGoal: use of linear composition of two or more predictorvariables.1. to predict scores on a single continuous variable(criterion) e.g. the prediction of body length of dugongfrom skeleton remains. body length2. to explain the nature of the single continuous criterionvariable from what is known about the predictor variable.Which factor influence the growth(height) the most? and how degree dofactors related to the growth of tree?
MULTIPLE REGRESSIONBasic analysis Simple regression —> Y = bX+a one criterion one predictorCoefficient of determination (R2)=SS regression/SS totalR2 = 0.85 —> 85% of total variation in Y can be explained by the linearrelationship between X and Y variable. heightY- v a r i a n c e share variance X-variance light (R2)=1-(SS error/SS total) SS total=SS regression + SS error (residual) (R2)=SS regression/SS total
MULTIPLE REGRESSIONIn multiple regression, the linear composition yields a correlation( m u l t i p l e R ) w i t h Y. I n t h e s a m e w a y a s s i m p l e r e g r e s s i o n , t h e S S e r r o r i sminimised and SS regression is maximised.a. uncorrelated X variable —> Y = b1X1+b2X2+…biXi+a R2 =r2X1~Y+r2X2~Y+…r2Xn~YX1-variance Y- v a r i a n c e X2-variance share variance
MULTIPLE REGRESSIONPredictor variables will always be correlated to some degree with eachother b. correlated X variable: the predictor variable is overlapping with each other. Y R2 =r2X1~Y+r2X2~Y+…r2Xn~YX1 Predictor variables will always be correlated to some degree with each otherX2
MULTIPLE REGRESSIONVariable Selection stepwise regression: how to choose the predictive variable by an automatic procedure. The form for selecting predictive variable such as F-test, t- test, adjusted R2, Akaike information criterion (AIC), Bayesian information criterion (BIC), etc. Main approaches for stepwise regression • Forward selection: starting with no variables in the model, testing the addition of each variable which improve the fit of model. • Backward selection: starting with all candidate variables, testing the delete of each variable whose loss gives the most statistically insignificant deterioration of the model fit. • Bidirectional elimination: a combination of the above.
MULTIPLE REGRESSION Cross-validation • a model validation method for assessing how accurately a predictive model will perform in practice. • In a prediction problem, a predictive model is given by a dataset of known data or training data but unknown dataset (testing data or untrained data) against the model, called “overfitting”
MULTIPLE REGRESSIONCommon types of cross-validation k-fold cross-validation • Exhaustive cross-validation Leave-p-out cross-validation (LPOCV) Leave-one-out cross-validation (LOOCV) • Non-exhaustive cross-validation k-fold cross-validation 2-fold cross-validation Repeated random sub-sampling validation Test data Training dataIteration 1Iteration 2Iteration 3Iteration k
L I N E A R D I S C R I M I N A N T A N A LY S I SLinear discriminant analysis (LDA) is a methods used in statistics,pattern recognition and machine learning to find a linear combination offeatures which characterises or separates two or more classes of objectsor events. Possible applications • medical studies ex. the assessment of severity stat of disease • forensic interest ex. sex identification from skeleton remains
L I N E A R D I S C R I M I N A N T A N A LY S I SStep for Linear discriminant analysis (LDA)Training DATA Testing DATA Select suitable model by accuracy and precisionCross Validation by leave one out classification model or jackknife
L I N E A R D I S C R I M I N A N T A N A LY S I SCriterion for choosing the great model Accuracy: observation error Precision (repeatability or reproducibility): random error Trueness : systematic error
L I N E A R D I S C R I M I N A N T A N A LY S I SConfusion matrices • true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease. • true negatives (TN): We predicted no, and they don't have the disease. • false positives (FP): We predicted yes, but they don't actually have the disease. (Also known as a \"Type I error.\") • false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a \"Type II error.\")Accuracy: Overall, how often is the classifier correct? Specificity: true negative rate ◦ (TP+TN)/total = (100+50)/165 = 0.91 ◦ TN/actual no = 50/60 = 0.83 ◦ equivalent to 1 minus False Positive RatePrecision: When it predicts yes, how often is itcorrect? Sensitivity: true positive rate ◦ TF/actual yes = 100/105 = 0.95 ◦ TP/predicted yes = 100/110 = 0.9
MODEL SELECTION • Coefficient of determination (R2) • (R2)=SS regression/SS total or (R2)=1-(SS error/SS total) R2 = 0.85 —> 85% of total variation in Y can be explained by the linear relationship between X and Y variable. L I N E A R D I S C R I M I N A N T A N A LY S I S
MODEL SELECTION • Receiver Operating Characteristic (ROC) curve • Area Under the curve (AUC)
SensitivityMODEL SELECTION • Receiver Operating Characteristic (ROC) curve • Area Under the curve (AUC) Area Under the curve (AUC) criterion 1.0-0.9 = Excellent 0.9-0.8 = Good 0.8-0.7 = Fair 0.7-0.6 = Poor 0.6-0.5 = Fail Remember that <0.5 is worse than guessing 1-Specificity
MA FOR DUGONG DATAModel selectionMODEL ADJ. R2 MS OF CV FUNCTION ALL R2 AT K=3BASED ON R2F O RWA R DB A C K WA R D BOTH
LDA FOR IVORY SPECIESModel selection ACCURACY PRECISION MODELALLC R O S S VA L I D AT I O N
PLOTTING SYMBOL IN R BASEPlot character Line type OPTION DESCRIPTION LT Y LINE TYPE. SEE THE CHART BELOW. LW D L I N E W I D T H R E L AT I V E TO T H E D E FA U LT ( D E FA U LT = 1 ) . 2 I S T W I C E A S W I D E .
Search
Read the Text Version
- 1 - 36
Pages: