
CLUSTER ANALYSIS Overview
An illustrated tutorial and introduction to cluster analysis using SPSS, SAS, SAS Enterprise Miner, and Stata for examples. Suitable for introductory graduatelevel study.
The 2014 edition is a major update to the 2012 edition. Among the new features are these:
The full content is now available from Statistical Associates Publishers. Click here.
Below is the unformatted table of contents.
Table of Contents CLUSTER ANALYSIS 1 Overview 10 Data examples in this volume 10 Key Concepts and Terms 12 Terminology 12 Distances (proximities) 12 Cluster formation 12 Cluster validity 12 Types of cluster analysis 14 Types of cluster analysis by software package 14 Disjoint clustering 15 Hierarchical clustering 15 Overlapping clustering 16 Fuzzy clustering 16 Hierarchical cluster analysis in SPSS 16 SPSS Input for hierarchical clustering 16 Example 16 The main "Hierarchical Cluster Analysis" dialog 17 Statistics button 18 Plots button 19 Methods button 20 SPSS output for hierarchical cluster analysis 21 Proximity table 21 Cluster membership table 22 Agglomeration Schedule 22 Dendogram 24 Icicle plots 27 Summary measures 28 Hierarchical cluster analysis in SAS 29 SAS input for hierarchical cluster analysis\ 29 Example 29 Data setup 29 SAS syntax 30 SAS output for hierarchical cluster analysis 31 Simple statistics table 31 Eigenvalues of the covariance matrix table 31 Root mean square coefficients 32 Cluster history table 33 Dendogram 34 Icicle Plots 36 Cluster membership table 36 Saving data to file 37 Hierarchical cluster analysis in Stata 38 Stata input for hierarchical cluster analysis 38 Stata output for hierarchical cluster analysis 40 Agglomeration coefficients 40 Dendogram 41 Saving cluster membership values 42 Cluster membership table 43 Kmeans cluster analysis 44 Overview 44 Example 45 Kmeans cluster analysis in SPSS 45 SPSS input 45 Main Kmeans dialog 45 The Iterate button 47 The Save button 48 The Options button 49 SPSS Output for KMeans cluster analysis 50 The Anova table 50 Number of cases in each cluster 51 Getting different clusters 52 Cluster membership table 52 KMeans cluster analysis in SAS 53 Overview 53 Example 54 SAS input for kmeans cluster analysis 54 SAS output for kmeans cluster analysis 55 The "Statistics for Variables" table 55 Criteria for determining k 57 The "Cluster Summary" table 60 Cluster membership and distance values 61 Crosstabulation tables 61 Cluster separation plots 62 KMeans cluster analysis in Stata 64 Example 64 Stata input for kmeans cluster analysis 64 The main kmeans clustering command 64 Obtaining descriptive statistics 65 Obtaining distance information 65 Obtaining cluster separation plots 65 Comparing kmeans and kmedian solutions 66 Stata output for kmeans cluster analysis 66 Cluster membership assignments 66 Descriptive statistics 67 Distance coefficients 69 Cluster separation plots 70 Comparing kmeans and kmedians solutions 71 Twostep cluster analysis in SPSS 72 Overview 72 Cluster feature tree (CF tree) 73 Proximity 73 Example 74 SPSS input for twostep clustering 74 The main twostep clustering dialog 74 Options button dialog 75 Output button dialog 78 SPSS output for twostep clustering 79 Autoclustering table 79 Cluster distribution table 81 Centroids (cluster profiles) table 81 Model summary 82 The "Cluster Quality" graph 82 The "Cluster Sizes" pie chart 82 The "Predictor Importance" chart 83 The "Clusters" table 84 The "Cell Distribution" chart 85 The "Cluster Comparison" chart 86 Nearest neighbor analysis in SPSS 87 Overview 87 Target variables 87 Selecting k 87 Feature variables 88 Focal cases 88 Case labels 89 Partitions and crossvalidation 89 Example 89 SPSS input 90 The user interface 90 The "Variables" tab 90 The "Neighbors" tab 91 The "Features" tab 92 The "Partitions" tab 93 The "Save" tab 95 The "Output" tab 96 The "Options" tab 97 SPSS output 97 Overview 97 The "Case Processing Summary " table 98 The "Predictor Space" plot 98 The "Peers Chart" 101 The "k Nearest Neighbors and Distances" table 102 "k and Predictor Selection" plots 103 "Quadrant Map" maps 104 The "Error Summary" table 105 SAS PROC ACECLUS: Preprocessing for elliptical clusters 106 Overview 106 Example 106 SAS input 107 Overview 107 Setup 107 Plot of original data 108 Using PROC ACECLUS to transform the data 108 Plot of transformed data 109 Kmeans clustering of transformed data 109 Kmeans clustering of original data 110 SAS output 110 Plot of untransformed data 110 Data transformation with PROC ACECLUS 111 Plot of transformed data 112 Kmeans (PROC FASTCLUS) results with original vs. transformed data 113 SAS PROC VARCLUS : Oblique principal components cluster analysis 115 Overview 115 The PROC VARCLUS default method 115 PROC VARCLUS variations 115 Example 116 SAS input 116 SAS output 119 The dendogram from PROC TREE 119 The cluster summary table 119 The Rsquared table 121 The standardized scoring coefficients table 122 The cluster structure table 123 The table of intercluster correlations 124 The cluster history summary statistics table 125 Cluster membership 126 Cluster scores 127 SAS PROC MODECLUS: Nonparametric density cluster analysis 127 Overview 127 Interpreting pvalues 129 Example 129 SAS input 130 PROC MODECLUS specifications 130 PROC MODECLUS command syntax 131 SAS output 133 First pass: Selecting the optimal radius 133 Second pass: Generating main output 136 PROC MODECLUS: Nearest neighbor analysis 141 SAS syntax for nearest neighbor lists/distances 141 SAS output for nearest neighbor analysis 142 Kohonen clustering in SAS Enterprise Miner 144 Overview of Kohonen clustering 144 Kohonen Clustering in SAS Enterprise Miner: Setup 144 Kohonen Clustering in SAS Enterprise Miner: Modeling 153 Overview 153 The flow chart model 154 Node overview 156 The "Input Data" node 156 The "SOM/Kohonen" node 157 The "Segment Profile" node 159 Kohonen Clustering in SAS Enterprise Miner: Output 160 Results of the "Data Input" node 160 Results of the "SOM/Kohonen" node 161 Results of the "Segment Profile" node 165 Other Forms of Cluster Analysis 173 Expectation maximization (EM) clustering 173 Crossclassification to determine k 173 Distributional characteristics 173 Classification probabilities 174 Qmode factor analysis 174 Multidimensional scaling 175 Discriminant function analysis 175 Fratio methods 176 Assumptions 176 Randomization 176 Data level 176 Independence of observations 176 Data distribution 177 Comparable scaling 177 GLM assumptions 178 Sample size 178 Outliers 178 Frequently Asked Questions 178 Should data be standardized prior to running cluster analysis? 178 What are alternative linkage methods? 180 SPSS 180 SAS 181 Stata 182 What are alternative distance measures? 183 SPSS 183 SAS 191 Stata 193 It is acknowledged that kmeans and hierarchical clustering are inefficient and inaccurate for large datasets, but what is the evidence that twostep clustering does better? 194 Can I cluster variables instead of cases? 194 Can I cluster repeated measures data? 194 Isn't discriminant analysis the same as cluster analysis? 195 What is the ratio of distance measure used in autoclustering in twostep cluster analysis? 195 How does SAS's PROC MODECLUS work? 196 How does joining and dissolving work in SAS PROC MODECLUS? 196 What is the rationale for the stability value criterion in SAS PROC MODECLUS? 198 What does the content of OUTSTAT= files look like for PROC VARCLUS? 199 What is BIRCH clustering? 200 What is ClustanGraphics? 200 What is SaTScan? 201 Where can I find cluster software for R? 201 How does cluster analysis compare with factor analysis and multidimensional scaling? 201 Acknowledgments 201 Bibliography 201 Pagecount: 207