A Handbook of                        Statistical                       Analyses                       Using n                                       SECOND                                       EDITION        © 2010 by Taylor and Francis Group, LLC
A Handbook of                Statistical               Analyses               Using                                     SECOND                                     EDITION             Brian S. Everitt and Ibrsten Hothorn                                    CRC Press                                    Taylor & Francis Croup                                    Boca Raton London New York                             CRC Press is an imprint of the                             Taylor & Francis Croup, an informa business                             A CHAPMAN & HALL BOOK        © 2010 by Taylor and Francis Group, LLC
Chapman & Hall/CRC        Taylor & Francis Group        6000 Broken Sound Parkway NW, Suite 300        Boca Raton, FL 33487-2742        © 2010 by Taylor and Francis Group, LLC        Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business        No claim to original U.S. Government works        Printed in the United States of America on acid-free paper        10 9 8 7 6 5 4 3 2 1        International Standard Book Number: 978-1-4200-7933-3 (Paperback)        This book contains information obtained from authentic and highly regarded sources. Reasonable efforts        have been made to publish reliable data and information, but the author and publisher cannot assume        responsibility for the validity of all materials or the consequences of their use. The authors and publishers        have attempted to trace the copyright holders of all material reproduced in this publication and apologize to        copyright holders if permission to publish in this form has not been obtained. If any copyright material has        not been acknowledged please write and let us know so we may rectify in any future reprint.        Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-        ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,        including photocopying, microfilming, and recording, or in any information storage or retrieval system,        without written permission from the publishers.        For permission to photocopy or use material electronically from this work, please access www.copyright.        com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood        Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and        registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,        a separate system of payment has been arranged.        Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used        only for identification and explanation without intent to infringe.                        Library of Congress Cataloging-in-Publication Data              Everitt, Brian.               A handbook of statistical analyses using R / Brian S. Everitt and Torsten Hothorn.              -- 2nd ed.                 p. cm.               Includes bibliographical references and index.               ISBN 978-1-4200-7933-3 (pbk. : alk. paper)               1. Mathematical statistics--Data processing--Handbooks, manuals, etc. 2.  R              (Computer program language)--Handbooks, manuals, etc.  I. Hothorn, Torsten. II. Title.              QA276.45.R3E94 2010              519.50285’5133--dc22                        2009018062        Visit the Taylor & Francis Web site at        http://www.taylorandfrancis.com        and the CRC Press Web site at        http://www.crcpress.com        © 2010 by Taylor and Francis Group, LLC
Dedication                                         To our wives, Mary-Elizabeth and Carolin,                                        for their constant support and encouragement                              © 2010 by Taylor and Francis Group, LLC
Preface to Second Edition                              Like the first edition this book is intended as a guide to data analysis with                              the R system for statistical computing. New chapters on graphical displays,                              generalised additive models and simultaneous inference have been added to                              this second edition and a section on generalised linear mixed models completes                              the chapter that discusses the analysis of longitudinal data where the response                              variable does not have a normal distribution. In addition, new examples and                              additional exercises have been added to several chapters. We have also taken                              the opportunity to correct a number of errors that were present in the first                              edition. Most of these errors were kindly pointed out to us by a variety of peo-                              ple to whom we are very grateful, especially Guido Schwarzer, Mike Cheung,                              Tobias Verbeke, Yihui Xie, Lothar H¨ aberle, and Radoslav Harman.                                We learnt that many instructors use our book successfully for introductory                              courses in applied statistics. We have had the pleasure to give some courses                              based on the first edition of the book ourselves and we are happy to share                              slides covering many sections of particular chapters with our readers. L A T X                                                                                              E                              sources and PDF versions of slides covering several chapters are available from                              the second author upon request.                                A new version of the HSAUR package, now called HSAUR2 for obvious                              reasons, is available from CRAN. Basically the package vignettes have been                              updated to cover the new and modified material as well. Otherwise, the tech-                              nical infrastructure remains as described in the preface to the first edition,                              with two small exceptions: names of R add-on packages are now printed in                              bold font and we refrain from showing significance stars in model summaries.                                Lastly we would like to thank Thomas Kneib and Achim Zeileis for com-                              menting on the newly added material and again the CRC Press staff, in par-                              ticular Rob Calver, for their support during the preparation of this second                              edition.                                                           Brian S. Everitt and Torsten Hothorn                                                                   London and M¨ unchen, April 2009                              © 2010 by Taylor and Francis Group, LLC
Preface to First Edition                              This book is intended as a guide to data analysis with the R system for sta-                              tistical computing. R is an environment incorporating an implementation of                              the S programming language, which is powerful and flexible and has excellent                              graphical facilities (R Development Core Team, 2009b). In the Handbook we                              aim to give relatively brief and straightforward descriptions of how to conduct                              a range of statistical analyses using R. Each chapter deals with the analy-                              sis appropriate for one or several data sets. A brief account of the relevant                              statistical background is included in each chapter along with appropriate ref-                              erences, but our prime focus is on how to use R and how to interpret results.                              We hope the book will provide students and researchers in many disciplines                              with a self-contained means of using R to analyse their data.                                R is an open-source project developed by dozens of volunteers for more than                              ten years now and is available from the Internet under the General Public Li-                              cence. R has become the lingua franca of statistical computing. Increasingly,                              implementations of new statistical methodology first appear as R add-on pack-                              ages. In some communities, such as in bioinformatics, R already is the primary                              workhorse for statistical analyses. Because the sources of the R system are open                              and available to everyone without restrictions and because of its powerful lan-                              guage and graphical capabilities, R has started to become the main computing                              engine for reproducible statistical research (Leisch, 2002a,b, 2003, Leisch and                              Rossini, 2003, Gentleman, 2005). For a reproducible piece of research, the orig-                              inal observations, all data preprocessing steps, the statistical analysis as well                              as the scientific report form a unity and all need to be available for inspection,                              reproduction and modification by the readers.                                Reproducibility is a natural requirement for textbooks such as the Handbook                              of Statistical Analyses Using R and therefore this book is fully reproducible                              using an R version greater or equal to 2.2.1. All analyses and results, including                              figures and tables, can be reproduced by the reader without having to retype                              a single line of R code. The data sets presented in this book are collected                              in a dedicated add-on package called HSAUR accompanying this book. The                              package can be installed from the Comprehensive R Archive Network (CRAN)                              via                              R> install.packages(\"HSAUR\")                              and its functionality is attached by                              R> library(\"HSAUR\")                              The relevant parts of each chapter are available as a vignette, basically a                              © 2010 by Taylor and Francis Group, LLC
document including both the R sources and the rendered output of every                             analysis contained in the book. For example, the first chapter can be inspected                             by                             R> vignette(\"Ch_introduction_to_R\", package = \"HSAUR\")                             and the R sources are available for reproducing our analyses by                             R> edit(vignette(\"Ch_introduction_to_R\", package = \"HSAUR\"))                             An overview on all chapter vignettes included in the package can be obtained                             from                             R> vignette(package = \"HSAUR\")                             We welcome comments on the R package HSAUR, and where we think these                             add to or improve our analysis of a data set we will incorporate them into the                             package and, hopefully at a later stage, into a revised or second edition of the                             book.                               Plots and tables of results obtained from R are all labelled as ‘Figures’ in                             the text. For the graphical material, the corresponding figure also contains                             the ‘essence’ of the R code used to produce the figure, although this code may                             differ a little from that given in the HSAUR package, since the latter may                             include some features, for example thicker line widths, designed to make a                             basic plot more suitable for publication.                               We would like to thank the R Development Core Team for the R system, and                             authors of contributed add-on packages, particularly Uwe Ligges and Vince                             Carey for helpful advice on scatterplot3d and gee. Kurt Hornik, Ludwig A.                             Hothorn, Fritz Leisch and Rafael Weißbach provided good advice with some                             statistical and technical problems. We are also very grateful to Achim Zeileis                             for reading the entire manuscript, pointing out inconsistencies or even bugs                             and for making many suggestions which have led to improvements. Lastly we                             would like to thank the CRC Press staff, in particular Rob Calver, for their                             support during the preparation of the book. Any errors in the book are, of                             course, the joint responsibility of the two authors.                                                          Brian S. Everitt and Torsten Hothorn                                                               London and Erlangen, December 2005                             © 2010 by Taylor and Francis Group, LLC
List of Figures                                 1.1   Histograms of the market value and the logarithm of the                                       market value for the companies contained in the Forbes 2000                                       list.                                                   19                                 1.2   Raw scatterplot of the logarithms of market value and sales.  20                                 1.3   Scatterplot with transparent shading of points of the loga-                                       rithms of market value and sales.                       21                                 1.4   Boxplots of the logarithms of the market value for four                                       selected countries, the width of the boxes is proportional to                                       the square roots of the number of companies.            22                                 2.1   Histogram (top) and boxplot (bottom) of malignant melanoma                                       mortality rates.                                        30                                 2.2   Parallel boxplots of malignant melanoma mortality rates by                                       contiguity to an ocean.                                 31                                 2.3   Estimated densities of malignant melanoma mortality rates                                       by contiguity to an ocean.                              32                                 2.4   Scatterplot of malignant melanoma mortality rates by geo-                                       graphical location.                                     33                                 2.5   Scatterplot of malignant melanoma mortality rates against                                       latitude.                                               34                                 2.6   Bar chart of happiness.                                 35                                 2.7   Spineplot of health status and happiness.               36                                 2.8   Spinogram (left) and conditional density plot (right) of                                       happiness depending on log-income                       38                                 3.1   Boxplots of estimates of room width in feet and metres (after                                       conversion to feet) and normal probability plots of estimates                                       of room width made in feet and in metres.               55                                 3.2   R output of the independent samples t-test for the roomwidth                                       data.                                                   56                                 3.3   R output of the independent samples Welch test for the                                       roomwidth data.                                         56                                 3.4   R output of the Wilcoxon rank sum test for the roomwidth                                       data.                                                   57                                 3.5   Boxplot and normal probability plot for differences between                                       the two mooring methods.                                58                              © 2010 by Taylor and Francis Group, LLC
3.6   R output of the paired t-test for the waves data.       59                                3.7   R output of the Wilcoxon signed rank test for the waves                                      data.                                                   59                                3.8   Enhanced scatterplot of water hardness and mortality,                                      showing both the joint and the marginal distributions and,                                      in addition, the location of the city by different plotting                                      symbols.                                                60                                3.9   R output of Pearsons’ correlation coefficient for the water                                      data.                                                   61                                3.10  R output of the chi-squared test for the pistonrings data.  61                                3.11  Association plot of the residuals for the pistonrings data.  62                                3.12  R output of McNemar’s test for the rearrests data.      63                                3.13  R output of an exact version of McNemar’s test for the                                      rearrests data computed via a binomial test.            63                                4.1   An approximation for the conditional distribution of the                                      difference of mean roomwidth estimates in the feet and                                      metres group under the null hypothesis. The vertical lines                                      show the negative and positive absolute value of the test                                      statistic T obtained from the original data.            71                                4.2   R output of the exact permutation test applied to the                                      roomwidth data.                                         72                                4.3   R output of the exact conditional Wilcoxon rank sum test                                      applied to the roomwidth data.                          73                                4.4   R output of Fisher’s exact test for the suicides data.  73                                5.1   Plot of mean weight gain for each level of the two factors.  84                                5.2   R output of the ANOVA fit for the weightgain data.       85                                5.3   Interaction plot of type and source.                    86                                5.4   Plot of mean litter weight for each level of the two factors for                                      the foster data.                                        87                                5.5   Graphical presentation of multiple comparison results for the                                      foster feeding data.                                    90                                5.6   Scatterplot matrix of epoch means for Egyptian skulls data.  92                                6.1   Scatterplot of velocity and distance.                  104                                6.2   Scatterplot of velocity and distance with estimated regression                                      line (left) and plot of residuals against fitted values (right).  105                                6.3   Boxplots of rainfall.                                  107                                6.4   Scatterplots of rainfall against the continuous covariates.  108                                6.5   R output of the linear model fit for the clouds data.   109                                6.6   Regression relationship between S-Ne criterion and rainfall                                      with and without seeding.                              111                                6.7   Plot of residuals against fitted values for clouds seeding                                      data.                                                  113                             © 2010 by Taylor and Francis Group, LLC
6.8   Normal probability plot of residuals from cloud seeding model                                       clouds_lm.                                             114                                 6.9   Index plot of Cook’s distances for cloud seeding data.  115                                 7.1   Conditional density plots of the erythrocyte sedimentation                                       rate (ESR) given fibrinogen and globulin.               123                                 7.2   R output of the summary method for the logistic regression                                       model fitted to ESR and fibrigonen.                      124                                 7.3   R output of the summary method for the logistic regression                                       model fitted to ESR and both globulin and fibrinogen.    125                                 7.4   Bubbleplot of fitted values for a logistic regression model                                       fitted to the plasma data.                              126                                 7.5   R output of the summary method for the logistic regression                                       model fitted to the womensrole data.                    127                                 7.6   Fitted (from womensrole_glm_1) and observed probabilities                                       of agreeing for the womensrole data.                   129                                 7.7   R output of the summary method for the logistic regression                                       model fitted to the womensrole data.                    130                                 7.8   Fitted (from womensrole_glm_2) and observed probabilities                                       of agreeing for the womensrole data.                   131                                 7.9   Plot of deviance residuals from logistic regression model fitted                                       to the womensrole data.                                132                                 7.10  R output of the summary method for the Poisson regression                                       model fitted to the polyps data.                        133                                 7.11  R output of the print method for the conditional logistic                                       regression model fitted to the backpain data.           136                                 8.1   Three commonly used kernel functions.                  144                                 8.2   Kernel estimate showing the contributions of Gaussian kernels                                       evaluated for the individual observations with bandwidth                                       h = 0.4.                                               145                                 8.3   Epanechnikov kernel for a grid between (−1.1, −1.1) and                                       (1.1, 1.1).                                            146                                 8.4   Density estimates of the geyser eruption data imposed on a                                       histogram of the data.                                 148                                 8.5   A contour plot of the bivariate density estimate of the                                       CYGOB1 data, i.e., a two-dimensional graphical display for a                                       three-dimensional problem.                             149                                 8.6   The bivariate density estimate of the CYGOB1 data, here shown                                       in a three-dimensional fashion using the persp function.  150                                 8.7   Fitted normal density and two-component normal mixture                                       for geyser eruption data.                              152                                 8.8   Bootstrap distribution and confidence intervals for the mean                                       estimates of a two-component mixture for the geyser data.  155                              © 2010 by Taylor and Francis Group, LLC
9.1   Initial tree for the body fat data with the distribution of body                                      fat in terminal nodes visualised via boxplots.         166                                9.2   Pruned regression tree for body fat data.              167                                9.3   Observed and predicted DXA measurements.               168                                9.4   Pruned classification tree of the glaucoma data with class                                      distribution in the leaves.                            169                                9.5   Estimated class probabilities depending on two important                                      variables. The 0.5 cut-off for the estimated glaucoma proba-                                      bility is depicted as a horizontal line. Glaucomateous eyes are                                      plotted as circles and normal eyes are triangles.      172                                9.6   Conditional inference tree with the distribution of body fat                                      content shown for each terminal leaf.                  173                                9.7   Conditional inference tree with the distribution of glaucoma-                                      teous eyes shown for each terminal leaf.               174                                10.1  A linear spline function with knots at a = 1, b = 3 and c = 5.  183                                10.2  Scatterplot of year and winning time.                  187                                10.3  Scatterplot of year and winning time with fitted values from                                      a simple linear model.                                 188                                10.4  Scatterplot of year and winning time with fitted values from                                      a smooth non-parametric model.                         189                                10.5  Scatterplot of year and winning time with fitted values from                                      a quadratic model.                                     190                                10.6  Partial contributions of six exploratory covariates to the                                      predicted SO 2 concentration.                          191                                10.7  Residual plot of SO 2 concentration.                   192                                10.8  Spinograms of the three exploratory variables and response                                      variable kyphosis.                                     193                                10.9  Partial contributions of three exploratory variables with                                      confidence bands.                                       194                                11.1  ‘Bath tub’ shape of a hazard function.                 202                                11.2  Survival times comparing treated and control patients.  205                                11.3  Kaplan-Meier estimates for breast cancer patients who either                                      received a hormonal therapy or not.                    207                                11.4  R output of the summary method for GBSG2_coxph.        208                                11.5  Estimated regression coefficient for age depending on time                                      for the GBSG2 data.                                    209                                11.6  Martingale residuals for the GBSG2 data.               210                                11.7  Conditional inference tree for the GBSG2 data with the                                      survival function, estimated by Kaplan-Meier, shown for                                      every subgroup of patients identified by the tree.      211                                12.1  Boxplots for the repeated measures by treatment group for                                      the BtheB data.                                        220                             © 2010 by Taylor and Francis Group, LLC
12.2  R output of the linear mixed-effects model fit for the BtheB                                       data.                                                  222                                 12.3  R output of the asymptotic p-values for linear mixed-effects                                       model fit for the BtheB data.                           223                                 12.4  Quantile-quantile plots of predicted random intercepts and                                       residuals for the random intercept model BtheB_lmer1 fitted                                       to the BtheB data.                                     224                                 12.5  Distribution of BDI values for patients that do (circles) and                                       do not (bullets) attend the next scheduled visit.      227                                 13.1  Simulation of a positive response in a random intercept                                       logistic regression model for 20 subjects. The thick line is the                                       average over all 20 subjects.                          237                                 13.2  R output of the summary method for the btb_gee model                                       (slightly abbreviated).                                239                                 13.3  R output of the summary method for the btb_gee1 model                                       (slightly abbreviated).                                240                                 13.4  R output of the summary method for the resp_glm model.  241                                 13.5  R output of the summary method for the resp_gee1 model                                       (slightly abbreviated).                                242                                 13.6  R output of the summary method for the resp_gee2 model                                       (slightly abbreviated).                                243                                 13.7  Boxplots of numbers of seizures in each two-week period post                                       randomisation for placebo and active treatments.       244                                 13.8  Boxplots of log of numbers of seizures in each two-week period                                       post randomisation for placebo and active treatments.  245                                 13.9  R output of the summary method for the epilepsy_glm                                       model.                                                 246                                 13.10 R output of the summary method for the epilepsy_gee1                                       model (slightly abbreviated).                          247                                 13.11 R output of the summary method for the epilepsy_gee2                                       model (slightly abbreviated).                          248                                 13.12 R output of the summary method for the epilepsy_gee3                                       model (slightly abbreviated).                          249                                 13.13 R output of the summary method for the resp_lmer model                                       (abbreviated).                                         249                                 14.1  Distribution of levels of expressed alpha synuclein mRNA in                                       three groups defined by the NACP-REP1 allele lengths.   258                                 14.2  Simultaneous confidence intervals for the alpha data based                                       on the ordinary covariance matrix (left) and a sandwich                                       estimator (right).                                     261                                 14.3  Probability of damage caused by roe deer browsing for six                                       tree species. Sample sizes are given in brackets.      263                              © 2010 by Taylor and Francis Group, LLC
14.4  Regression relationship between S-Ne criterion and rainfall                                      with and without seeding. The confidence bands cover the                                      area within the dashed curves.                         265                                15.1  R output of the summary method for smokingOR.          274                                15.2  Forest plot of observed effect sizes and 95% confidence                                      intervals for the nicotine gum studies.                275                                15.3  R output of the summary method for BCG_OR.             277                                15.4  R output of the summary method for BCG_DSL.            278                                15.5  R output of the summary method for BCG_mod.            279                                15.6  Plot of observed effect size for the BCG vaccine data against                                      latitude, with a weighted least squares regression fit shown in                                      addition.                                              280                                15.7  Example funnel plots from simulated data. The asymmetry                                      in the lower plot is a hint that a publication bias might be a                                      problem.                                               281                                15.8  Funnel plot for nicotine gum data.                     282                                16.1  Scatterplot matrix for the heptathlon data (all countries).  289                                16.2  Scatterplot matrix for the heptathlon data after removing                                      observations of the PNG competitor.                    291                                16.3  Barplot of the variances explained by the principal compo-                                      nents. (with observations for PNG removed).            294                                16.4  Biplot of the (scaled) first two principal components (with                                      observations for PNG removed).                         295                                16.5  Scatterplot of the score assigned to each athlete in 1988 and                                      the first principal component.                          296                                17.1  Two-dimensional solution from classical multidimensional                                      scaling of distance matrix for water vole populations.  306                                17.2  Minimum spanning tree for the watervoles data.         308                                17.3  Two-dimensional solution from non-metric multidimensional                                      scaling of distance matrix for voting matrix.          309                                17.4  The Shepard diagram for the voting data shows some                                      discrepancies between the original dissimilarities and the                                      multidimensional scaling solution.                     310                                18.1  Bivariate data showing the presence of three clusters.  319                                18.2  Example of a dendrogram.                               321                                18.3  Darwin’s Tree of Life.                                 322                                18.4  Image plot of the dissimilarity matrix of the pottery data.  326                                18.5  Hierarchical clustering of pottery data and resulting den-                                      drograms.                                              327                                18.6  3D scatterplot of the logarithms of the three variables                                      available for each of the exoplanets.                  328                             © 2010 by Taylor and Francis Group, LLC
18.7  Within-cluster sum of squares for different numbers of clusters                                       for the exoplanet data.                                329                                 18.8  Plot of BIC values for a variety of models and a range of                                       number of clusters.                                    331                                 18.9  Scatterplot matrix of planets data showing a three-cluster                                       solution from Mclust.                                  332                                 18.10 3D scatterplot of planets data showing a three-cluster solution                                       from Mclust.                                           333                              © 2010 by Taylor and Francis Group, LLC
List of Tables                                 2.1   USmelanoma data. USA mortality rates for white males due                                       to malignant melanoma.                                  25                                 2.2   CHFLS data. Chinese Health and Family Life Survey.      28                                 2.3   household data. Household expenditure for single men and                                       women.                                                  40                                 2.4   suicides2 data. Mortality rates per 100, 000 from male                                       suicides.                                               41                                 2.5   USstates data. Socio-demographic variables for ten US                                       states.                                                 42                                 2.6   banknote data (package alr3). Swiss bank note data.     43                                 3.1   roomwidth data. Room width estimates (width) in feet and                                       in metres (unit).                                       45                                 3.2   waves data. Bending stress (root mean squared bending                                       moment in Newton metres) for two mooring methods in a                                       wave energy experiment.                                 46                                 3.3   water data. Mortality (per 100,000 males per year, mor-                                       tality) and water hardness for 61 cities in England and                                       Wales.                                                  47                                 3.4   pistonrings data. Number of piston ring failures for three                                       legs of four compressors.                               49                                 3.5   rearrests data. Rearrests of juvenile felons by type of court                                       in which they were tried.                               49                                 3.6   The general r × c table.                                52                                 3.7   Frequencies in matched samples data.                    53                                 4.1   suicides data. Crowd behaviour at threatened                                       suicides.                                               66                                 4.2   Classification system for the response variable.         66                                 4.3   Lanza data. Misoprostol randomised clinical trial from Lanza                                       (1987).                                                 66                                 4.4   Lanza data. Misoprostol randomised clinical trial from Lanza                                       et al. (1988a).                                         67                                 4.5   Lanza data. Misoprostol randomised clinical trial from Lanza                                       et al. (1988b).                                         67                              © 2010 by Taylor and Francis Group, LLC
4.6   Lanza data. Misoprostol randomised clinical trial from Lanza                                      et al. (1989).                                          67                                4.7   anomalies data. Abnormalities of the face and digits of                                      newborn infants exposed to antiepileptic drugs as assessed by                                      a paediatrician (MD) and a research assistant (RA).     68                                4.8   orallesions data. Oral lesions found in house-to-house                                      surveys in three geographic regions of rural India.     78                                5.1   weightgain data. Rat weight gain for diets differing by the                                      amount of protein (type) and source of protein (source).  79                                5.2   foster data. Foster feeding experiment for rats with different                                      genotypes of the litter (litgen) and mother (motgen).   80                                5.3   skulls data. Measurements of four variables taken from                                      Egyptian skulls of five periods.                         81                                5.4   schooldays data. Days absent from school.               95                                5.5   students data. Treatment and results of two tests in three                                      groups of students.                                     96                                6.1   hubble data. Distance and velocity for 24 galaxies.     97                                6.2   clouds data. Cloud seeding experiments in Florida – see                                      above for explanations of the variables.                98                                6.3   Analysis of variance table for the multiple linear regression                                      model.                                                 102                                7.1   plasma data. Blood plasma data.                        117                                7.2   womensrole data. Women’s role in society data.         118                                7.3   polyps data. Number of polyps for two treatment arms.  119                                                                                        ¯                                7.4   backpain data. Number of drivers (D) and non-drivers (D),                                                                     ¯                                      suburban (S) and city inhabitants (S) either suffering from a                                      herniated disc (cases) or not (controls).              120                                7.5   bladdercancer data. Number of recurrent tumours for                                      bladder cancer patients.                               137                                7.6   leuk data (package MASS). Survival times of patients                                      suffering from leukemia.                                138                                8.1   faithful data (package datasets). Old Faithful geyser waiting                                      times between two eruptions.                           139                                8.2   CYGOB1 data. Energy output and surface temperature of Star                                      Cluster CYG OB1.                                       141                                8.3   galaxies data (package MASS). Velocities of 82 galaxies.  156                                8.4   birthdeathrates data. Birth and death rates for 69 coun-                                      tries.                                                 157                                8.5   schizophrenia data. Age on onset of schizophrenia for both                                      sexes.                                                 158                             © 2010 by Taylor and Francis Group, LLC
9.1   bodyfat data (package mboost). Body fat prediction by                                       skinfold thickness, circumferences, and bone breadths.  161                                 10.1  men1500m data. Olympic Games 1896 to 2004 winners of the                                       men’s 1500m.                                           177                                 10.2  USairpollution data. Air pollution in 41 US cities.    178                                 10.3  kyphosis data (package rpart). Children who have had                                       corrective spinal surgery.                             180                                 11.1  glioma data. Patients suffering from two types of glioma                                       treated with the standard therapy or a novel radioim-                                       munotherapy (RIT).                                     197                                 11.2  GBSG2 data (package ipred). Randomised clinical trial data                                       from patients suffering from node-positive breast cancer. Only                                       the data of the first 20 patients are shown here.       199                                 11.3  mastectomy data. Survival times in months after mastectomy                                       of women with breast cancer.                           212                                 12.1  BtheB data. Data of a randomised trial evaluating the effects                                       of Beat the Blues.                                     214                                 12.2  phosphate data. Plasma inorganic phosphate levels for                                       various time points after glucose challenge.           228                                 13.1  respiratory data. Randomised clinical trial data from                                       patients suffering from respiratory illness. Only the data of                                       the first seven patients are shown here.                231                                 13.2  epilepsy data. Randomised clinical trial data from patients                                       suffering from epilepsy. Only the data of the first seven                                       patients are shown here.                               232                                 13.3  schizophrenia2 data. Clinical trial data from patients                                       suffering from schizophrenia. Only the data of the first four                                       patients are shown here.                               251                                 14.1  alpha data (package coin). Allele length and levels of                                       expressed alpha synuclein mRNA in alcohol-dependent                                       patients.                                              253                                 14.2  trees513 data (package multcomp).                      255                                 15.1  smoking data. Meta-analysis on nicotine gum showing the                                       number of quitters who have been treated (qt), the total                                       number of treated (tt) as well as the number of quitters in                                       the control group (qc) with total number of smokers in the                                       control group (tc).                                    268                              © 2010 by Taylor and Francis Group, LLC
15.2  BCG data. Meta-analysis on BCG vaccine with the following                                      data: the number of TBC cases after a vaccination with BCG                                      (BCGTB), the total number of people who received BCG (BCG)                                      as well as the number of TBC cases without vaccination                                      (NoVaccTB) and the total number of people in the study                                      without vaccination (NoVacc).                          269                                15.4  aspirin data. Meta-analysis on aspirin and myocardial                                      infarct, the table shows the number of deaths after placebo                                      (dp), the total number subjects treated with placebo (tp) as                                      well as the number of deaths after aspirin (da) and the total                                      number of subjects treated with aspirin (ta).          283                                15.5  toothpaste data. Meta-analysis on trials comparing two                                      toothpastes, the number of individuals in the study, the mean                                      and the standard deviation for each study A and B are shown.  284                                16.1  heptathlon data. Results Olympic heptathlon, Seoul, 1988.  286                                16.2  meteo data. Meteorological measurements in an 11-year                                      period.                                                297                                16.3  Correlations for calculus measurements for the six anterior                                      mandibular teeth.                                      297                                17.1  watervoles data. Water voles data – dissimilarity matrix.  300                                17.2  voting data. House of Representatives voting data.     301                                17.3  eurodist data (package datasets). Distances between Euro-                                      pean cities, in km.                                    312                                17.4  gardenflowers data. Dissimilarity matrix of 18 species of                                      gardenflowers.                                          313                                18.1  pottery data. Romano-British pottery data.             315                                18.2  planets data. Jupiter mass, period and eccentricity of                                      exoplanets.                                            317                                18.3  Number of possible partitions depending on the sample size                                      n and number of clusters k.                            322                             © 2010 by Taylor and Francis Group, LLC
Contents                              1 An Introduction to R                                            1                                 1.1 What is R?                                                 1                                 1.2 Installing R                                               2                                 1.3 Help and Documentation                                     4                                 1.4 Data Objects in R                                          5                                 1.5 Data Import and Export                                     9                                 1.6 Basic Data Manipulation                                   11                                 1.7 Computing with Data                                       14                                 1.8 Organising an Analysis                                    20                                 1.9 Summary                                                   21                              2 Data Analysis Using Graphical Displays                         25                                 2.1 Introduction                                              25                                 2.2 Initial Data Analysis                                     27                                 2.3 Analysis Using R                                          29                                 2.4 Summary                                                   38                              3 Simple Inference                                               45                                 3.1 Introduction                                              45                                 3.2 Statistical Tests                                         49                                 3.3 Analysis Using R                                          53                                 3.4 Summary                                                   63                              4 Conditional Inference                                          65                                 4.1 Introduction                                              65                                 4.2 Conditional Test Procedures                               68                                 4.3 Analysis Using R                                          70                                 4.4 Summary                                                   77                              5 Analysis of Variance                                           79                                 5.1 Introduction                                              79                                 5.2 Analysis of Variance                                      82                                 5.3 Analysis Using R                                          83                                 5.4 Summary                                                   94                              © 2010 by Taylor and Francis Group, LLC
6 Simple and Multiple Linear Regression                          97                                6.1 Introduction                                              97                                6.2 Simple Linear Regression                                  99                                6.3 Multiple Linear Regression                               100                                6.4 Analysis Using R                                         103                                6.5 Summary                                                  112                             7 Logistic Regression and Generalised Linear Models             117                                7.1 Introduction                                             117                                7.2 Logistic Regression and Generalised Linear Models        120                                7.3 Analysis Using R                                         122                                7.4 Summary                                                  136                             8 Density Estimation                                            139                                8.1 Introduction                                             139                                8.2 Density Estimation                                       141                                8.3 Analysis Using R                                         147                                8.4 Summary                                                  155                             9 Recursive Partitioning                                        161                                9.1 Introduction                                             161                                9.2 Recursive Partitioning                                   164                                9.3 Analysis Using R                                         165                                9.4 Summary                                                  174                             10 Smoothers and Generalised Additive Models                    177                                10.1 Introduction                                            177                                10.2 Smoothers and Generalised Additive Models               181                                10.3 Analysis Using R                                        186                             11 Survival Analysis                                            197                                11.1 Introduction                                            197                                11.2 Survival Analysis                                       198                                11.3 Analysis Using R                                        204                                11.4 Summary                                                 211                             12 Analysing Longitudinal Data I                                213                                12.1 Introduction                                            213                                12.2 Analysing Longitudinal Data                             216                                12.3 Linear Mixed Effects Models                              217                                12.4 Analysis Using R                                        219                                12.5 Prediction of Random Effects                             223                                12.6 The Problem of Dropouts                                 223                                12.7 Summary                                                 226                             © 2010 by Taylor and Francis Group, LLC
13 Analysing Longitudinal Data II                              231                                 13.1 Introduction                                            231                                 13.2 Methods for Non-normal Distributions                    233                                 13.3 Analysis Using R: GEE                                   238                                 13.4 Analysis Using R: Random Effects                         247                                 13.5 Summary                                                 250                              14 Simultaneous Inference and Multiple Comparisons             253                                 14.1 Introduction                                            253                                 14.2 Simultaneous Inference and Multiple Comparisons         256                                 14.3 Analysis Using R                                        257                                 14.4 Summary                                                 264                              15 Meta-Analysis                                               267                                 15.1 Introduction                                            267                                 15.2 Systematic Reviews and Meta-Analysis                    269                                 15.3 Statistics of Meta-Analysis                             271                                 15.4 Analysis Using R                                        273                                 15.5 Meta-Regression                                         276                                 15.6 Publication Bias                                        277                                 15.7 Summary                                                 279                              16 Principal Component Analysis                                285                                 16.1 Introduction                                            285                                 16.2 Principal Component Analysis                            285                                 16.3 Analysis Using R                                        288                                 16.4 Summary                                                 295                              17 Multidimensional Scaling                                    299                                 17.1 Introduction                                            299                                 17.2 Multidimensional Scaling                                299                                 17.3 Analysis Using R                                        305                                 17.4 Summary                                                 310                              18 Cluster Analysis                                            315                                 18.1 Introduction                                            315                                 18.2 Cluster Analysis                                        318                                 18.3 Analysis Using R                                        325                                 18.4 Summary                                                 334                              Bibliography                                                   335                              © 2010 by Taylor and Francis Group, LLC
CHAPTER 1                                             An Introduction to R                              1.1 What is R?                              The R system for statistical computing is an environment for data analysis and                              graphics. The root of R is the S language, developed by John Chambers and                              colleagues (Becker et al., 1988, Chambers and Hastie, 1992, Chambers, 1998)                              at Bell Laboratories (formerly AT&T, now owned by Lucent Technologies)                              starting in the 1960ies. The S language was designed and developed as a                              programming language for data analysis tasks but in fact it is a full-featured                              programming language in its current implementations.                                The development of the R system for statistical computing is heavily influ-                              enced by the open source idea: The base distribution of R and a large number                              of user contributed extensions are available under the terms of the Free Soft-                              ware Foundation’s GNU General Public License in source code form. This                              licence has two major implications for the data analyst working with R. The                              complete source code is available and thus the practitioner can investigate the                              details of the implementation of a special method, can make changes and can                              distribute modifications to colleagues. As a side-effect, the R system for statis-                              tical computing is available to everyone. All scientists, including, in particular,                              those working in developing countries, now have access to state-of-the-art tools                              for statistical data analysis without additional costs. With the help of the R                              system for statistical computing, research really becomes reproducible when                              both the data and the results of all data analysis steps reported in a paper are                              available to the readers through an R transcript file. R is most widely used for                              teaching undergraduate and graduate statistics classes at universities all over                              the world because students can freely use the statistical computing tools.                                The base distribution of R is maintained by a small group of statisticians,                              the R Development Core Team. A huge amount of additional functionality is                              implemented in add-on packages authored and maintained by a large group of                              volunteers. The main source of information about the R system is the world                              wide web with the official home page of the R project being                                                   http://www.R-project.org                              All resources are available from this page: the R system itself, a collection of                              add-on packages, manuals, documentation and more.                                The intention of this chapter is to give a rather informal introduction to                              basic concepts and data manipulation techniques for the R novice. Instead                              of a rigid treatment of the technical background, the most common tasks                                                               1                              © 2010 by Taylor and Francis Group, LLC
© 2010 by Taylor and Francis Group, LLC
INSTALLING R                                                      3                              One can change the appearance of the prompt by                              > options(prompt = \"R> \")                              and we will use the prompt R> for the display of the code examples throughout                              this book. A + sign at the very beginning of a line indicates a continuing                              command after a newline.                                Essentially, the R system evaluates commands typed on the R prompt and                              returns the results of the computations. The end of a command is indicated                              by the return key. Virtually all introductory texts on R start with an example                              using R as a pocket calculator, and so do we:                              R> x <- sqrt(25) + 2                                                                               √                              This simple statement asks the R interpreter to calculate  25 and then to add                              2. The result of the operation is assigned to an R object with variable name x.                              The assignment operator <- binds the value of its right hand side to a variable                              name on the left hand side. The value of the object x can be inspected simply                              by typing                              R> x                              [1] 7                              which, implicitly, calls the print method:                              R> print(x)                              [1] 7                              1.2.2 Packages                              The base distribution already comes with some high-priority add-on packages,                              namely                                          mgcv      KernSmooth   MASS       base                                          boot      class        cluster    codetools                                          datasets  foreign      grDevices  graphics                                          grid      lattice      methods    nlme                                          nnet      rcompgen     rpart      spatial                                          splines   stats        stats4     survival                                          tcltk     tools        utils                              Some of the packages listed here implement standard statistical functionality,                              for example linear models, classical tests, a huge collection of high-level plot-                              ting functions or tools for survival analysis; many of these will be described                              and used in later chapters. Others provide basic infrastructure, for example                              for graphic systems, code analysis tools, graphical user-interfaces or other util-                              ities.                                Packages not included in the base distribution can be installed directly                              from the R prompt. At the time of writing this chapter, 1756 user-contributed                              packages covering almost all fields of statistical methodology were available.                              Certain so-called ‘task views’ for special topics, such as statistics in the social                              sciences, environmetrics, robust statistics etc., describe important and helpful                              packages and are available from                              © 2010 by Taylor and Francis Group, LLC
4                                       AN INTRODUCTION TO R                                            http://CRAN.R-project.org/web/views/                               Given that an Internet connection is available, a package is installed by                             supplying the name of the package to the function install.packages. If,                             for example, add-on functionality for robust estimation of covariance matrices                             via sandwich estimators is required (for example in Chapter 13), the sandwich                             package (Zeileis, 2004) can be downloaded and installed via                             R> install.packages(\"sandwich\")                             The package functionality is available after attaching the package by                             R> library(\"sandwich\")                             A comprehensive list of available packages can be obtained from                                          http://CRAN.R-project.org/web/packages/                             Note that on Windows operating systems, precompiled versions of packages                             are downloaded and installed. In contrast, packages are compiled locally before                             they are installed on Unix systems.                             1.3 Help and Documentation                             Roughly, three different forms of documentation for the R system for statis-                             tical computing may be distinguished: online help that comes with the base                             distribution or packages, electronic manuals and publications work in the form                             of books etc.                               The help system is a collection of manual pages describing each user-visible                             function and data set that comes with R. A manual page is shown in a pager                             or web browser when the name of the function we would like to get help for                             is supplied to the help function                             R> help(\"mean\")                             or, for short,                             R> ?mean                             Each manual page consists of a general description, the argument list of the                             documented function with a description of each single argument, information                             about the return value of the function and, optionally, references, cross-links                             and, in most cases, executable examples. The function help.search is helpful                             for searching within manual pages. An overview on documented topics in an                             add-on package is given, for example for the sandwich package, by                             R> help(package = \"sandwich\")                             Often a package comes along with an additional document describing the pack-                             age functionality and giving examples. Such a document is called a vignette                             (Leisch, 2003, Gentleman, 2005). For example, the sandwich package vignette                             is opened using                             R> vignette(\"sandwich\", package = \"sandwich\")                               More extensive documentation is available electronically from the collection                             of manuals at                             © 2010 by Taylor and Francis Group, LLC
DATA OBJECTS IN R                                                 5                                           http://CRAN.R-project.org/manuals.html                              For the beginner, at least the first and the second document of the following                              four manuals (R Development Core Team, 2009a,c,d,e) are mandatory:                              An Introduction to R: A more formal introduction to data analysis with                                R than this chapter.                              R Data Import/Export: A very useful description of how to read and write                                various external data formats.                              R Installation and Administration: Hints for installing R on special plat-                                forms.                              Writing R Extensions: The authoritative source on how to write R pro-                                grams and packages.                                Both printed and online publications are available, the most important ones                              are Modern Applied Statistics with S (Venables and Ripley, 2002), Introductory                              Statistics with R (Dalgaard, 2002), R Graphics (Murrell, 2005) and the R                              Newsletter, freely available from                                            http://CRAN.R-project.org/doc/Rnews/                              In case the electronically available documentation and the answers to fre-                              quently asked questions (FAQ), available from                                             http://CRAN.R-project.org/faqs.html                              have been consulted but a problem or question remains unsolved, the r-help                              email list is the right place to get answers to well-thought-out questions. It is                              helpful to read the posting guide                                         http://www.R-project.org/posting-guide.html                              before starting to ask.                              1.4 Data Objects in R                              The data handling and manipulation techniques explained in this chapter will                              be illustrated by means of a data set of 2000 world leading companies, the                              Forbes 2000 list for the year 2004 collected by Forbes Magazine. This list is                              originally available from                                                    http://www.forbes.com                              and, as an R data object, it is part of the HSAUR2 package (Source: From                              Forbes.com, New York, New York, 2004. With permission.). In a first step, we                              make the data available for computations within R. The data function searches                              for data objects of the specified name (\"Forbes2000\") in the package specified                              via the package argument and, if the search was successful, attaches the data                              object to the global environment:                              R> data(\"Forbes2000\", package = \"HSAUR2\")                              R> ls()                              [1] \"x\"           \"Forbes2000\"                              © 2010 by Taylor and Francis Group, LLC
6                                       AN INTRODUCTION TO R                             The output of the ls function lists the names of all objects currently stored in                             the global environment, and, as the result of the previous command, a variable                             named Forbes2000 is available for further manipulation. The variable x arises                             from the pocket calculator example in Subsection 1.2.1.                               As one can imagine, printing a list of 2000 companies via                             R> print(Forbes2000)                               rank                  name        country       category  sales                             1    1            Citigroup United States          Banking  94.71                             2    2     General Electric United States Conglomerates 134.19                             3    3 American Intl Group United States        Insurance   76.66                               profits   assets marketvalue                             1   17.85 1264.03        255.30                             2   15.59   626.93       328.54                             3    6.46   647.66       194.87                             ...                             will not be particularly helpful in gathering some initial information about                             the data; it is more useful to look at a description of their structure found by                             using the following command                             R> str(Forbes2000)                             'data.frame':          2000 obs. of   8 variables:                              $ rank        : int   1 2 3 4 5 ...                              $ name        : chr   \"Citigroup\" \"General Electric\" ...                              $ country     : Factor w/ 61 levels \"Africa\",\"Australia\",...                              $ category    : Factor w/ 27 levels \"Aerospace & defense\",..                              $ sales       : num   94.7 134.2 ...                              $ profits     : num   17.9 15.6 ...                              $ assets      : num   1264 627 ...                              $ marketvalue: num    255 329 ...                             The output of the str function tells us that Forbes2000 is an object of class                             data.frame, the most important data structure for handling tabular statistical                             data in R. As expected, information about 2000 observations, i.e., companies,                             are stored in this object. For each observation, the following eight variables                             are available:                              rank: the ranking of the company,                              name: the name of the company,                              country: the country the company is situated in,                              category: a category describing the products the company produces,                              sales: the amount of sales of the company in billion US dollars,                              profits: the profit of the company in billion US dollars,                              assets: the assets of the company in billion US dollars,                              marketvalue: the market value of the company in billion US dollars.                             A similar but more detailed description is available from the help page for the                             Forbes2000 object:                             © 2010 by Taylor and Francis Group, LLC
DATA OBJECTS IN R                                                 7                              R> help(\"Forbes2000\")                              or                              R> ?Forbes2000                              All information provided by str can be obtained by specialised functions as                              well and we will now have a closer look at the most important of these.                                The R language is an object-oriented programming language, so every object                              is an instance of a class. The name of the class of an object can be determined                              by                              R> class(Forbes2000)                              [1] \"data.frame\"                              Objects of class data.frame represent data the traditional table-oriented way.                              Each row is associated with one single observation and each column corre-                              sponds to one variable. The dimensions of such a table can be extracted using                              the dim function                              R> dim(Forbes2000)                              [1] 2000     8                              Alternatively, the numbers of rows and columns can be found using                              R> nrow(Forbes2000)                              [1] 2000                              R> ncol(Forbes2000)                              [1] 8                              The results of both statements show that Forbes2000 has 2000 rows, i.e.,                              observations, the companies in our case, with eight variables describing the                              observations. The variable names are accessible from                              R> names(Forbes2000)                              [1] \"rank\"         \"name\"         \"country\"      \"category\"                              [5] \"sales\"        \"profits\"      \"assets\"       \"marketvalue\"                              The values of single variables can be extracted from the Forbes2000 object                              by their names, for example the ranking of the companies                              R> class(Forbes2000[,\"rank\"])                              [1] \"integer\"                              is stored as an integer variable. Brackets [] always indicate a subset of a larger                              object, in our case a single variable extracted from the whole table. Because                              data.frames have two dimensions, observations and variables, the comma is                              required in order to specify that we want a subset of the second dimension,                              i.e., the variables. The rankings for all 2000 companies are represented in a                              vector structure the length of which is given by                              R> length(Forbes2000[,\"rank\"])                              [1] 2000                              © 2010 by Taylor and Francis Group, LLC
8                                       AN INTRODUCTION TO R                             A vector is the elementary structure for data handling in R and is a set of                             simple elements, all being objects of the same class. For example, a simple                             vector of the numbers one to three can be constructed by one of the following                             commands                             R> 1:3                             [1] 1 2 3                             R> c(1,2,3)                             [1] 1 2 3                             R> seq(from = 1, to = 3, by = 1)                             [1] 1 2 3                               The unique names of all 2000 companies are stored in a character vector                             R> class(Forbes2000[,\"name\"])                             [1] \"character\"                             R> length(Forbes2000[,\"name\"])                             [1] 2000                             and the first element of this vector is                             R> Forbes2000[,\"name\"][1]                             [1] \"Citigroup\"                             Because the companies are ranked, Citigroup is the world’s largest company                             according to the Forbes 2000 list. Further details on vectors and subsetting                             are given in Section 1.6.                               Nominal measurements are represented by factor variables in R, such as the                             category of the company’s business segment                             R> class(Forbes2000[,\"category\"])                             [1] \"factor\"                             Objects of class factor and character basically differ in the way their values                             are stored internally. Each element of a vector of class character is stored as a                             character variable whereas an integer variable indicating the level of a factor                             is saved for factor objects. In our case, there are                             R> nlevels(Forbes2000[,\"category\"])                             [1] 27                             different levels, i.e., business categories, which can be extracted by                             R> levels(Forbes2000[,\"category\"])                             [1] \"Aerospace & defense\"                             [2] \"Banking\"                             [3] \"Business services & supplies\"                             ...                             © 2010 by Taylor and Francis Group, LLC
DATA IMPORT AND EXPORT                                            9                              As a simple summary statistic, the frequencies of the levels of such a factor                              variable can be found from                              R> table(Forbes2000[,\"category\"])                                       Aerospace & defense                          Banking                                                          19                            313                              Business services & supplies                                                          70                              ...                                The sales, assets, profits and market value variables are of type numeric,                              the natural data type for continuous or discrete measurements, for example                              R> class(Forbes2000[,\"sales\"])                              [1] \"numeric\"                              and simple summary statistics such as the mean, median and range can be                              found from                              R> median(Forbes2000[,\"sales\"])                              [1] 4.365                              R> mean(Forbes2000[,\"sales\"])                              [1] 9.69701                              R> range(Forbes2000[,\"sales\"])                              [1]   0.01 256.33                              The summary method can be applied to a numeric vector to give a set of useful                              summary statistics, namely the minimum, maximum, mean, median and the                              25% and 75% quartiles; for example                              R> summary(Forbes2000[,\"sales\"])                                 Min. 1st Qu.   Median     Mean 3rd Qu.     Max.                                0.010    2.018   4.365    9.697   9.548 256.300                              1.5 Data Import and Export                              In the previous section, the data from the Forbes 2000 list of the world’s largest                              companies were loaded into R from the HSAUR2 package but we will now ex-                              plore practically more relevant ways to import data into the R system. The                              most frequent data formats the data analyst is confronted with are comma sep-                              arated files, Excel spreadsheets, files in SPSS format and a variety of SQL data                              base engines. Querying data bases is a nontrivial task and requires additional                              knowledge about querying languages, and we therefore refer to the R Data                              Import/Export manual – see Section 1.3. We assume that a comma separated                              file containing the Forbes 2000 list is available as Forbes2000.csv (such a file                              is part of the HSAUR2 source package in directory HSAUR2/inst/rawdata).                              When the fields are separated by commas and each row begins with a name                              (a text format typically created by Excel), we can read in the data as follows                              using the read.table function                              © 2010 by Taylor and Francis Group, LLC
10                                      AN INTRODUCTION TO R                             R> csvForbes2000 <- read.table(\"Forbes2000.csv\",                             +      header = TRUE, sep = \",\", row.names = 1)                             The argument header = TRUE indicates that the entries in the first line of the                             text file \"Forbes2000.csv\" should be interpreted as variable names. Columns                             are separated by a comma (sep = \",\"), users of continental versions of Excel                             should take care of the character symbol coding for decimal points (by default                             dec = \".\"). Finally, the first column should be interpreted as row names but                             not as a variable (row.names = 1). Alternatively, the function read.csv can                             be used to read comma separated files. The function read.table by default                             guesses the class of each variable from the specified file. In our case, character                             variables are stored as factors                             R> class(csvForbes2000[,\"name\"])                             [1] \"factor\"                             which is only suboptimal since the names of the companies are unique. How-                             ever, we can supply the types for each variable to the colClasses argument                             R> csvForbes2000 <- read.table(\"Forbes2000.csv\",                             +      header = TRUE, sep = \",\", row.names = 1,                             +      colClasses = c(\"character\", \"integer\", \"character\",                             +           \"factor\", \"factor\", \"numeric\", \"numeric\", \"numeric\",                             +           \"numeric\"))                             R> class(csvForbes2000[,\"name\"])                             [1] \"character\"                             and check if this object is identical with our previous Forbes 2000 list object                             R> all.equal(csvForbes2000, Forbes2000)                             [1] TRUE                             The argument colClasses expects a character vector of length equal to the                             number of columns in the file. Such a vector can be supplied by the c function                             that combines the objects given in the parameter list into a vector                             R> classes <- c(\"character\", \"integer\", \"character\", \"factor\",                             +      \"factor\", \"numeric\", \"numeric\", \"numeric\", \"numeric\")                             R> length(classes)                             [1] 9                             R> class(classes)                             [1] \"character\"                               An R interface to the open data base connectivity standard (ODBC) is                             available in package RODBC and its functionality can be used to access Excel                             and Access files directly:                             R> library(\"RODBC\")                             R> cnct <- odbcConnectExcel(\"Forbes2000.xls\")                             R> sqlQuery(cnct, \"select * from \\"Forbes2000\\$\\"\")                             © 2010 by Taylor and Francis Group, LLC
BASIC DATA MANIPULATION                                          11                              The function odbcConnectExcel opens a connection to the specified Excel or                              Access file which can be used to send SQL queries to the data base engine and                              retrieve the results of the query.                                Files in SPSS format are read in a way similar to reading comma separated                              files, using the function read.spss from package foreign (which comes with                              the base distribution).                                Exporting data from R is now rather straightforward. A comma separated                              file readable by Excel can be constructed from a data.frame object via                              R> write.table(Forbes2000, file = \"Forbes2000.csv\", sep = \",\",                              +              col.names = NA)                              The function write.csv is one alternative and the functionality implemented                              in the RODBC package can be used to write data directly into Excel spread-                              sheets as well.                                Alternatively, when data should be saved for later processing in R only, R                              objects of arbitrary kind can be stored into an external binary file via                              R> save(Forbes2000, file = \"Forbes2000.rda\")                              where the extension .rda is standard. We can get the file names of all files                              with extension .rda from the working directory                              R> list.files(pattern = \"\\.rda\")                              [1] \"Forbes2000.rda\"                              and we can load the contents of the file into R by                              R> load(\"Forbes2000.rda\")                              1.6 Basic Data Manipulation                              The examples shown in the previous section have illustrated the importance of                              data.frames for storing and handling tabular data in R. Internally, a data.frame                              is a list of vectors of a common length n, the number of rows of the table. Each                              of those vectors represents the measurements of one variable and we have seen                              that we can access such a variable by its name, for example the names of the                              companies                              R> companies <- Forbes2000[,\"name\"]                              Of course, the companies vector is of class character and of length 2000. A                              subset of the elements of the vector companies can be extracted using the []                              subset operator. For example, the largest of the 2000 companies listed in the                              Forbes 2000 list is                              R> companies[1]                              [1] \"Citigroup\"                              and the top three companies can be extracted utilising an integer vector of                              the numbers one to three:                              R> 1:3                              © 2010 by Taylor and Francis Group, LLC
12                                      AN INTRODUCTION TO R                             [1] 1 2 3                             R> companies[1:3]                             [1] \"Citigroup\"             \"General Electric\"                             [3] \"American Intl Group\"                             In contrast to indexing with positive integers, negative indexing returns all                             elements that are not part of the index vector given in brackets. For example,                             all companies except those with numbers four to two-thousand, i.e., the top                             three companies, are again                             R> companies[-(4:2000)]                             [1] \"Citigroup\"             \"General Electric\"                             [3] \"American Intl Group\"                               The complete information about the top three companies can be printed in                             a similar way. Because data.frames have a concept of rows and columns, we                             need to separate the subsets corresponding to rows and columns by a comma.                             The statement                             R> Forbes2000[1:3, c(\"name\", \"sales\", \"profits\", \"assets\")]                                               name   sales profits   assets                             1            Citigroup   94.71   17.85 1264.03                             2    General Electric 134.19     15.59   626.93                             3 American Intl Group    76.66    6.46   647.66                             extracts the variables name, sales, profits and assets for the three largest                             companies. Alternatively, a single variable can be extracted from a data.frame                             by                             R> companies <- Forbes2000$name                             which is equivalent to the previously shown statement                             R> companies <- Forbes2000[,\"name\"]                               We might be interested in extracting the largest companies with respect                             to an alternative ordering. The three top selling companies can be computed                             along the following lines. First, we need to compute the ordering of the com-                             panies’ sales                             R> order_sales <- order(Forbes2000$sales)                             which returns the indices of the ordered elements of the numeric vector sales.                             Consequently the three companies with the lowest sales are                             R> companies[order_sales[1:3]]                             [1] \"Custodia Holding\"         \"Central European Media\"                             [3] \"Minara Resources\"                             The indices of the three top sellers are the elements 1998, 1999 and 2000 of                             the integer vector order_sales                             R> Forbes2000[order_sales[c(2000, 1999, 1998)],                             +              c(\"name\", \"sales\", \"profits\", \"assets\")]                             © 2010 by Taylor and Francis Group, LLC
BASIC DATA MANIPULATION                                          13                                             name  sales profits assets                              10 Wal-Mart Stores 256.33      9.05 104.91                              5                BP 232.57    10.27 177.57                              4       ExxonMobil 222.88     20.96 166.99                                Another way of selecting vector elements is the use of a logical vector being                              TRUE when the corresponding element is to be selected and FALSE otherwise.                              The companies with assets of more than 1000 billion US dollars are                              R> Forbes2000[Forbes2000$assets > 1000,                              +             c(\"name\", \"sales\", \"profits\", \"assets\")]                                               name sales profits   assets                              1           Citigroup 94.71    17.85 1264.03                              9          Fannie Mae 53.13     6.48 1019.17                              403 Mizuho Financial 24.40    -20.11 1115.90                              where the expression Forbes2000$assets > 1000 indicates a logical vector                              of length 2000 with                              R> table(Forbes2000$assets > 1000)                              FALSE  TRUE                               1997      3                              elements being either FALSE or TRUE. In fact, for some of the companies the                              measurement of the profits variable are missing. In R, missing values are                              treated by a special symbol, NA, indicating that this measurement is not avail-                              able. The observations with profit information missing can be obtained via                              R> na_profits <- is.na(Forbes2000$profits)                              R> table(na_profits)                              na_profits                              FALSE  TRUE                               1995      5                              R> Forbes2000[na_profits,                              +             c(\"name\", \"sales\", \"profits\", \"assets\")]                                                     name sales profits assets                              772                     AMP   5.40       NA  42.94                              1085                    HHG   5.68       NA  51.65                              1091                    NTL   3.50       NA  10.59                              1425       US Airways Group   5.50       NA   8.58                              1909 Laidlaw International    4.48       NA   3.98                              where the function is.na returns a logical vector being TRUE when the corre-                              sponding element of the supplied vector is NA. A more comfortable approach                              is available when we want to remove all observations with at least one miss-                              ing value from a data.frame object. The function complete.cases takes a                              data.frame and returns a logical vector being TRUE when the corresponding                              observation does not contain any missing value:                              R> table(complete.cases(Forbes2000))                              © 2010 by Taylor and Francis Group, LLC
14                                      AN INTRODUCTION TO R                             FALSE   TRUE                                 5   1995                               Subsetting data.frames driven by logical expressions may induce a lot of                             typing which can be avoided. The subset function takes a data.frame as first                             argument and a logical expression as second argument. For example, we can                             select a subset of the Forbes 2000 list consisting of all companies situated in                             the United Kingdom by                             R> UKcomp <- subset(Forbes2000, country == \"United Kingdom\")                             R> dim(UKcomp)                             [1] 137    8                             i.e., 137 of the 2000 companies are from the UK. Note that it is not neces-                             sary to extract the variable country from the data.frame Forbes2000 when                             formulating the logical expression with subset.                             1.7 Computing with Data                             1.7.1 Simple Summary Statistics                             Two functions are helpful for getting an overview about R objects: str and                             summary, where str is more detailed about data types and summary gives a                             collection of sensible summary statistics. For example, applying the summary                             method to the Forbes2000 data set,                             R> summary(Forbes2000)                             results in the following output                                   rank              name                       country                              Min.    :   1.0   Length:2000          United States :751                              1st Qu.: 500.8    Class :character     Japan          :316                              Median :1000.5    Mode   :character    United Kingdom:137                              Mean    :1000.5                        Germany        : 65                              3rd Qu.:1500.2                         France         : 63                              Max.    :2000.0                        Canada         : 56                                                                     (Other)        :612                                                 category          sales                              Banking                 : 313   Min.    :  0.010                              Diversified financials: 158     1st Qu.:   2.018                              Insurance               : 112   Median :   4.365                              Utilities               : 110   Mean    :  9.697                              Materials               :  97   3rd Qu.:   9.547                              Oil & gas operations    :  90   Max.    :256.330                              (Other)                 :1120                                 profits               assets           marketvalue                              Min.    :-25.8300    Min.   :   0.270    Min.    :  0.02                              1st Qu.:   0.0800    1st Qu.:   4.025    1st Qu.:   2.72                              Median :   0.2000    Median :   9.345    Median :   5.15                              Mean    :  0.3811    Mean   :  34.042    Mean    : 11.88                              3rd Qu.:   0.4400    3rd Qu.:  22.793    3rd Qu.: 10.60                             © 2010 by Taylor and Francis Group, LLC
COMPUTING WITH DATA                                              15                               Max.   : 20.9600    Max.    :1264.030    Max.   :328.54                               NA's   :   5.0000                              From this output we can immediately see that most of the companies are                              situated in the US and that most of the companies are working in the banking                              sector as well as that negative profits, or losses, up to 26 billion US dollars                              occur.                                Internally, summary is a so-called generic function with methods for a multi-                              tude of classes, i.e., summary can be applied to objects of different classes and                              will report sensible results. Here, we supply a data.frame object to summary                              where it is natural to apply summary to each of the variables in this data.frame.                              Because a data.frame is a list with each variable being an element of that list,                              the same effect can be achieved by                              R> lapply(Forbes2000, summary)                                The members of the apply family help to solve recurring tasks for each                              element of a data.frame, matrix, list or for each level of a factor. It might be                              interesting to compare the profits in each of the 27 categories. To do so, we                              first compute the median profit for each category from                              R> mprofits <- tapply(Forbes2000$profits,                              +                      Forbes2000$category, median, na.rm = TRUE)                              a command that should be read as follows. For each level of the factor cat-                              egory, determine the corresponding elements of the numeric vector profits                              and supply them to the median function with additional argument na.rm =                              TRUE. The latter one is necessary because profits contains missing values                              which would lead to a non-sensible result of the median function                              R> median(Forbes2000$profits)                              [1] NA                              The three categories with highest median profit are computed from the vector                              of sorted median profits                              R> rev(sort(mprofits))[1:3]                                       Oil & gas operations            Drugs & biotechnology                                                         0.35                            0.35                              Household & personal products                                                         0.31                              where rev rearranges the vector of median profits sorted from smallest to                              largest. Of course, we can replace the median function with mean or whatever                              is appropriate in the call to tapply. In our situation, mean is not a good choice,                              because the distributions of profits or sales are naturally skewed. Simple graph-                              ical tools for the inspection of the empirical distributions are introduced later                              on and in Chapter 2.                              1.7.2 Customising Analyses                              In the preceding sections we have done quite complex analyses on our data                              using functions available from R. However, the real power of the system comes                              © 2010 by Taylor and Francis Group, LLC
16                                      AN INTRODUCTION TO R                             to light when writing our own functions for our own analysis tasks. Although                             R is a full-featured programming language, writing small helper functions for                             our daily work is not too complicated. We’ll study two example cases.                               At first, we want to add a robust measure of variability to the location                             measures computed in the previous subsection. In addition to the median                             profit, computed via                             R> median(Forbes2000$profits, na.rm = TRUE)                             [1] 0.2                             we want to compute the inter-quartile range, i.e., the difference between                             the 3rd and 1st quartile. Although a quick search in the manual pages (via                             help(\"interquartile\")) brings function IQR to our attention, we will ap-                             proach this task without making use of this tool, but using function quantile                             for computing sample quantiles only.                               A function in R is nothing but an object, and all objects are created equal.                             Thus, we ‘just’ have to assign a function object to a variable. A function                             object consists of an argument list, defining arguments and possibly default                             values, and a body defining the computations. The body starts and ends with                             braces. Of course, the body is assumed to be valid R code. In most cases we                             expect a function to return an object, therefore, the body will contain one or                             more return statements the arguments of which define the return values.                               Returning to our example, we’ll name our function iqr. The iqr function                             should operate on numeric vectors, therefore it should have an argument x.                             This numeric vector will be passed on to the quantile function for computing                                                                                       st                             the sample quartiles. The required difference between the 3 rd  and 1 quartile                             can then be computed using diff. The definition of our function reads as                             follows                             R> iqr <- function(x) {                             +      q <- quantile(x, prob = c(0.25, 0.75), names = FALSE)                             +      return(diff(q))                             +  }                             A simple test on simulated data from a standard normal distribution shows                             that our first function actually works, a comparison with the IQR function                             shows that the result is correct:                             R> xdata <- rnorm(100)                             R> iqr(xdata)                             [1] 1.495980                             R> IQR(xdata)                             [1] 1.495980                             However, when the numeric vector contains missing values, our function fails                             as the following example shows:                             R> xdata[1] <- NA                             R> iqr(xdata)                             © 2010 by Taylor and Francis Group, LLC
COMPUTING WITH DATA                                              17                              Error in quantile.default(x, prob = c(0.25, 0.75)):                                missing values and NaN's not allowed if 'na.rm' is FALSE                                In order to make our little function more flexible it would be helpful to                              add all arguments of quantile to the argument list of iqr. The copy-and-                              paste approach that first comes to mind is likely to lead to inconsistencies                              and errors, for example when the argument list of quantile changes. Instead,                              the dot argument, a wildcard for any argument, is more appropriate and we                              redefine our function accordingly:                              R> iqr <- function(x, ...) {                              +      q <- quantile(x, prob = c(0.25, 0.75), names = FALSE,                              +                     ...)                              +      return(diff(q))                              +  }                              R> iqr(xdata, na.rm = TRUE)                              [1] 1.503438                              R> IQR(xdata, na.rm = TRUE)                              [1] 1.503438                                Now, we can assess the variability of the profits using our new iqr tool:                              R> iqr(Forbes2000$profits, na.rm = TRUE)                              [1] 0.36                              Since there is no difference between functions that have been written by one of                              the R developers and user-created functions, we can compute the inter-quartile                              range of profits for each of the business categories by using our iqr function                              inside a tapply statement;                              R> iqr_profits <- tapply(Forbes2000$profits,                              +                         Forbes2000$category, iqr, na.rm = TRUE)                              and extract the categories with the smallest and greatest variability                              R> levels(Forbes2000$category)[which.min(iqr_profits)]                              [1] \"Hotels restaurants & leisure\"                              R> levels(Forbes2000$category)[which.max(iqr_profits)]                              [1] \"Drugs & biotechnology\"                              We observe less variable profits in tourism enterprises compared with profits                              in the pharmaceutical industry.                                As other members of the apply family, tapply is very helpful when the same                              task is to be done more than one time. Moreover, its use is more convenient                              compared to the usage of for loops. For the sake of completeness, we will                              compute the category-wise inter-quartile range of the profits using a for loop.                                Like a function, a for loop consists of a body, i.e., a chain of R commands                              to be executed. In addition, we need a set of values and a variable that iterates                              over this set. Here, the set we are interested in is the business categories:                              © 2010 by Taylor and Francis Group, LLC
18                                      AN INTRODUCTION TO R                             R> bcat <- Forbes2000$category                             R> iqr_profits2 <- numeric(nlevels(bcat))                             R> names(iqr_profits2) <- levels(bcat)                             R> for (cat in levels(bcat)) {                             +      catprofit <- subset(Forbes2000, category == cat)$profit                             +      this_iqr <- iqr(catprofit, na.rm = TRUE)                             +      iqr_profits2[levels(bcat) == cat] <- this_iqr                             +  }                             Compared to the usage of tapply, the above code is rather complicated. At                             first, we have to set up a vector for storing the results and assign the appro-                             priate names to it. Next, inside the body of the for loop, the iqr function has                             to be called on the appropriate subset of all companies of the current business                             category cat. The corresponding inter-quartile range must then be assigned                             to the correct vector element in the result vector. Luckily, such complicated                             constructs will be used in only one of the remaining chapters of the book and                             are almost always avoidable in practical data analyses.                             1.7.3 Simple Graphics                             The degree of skewness of a distribution can be investigated by constructing                             histograms using the hist function. (More sophisticated alternatives such as                             smooth density estimates will be considered in Chapter 8.) For example, the                             code for producing Figure 1.1 first divides the plot region into two equally                             spaced rows (the layout function) and then plots the histograms of the raw                             market values in the upper part using the hist function. The lower part of                             the figure depicts the histogram for the log transformed market values which                             appear to be more symmetric.                               Bivariate relationships of two continuous variables are usually depicted as                             scatterplots. In R, regression relationships are specified by so-called model                             formulae which, in a simple bivariate case, may look like                             R> fm <- marketvalue ~ sales                             R> class(fm)                             [1] \"formula\"                             with the dependent variable on the left hand side and the independent variable                             on the right hand side. The tilde separates left and right hand sides. Such a                             model formula can be passed to a model function (for example to the linear                             model function as explained in Chapter 6). The plot generic function imple-                             ments a formula method as well. Because the distributions of both market                             value and sales are skewed we choose to depict their logarithms. A raw scat-                             terplot of 2000 data points (Figure 1.2) is rather uninformative due to areas                             with very high density. This problem can be avoided by choosing a transparent                             color for the dots as shown in Figure 1.3.                               If the independent variable is a factor, a boxplot representation is a natural                             choice. For four selected countries, the distributions of the logarithms of the                             © 2010 by Taylor and Francis Group, LLC
COMPUTING WITH DATA                                              19                              R> layout(matrix(1:2, nrow = 2))                              R> hist(Forbes2000$marketvalue)                              R> hist(log(Forbes2000$marketvalue))                                              Histogram of Forbes2000$marketvalue                                 Frequency  1000                                      0                                          0     50    100    150    200    250    300    350                                                       Forbes2000$marketvalue                                            Histogram of log(Forbes2000$marketvalue)                                      800                                 Frequency  400                                      0                                         −4       −2        0         2        4        6                                                      log(Forbes2000$marketvalue)                              Figure 1.1  Histograms of the market value and the logarithm of the market value                                         for the companies contained in the Forbes 2000 list.                              market value may be visually compared in Figure 1.4. Prior to calling the                              plot function on our data, we have to remove empty levels from the country                              variable, because otherwise the x-axis would show all and not only the selected                              countries. This task is most easily performed by subsetting the corresponding                              factor with additional argument drop = TRUE. Here, the width of the boxes                              are proportional to the square root of the number of companies for each coun-                              try and extremely large or small market values are depicted by single points.                              More elaborate graphical methods will be discussed in Chapter 2.                              © 2010 by Taylor and Francis Group, LLC
20                                      AN INTRODUCTION TO R                             R> plot(log(marketvalue) ~ log(sales), data = Forbes2000,                             +       pch = \".\")                                     6                                     4                                 log(marketvalue)  2  0                                     −2                                     −4                                           −4       −2        0        2        4                                                             log(sales)                                Figure 1.2  Raw scatterplot of the logarithms of market value and sales.                             1.8 Organising an Analysis                             Although it is possible to perform an analysis typing all commands directly                             on the R prompt it is much more comfortable to maintain a separate text file                             collecting all steps necessary to perform a certain data analysis task. Such an                             R transcript file, for example called analysis.R created with your favourite                             text editor, can be sourced into R using the source command                             R> source(\"analysis.R\", echo = TRUE)                             When all steps of a data analysis, i.e., data preprocessing, transformations,                             simple summary statistics and plots, model building and inference as well                             as reporting, are collected in such an R transcript file, the analysis can be                             © 2010 by Taylor and Francis Group, LLC
© 2010 by Taylor and Francis Group, LLC
22                                      AN INTRODUCTION TO R                             R> tmp <- subset(Forbes2000,                             +      country %in% c(\"United Kingdom\", \"Germany\",                             +                      \"India\", \"Turkey\"))                             R> tmp$country <- tmp$country[,drop = TRUE]                             R> plot(log(marketvalue) ~ country, data = tmp,                             +       ylab = \"log(marketvalue)\", varwidth = TRUE)                                     4                                 log(marketvalue)  2                                     0                                     −2                                           Germany       India      Turkey  United Kingdom                                                              country                             Figure 1.4  Boxplots of the logarithms of the market value for four selected coun-                                        tries, the width of the boxes is proportional to the square roots of the                                        number of companies.                             © 2010 by Taylor and Francis Group, LLC
SUMMARY                                                          23                              examples of these functions and those that produce more interesting graphics                              in later chapters.                              Exercises                               Ex. 1.1 Calculate the median profit for the companies in the US and the                                median profit for the companies in the UK, France and Germany.                               Ex. 1.2 Find all German companies with negative profit.                               Ex. 1.3 To which business category do most of the Bermuda island companies                                belong?                               Ex. 1.4 For the 50 companies in the Forbes data set with the highest profits,                                plot sales against assets (or some suitable transformation of each variable),                                labelling each point with the appropriate country name which may need                                to be abbreviated (using abbreviate) to avoid making the plot look too                                ‘messy’.                               Ex. 1.5 Find the average value of sales for the companies in each country                                in the Forbes data set, and find the number of companies in each country                                with profits above 5 billion US dollars.                              © 2010 by Taylor and Francis Group, LLC
CHAPTER 2                                     Data Analysis Using Graphical                               Displays: Malignant Melanoma in the                                      USA and Chinese Health and                                                     Family Life                              2.1 Introduction                              Fisher and Belle (1993) report mortality rates due to malignant melanoma                              of the skin for white males during the period 1950–1969, for each state on                              the US mainland. The data are given in Table 2.1 and include the number of                              deaths due to malignant melanoma in the corresponding state, the longitude                              and latitude of the geographic centre of each state, and a binary variable                              indicating contiguity to an ocean, that is, if the state borders one of the                              oceans. Questions of interest about these data include: how do the mortality                              rates compare for ocean and non-ocean states? and how are mortality rates                              affected by latitude and longitude?                                  Table 2.1: USmelanoma data. USA mortality rates for white                                              males due to malignant melanoma.                                                      mortality  latitude   longitude   ocean                                 Alabama                    219       33.0        87.0     yes                                 Arizona                    160       34.5       112.0     no                                 Arkansas                   170       35.0        92.5     no                                 California                 182       37.5       119.5     yes                                 Colorado                   149       39.0       105.5     no                                 Connecticut                159       41.8        72.8     yes                                 Delaware                   200       39.0        75.5     yes                                 District of Columbia       177       39.0        77.0     no                                 Florida                    197       28.0        82.0     yes                                 Georgia                    214       33.0        83.5     yes                                 Idaho                      116       44.5       114.0     no                                 Illinois                   124       40.0        89.5     no                                 Indiana                    128       40.2        86.2     no                                 Iowa                       128       42.2        93.8     no                                 Kansas                     166       38.5        98.5     no                                 Kentucky                   147       37.8        85.0     no                                 Louisiana                  190       31.2        91.8     yes                                                              25                              © 2010 by Taylor and Francis Group, LLC
26                  DATA ANALYSIS USING GRAPHICAL DISPLAYS                                           Table 2.1: USmelanoma data (continued).                                                     mortality   latitude  longitude   ocean                                 Maine                     117        45.2       69.0     yes                                 Maryland                  162        39.0       76.5     yes                                 Massachusetts             143        42.2       71.8     yes                                 Michigan                  117        43.5       84.5     no                                 Minnesota                 116        46.0       94.5     no                                 Mississippi               207        32.8       90.0     yes                                 Missouri                  131        38.5       92.0     no                                 Montana                   109        47.0      110.5     no                                 Nebraska                  122        41.5       99.5     no                                 Nevada                    191        39.0      117.0     no                                 New Hampshire             129        43.8       71.5     yes                                 New Jersey                159        40.2       74.5     yes                                 New Mexico                141        35.0      106.0     no                                 New York                  152        43.0       75.5     yes                                 North Carolina            199        35.5       79.5     yes                                 North Dakota              115        47.5      100.5     no                                 Ohio                      131        40.2       82.8     no                                 Oklahoma                  182        35.5       97.2     no                                 Oregon                    136        44.0      120.5     yes                                 Pennsylvania              132        40.8       77.8     no                                 Rhode Island              137        41.8       71.5     yes                                 South Carolina            178        33.8       81.0     yes                                 South Dakota               86        44.8      100.0     no                                 Tennessee                 186        36.0       86.2     no                                 Texas                     229        31.5       98.0     yes                                 Utah                      142        39.5      111.5     no                                 Vermont                   153        44.0       72.5     yes                                 Virginia                  166        37.5       78.5     yes                                 Washington                117        47.5      121.0     yes                                 West Virginia             136        38.8       80.8     no                                 Wisconsin                 110        44.5       90.2     no                                 Wyoming                   134        43.0      107.5     no                             Source: From Fisher, L. D., and Belle, G. V., Biostatistics. A Methodology                             for the Health Sciences, John Wiley & Sons, Chichester, UK, 1993. With                             permission.                               Contemporary China is on the leading edge of a sexual revolution, with                             tremendous regional and generational differences that provide unparalleled                             natural experiments for analysis of the antecedents and outcomes of sexual                             behaviour. The Chinese Health and Family Life Study, conducted 1999–2000                             as a collaborative research project of the Universities of Chicago, Beijing, and                             © 2010 by Taylor and Francis Group, LLC
INITIAL DATA ANALYSIS                                            27                              North Carolina, provides a baseline from which to anticipate and track future                              changes. Specifically, this study produces a baseline set of results on sexual                              behaviour and disease patterns, using a nationally representative probability                              sample. The Chinese Health and Family Life Survey sampled 60 villages and                              urban neighbourhoods chosen in such a way as to represent the full geographi-                              cal and socioeconomic range of contemporary China excluding Hong Kong and                              Tibet. Eighty-three individuals were chosen at random for each location from                              official registers of adults aged between 20 and 64 years to target a sample of                              5000 individuals in total. Here, we restrict our attention to women with cur-                              rent male partners for whom no information was missing, leading to a sample                              of 1534 women with the following variables (see Table 2.2 for example data                              sets):                               R_edu: level of education of the responding woman,                               R_income: monthly income (in yuan) of the responding woman,                               R_health: health status of the responding woman in the last year,                               R_happy: how happy was the responding woman in the last year,                               A_edu: level of education of the woman’s partner,                               A_income: monthly income (in yuan) of the woman’s partner.                              In the list above the income variables are continuous and the remaining vari-                              ables are categorical with ordered categories. The income variables are based                              on (partially) imputed measures. All information, including the partner’s in-                              come, are derived from a questionnaire answered by the responding woman                              only. Here, we focus on graphical displays for inspecting the relationship of                              these health and socioeconomic variables of heterosexual women and their                              partners.                              2.2 Initial Data Analysis                              According to Chambers et al. (1983), “there is no statistical tool that is as                              powerful as a well chosen graph”. Certainly, the analysis of most (probably                              all) data sets should begin with an initial attempt to understand the general                              characteristics of the data by graphing them in some hopefully useful and in-                              formative manner. The possible advantages of graphical presentation methods                              are summarised by Schmid (1954); they include the following                              • In comparison with other types of presentation, well-designed charts are                                more effective in creating interest and in appealing to the attention of the                                reader.                              • Visual relationships as portrayed by charts and graphs are more easily                                grasped and more easily remembered.                              • The use of charts and graphs saves time, since the essential meaning of                                large measures of statistical data can be visualised at a glance.                              • Charts and graphs provide a comprehensive picture of a problem that makes                              © 2010 by Taylor and Francis Group, LLC
28                  DATA ANALYSIS USING GRAPHICAL DISPLAYS                                       A_income  500  800  700  700  400  900  300  800  200  600  200  400  500  200  800  500  500  600  0  200  .  .  .                                       A_edu  school  school  school  school  school  college  school  school  college  school  school  school  school  school  college  University  school  school  school  school  .  .  .                                  Survey.  high  high  high  Elementary  high  Junior  high  high  Junior  high  high  high  high  high  Junior  high  high  high  high                                  Life   Senior  Senior  Junior  Junior  Junior  Senior  Senior  Junior  Junior  Junior  Senior  Senior  Junior  Junior  Junior                                  Family  R_happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  happy  .  .  .                                  and                   Very  too  too  too       too  Very                                  Health  Somewhat  Somewhat  Somewhat  Somewhat  Somewhat  Somewhat  Not  Not  Somewhat  Not  Somewhat  Somewhat  Somewhat  Somewhat  Somewhat  Somewhat  Not  Somewhat                                  Chinese  Good  Fair  Good  Fair  Fair  good  Good  Fair  Good  good  Fair  Good  Fair  Fair  good  good  .  .  .                                  data.  R_health    Excellent  Not  Not  Excellent  Excellent  Not  Excellent  Not                                  CHFLS  R_income  900  500  800  300  300  500  0  100  200  400  300  0  200  300  3000  0  500  0  0  500  .  .  .                                  2.2:                                  Table  R_edu  school  school  school  school  school  school  school  school  school  school  school  school  school  school  college  college  school  school  school  school  .  .  .                                         high  high  high  high  high  high  high  high  high  high  high  high  high  high  Junior  Junior  high  high  high  high                                         Senior  Senior  Senior  Junior  Junior  Senior  Junior  Junior  Junior  Senior  Junior  Junior  Junior  Senior  Senior  Junior  Senior  Junior  © 2010 by Taylor and Francis Group, LLC                                         2  3  10  11  22  23  24  25  26  32  33  35  36  37  38  39  40  41  55  56  57
ANALYSIS USING R                                                 29                                for a more complete and better balanced understanding than could be de-                                rived from tabular or textual forms of presentation.                              • Charts and graphs can bring out hidden facts and relationships and can                                stimulate, as well as aid, analytical thinking and investigation.                              Graphs are very popular; it has been estimated that between 900 billion (9 ×                                11                                                     12                              10 ) and 2 trillion (2 × 10 ) images of statistical graphics are printed each                              year. Perhaps one of the main reasons for such popularity is that graphical                              presentation of data often provides the vehicle for discovering the unexpected;                              the human visual system is very powerful in detecting patterns, although the                              following caveat from the late Carl Sagan (in his book Contact) should be                              kept in mind:                                Humans are good at discerning subtle patterns that are really there, but equally                                so at imagining them when they are altogether absent.                              During the last two decades a wide variety of new methods for displaying data                              graphically have been developed; these will hunt for special effects in data,                              indicate outliers, identify patterns, diagnose models and generally search for                              novel and perhaps unexpected phenomena. Large numbers of graphs may be                              required and computers are generally needed to supply them for the same                              reasons they are used for numerical analyses, namely that they are fast and                              they are accurate.                                So, because the machine is doing the work the question is no longer“shall we                              plot?” but rather “what shall we plot?” There are many exciting possibilities                              including dynamic graphics but graphical exploration of data usually begins,                              at least, with some simpler, well-known methods, for example, histograms,                              barcharts, boxplots and scatterplots. Each of these will be illustrated in this                              chapter along with more complex methods such as spinograms and trellis plots.                              2.3 Analysis Using R                              2.3.1 Malignant Melanoma                              We might begin to examine the malignant melanoma data in Table 2.1 by con-                              structing a histogram or boxplot for all the mortality rates in Figure 2.1. The                              plot, hist and boxplot functions have already been introduced in Chapter 1                              and we want to produce a plot where both techniques are applied at once.                              The layout function organises two independent plots on one plotting device,                              for example on top of each other. Using this relatively simple technique (more                              advanced methods will be introduced later) we have to make sure that the                              x-axis is the same in both graphs. This can be done by computing a plausible                              range of the data, later to be specified in a plot via the xlim argument:                              R> xr <- range(USmelanoma$mortality) * c(0.9, 1.1)                              R> xr                              [1]  77.4 251.9                              Now, plotting both the histogram and the boxplot requires setting up the                              plotting device with equal space for two independent plots on top of each other.                              © 2010 by Taylor and Francis Group, LLC
                                
                                
                                Search
                            
                            Read the Text Version
- 1
 - 2
 - 3
 - 4
 - 5
 - 6
 - 7
 - 8
 - 9
 - 10
 - 11
 - 12
 - 13
 - 14
 - 15
 - 16
 - 17
 - 18
 - 19
 - 20
 - 21
 - 22
 - 23
 - 24
 - 25
 - 26
 - 27
 - 28
 - 29
 - 30
 - 31
 - 32
 - 33
 - 34
 - 35
 - 36
 - 37
 - 38
 - 39
 - 40
 - 41
 - 42
 - 43
 - 44
 - 45
 - 46
 - 47
 - 48
 - 49
 - 50
 - 51
 - 52
 - 53
 - 54
 - 55
 - 56
 - 57
 - 58
 - 59
 - 60
 - 61
 - 62
 - 63
 - 64
 - 65
 - 66
 - 67
 - 68
 - 69
 - 70
 - 71
 - 72
 - 73
 - 74
 - 75
 - 76
 - 77
 - 78
 - 79
 - 80
 - 81
 - 82
 - 83
 - 84
 - 85
 - 86
 - 87
 - 88
 - 89
 - 90
 - 91
 - 92
 - 93
 - 94
 - 95
 - 96
 - 97
 - 98
 - 99
 - 100
 - 101
 - 102
 - 103
 - 104
 - 105
 - 106
 - 107
 - 108
 - 109
 - 110
 - 111
 - 112
 - 113
 - 114
 - 115
 - 116
 - 117
 - 118
 - 119
 - 120
 - 121
 - 122
 - 123
 - 124
 - 125
 - 126
 - 127
 - 128
 - 129
 - 130
 - 131
 - 132
 - 133
 - 134
 - 135
 - 136
 - 137
 - 138
 - 139
 - 140
 - 141
 - 142
 - 143
 - 144
 - 145
 - 146
 - 147
 - 148
 - 149
 - 150
 - 151
 - 152
 - 153
 - 154
 - 155
 - 156
 - 157
 - 158
 - 159
 - 160
 - 161
 - 162
 - 163
 - 164
 - 165
 - 166
 - 167
 - 168
 - 169
 - 170
 - 171
 - 172
 - 173
 - 174
 - 175
 - 176
 - 177
 - 178
 - 179
 - 180
 - 181
 - 182
 - 183
 - 184
 - 185
 - 186
 - 187
 - 188
 - 189
 - 190
 - 191
 - 192
 - 193
 - 194
 - 195
 - 196
 - 197
 - 198
 - 199
 - 200
 - 201
 - 202
 - 203
 - 204
 - 205
 - 206
 - 207
 - 208
 - 209
 - 210
 - 211
 - 212
 - 213
 - 214
 - 215
 - 216
 - 217
 - 218
 - 219
 - 220
 - 221
 - 222
 - 223
 - 224
 - 225
 - 226
 - 227
 - 228
 - 229
 - 230
 - 231
 - 232
 - 233
 - 234
 - 235
 - 236
 - 237
 - 238
 - 239
 - 240
 - 241
 - 242
 - 243
 - 244
 - 245
 - 246
 - 247
 - 248
 - 249
 - 250
 - 251
 - 252
 - 253
 - 254
 - 255
 - 256
 - 257
 - 258
 - 259
 - 260
 - 261
 - 262
 - 263
 - 264
 - 265
 - 266
 - 267
 - 268
 - 269
 - 270
 - 271
 - 272
 - 273
 - 274
 - 275
 - 276
 - 277
 - 278
 - 279
 - 280
 - 281
 - 282
 - 283
 - 284
 - 285
 - 286
 - 287
 - 288
 - 289
 - 290
 - 291
 - 292
 - 293
 - 294
 - 295
 - 296
 - 297
 - 298
 - 299
 - 300
 - 301
 - 302
 - 303
 - 304
 - 305
 - 306
 - 307
 - 308
 - 309
 - 310
 - 311
 - 312
 - 313
 - 314
 - 315
 - 316
 - 317
 - 318
 - 319
 - 320
 - 321
 - 322
 - 323
 - 324
 - 325
 - 326
 - 327
 - 328
 - 329
 - 330
 - 331
 - 332
 - 333
 - 334
 - 335
 - 336
 - 337
 - 338
 - 339
 - 340
 - 341
 - 342
 - 343
 - 344
 - 345
 - 346
 - 347
 - 348
 - 349
 - 350
 - 351
 - 352
 - 353
 - 354
 - 355
 - 356
 - 357
 - 358
 - 359
 - 360
 - 361