Home Explore Statistical analysis with R

Statistical analysis with R

Published by andiny.clock, 2014-07-25 10:34:09

Description: Like the first edition this book is intended as a guide to data analysis with
the R system for statistical computing. New chapters on graphical displays,
generalised additive models and simultaneous inference have been added to
this second edition and a section on generalised linear mixed models completes
the chapter that discusses the analysis of longitudinal data where the response
variable does not have a normal distribution. In addition, new examples and
additional exercises have been added to several chapters. We have also taken
the opportunity to correct a number of errors that were present in the first
edition. Most of these errors were kindly pointed out to us by a variety of people to whom we are very grateful, especially Guido Schwarzer, Mike Cheung,
T obias V erbeke, Yihui Xie, Lothar H¨ aberle, and Radoslav Harman.
We learnt that many instructors use our book successfully for introductory
courses in applied statistics. We have had the pleasure to give some courses
based on

Read the Text Version

Pages:

A Handbook of Statistical Analyses Using n SECOND EDITION © 2010 by Taylor and Francis Group, LLC

A Handbook of Statistical Analyses Using SECOND EDITION Brian S. Everitt and Ibrsten Hothorn CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor & Francis Croup, an informa business A CHAPMAN & HALL BOOK © 2010 by Taylor and Francis Group, LLC

Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2010 by Taylor and Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number: 978-1-4200-7933-3 (Paperback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Everitt, Brian. A handbook of statistical analyses using R / Brian S. Everitt and Torsten Hothorn. -- 2nd ed. p. cm. Includes bibliographical references and index. ISBN 978-1-4200-7933-3 (pbk. : alk. paper) 1. Mathematical statistics--Data processing--Handbooks, manuals, etc. 2. R (Computer program language)--Handbooks, manuals, etc. I. Hothorn, Torsten. II. Title. QA276.45.R3E94 2010 519.50285’5133--dc22 2009018062 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com © 2010 by Taylor and Francis Group, LLC

Dedication To our wives, Mary-Elizabeth and Carolin, for their constant support and encouragement © 2010 by Taylor and Francis Group, LLC

Preface to Second Edition Like the ﬁrst edition this book is intended as a guide to data analysis with the R system for statistical computing. New chapters on graphical displays, generalised additive models and simultaneous inference have been added to this second edition and a section on generalised linear mixed models completes the chapter that discusses the analysis of longitudinal data where the response variable does not have a normal distribution. In addition, new examples and additional exercises have been added to several chapters. We have also taken the opportunity to correct a number of errors that were present in the ﬁrst edition. Most of these errors were kindly pointed out to us by a variety of peo- ple to whom we are very grateful, especially Guido Schwarzer, Mike Cheung, Tobias Verbeke, Yihui Xie, Lothar H¨ aberle, and Radoslav Harman. We learnt that many instructors use our book successfully for introductory courses in applied statistics. We have had the pleasure to give some courses based on the ﬁrst edition of the book ourselves and we are happy to share slides covering many sections of particular chapters with our readers. L A T X E sources and PDF versions of slides covering several chapters are available from the second author upon request. A new version of the HSAUR package, now called HSAUR2 for obvious reasons, is available from CRAN. Basically the package vignettes have been updated to cover the new and modiﬁed material as well. Otherwise, the tech- nical infrastructure remains as described in the preface to the ﬁrst edition, with two small exceptions: names of R add-on packages are now printed in bold font and we refrain from showing signiﬁcance stars in model summaries. Lastly we would like to thank Thomas Kneib and Achim Zeileis for com- menting on the newly added material and again the CRC Press staﬀ, in par- ticular Rob Calver, for their support during the preparation of this second edition. Brian S. Everitt and Torsten Hothorn London and M¨ unchen, April 2009 © 2010 by Taylor and Francis Group, LLC

Preface to First Edition This book is intended as a guide to data analysis with the R system for sta- tistical computing. R is an environment incorporating an implementation of the S programming language, which is powerful and ﬂexible and has excellent graphical facilities (R Development Core Team, 2009b). In the Handbook we aim to give relatively brief and straightforward descriptions of how to conduct a range of statistical analyses using R. Each chapter deals with the analy- sis appropriate for one or several data sets. A brief account of the relevant statistical background is included in each chapter along with appropriate ref- erences, but our prime focus is on how to use R and how to interpret results. We hope the book will provide students and researchers in many disciplines with a self-contained means of using R to analyse their data. R is an open-source project developed by dozens of volunteers for more than ten years now and is available from the Internet under the General Public Li- cence. R has become the lingua franca of statistical computing. Increasingly, implementations of new statistical methodology ﬁrst appear as R add-on pack- ages. In some communities, such as in bioinformatics, R already is the primary workhorse for statistical analyses. Because the sources of the R system are open and available to everyone without restrictions and because of its powerful lan- guage and graphical capabilities, R has started to become the main computing engine for reproducible statistical research (Leisch, 2002a,b, 2003, Leisch and Rossini, 2003, Gentleman, 2005). For a reproducible piece of research, the orig- inal observations, all data preprocessing steps, the statistical analysis as well as the scientiﬁc report form a unity and all need to be available for inspection, reproduction and modiﬁcation by the readers. Reproducibility is a natural requirement for textbooks such as the Handbook of Statistical Analyses Using R and therefore this book is fully reproducible using an R version greater or equal to 2.2.1. All analyses and results, including ﬁgures and tables, can be reproduced by the reader without having to retype a single line of R code. The data sets presented in this book are collected in a dedicated add-on package called HSAUR accompanying this book. The package can be installed from the Comprehensive R Archive Network (CRAN) via R> install.packages(\"HSAUR\") and its functionality is attached by R> library(\"HSAUR\") The relevant parts of each chapter are available as a vignette, basically a © 2010 by Taylor and Francis Group, LLC

document including both the R sources and the rendered output of every analysis contained in the book. For example, the ﬁrst chapter can be inspected by R> vignette(\"Ch_introduction_to_R\", package = \"HSAUR\") and the R sources are available for reproducing our analyses by R> edit(vignette(\"Ch_introduction_to_R\", package = \"HSAUR\")) An overview on all chapter vignettes included in the package can be obtained from R> vignette(package = \"HSAUR\") We welcome comments on the R package HSAUR, and where we think these add to or improve our analysis of a data set we will incorporate them into the package and, hopefully at a later stage, into a revised or second edition of the book. Plots and tables of results obtained from R are all labelled as ‘Figures’ in the text. For the graphical material, the corresponding ﬁgure also contains the ‘essence’ of the R code used to produce the ﬁgure, although this code may diﬀer a little from that given in the HSAUR package, since the latter may include some features, for example thicker line widths, designed to make a basic plot more suitable for publication. We would like to thank the R Development Core Team for the R system, and authors of contributed add-on packages, particularly Uwe Ligges and Vince Carey for helpful advice on scatterplot3d and gee. Kurt Hornik, Ludwig A. Hothorn, Fritz Leisch and Rafael Weißbach provided good advice with some statistical and technical problems. We are also very grateful to Achim Zeileis for reading the entire manuscript, pointing out inconsistencies or even bugs and for making many suggestions which have led to improvements. Lastly we would like to thank the CRC Press staﬀ, in particular Rob Calver, for their support during the preparation of the book. Any errors in the book are, of course, the joint responsibility of the two authors. Brian S. Everitt and Torsten Hothorn London and Erlangen, December 2005 © 2010 by Taylor and Francis Group, LLC

List of Figures 1.1 Histograms of the market value and the logarithm of the market value for the companies contained in the Forbes 2000 list. 19 1.2 Raw scatterplot of the logarithms of market value and sales. 20 1.3 Scatterplot with transparent shading of points of the loga- rithms of market value and sales. 21 1.4 Boxplots of the logarithms of the market value for four selected countries, the width of the boxes is proportional to the square roots of the number of companies. 22 2.1 Histogram (top) and boxplot (bottom) of malignant melanoma mortality rates. 30 2.2 Parallel boxplots of malignant melanoma mortality rates by contiguity to an ocean. 31 2.3 Estimated densities of malignant melanoma mortality rates by contiguity to an ocean. 32 2.4 Scatterplot of malignant melanoma mortality rates by geo- graphical location. 33 2.5 Scatterplot of malignant melanoma mortality rates against latitude. 34 2.6 Bar chart of happiness. 35 2.7 Spineplot of health status and happiness. 36 2.8 Spinogram (left) and conditional density plot (right) of happiness depending on log-income 38 3.1 Boxplots of estimates of room width in feet and metres (after conversion to feet) and normal probability plots of estimates of room width made in feet and in metres. 55 3.2 R output of the independent samples t-test for the roomwidth data. 56 3.3 R output of the independent samples Welch test for the roomwidth data. 56 3.4 R output of the Wilcoxon rank sum test for the roomwidth data. 57 3.5 Boxplot and normal probability plot for diﬀerences between the two mooring methods. 58 © 2010 by Taylor and Francis Group, LLC

3.6 R output of the paired t-test for the waves data. 59 3.7 R output of the Wilcoxon signed rank test for the waves data. 59 3.8 Enhanced scatterplot of water hardness and mortality, showing both the joint and the marginal distributions and, in addition, the location of the city by diﬀerent plotting symbols. 60 3.9 R output of Pearsons’ correlation coeﬃcient for the water data. 61 3.10 R output of the chi-squared test for the pistonrings data. 61 3.11 Association plot of the residuals for the pistonrings data. 62 3.12 R output of McNemar’s test for the rearrests data. 63 3.13 R output of an exact version of McNemar’s test for the rearrests data computed via a binomial test. 63 4.1 An approximation for the conditional distribution of the diﬀerence of mean roomwidth estimates in the feet and metres group under the null hypothesis. The vertical lines show the negative and positive absolute value of the test statistic T obtained from the original data. 71 4.2 R output of the exact permutation test applied to the roomwidth data. 72 4.3 R output of the exact conditional Wilcoxon rank sum test applied to the roomwidth data. 73 4.4 R output of Fisher’s exact test for the suicides data. 73 5.1 Plot of mean weight gain for each level of the two factors. 84 5.2 R output of the ANOVA ﬁt for the weightgain data. 85 5.3 Interaction plot of type and source. 86 5.4 Plot of mean litter weight for each level of the two factors for the foster data. 87 5.5 Graphical presentation of multiple comparison results for the foster feeding data. 90 5.6 Scatterplot matrix of epoch means for Egyptian skulls data. 92 6.1 Scatterplot of velocity and distance. 104 6.2 Scatterplot of velocity and distance with estimated regression line (left) and plot of residuals against ﬁtted values (right). 105 6.3 Boxplots of rainfall. 107 6.4 Scatterplots of rainfall against the continuous covariates. 108 6.5 R output of the linear model ﬁt for the clouds data. 109 6.6 Regression relationship between S-Ne criterion and rainfall with and without seeding. 111 6.7 Plot of residuals against ﬁtted values for clouds seeding data. 113 © 2010 by Taylor and Francis Group, LLC

6.8 Normal probability plot of residuals from cloud seeding model clouds_lm. 114 6.9 Index plot of Cook’s distances for cloud seeding data. 115 7.1 Conditional density plots of the erythrocyte sedimentation rate (ESR) given ﬁbrinogen and globulin. 123 7.2 R output of the summary method for the logistic regression model ﬁtted to ESR and ﬁbrigonen. 124 7.3 R output of the summary method for the logistic regression model ﬁtted to ESR and both globulin and ﬁbrinogen. 125 7.4 Bubbleplot of ﬁtted values for a logistic regression model ﬁtted to the plasma data. 126 7.5 R output of the summary method for the logistic regression model ﬁtted to the womensrole data. 127 7.6 Fitted (from womensrole_glm_1) and observed probabilities of agreeing for the womensrole data. 129 7.7 R output of the summary method for the logistic regression model ﬁtted to the womensrole data. 130 7.8 Fitted (from womensrole_glm_2) and observed probabilities of agreeing for the womensrole data. 131 7.9 Plot of deviance residuals from logistic regression model ﬁtted to the womensrole data. 132 7.10 R output of the summary method for the Poisson regression model ﬁtted to the polyps data. 133 7.11 R output of the print method for the conditional logistic regression model ﬁtted to the backpain data. 136 8.1 Three commonly used kernel functions. 144 8.2 Kernel estimate showing the contributions of Gaussian kernels evaluated for the individual observations with bandwidth h = 0.4. 145 8.3 Epanechnikov kernel for a grid between (−1.1, −1.1) and (1.1, 1.1). 146 8.4 Density estimates of the geyser eruption data imposed on a histogram of the data. 148 8.5 A contour plot of the bivariate density estimate of the CYGOB1 data, i.e., a two-dimensional graphical display for a three-dimensional problem. 149 8.6 The bivariate density estimate of the CYGOB1 data, here shown in a three-dimensional fashion using the persp function. 150 8.7 Fitted normal density and two-component normal mixture for geyser eruption data. 152 8.8 Bootstrap distribution and conﬁdence intervals for the mean estimates of a two-component mixture for the geyser data. 155 © 2010 by Taylor and Francis Group, LLC

9.1 Initial tree for the body fat data with the distribution of body fat in terminal nodes visualised via boxplots. 166 9.2 Pruned regression tree for body fat data. 167 9.3 Observed and predicted DXA measurements. 168 9.4 Pruned classiﬁcation tree of the glaucoma data with class distribution in the leaves. 169 9.5 Estimated class probabilities depending on two important variables. The 0.5 cut-oﬀ for the estimated glaucoma proba- bility is depicted as a horizontal line. Glaucomateous eyes are plotted as circles and normal eyes are triangles. 172 9.6 Conditional inference tree with the distribution of body fat content shown for each terminal leaf. 173 9.7 Conditional inference tree with the distribution of glaucoma- teous eyes shown for each terminal leaf. 174 10.1 A linear spline function with knots at a = 1, b = 3 and c = 5. 183 10.2 Scatterplot of year and winning time. 187 10.3 Scatterplot of year and winning time with ﬁtted values from a simple linear model. 188 10.4 Scatterplot of year and winning time with ﬁtted values from a smooth non-parametric model. 189 10.5 Scatterplot of year and winning time with ﬁtted values from a quadratic model. 190 10.6 Partial contributions of six exploratory covariates to the predicted SO 2 concentration. 191 10.7 Residual plot of SO 2 concentration. 192 10.8 Spinograms of the three exploratory variables and response variable kyphosis. 193 10.9 Partial contributions of three exploratory variables with conﬁdence bands. 194 11.1 ‘Bath tub’ shape of a hazard function. 202 11.2 Survival times comparing treated and control patients. 205 11.3 Kaplan-Meier estimates for breast cancer patients who either received a hormonal therapy or not. 207 11.4 R output of the summary method for GBSG2_coxph. 208 11.5 Estimated regression coeﬃcient for age depending on time for the GBSG2 data. 209 11.6 Martingale residuals for the GBSG2 data. 210 11.7 Conditional inference tree for the GBSG2 data with the survival function, estimated by Kaplan-Meier, shown for every subgroup of patients identiﬁed by the tree. 211 12.1 Boxplots for the repeated measures by treatment group for the BtheB data. 220 © 2010 by Taylor and Francis Group, LLC

12.2 R output of the linear mixed-eﬀects model ﬁt for the BtheB data. 222 12.3 R output of the asymptotic p-values for linear mixed-eﬀects model ﬁt for the BtheB data. 223 12.4 Quantile-quantile plots of predicted random intercepts and residuals for the random intercept model BtheB_lmer1 ﬁtted to the BtheB data. 224 12.5 Distribution of BDI values for patients that do (circles) and do not (bullets) attend the next scheduled visit. 227 13.1 Simulation of a positive response in a random intercept logistic regression model for 20 subjects. The thick line is the average over all 20 subjects. 237 13.2 R output of the summary method for the btb_gee model (slightly abbreviated). 239 13.3 R output of the summary method for the btb_gee1 model (slightly abbreviated). 240 13.4 R output of the summary method for the resp_glm model. 241 13.5 R output of the summary method for the resp_gee1 model (slightly abbreviated). 242 13.6 R output of the summary method for the resp_gee2 model (slightly abbreviated). 243 13.7 Boxplots of numbers of seizures in each two-week period post randomisation for placebo and active treatments. 244 13.8 Boxplots of log of numbers of seizures in each two-week period post randomisation for placebo and active treatments. 245 13.9 R output of the summary method for the epilepsy_glm model. 246 13.10 R output of the summary method for the epilepsy_gee1 model (slightly abbreviated). 247 13.11 R output of the summary method for the epilepsy_gee2 model (slightly abbreviated). 248 13.12 R output of the summary method for the epilepsy_gee3 model (slightly abbreviated). 249 13.13 R output of the summary method for the resp_lmer model (abbreviated). 249 14.1 Distribution of levels of expressed alpha synuclein mRNA in three groups deﬁned by the NACP-REP1 allele lengths. 258 14.2 Simultaneous conﬁdence intervals for the alpha data based on the ordinary covariance matrix (left) and a sandwich estimator (right). 261 14.3 Probability of damage caused by roe deer browsing for six tree species. Sample sizes are given in brackets. 263 © 2010 by Taylor and Francis Group, LLC

14.4 Regression relationship between S-Ne criterion and rainfall with and without seeding. The conﬁdence bands cover the area within the dashed curves. 265 15.1 R output of the summary method for smokingOR. 274 15.2 Forest plot of observed eﬀect sizes and 95% conﬁdence intervals for the nicotine gum studies. 275 15.3 R output of the summary method for BCG_OR. 277 15.4 R output of the summary method for BCG_DSL. 278 15.5 R output of the summary method for BCG_mod. 279 15.6 Plot of observed eﬀect size for the BCG vaccine data against latitude, with a weighted least squares regression ﬁt shown in addition. 280 15.7 Example funnel plots from simulated data. The asymmetry in the lower plot is a hint that a publication bias might be a problem. 281 15.8 Funnel plot for nicotine gum data. 282 16.1 Scatterplot matrix for the heptathlon data (all countries). 289 16.2 Scatterplot matrix for the heptathlon data after removing observations of the PNG competitor. 291 16.3 Barplot of the variances explained by the principal compo- nents. (with observations for PNG removed). 294 16.4 Biplot of the (scaled) ﬁrst two principal components (with observations for PNG removed). 295 16.5 Scatterplot of the score assigned to each athlete in 1988 and the ﬁrst principal component. 296 17.1 Two-dimensional solution from classical multidimensional scaling of distance matrix for water vole populations. 306 17.2 Minimum spanning tree for the watervoles data. 308 17.3 Two-dimensional solution from non-metric multidimensional scaling of distance matrix for voting matrix. 309 17.4 The Shepard diagram for the voting data shows some discrepancies between the original dissimilarities and the multidimensional scaling solution. 310 18.1 Bivariate data showing the presence of three clusters. 319 18.2 Example of a dendrogram. 321 18.3 Darwin’s Tree of Life. 322 18.4 Image plot of the dissimilarity matrix of the pottery data. 326 18.5 Hierarchical clustering of pottery data and resulting den- drograms. 327 18.6 3D scatterplot of the logarithms of the three variables available for each of the exoplanets. 328 © 2010 by Taylor and Francis Group, LLC

18.7 Within-cluster sum of squares for diﬀerent numbers of clusters for the exoplanet data. 329 18.8 Plot of BIC values for a variety of models and a range of number of clusters. 331 18.9 Scatterplot matrix of planets data showing a three-cluster solution from Mclust. 332 18.10 3D scatterplot of planets data showing a three-cluster solution from Mclust. 333 © 2010 by Taylor and Francis Group, LLC

List of Tables 2.1 USmelanoma data. USA mortality rates for white males due to malignant melanoma. 25 2.2 CHFLS data. Chinese Health and Family Life Survey. 28 2.3 household data. Household expenditure for single men and women. 40 2.4 suicides2 data. Mortality rates per 100, 000 from male suicides. 41 2.5 USstates data. Socio-demographic variables for ten US states. 42 2.6 banknote data (package alr3). Swiss bank note data. 43 3.1 roomwidth data. Room width estimates (width) in feet and in metres (unit). 45 3.2 waves data. Bending stress (root mean squared bending moment in Newton metres) for two mooring methods in a wave energy experiment. 46 3.3 water data. Mortality (per 100,000 males per year, mor- tality) and water hardness for 61 cities in England and Wales. 47 3.4 pistonrings data. Number of piston ring failures for three legs of four compressors. 49 3.5 rearrests data. Rearrests of juvenile felons by type of court in which they were tried. 49 3.6 The general r × c table. 52 3.7 Frequencies in matched samples data. 53 4.1 suicides data. Crowd behaviour at threatened suicides. 66 4.2 Classiﬁcation system for the response variable. 66 4.3 Lanza data. Misoprostol randomised clinical trial from Lanza (1987). 66 4.4 Lanza data. Misoprostol randomised clinical trial from Lanza et al. (1988a). 67 4.5 Lanza data. Misoprostol randomised clinical trial from Lanza et al. (1988b). 67 © 2010 by Taylor and Francis Group, LLC

4.6 Lanza data. Misoprostol randomised clinical trial from Lanza et al. (1989). 67 4.7 anomalies data. Abnormalities of the face and digits of newborn infants exposed to antiepileptic drugs as assessed by a paediatrician (MD) and a research assistant (RA). 68 4.8 orallesions data. Oral lesions found in house-to-house surveys in three geographic regions of rural India. 78 5.1 weightgain data. Rat weight gain for diets diﬀering by the amount of protein (type) and source of protein (source). 79 5.2 foster data. Foster feeding experiment for rats with diﬀerent genotypes of the litter (litgen) and mother (motgen). 80 5.3 skulls data. Measurements of four variables taken from Egyptian skulls of ﬁve periods. 81 5.4 schooldays data. Days absent from school. 95 5.5 students data. Treatment and results of two tests in three groups of students. 96 6.1 hubble data. Distance and velocity for 24 galaxies. 97 6.2 clouds data. Cloud seeding experiments in Florida – see above for explanations of the variables. 98 6.3 Analysis of variance table for the multiple linear regression model. 102 7.1 plasma data. Blood plasma data. 117 7.2 womensrole data. Women’s role in society data. 118 7.3 polyps data. Number of polyps for two treatment arms. 119 ¯ 7.4 backpain data. Number of drivers (D) and non-drivers (D), ¯ suburban (S) and city inhabitants (S) either suﬀering from a herniated disc (cases) or not (controls). 120 7.5 bladdercancer data. Number of recurrent tumours for bladder cancer patients. 137 7.6 leuk data (package MASS). Survival times of patients suﬀering from leukemia. 138 8.1 faithful data (package datasets). Old Faithful geyser waiting times between two eruptions. 139 8.2 CYGOB1 data. Energy output and surface temperature of Star Cluster CYG OB1. 141 8.3 galaxies data (package MASS). Velocities of 82 galaxies. 156 8.4 birthdeathrates data. Birth and death rates for 69 coun- tries. 157 8.5 schizophrenia data. Age on onset of schizophrenia for both sexes. 158 © 2010 by Taylor and Francis Group, LLC

9.1 bodyfat data (package mboost). Body fat prediction by skinfold thickness, circumferences, and bone breadths. 161 10.1 men1500m data. Olympic Games 1896 to 2004 winners of the men’s 1500m. 177 10.2 USairpollution data. Air pollution in 41 US cities. 178 10.3 kyphosis data (package rpart). Children who have had corrective spinal surgery. 180 11.1 glioma data. Patients suﬀering from two types of glioma treated with the standard therapy or a novel radioim- munotherapy (RIT). 197 11.2 GBSG2 data (package ipred). Randomised clinical trial data from patients suﬀering from node-positive breast cancer. Only the data of the ﬁrst 20 patients are shown here. 199 11.3 mastectomy data. Survival times in months after mastectomy of women with breast cancer. 212 12.1 BtheB data. Data of a randomised trial evaluating the eﬀects of Beat the Blues. 214 12.2 phosphate data. Plasma inorganic phosphate levels for various time points after glucose challenge. 228 13.1 respiratory data. Randomised clinical trial data from patients suﬀering from respiratory illness. Only the data of the ﬁrst seven patients are shown here. 231 13.2 epilepsy data. Randomised clinical trial data from patients suﬀering from epilepsy. Only the data of the ﬁrst seven patients are shown here. 232 13.3 schizophrenia2 data. Clinical trial data from patients suﬀering from schizophrenia. Only the data of the ﬁrst four patients are shown here. 251 14.1 alpha data (package coin). Allele length and levels of expressed alpha synuclein mRNA in alcohol-dependent patients. 253 14.2 trees513 data (package multcomp). 255 15.1 smoking data. Meta-analysis on nicotine gum showing the number of quitters who have been treated (qt), the total number of treated (tt) as well as the number of quitters in the control group (qc) with total number of smokers in the control group (tc). 268 © 2010 by Taylor and Francis Group, LLC

15.2 BCG data. Meta-analysis on BCG vaccine with the following data: the number of TBC cases after a vaccination with BCG (BCGTB), the total number of people who received BCG (BCG) as well as the number of TBC cases without vaccination (NoVaccTB) and the total number of people in the study without vaccination (NoVacc). 269 15.4 aspirin data. Meta-analysis on aspirin and myocardial infarct, the table shows the number of deaths after placebo (dp), the total number subjects treated with placebo (tp) as well as the number of deaths after aspirin (da) and the total number of subjects treated with aspirin (ta). 283 15.5 toothpaste data. Meta-analysis on trials comparing two toothpastes, the number of individuals in the study, the mean and the standard deviation for each study A and B are shown. 284 16.1 heptathlon data. Results Olympic heptathlon, Seoul, 1988. 286 16.2 meteo data. Meteorological measurements in an 11-year period. 297 16.3 Correlations for calculus measurements for the six anterior mandibular teeth. 297 17.1 watervoles data. Water voles data – dissimilarity matrix. 300 17.2 voting data. House of Representatives voting data. 301 17.3 eurodist data (package datasets). Distances between Euro- pean cities, in km. 312 17.4 gardenflowers data. Dissimilarity matrix of 18 species of gardenﬂowers. 313 18.1 pottery data. Romano-British pottery data. 315 18.2 planets data. Jupiter mass, period and eccentricity of exoplanets. 317 18.3 Number of possible partitions depending on the sample size n and number of clusters k. 322 © 2010 by Taylor and Francis Group, LLC

Contents 1 An Introduction to R 1 1.1 What is R? 1 1.2 Installing R 2 1.3 Help and Documentation 4 1.4 Data Objects in R 5 1.5 Data Import and Export 9 1.6 Basic Data Manipulation 11 1.7 Computing with Data 14 1.8 Organising an Analysis 20 1.9 Summary 21 2 Data Analysis Using Graphical Displays 25 2.1 Introduction 25 2.2 Initial Data Analysis 27 2.3 Analysis Using R 29 2.4 Summary 38 3 Simple Inference 45 3.1 Introduction 45 3.2 Statistical Tests 49 3.3 Analysis Using R 53 3.4 Summary 63 4 Conditional Inference 65 4.1 Introduction 65 4.2 Conditional Test Procedures 68 4.3 Analysis Using R 70 4.4 Summary 77 5 Analysis of Variance 79 5.1 Introduction 79 5.2 Analysis of Variance 82 5.3 Analysis Using R 83 5.4 Summary 94 © 2010 by Taylor and Francis Group, LLC

6 Simple and Multiple Linear Regression 97 6.1 Introduction 97 6.2 Simple Linear Regression 99 6.3 Multiple Linear Regression 100 6.4 Analysis Using R 103 6.5 Summary 112 7 Logistic Regression and Generalised Linear Models 117 7.1 Introduction 117 7.2 Logistic Regression and Generalised Linear Models 120 7.3 Analysis Using R 122 7.4 Summary 136 8 Density Estimation 139 8.1 Introduction 139 8.2 Density Estimation 141 8.3 Analysis Using R 147 8.4 Summary 155 9 Recursive Partitioning 161 9.1 Introduction 161 9.2 Recursive Partitioning 164 9.3 Analysis Using R 165 9.4 Summary 174 10 Smoothers and Generalised Additive Models 177 10.1 Introduction 177 10.2 Smoothers and Generalised Additive Models 181 10.3 Analysis Using R 186 11 Survival Analysis 197 11.1 Introduction 197 11.2 Survival Analysis 198 11.3 Analysis Using R 204 11.4 Summary 211 12 Analysing Longitudinal Data I 213 12.1 Introduction 213 12.2 Analysing Longitudinal Data 216 12.3 Linear Mixed Eﬀects Models 217 12.4 Analysis Using R 219 12.5 Prediction of Random Eﬀects 223 12.6 The Problem of Dropouts 223 12.7 Summary 226 © 2010 by Taylor and Francis Group, LLC

13 Analysing Longitudinal Data II 231 13.1 Introduction 231 13.2 Methods for Non-normal Distributions 233 13.3 Analysis Using R: GEE 238 13.4 Analysis Using R: Random Eﬀects 247 13.5 Summary 250 14 Simultaneous Inference and Multiple Comparisons 253 14.1 Introduction 253 14.2 Simultaneous Inference and Multiple Comparisons 256 14.3 Analysis Using R 257 14.4 Summary 264 15 Meta-Analysis 267 15.1 Introduction 267 15.2 Systematic Reviews and Meta-Analysis 269 15.3 Statistics of Meta-Analysis 271 15.4 Analysis Using R 273 15.5 Meta-Regression 276 15.6 Publication Bias 277 15.7 Summary 279 16 Principal Component Analysis 285 16.1 Introduction 285 16.2 Principal Component Analysis 285 16.3 Analysis Using R 288 16.4 Summary 295 17 Multidimensional Scaling 299 17.1 Introduction 299 17.2 Multidimensional Scaling 299 17.3 Analysis Using R 305 17.4 Summary 310 18 Cluster Analysis 315 18.1 Introduction 315 18.2 Cluster Analysis 318 18.3 Analysis Using R 325 18.4 Summary 334 Bibliography 335 © 2010 by Taylor and Francis Group, LLC

CHAPTER 1 An Introduction to R 1.1 What is R? The R system for statistical computing is an environment for data analysis and graphics. The root of R is the S language, developed by John Chambers and colleagues (Becker et al., 1988, Chambers and Hastie, 1992, Chambers, 1998) at Bell Laboratories (formerly AT&T, now owned by Lucent Technologies) starting in the 1960ies. The S language was designed and developed as a programming language for data analysis tasks but in fact it is a full-featured programming language in its current implementations. The development of the R system for statistical computing is heavily inﬂu- enced by the open source idea: The base distribution of R and a large number of user contributed extensions are available under the terms of the Free Soft- ware Foundation’s GNU General Public License in source code form. This licence has two major implications for the data analyst working with R. The complete source code is available and thus the practitioner can investigate the details of the implementation of a special method, can make changes and can distribute modiﬁcations to colleagues. As a side-eﬀect, the R system for statis- tical computing is available to everyone. All scientists, including, in particular, those working in developing countries, now have access to state-of-the-art tools for statistical data analysis without additional costs. With the help of the R system for statistical computing, research really becomes reproducible when both the data and the results of all data analysis steps reported in a paper are available to the readers through an R transcript ﬁle. R is most widely used for teaching undergraduate and graduate statistics classes at universities all over the world because students can freely use the statistical computing tools. The base distribution of R is maintained by a small group of statisticians, the R Development Core Team. A huge amount of additional functionality is implemented in add-on packages authored and maintained by a large group of volunteers. The main source of information about the R system is the world wide web with the oﬃcial home page of the R project being http://www.R-project.org All resources are available from this page: the R system itself, a collection of add-on packages, manuals, documentation and more. The intention of this chapter is to give a rather informal introduction to basic concepts and data manipulation techniques for the R novice. Instead of a rigid treatment of the technical background, the most common tasks 1 © 2010 by Taylor and Francis Group, LLC

© 2010 by Taylor and Francis Group, LLC

INSTALLING R 3 One can change the appearance of the prompt by > options(prompt = \"R> \") and we will use the prompt R> for the display of the code examples throughout this book. A + sign at the very beginning of a line indicates a continuing command after a newline. Essentially, the R system evaluates commands typed on the R prompt and returns the results of the computations. The end of a command is indicated by the return key. Virtually all introductory texts on R start with an example using R as a pocket calculator, and so do we: R> x <- sqrt(25) + 2 √ This simple statement asks the R interpreter to calculate 25 and then to add 2. The result of the operation is assigned to an R object with variable name x. The assignment operator <- binds the value of its right hand side to a variable name on the left hand side. The value of the object x can be inspected simply by typing R> x [1] 7 which, implicitly, calls the print method: R> print(x) [1] 7 1.2.2 Packages The base distribution already comes with some high-priority add-on packages, namely mgcv KernSmooth MASS base boot class cluster codetools datasets foreign grDevices graphics grid lattice methods nlme nnet rcompgen rpart spatial splines stats stats4 survival tcltk tools utils Some of the packages listed here implement standard statistical functionality, for example linear models, classical tests, a huge collection of high-level plot- ting functions or tools for survival analysis; many of these will be described and used in later chapters. Others provide basic infrastructure, for example for graphic systems, code analysis tools, graphical user-interfaces or other util- ities. Packages not included in the base distribution can be installed directly from the R prompt. At the time of writing this chapter, 1756 user-contributed packages covering almost all ﬁelds of statistical methodology were available. Certain so-called ‘task views’ for special topics, such as statistics in the social sciences, environmetrics, robust statistics etc., describe important and helpful packages and are available from © 2010 by Taylor and Francis Group, LLC

4 AN INTRODUCTION TO R http://CRAN.R-project.org/web/views/ Given that an Internet connection is available, a package is installed by supplying the name of the package to the function install.packages. If, for example, add-on functionality for robust estimation of covariance matrices via sandwich estimators is required (for example in Chapter 13), the sandwich package (Zeileis, 2004) can be downloaded and installed via R> install.packages(\"sandwich\") The package functionality is available after attaching the package by R> library(\"sandwich\") A comprehensive list of available packages can be obtained from http://CRAN.R-project.org/web/packages/ Note that on Windows operating systems, precompiled versions of packages are downloaded and installed. In contrast, packages are compiled locally before they are installed on Unix systems. 1.3 Help and Documentation Roughly, three diﬀerent forms of documentation for the R system for statis- tical computing may be distinguished: online help that comes with the base distribution or packages, electronic manuals and publications work in the form of books etc. The help system is a collection of manual pages describing each user-visible function and data set that comes with R. A manual page is shown in a pager or web browser when the name of the function we would like to get help for is supplied to the help function R> help(\"mean\") or, for short, R> ?mean Each manual page consists of a general description, the argument list of the documented function with a description of each single argument, information about the return value of the function and, optionally, references, cross-links and, in most cases, executable examples. The function help.search is helpful for searching within manual pages. An overview on documented topics in an add-on package is given, for example for the sandwich package, by R> help(package = \"sandwich\") Often a package comes along with an additional document describing the pack- age functionality and giving examples. Such a document is called a vignette (Leisch, 2003, Gentleman, 2005). For example, the sandwich package vignette is opened using R> vignette(\"sandwich\", package = \"sandwich\") More extensive documentation is available electronically from the collection of manuals at © 2010 by Taylor and Francis Group, LLC

DATA OBJECTS IN R 5 http://CRAN.R-project.org/manuals.html For the beginner, at least the ﬁrst and the second document of the following four manuals (R Development Core Team, 2009a,c,d,e) are mandatory: An Introduction to R: A more formal introduction to data analysis with R than this chapter. R Data Import/Export: A very useful description of how to read and write various external data formats. R Installation and Administration: Hints for installing R on special plat- forms. Writing R Extensions: The authoritative source on how to write R pro- grams and packages. Both printed and online publications are available, the most important ones are Modern Applied Statistics with S (Venables and Ripley, 2002), Introductory Statistics with R (Dalgaard, 2002), R Graphics (Murrell, 2005) and the R Newsletter, freely available from http://CRAN.R-project.org/doc/Rnews/ In case the electronically available documentation and the answers to fre- quently asked questions (FAQ), available from http://CRAN.R-project.org/faqs.html have been consulted but a problem or question remains unsolved, the r-help email list is the right place to get answers to well-thought-out questions. It is helpful to read the posting guide http://www.R-project.org/posting-guide.html before starting to ask. 1.4 Data Objects in R The data handling and manipulation techniques explained in this chapter will be illustrated by means of a data set of 2000 world leading companies, the Forbes 2000 list for the year 2004 collected by Forbes Magazine. This list is originally available from http://www.forbes.com and, as an R data object, it is part of the HSAUR2 package (Source: From Forbes.com, New York, New York, 2004. With permission.). In a ﬁrst step, we make the data available for computations within R. The data function searches for data objects of the speciﬁed name (\"Forbes2000\") in the package speciﬁed via the package argument and, if the search was successful, attaches the data object to the global environment: R> data(\"Forbes2000\", package = \"HSAUR2\") R> ls() [1] \"x\" \"Forbes2000\" © 2010 by Taylor and Francis Group, LLC

6 AN INTRODUCTION TO R The output of the ls function lists the names of all objects currently stored in the global environment, and, as the result of the previous command, a variable named Forbes2000 is available for further manipulation. The variable x arises from the pocket calculator example in Subsection 1.2.1. As one can imagine, printing a list of 2000 companies via R> print(Forbes2000) rank name country category sales 1 1 Citigroup United States Banking 94.71 2 2 General Electric United States Conglomerates 134.19 3 3 American Intl Group United States Insurance 76.66 profits assets marketvalue 1 17.85 1264.03 255.30 2 15.59 626.93 328.54 3 6.46 647.66 194.87 ... will not be particularly helpful in gathering some initial information about the data; it is more useful to look at a description of their structure found by using the following command R> str(Forbes2000) 'data.frame': 2000 obs. of 8 variables: $ rank : int 1 2 3 4 5 ... $ name : chr \"Citigroup\" \"General Electric\" ... $ country : Factor w/ 61 levels \"Africa\",\"Australia\",... $ category : Factor w/ 27 levels \"Aerospace & defense\",.. $ sales : num 94.7 134.2 ... $ profits : num 17.9 15.6 ... $ assets : num 1264 627 ... $ marketvalue: num 255 329 ... The output of the str function tells us that Forbes2000 is an object of class data.frame, the most important data structure for handling tabular statistical data in R. As expected, information about 2000 observations, i.e., companies, are stored in this object. For each observation, the following eight variables are available: rank: the ranking of the company, name: the name of the company, country: the country the company is situated in, category: a category describing the products the company produces, sales: the amount of sales of the company in billion US dollars, profits: the proﬁt of the company in billion US dollars, assets: the assets of the company in billion US dollars, marketvalue: the market value of the company in billion US dollars. A similar but more detailed description is available from the help page for the Forbes2000 object: © 2010 by Taylor and Francis Group, LLC

DATA OBJECTS IN R 7 R> help(\"Forbes2000\") or R> ?Forbes2000 All information provided by str can be obtained by specialised functions as well and we will now have a closer look at the most important of these. The R language is an object-oriented programming language, so every object is an instance of a class. The name of the class of an object can be determined by R> class(Forbes2000) [1] \"data.frame\" Objects of class data.frame represent data the traditional table-oriented way. Each row is associated with one single observation and each column corre- sponds to one variable. The dimensions of such a table can be extracted using the dim function R> dim(Forbes2000) [1] 2000 8 Alternatively, the numbers of rows and columns can be found using R> nrow(Forbes2000) [1] 2000 R> ncol(Forbes2000) [1] 8 The results of both statements show that Forbes2000 has 2000 rows, i.e., observations, the companies in our case, with eight variables describing the observations. The variable names are accessible from R> names(Forbes2000) [1] \"rank\" \"name\" \"country\" \"category\" [5] \"sales\" \"profits\" \"assets\" \"marketvalue\" The values of single variables can be extracted from the Forbes2000 object by their names, for example the ranking of the companies R> class(Forbes2000[,\"rank\"]) [1] \"integer\" is stored as an integer variable. Brackets [] always indicate a subset of a larger object, in our case a single variable extracted from the whole table. Because data.frames have two dimensions, observations and variables, the comma is required in order to specify that we want a subset of the second dimension, i.e., the variables. The rankings for all 2000 companies are represented in a vector structure the length of which is given by R> length(Forbes2000[,\"rank\"]) [1] 2000 © 2010 by Taylor and Francis Group, LLC

8 AN INTRODUCTION TO R A vector is the elementary structure for data handling in R and is a set of simple elements, all being objects of the same class. For example, a simple vector of the numbers one to three can be constructed by one of the following commands R> 1:3 [1] 1 2 3 R> c(1,2,3) [1] 1 2 3 R> seq(from = 1, to = 3, by = 1) [1] 1 2 3 The unique names of all 2000 companies are stored in a character vector R> class(Forbes2000[,\"name\"]) [1] \"character\" R> length(Forbes2000[,\"name\"]) [1] 2000 and the ﬁrst element of this vector is R> Forbes2000[,\"name\"][1] [1] \"Citigroup\" Because the companies are ranked, Citigroup is the world’s largest company according to the Forbes 2000 list. Further details on vectors and subsetting are given in Section 1.6. Nominal measurements are represented by factor variables in R, such as the category of the company’s business segment R> class(Forbes2000[,\"category\"]) [1] \"factor\" Objects of class factor and character basically diﬀer in the way their values are stored internally. Each element of a vector of class character is stored as a character variable whereas an integer variable indicating the level of a factor is saved for factor objects. In our case, there are R> nlevels(Forbes2000[,\"category\"]) [1] 27 diﬀerent levels, i.e., business categories, which can be extracted by R> levels(Forbes2000[,\"category\"]) [1] \"Aerospace & defense\" [2] \"Banking\" [3] \"Business services & supplies\" ... © 2010 by Taylor and Francis Group, LLC

DATA IMPORT AND EXPORT 9 As a simple summary statistic, the frequencies of the levels of such a factor variable can be found from R> table(Forbes2000[,\"category\"]) Aerospace & defense Banking 19 313 Business services & supplies 70 ... The sales, assets, proﬁts and market value variables are of type numeric, the natural data type for continuous or discrete measurements, for example R> class(Forbes2000[,\"sales\"]) [1] \"numeric\" and simple summary statistics such as the mean, median and range can be found from R> median(Forbes2000[,\"sales\"]) [1] 4.365 R> mean(Forbes2000[,\"sales\"]) [1] 9.69701 R> range(Forbes2000[,\"sales\"]) [1] 0.01 256.33 The summary method can be applied to a numeric vector to give a set of useful summary statistics, namely the minimum, maximum, mean, median and the 25% and 75% quartiles; for example R> summary(Forbes2000[,\"sales\"]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.010 2.018 4.365 9.697 9.548 256.300 1.5 Data Import and Export In the previous section, the data from the Forbes 2000 list of the world’s largest companies were loaded into R from the HSAUR2 package but we will now ex- plore practically more relevant ways to import data into the R system. The most frequent data formats the data analyst is confronted with are comma sep- arated ﬁles, Excel spreadsheets, ﬁles in SPSS format and a variety of SQL data base engines. Querying data bases is a nontrivial task and requires additional knowledge about querying languages, and we therefore refer to the R Data Import/Export manual – see Section 1.3. We assume that a comma separated ﬁle containing the Forbes 2000 list is available as Forbes2000.csv (such a ﬁle is part of the HSAUR2 source package in directory HSAUR2/inst/rawdata). When the ﬁelds are separated by commas and each row begins with a name (a text format typically created by Excel), we can read in the data as follows using the read.table function © 2010 by Taylor and Francis Group, LLC

10 AN INTRODUCTION TO R R> csvForbes2000 <- read.table(\"Forbes2000.csv\", + header = TRUE, sep = \",\", row.names = 1) The argument header = TRUE indicates that the entries in the ﬁrst line of the text ﬁle \"Forbes2000.csv\" should be interpreted as variable names. Columns are separated by a comma (sep = \",\"), users of continental versions of Excel should take care of the character symbol coding for decimal points (by default dec = \".\"). Finally, the ﬁrst column should be interpreted as row names but not as a variable (row.names = 1). Alternatively, the function read.csv can be used to read comma separated ﬁles. The function read.table by default guesses the class of each variable from the speciﬁed ﬁle. In our case, character variables are stored as factors R> class(csvForbes2000[,\"name\"]) [1] \"factor\" which is only suboptimal since the names of the companies are unique. How- ever, we can supply the types for each variable to the colClasses argument R> csvForbes2000 <- read.table(\"Forbes2000.csv\", + header = TRUE, sep = \",\", row.names = 1, + colClasses = c(\"character\", \"integer\", \"character\", + \"factor\", \"factor\", \"numeric\", \"numeric\", \"numeric\", + \"numeric\")) R> class(csvForbes2000[,\"name\"]) [1] \"character\" and check if this object is identical with our previous Forbes 2000 list object R> all.equal(csvForbes2000, Forbes2000) [1] TRUE The argument colClasses expects a character vector of length equal to the number of columns in the ﬁle. Such a vector can be supplied by the c function that combines the objects given in the parameter list into a vector R> classes <- c(\"character\", \"integer\", \"character\", \"factor\", + \"factor\", \"numeric\", \"numeric\", \"numeric\", \"numeric\") R> length(classes) [1] 9 R> class(classes) [1] \"character\" An R interface to the open data base connectivity standard (ODBC) is available in package RODBC and its functionality can be used to access Excel and Access ﬁles directly: R> library(\"RODBC\") R> cnct <- odbcConnectExcel(\"Forbes2000.xls\") R> sqlQuery(cnct, \"select * from \\"Forbes2000\\$\\"\") © 2010 by Taylor and Francis Group, LLC

BASIC DATA MANIPULATION 11 The function odbcConnectExcel opens a connection to the speciﬁed Excel or Access ﬁle which can be used to send SQL queries to the data base engine and retrieve the results of the query. Files in SPSS format are read in a way similar to reading comma separated ﬁles, using the function read.spss from package foreign (which comes with the base distribution). Exporting data from R is now rather straightforward. A comma separated ﬁle readable by Excel can be constructed from a data.frame object via R> write.table(Forbes2000, file = \"Forbes2000.csv\", sep = \",\", + col.names = NA) The function write.csv is one alternative and the functionality implemented in the RODBC package can be used to write data directly into Excel spread- sheets as well. Alternatively, when data should be saved for later processing in R only, R objects of arbitrary kind can be stored into an external binary ﬁle via R> save(Forbes2000, file = \"Forbes2000.rda\") where the extension .rda is standard. We can get the ﬁle names of all ﬁles with extension .rda from the working directory R> list.files(pattern = \"\\.rda\") [1] \"Forbes2000.rda\" and we can load the contents of the ﬁle into R by R> load(\"Forbes2000.rda\") 1.6 Basic Data Manipulation The examples shown in the previous section have illustrated the importance of data.frames for storing and handling tabular data in R. Internally, a data.frame is a list of vectors of a common length n, the number of rows of the table. Each of those vectors represents the measurements of one variable and we have seen that we can access such a variable by its name, for example the names of the companies R> companies <- Forbes2000[,\"name\"] Of course, the companies vector is of class character and of length 2000. A subset of the elements of the vector companies can be extracted using the [] subset operator. For example, the largest of the 2000 companies listed in the Forbes 2000 list is R> companies[1] [1] \"Citigroup\" and the top three companies can be extracted utilising an integer vector of the numbers one to three: R> 1:3 © 2010 by Taylor and Francis Group, LLC

12 AN INTRODUCTION TO R [1] 1 2 3 R> companies[1:3] [1] \"Citigroup\" \"General Electric\" [3] \"American Intl Group\" In contrast to indexing with positive integers, negative indexing returns all elements that are not part of the index vector given in brackets. For example, all companies except those with numbers four to two-thousand, i.e., the top three companies, are again R> companies[-(4:2000)] [1] \"Citigroup\" \"General Electric\" [3] \"American Intl Group\" The complete information about the top three companies can be printed in a similar way. Because data.frames have a concept of rows and columns, we need to separate the subsets corresponding to rows and columns by a comma. The statement R> Forbes2000[1:3, c(\"name\", \"sales\", \"profits\", \"assets\")] name sales profits assets 1 Citigroup 94.71 17.85 1264.03 2 General Electric 134.19 15.59 626.93 3 American Intl Group 76.66 6.46 647.66 extracts the variables name, sales, profits and assets for the three largest companies. Alternatively, a single variable can be extracted from a data.frame by R> companies <- Forbes2000$name which is equivalent to the previously shown statement R> companies <- Forbes2000[,\"name\"] We might be interested in extracting the largest companies with respect to an alternative ordering. The three top selling companies can be computed along the following lines. First, we need to compute the ordering of the com- panies’ sales R> order_sales <- order(Forbes2000$sales) which returns the indices of the ordered elements of the numeric vector sales. Consequently the three companies with the lowest sales are R> companies[order_sales[1:3]] [1] \"Custodia Holding\" \"Central European Media\" [3] \"Minara Resources\" The indices of the three top sellers are the elements 1998, 1999 and 2000 of the integer vector order_sales R> Forbes2000[order_sales[c(2000, 1999, 1998)], + c(\"name\", \"sales\", \"profits\", \"assets\")] © 2010 by Taylor and Francis Group, LLC

BASIC DATA MANIPULATION 13 name sales profits assets 10 Wal-Mart Stores 256.33 9.05 104.91 5 BP 232.57 10.27 177.57 4 ExxonMobil 222.88 20.96 166.99 Another way of selecting vector elements is the use of a logical vector being TRUE when the corresponding element is to be selected and FALSE otherwise. The companies with assets of more than 1000 billion US dollars are R> Forbes2000[Forbes2000$assets > 1000, + c(\"name\", \"sales\", \"profits\", \"assets\")] name sales profits assets 1 Citigroup 94.71 17.85 1264.03 9 Fannie Mae 53.13 6.48 1019.17 403 Mizuho Financial 24.40 -20.11 1115.90 where the expression Forbes2000$assets > 1000 indicates a logical vector of length 2000 with R> table(Forbes2000$assets > 1000) FALSE TRUE 1997 3 elements being either FALSE or TRUE. In fact, for some of the companies the measurement of the profits variable are missing. In R, missing values are treated by a special symbol, NA, indicating that this measurement is not avail- able. The observations with proﬁt information missing can be obtained via R> na_profits <- is.na(Forbes2000$profits) R> table(na_profits) na_profits FALSE TRUE 1995 5 R> Forbes2000[na_profits, + c(\"name\", \"sales\", \"profits\", \"assets\")] name sales profits assets 772 AMP 5.40 NA 42.94 1085 HHG 5.68 NA 51.65 1091 NTL 3.50 NA 10.59 1425 US Airways Group 5.50 NA 8.58 1909 Laidlaw International 4.48 NA 3.98 where the function is.na returns a logical vector being TRUE when the corre- sponding element of the supplied vector is NA. A more comfortable approach is available when we want to remove all observations with at least one miss- ing value from a data.frame object. The function complete.cases takes a data.frame and returns a logical vector being TRUE when the corresponding observation does not contain any missing value: R> table(complete.cases(Forbes2000)) © 2010 by Taylor and Francis Group, LLC

14 AN INTRODUCTION TO R FALSE TRUE 5 1995 Subsetting data.frames driven by logical expressions may induce a lot of typing which can be avoided. The subset function takes a data.frame as ﬁrst argument and a logical expression as second argument. For example, we can select a subset of the Forbes 2000 list consisting of all companies situated in the United Kingdom by R> UKcomp <- subset(Forbes2000, country == \"United Kingdom\") R> dim(UKcomp) [1] 137 8 i.e., 137 of the 2000 companies are from the UK. Note that it is not neces- sary to extract the variable country from the data.frame Forbes2000 when formulating the logical expression with subset. 1.7 Computing with Data 1.7.1 Simple Summary Statistics Two functions are helpful for getting an overview about R objects: str and summary, where str is more detailed about data types and summary gives a collection of sensible summary statistics. For example, applying the summary method to the Forbes2000 data set, R> summary(Forbes2000) results in the following output rank name country Min. : 1.0 Length:2000 United States :751 1st Qu.: 500.8 Class :character Japan :316 Median :1000.5 Mode :character United Kingdom:137 Mean :1000.5 Germany : 65 3rd Qu.:1500.2 France : 63 Max. :2000.0 Canada : 56 (Other) :612 category sales Banking : 313 Min. : 0.010 Diversified financials: 158 1st Qu.: 2.018 Insurance : 112 Median : 4.365 Utilities : 110 Mean : 9.697 Materials : 97 3rd Qu.: 9.547 Oil & gas operations : 90 Max. :256.330 (Other) :1120 profits assets marketvalue Min. :-25.8300 Min. : 0.270 Min. : 0.02 1st Qu.: 0.0800 1st Qu.: 4.025 1st Qu.: 2.72 Median : 0.2000 Median : 9.345 Median : 5.15 Mean : 0.3811 Mean : 34.042 Mean : 11.88 3rd Qu.: 0.4400 3rd Qu.: 22.793 3rd Qu.: 10.60 © 2010 by Taylor and Francis Group, LLC

COMPUTING WITH DATA 15 Max. : 20.9600 Max. :1264.030 Max. :328.54 NA's : 5.0000 From this output we can immediately see that most of the companies are situated in the US and that most of the companies are working in the banking sector as well as that negative proﬁts, or losses, up to 26 billion US dollars occur. Internally, summary is a so-called generic function with methods for a multi- tude of classes, i.e., summary can be applied to objects of diﬀerent classes and will report sensible results. Here, we supply a data.frame object to summary where it is natural to apply summary to each of the variables in this data.frame. Because a data.frame is a list with each variable being an element of that list, the same eﬀect can be achieved by R> lapply(Forbes2000, summary) The members of the apply family help to solve recurring tasks for each element of a data.frame, matrix, list or for each level of a factor. It might be interesting to compare the proﬁts in each of the 27 categories. To do so, we ﬁrst compute the median proﬁt for each category from R> mprofits <- tapply(Forbes2000$profits, + Forbes2000$category, median, na.rm = TRUE) a command that should be read as follows. For each level of the factor cat- egory, determine the corresponding elements of the numeric vector profits and supply them to the median function with additional argument na.rm = TRUE. The latter one is necessary because profits contains missing values which would lead to a non-sensible result of the median function R> median(Forbes2000$profits) [1] NA The three categories with highest median proﬁt are computed from the vector of sorted median proﬁts R> rev(sort(mprofits))[1:3] Oil & gas operations Drugs & biotechnology 0.35 0.35 Household & personal products 0.31 where rev rearranges the vector of median proﬁts sorted from smallest to largest. Of course, we can replace the median function with mean or whatever is appropriate in the call to tapply. In our situation, mean is not a good choice, because the distributions of proﬁts or sales are naturally skewed. Simple graph- ical tools for the inspection of the empirical distributions are introduced later on and in Chapter 2. 1.7.2 Customising Analyses In the preceding sections we have done quite complex analyses on our data using functions available from R. However, the real power of the system comes © 2010 by Taylor and Francis Group, LLC

16 AN INTRODUCTION TO R to light when writing our own functions for our own analysis tasks. Although R is a full-featured programming language, writing small helper functions for our daily work is not too complicated. We’ll study two example cases. At ﬁrst, we want to add a robust measure of variability to the location measures computed in the previous subsection. In addition to the median proﬁt, computed via R> median(Forbes2000$profits, na.rm = TRUE) [1] 0.2 we want to compute the inter-quartile range, i.e., the diﬀerence between the 3rd and 1st quartile. Although a quick search in the manual pages (via help(\"interquartile\")) brings function IQR to our attention, we will ap- proach this task without making use of this tool, but using function quantile for computing sample quantiles only. A function in R is nothing but an object, and all objects are created equal. Thus, we ‘just’ have to assign a function object to a variable. A function object consists of an argument list, deﬁning arguments and possibly default values, and a body deﬁning the computations. The body starts and ends with braces. Of course, the body is assumed to be valid R code. In most cases we expect a function to return an object, therefore, the body will contain one or more return statements the arguments of which deﬁne the return values. Returning to our example, we’ll name our function iqr. The iqr function should operate on numeric vectors, therefore it should have an argument x. This numeric vector will be passed on to the quantile function for computing st the sample quartiles. The required diﬀerence between the 3 rd and 1 quartile can then be computed using diff. The deﬁnition of our function reads as follows R> iqr <- function(x) { + q <- quantile(x, prob = c(0.25, 0.75), names = FALSE) + return(diff(q)) + } A simple test on simulated data from a standard normal distribution shows that our ﬁrst function actually works, a comparison with the IQR function shows that the result is correct: R> xdata <- rnorm(100) R> iqr(xdata) [1] 1.495980 R> IQR(xdata) [1] 1.495980 However, when the numeric vector contains missing values, our function fails as the following example shows: R> xdata[1] <- NA R> iqr(xdata) © 2010 by Taylor and Francis Group, LLC

COMPUTING WITH DATA 17 Error in quantile.default(x, prob = c(0.25, 0.75)): missing values and NaN's not allowed if 'na.rm' is FALSE In order to make our little function more ﬂexible it would be helpful to add all arguments of quantile to the argument list of iqr. The copy-and- paste approach that ﬁrst comes to mind is likely to lead to inconsistencies and errors, for example when the argument list of quantile changes. Instead, the dot argument, a wildcard for any argument, is more appropriate and we redeﬁne our function accordingly: R> iqr <- function(x, ...) { + q <- quantile(x, prob = c(0.25, 0.75), names = FALSE, + ...) + return(diff(q)) + } R> iqr(xdata, na.rm = TRUE) [1] 1.503438 R> IQR(xdata, na.rm = TRUE) [1] 1.503438 Now, we can assess the variability of the proﬁts using our new iqr tool: R> iqr(Forbes2000$profits, na.rm = TRUE) [1] 0.36 Since there is no diﬀerence between functions that have been written by one of the R developers and user-created functions, we can compute the inter-quartile range of proﬁts for each of the business categories by using our iqr function inside a tapply statement; R> iqr_profits <- tapply(Forbes2000$profits, + Forbes2000$category, iqr, na.rm = TRUE) and extract the categories with the smallest and greatest variability R> levels(Forbes2000$category)[which.min(iqr_profits)] [1] \"Hotels restaurants & leisure\" R> levels(Forbes2000$category)[which.max(iqr_profits)] [1] \"Drugs & biotechnology\" We observe less variable proﬁts in tourism enterprises compared with proﬁts in the pharmaceutical industry. As other members of the apply family, tapply is very helpful when the same task is to be done more than one time. Moreover, its use is more convenient compared to the usage of for loops. For the sake of completeness, we will compute the category-wise inter-quartile range of the proﬁts using a for loop. Like a function, a for loop consists of a body, i.e., a chain of R commands to be executed. In addition, we need a set of values and a variable that iterates over this set. Here, the set we are interested in is the business categories: © 2010 by Taylor and Francis Group, LLC

18 AN INTRODUCTION TO R R> bcat <- Forbes2000$category R> iqr_profits2 <- numeric(nlevels(bcat)) R> names(iqr_profits2) <- levels(bcat) R> for (cat in levels(bcat)) { + catprofit <- subset(Forbes2000, category == cat)$profit + this_iqr <- iqr(catprofit, na.rm = TRUE) + iqr_profits2[levels(bcat) == cat] <- this_iqr + } Compared to the usage of tapply, the above code is rather complicated. At ﬁrst, we have to set up a vector for storing the results and assign the appro- priate names to it. Next, inside the body of the for loop, the iqr function has to be called on the appropriate subset of all companies of the current business category cat. The corresponding inter-quartile range must then be assigned to the correct vector element in the result vector. Luckily, such complicated constructs will be used in only one of the remaining chapters of the book and are almost always avoidable in practical data analyses. 1.7.3 Simple Graphics The degree of skewness of a distribution can be investigated by constructing histograms using the hist function. (More sophisticated alternatives such as smooth density estimates will be considered in Chapter 8.) For example, the code for producing Figure 1.1 ﬁrst divides the plot region into two equally spaced rows (the layout function) and then plots the histograms of the raw market values in the upper part using the hist function. The lower part of the ﬁgure depicts the histogram for the log transformed market values which appear to be more symmetric. Bivariate relationships of two continuous variables are usually depicted as scatterplots. In R, regression relationships are speciﬁed by so-called model formulae which, in a simple bivariate case, may look like R> fm <- marketvalue ~ sales R> class(fm) [1] \"formula\" with the dependent variable on the left hand side and the independent variable on the right hand side. The tilde separates left and right hand sides. Such a model formula can be passed to a model function (for example to the linear model function as explained in Chapter 6). The plot generic function imple- ments a formula method as well. Because the distributions of both market value and sales are skewed we choose to depict their logarithms. A raw scat- terplot of 2000 data points (Figure 1.2) is rather uninformative due to areas with very high density. This problem can be avoided by choosing a transparent color for the dots as shown in Figure 1.3. If the independent variable is a factor, a boxplot representation is a natural choice. For four selected countries, the distributions of the logarithms of the © 2010 by Taylor and Francis Group, LLC

COMPUTING WITH DATA 19 R> layout(matrix(1:2, nrow = 2)) R> hist(Forbes2000$marketvalue) R> hist(log(Forbes2000$marketvalue)) Histogram of Forbes2000$marketvalue Frequency 1000 0 0 50 100 150 200 250 300 350 Forbes2000$marketvalue Histogram of log(Forbes2000$marketvalue) 800 Frequency 400 0 −4 −2 0 2 4 6 log(Forbes2000$marketvalue) Figure 1.1 Histograms of the market value and the logarithm of the market value for the companies contained in the Forbes 2000 list. market value may be visually compared in Figure 1.4. Prior to calling the plot function on our data, we have to remove empty levels from the country variable, because otherwise the x-axis would show all and not only the selected countries. This task is most easily performed by subsetting the corresponding factor with additional argument drop = TRUE. Here, the width of the boxes are proportional to the square root of the number of companies for each coun- try and extremely large or small market values are depicted by single points. More elaborate graphical methods will be discussed in Chapter 2. © 2010 by Taylor and Francis Group, LLC

20 AN INTRODUCTION TO R R> plot(log(marketvalue) ~ log(sales), data = Forbes2000, + pch = \".\") 6 4 log(marketvalue) 2 0 −2 −4 −4 −2 0 2 4 log(sales) Figure 1.2 Raw scatterplot of the logarithms of market value and sales. 1.8 Organising an Analysis Although it is possible to perform an analysis typing all commands directly on the R prompt it is much more comfortable to maintain a separate text ﬁle collecting all steps necessary to perform a certain data analysis task. Such an R transcript ﬁle, for example called analysis.R created with your favourite text editor, can be sourced into R using the source command R> source(\"analysis.R\", echo = TRUE) When all steps of a data analysis, i.e., data preprocessing, transformations, simple summary statistics and plots, model building and inference as well as reporting, are collected in such an R transcript ﬁle, the analysis can be © 2010 by Taylor and Francis Group, LLC

© 2010 by Taylor and Francis Group, LLC

22 AN INTRODUCTION TO R R> tmp <- subset(Forbes2000, + country %in% c(\"United Kingdom\", \"Germany\", + \"India\", \"Turkey\")) R> tmp$country <- tmp$country[,drop = TRUE] R> plot(log(marketvalue) ~ country, data = tmp, + ylab = \"log(marketvalue)\", varwidth = TRUE) 4 log(marketvalue) 2 0 −2 Germany India Turkey United Kingdom country Figure 1.4 Boxplots of the logarithms of the market value for four selected coun- tries, the width of the boxes is proportional to the square roots of the number of companies. © 2010 by Taylor and Francis Group, LLC

SUMMARY 23 examples of these functions and those that produce more interesting graphics in later chapters. Exercises Ex. 1.1 Calculate the median proﬁt for the companies in the US and the median proﬁt for the companies in the UK, France and Germany. Ex. 1.2 Find all German companies with negative proﬁt. Ex. 1.3 To which business category do most of the Bermuda island companies belong? Ex. 1.4 For the 50 companies in the Forbes data set with the highest proﬁts, plot sales against assets (or some suitable transformation of each variable), labelling each point with the appropriate country name which may need to be abbreviated (using abbreviate) to avoid making the plot look too ‘messy’. Ex. 1.5 Find the average value of sales for the companies in each country in the Forbes data set, and ﬁnd the number of companies in each country with proﬁts above 5 billion US dollars. © 2010 by Taylor and Francis Group, LLC

CHAPTER 2 Data Analysis Using Graphical Displays: Malignant Melanoma in the USA and Chinese Health and Family Life 2.1 Introduction Fisher and Belle (1993) report mortality rates due to malignant melanoma of the skin for white males during the period 1950–1969, for each state on the US mainland. The data are given in Table 2.1 and include the number of deaths due to malignant melanoma in the corresponding state, the longitude and latitude of the geographic centre of each state, and a binary variable indicating contiguity to an ocean, that is, if the state borders one of the oceans. Questions of interest about these data include: how do the mortality rates compare for ocean and non-ocean states? and how are mortality rates aﬀected by latitude and longitude? Table 2.1: USmelanoma data. USA mortality rates for white males due to malignant melanoma. mortality latitude longitude ocean Alabama 219 33.0 87.0 yes Arizona 160 34.5 112.0 no Arkansas 170 35.0 92.5 no California 182 37.5 119.5 yes Colorado 149 39.0 105.5 no Connecticut 159 41.8 72.8 yes Delaware 200 39.0 75.5 yes District of Columbia 177 39.0 77.0 no Florida 197 28.0 82.0 yes Georgia 214 33.0 83.5 yes Idaho 116 44.5 114.0 no Illinois 124 40.0 89.5 no Indiana 128 40.2 86.2 no Iowa 128 42.2 93.8 no Kansas 166 38.5 98.5 no Kentucky 147 37.8 85.0 no Louisiana 190 31.2 91.8 yes 25 © 2010 by Taylor and Francis Group, LLC

26 DATA ANALYSIS USING GRAPHICAL DISPLAYS Table 2.1: USmelanoma data (continued). mortality latitude longitude ocean Maine 117 45.2 69.0 yes Maryland 162 39.0 76.5 yes Massachusetts 143 42.2 71.8 yes Michigan 117 43.5 84.5 no Minnesota 116 46.0 94.5 no Mississippi 207 32.8 90.0 yes Missouri 131 38.5 92.0 no Montana 109 47.0 110.5 no Nebraska 122 41.5 99.5 no Nevada 191 39.0 117.0 no New Hampshire 129 43.8 71.5 yes New Jersey 159 40.2 74.5 yes New Mexico 141 35.0 106.0 no New York 152 43.0 75.5 yes North Carolina 199 35.5 79.5 yes North Dakota 115 47.5 100.5 no Ohio 131 40.2 82.8 no Oklahoma 182 35.5 97.2 no Oregon 136 44.0 120.5 yes Pennsylvania 132 40.8 77.8 no Rhode Island 137 41.8 71.5 yes South Carolina 178 33.8 81.0 yes South Dakota 86 44.8 100.0 no Tennessee 186 36.0 86.2 no Texas 229 31.5 98.0 yes Utah 142 39.5 111.5 no Vermont 153 44.0 72.5 yes Virginia 166 37.5 78.5 yes Washington 117 47.5 121.0 yes West Virginia 136 38.8 80.8 no Wisconsin 110 44.5 90.2 no Wyoming 134 43.0 107.5 no Source: From Fisher, L. D., and Belle, G. V., Biostatistics. A Methodology for the Health Sciences, John Wiley & Sons, Chichester, UK, 1993. With permission. Contemporary China is on the leading edge of a sexual revolution, with tremendous regional and generational diﬀerences that provide unparalleled natural experiments for analysis of the antecedents and outcomes of sexual behaviour. The Chinese Health and Family Life Study, conducted 1999–2000 as a collaborative research project of the Universities of Chicago, Beijing, and © 2010 by Taylor and Francis Group, LLC

INITIAL DATA ANALYSIS 27 North Carolina, provides a baseline from which to anticipate and track future changes. Speciﬁcally, this study produces a baseline set of results on sexual behaviour and disease patterns, using a nationally representative probability sample. The Chinese Health and Family Life Survey sampled 60 villages and urban neighbourhoods chosen in such a way as to represent the full geographi- cal and socioeconomic range of contemporary China excluding Hong Kong and Tibet. Eighty-three individuals were chosen at random for each location from oﬃcial registers of adults aged between 20 and 64 years to target a sample of 5000 individuals in total. Here, we restrict our attention to women with cur- rent male partners for whom no information was missing, leading to a sample of 1534 women with the following variables (see Table 2.2 for example data sets): R_edu: level of education of the responding woman, R_income: monthly income (in yuan) of the responding woman, R_health: health status of the responding woman in the last year, R_happy: how happy was the responding woman in the last year, A_edu: level of education of the woman’s partner, A_income: monthly income (in yuan) of the woman’s partner. In the list above the income variables are continuous and the remaining vari- ables are categorical with ordered categories. The income variables are based on (partially) imputed measures. All information, including the partner’s in- come, are derived from a questionnaire answered by the responding woman only. Here, we focus on graphical displays for inspecting the relationship of these health and socioeconomic variables of heterosexual women and their partners. 2.2 Initial Data Analysis According to Chambers et al. (1983), “there is no statistical tool that is as powerful as a well chosen graph”. Certainly, the analysis of most (probably all) data sets should begin with an initial attempt to understand the general characteristics of the data by graphing them in some hopefully useful and in- formative manner. The possible advantages of graphical presentation methods are summarised by Schmid (1954); they include the following • In comparison with other types of presentation, well-designed charts are more eﬀective in creating interest and in appealing to the attention of the reader. • Visual relationships as portrayed by charts and graphs are more easily grasped and more easily remembered. • The use of charts and graphs saves time, since the essential meaning of large measures of statistical data can be visualised at a glance. • Charts and graphs provide a comprehensive picture of a problem that makes © 2010 by Taylor and Francis Group, LLC

28 DATA ANALYSIS USING GRAPHICAL DISPLAYS A_income 500 800 700 700 400 900 300 800 200 600 200 400 500 200 800 500 500 600 0 200 . . . A_edu school school school school school college school school college school school school school school college University school school school school . . . Survey. high high high Elementary high Junior high high Junior high high high high high Junior high high high high Life Senior Senior Junior Junior Junior Senior Senior Junior Junior Junior Senior Senior Junior Junior Junior Family R_happy happy happy happy happy happy happy happy happy happy happy happy happy happy happy happy happy happy happy happy happy . . . and Very too too too too Very Health Somewhat Somewhat Somewhat Somewhat Somewhat Somewhat Not Not Somewhat Not Somewhat Somewhat Somewhat Somewhat Somewhat Somewhat Not Somewhat Chinese Good Fair Good Fair Fair good Good Fair Good good Fair Good Fair Fair good good . . . data. R_health Excellent Not Not Excellent Excellent Not Excellent Not CHFLS R_income 900 500 800 300 300 500 0 100 200 400 300 0 200 300 3000 0 500 0 0 500 . . . 2.2: Table R_edu school school school school school school school school school school school school school school college college school school school school . . . high high high high high high high high high high high high high high Junior Junior high high high high Senior Senior Senior Junior Junior Senior Junior Junior Junior Senior Junior Junior Junior Senior Senior Junior Senior Junior © 2010 by Taylor and Francis Group, LLC 2 3 10 11 22 23 24 25 26 32 33 35 36 37 38 39 40 41 55 56 57

ANALYSIS USING R 29 for a more complete and better balanced understanding than could be de- rived from tabular or textual forms of presentation. • Charts and graphs can bring out hidden facts and relationships and can stimulate, as well as aid, analytical thinking and investigation. Graphs are very popular; it has been estimated that between 900 billion (9 × 11 12 10 ) and 2 trillion (2 × 10 ) images of statistical graphics are printed each year. Perhaps one of the main reasons for such popularity is that graphical presentation of data often provides the vehicle for discovering the unexpected; the human visual system is very powerful in detecting patterns, although the following caveat from the late Carl Sagan (in his book Contact) should be kept in mind: Humans are good at discerning subtle patterns that are really there, but equally so at imagining them when they are altogether absent. During the last two decades a wide variety of new methods for displaying data graphically have been developed; these will hunt for special eﬀects in data, indicate outliers, identify patterns, diagnose models and generally search for novel and perhaps unexpected phenomena. Large numbers of graphs may be required and computers are generally needed to supply them for the same reasons they are used for numerical analyses, namely that they are fast and they are accurate. So, because the machine is doing the work the question is no longer“shall we plot?” but rather “what shall we plot?” There are many exciting possibilities including dynamic graphics but graphical exploration of data usually begins, at least, with some simpler, well-known methods, for example, histograms, barcharts, boxplots and scatterplots. Each of these will be illustrated in this chapter along with more complex methods such as spinograms and trellis plots. 2.3 Analysis Using R 2.3.1 Malignant Melanoma We might begin to examine the malignant melanoma data in Table 2.1 by con- structing a histogram or boxplot for all the mortality rates in Figure 2.1. The plot, hist and boxplot functions have already been introduced in Chapter 1 and we want to produce a plot where both techniques are applied at once. The layout function organises two independent plots on one plotting device, for example on top of each other. Using this relatively simple technique (more advanced methods will be introduced later) we have to make sure that the x-axis is the same in both graphs. This can be done by computing a plausible range of the data, later to be speciﬁed in a plot via the xlim argument: R> xr <- range(USmelanoma$mortality) * c(0.9, 1.1) R> xr [1] 77.4 251.9 Now, plotting both the histogram and the boxplot requires setting up the plotting device with equal space for two independent plots on top of each other. © 2010 by Taylor and Francis Group, LLC

Pages:

andiny.clock

Statistical analysis with R

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

Statistical analysis with R

Read the Text Version

andiny.clock

TOP SEARCH

RELATED PUBLICATIONS