Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore data-wrangling-cheatsheet

data-wrangling-cheatsheet

Published by karuna.chemiz, 2023-02-09 13:57:27

Description: data-wrangling-cheatsheet

Search

Read the Text Version

Data Wrangling Tidy Data - A foundation for wrangling in R with dplyr and tidyr F MA F MA Tidy data complements R’s vectorized M* A F operations. R will automatically preserve M*A Cheat Sheet In a tidy & observations as you manipulate variables. data set: No other format works as intuitively with R. Syntax - Helpful conventions for wrangling Each variable is saved Each observation is in its own column saved in its own row Reshaping Data - Change the layout of a data set dplyr::tbl_df(iris) dplyr::data_frame(a = 1:3, b = 4:6) Converts data to tbl class. tbl’s are easier to examine than 111wAwwAA000www101503tidyr::gather(cases, \"year\", \"n\", 2:4) wwww (optimized). data frames. R displays only the data that fits onscreen: 1A010Gather columns into rows. Combine vectors into data frame 14111w4510w050p00149711w45100w5p00097tidyr::separate(storms, date, c(\"y\", \"m\", \"d\")) ww dplyr::arrange(mtcars, mpg) 111AAA000101503tidyr::spread(pollution, size, amount) Order rows by values of a column Source: local data frame [150 x 5] Separate one column into several. 1A010Spread rows into columns. (low to high). dplyr::arrange(mtcars, desc(mpg)) Sepal.Length Sepal.Width Petal.Length 1 5.1 3.5 1.4 2 4.9 3.0 1.4 Order rows by values of a column 3 4.7 3.2 1.3 4 4.6 3.1 1.5 1411w45100w5p0007914111w4510w050p0079tidyr::unite(data, col, ..., sep) (high to low). 5 5.0 3.6 1.4 Unite several columns into one. dplyr::rename(tb, y = year) .. ... ... ... Variables not shown: Petal.Width (dbl), Rename the columns of a data Species (fctr) frame. dplyr::glimpse(iris) Subset Observations (Rows) Subset Variables (Columns) Information dense summary of tbl data. utils::View(iris) 11w10w10w10w10w0 wwww 1141100w500p1004p00197w5790dplyr::select(iris, Sepal.Width, Petal.Length, Species) View data set in spreadsheet-like display (note capital V). dplyr::filter(iris, Sepal.Length > 7) Select columns by name or helper function. dplyr::%>% Extract rows that meet logical criteria. Passes object on left hand side as first argument (or . Helper functions for select - ?select argument) of function on righthand side. dplyr::distinct(iris) x %>% f(y) is the same as f(x, y) Remove duplicate rows. select(iris, contains(\".\")) y %>% f(x, ., z) is the same as f(x, y, z ) Select columns whose name contains a character string. dplyr::sample_frac(iris, 0.5, replace = TRUE) \"Piping\" with %>% makes code more readable, e.g. Randomly select fraction of rows. select(iris, ends_with(\"Length\")) Select columns whose name ends with a character string. iris %>% dplyr::sample_n(iris, 10, replace = TRUE) group_by(Species) %>% Randomly select n rows. select(iris, everything()) summarise(avg = mean(Sepal.Width)) %>% Select every column. arrange(avg) dplyr::slice(iris, 10:15) Select rows by position. select(iris, matches(\".t.\")) Select columns whose name matches a regular expression. dplyr::top_n(storms, 2, date) Select and order top n entries (by group if grouped data). select(iris, num_range(\"x\", 1:5)) Select columns named x1, x2, x3, x4, x5. Logic in R - ?Comparison, ?base::Logic select(iris, one_of(c(\"Species\", \"Genus\"))) < Less than != Not equal to Select columns whose names are in a group of names. > Greater than %in% Group membership select(iris, starts_with(\"Sepal\")) Select columns whose name starts with a character string. == Equal to is.na Is NA select(iris, Sepal.Length:Petal.Width) <= Less than or equal to !is.na Is not NA Select all columns between Sepal.Length and Petal.Width (inclusive). >= Greater than or equal to &,|,!,xor,any,all Boolean operators select(iris, -Species) Select all columns except Species. RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com devtools::install_github(\"rstudio/EDAWR\") for data sets Learn more with browseVignettes(package = c(\"dplyr\", \"tidyr\")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15

Summarise Data Make New Variables Combine Data Sets dplyr::summarise(iris, avg = mean(Sepal.Length)) dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width) a b = Summarise data into single row of values. Compute and append one or more new columns. x1 x2 x1 x3 dplyr::summarise_each(iris, funs(mean)) dplyr::mutate_each(iris, funs(min_rank)) A1 Apply summary function to each column. Apply window function to each column. B2 + AT C3 BF dplyr::count(iris, Species, wt = Sepal.Length) dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width) DT Count number of rows with each unique value of Compute one or more new columns. Drop original columns. Mutating Joins variable (with or without weights). window x1 x2 x3 dplyr::left_join(a, b, by = \"x1\") summary function A1 T Join matching rows from b to a. function B2 F Mutate uses window functions, functions that take a vector of C 3 NA dplyr::right_join(a, b, by = \"x1\") Summarise uses summary functions, functions that values and return another vector of values, such as: Join matching rows from a to b. take a vector of values and return a single value, such as: x1 x3 x2 AT 1 dplyr::inner_join(a, b, by = \"x1\") BF 2 Join data. Retain only rows in both sets. D T NA x1 x2 x3 A1 T B2 F x1 x2 x3 dplyr::full_join(a, b, by = \"x1\") A1 T Join data. Retain all values, all rows. B2 F C 3 NA D NA T dplyr::first min dplyr::lead dplyr::cumall Filtering Joins First value of a vector. Minimum value in a vector. Copy with values shifted by 1. Cumulative all x1 x2 dplyr::semi_join(a, b, by = \"x1\") dplyr::last max dplyr::lag dplyr::cumany A1 All rows in a that have a match in b. Last value of a vector. Maximum value in a vector. Copy with values lagged by 1. Cumulative any B2 dplyr::nth mean dplyr::dense_rank dplyr::cummean x1 x2 dplyr::anti_join(a, b, by = \"x1\") Nth value of a vector. Mean value of a vector. Ranks with no gaps. Cumulative mean C3 All rows in a that do not have a match in b. dplyr::n median dplyr::min_rank cumsum y z # of values in a vector. Median value of a vector. Ranks. Ties get min rank. Cumulative sum x1 x2 x1 x2 = dplyr::n_distinct var dplyr::percent_rank cummax A 1 # of distinct values in Variance of a vector. Ranks rescaled to [0, 1]. Cumulative max B 2 + B2 a vector. C 3 C3 sd dplyr::row_number cummin D4 IQR Standard deviation of a Ranks. Ties got to first value. Cumulative min Set Operations IQR of a vector. vector. dplyr::ntile cumprod x1 x2 dplyr::intersect(y, z) Group Data Bin vector into n buckets. Cumulative prod B2 Rows that appear in both y and z. C3 dplyr::group_by(iris, Species) dplyr::between pmax dplyr::union(y, z) Group data into rows with the same value of Species. Are values between a and b? Element-wise max x1 x2 Rows that appear in either or both y and z. A1 dplyr::ungroup(iris) dplyr::cume_dist pmin B2 dplyr::setdiff(y, z) Remove grouping information from data frame. Cumulative distribution. Element-wise min C3 Rows that appear in y but not z. D4 x1 x2 A1 Binding iris %>% group_by(Species) %>% summarise(…) iris %>% group_by(Species) %>% mutate(…) x1 x2 dplyr::bind_rows(y, z) Compute separate summary row for each group. Compute new variables by group. A1 Append z to y as new rows. B2 ir Cir C3 dplyr::bind_cols(y, z) B2 Append z to y as new columns. C3 Caution: matches rows by position. D4 x1 x2 x1 x2 A1B2 B2C3 C3D4 RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com devtools::install_github(\"rstudio/EDAWR\") for data sets Learn more with browseVignettes(package = c(\"dplyr\", \"tidyr\")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook