Cleaning data is an essential part of the data analysis process, as it ensures that the data is accurate, consistent, and ready for analysis. The following is a cheat sheet of common techniques for cleaning data in R:
Check for missing values: Use the is.na() function to check for missing values in a data frame or vector.
Remove duplicates: Use the duplicated() function to identify duplicate rows in a data frame, then use the unique() function to remove the duplicates.
Convert data types: Use the as.numeric(), as.character(), and as.factor() functions to convert data to numeric, character, or factor data types.
Rename columns: Use the rename() function from the "dplyr" package to rename the columns of a data frame.
Replace values: Use the replace() function to replace certain values in a data frame or vector with other values.
Combine columns: Use the unite() function from the "tidyr" package to combine multiple columns into a single column.
Split columns: Use the separate() function from the "tidyr" package to split a single column into multiple columns.
Here are three examples of using these techniques to clean data in R:
Removing missing values:
mydata %>%
filter(!is.na(x)) %>%
filter(!is.na(y))
removes the NA values
Converting data types:
mydata %>%
mutate(x = as.numeric(x),
y = as.character(y))
Make x numeric and y character.
Renaming and combining columns:
mydata %>%
rename(x_new = x,
y_new = y) %>%
unite(z, x_new, y_new)
It renamed x to x_new and y to y_new and then it combines them in a column.
No comments:
Post a Comment