Thursday, December 8, 2022

Cleaning data in a nutshell


Cleaning data is an essential part of the data analysis process, as it ensures that the data is accurate, consistent, and ready for analysis. The following is a cheat sheet of common techniques for cleaning data in R:

Check for missing values: Use the is.na() function to check for missing values in a data frame or vector.

Remove duplicates: Use the duplicated() function to identify duplicate rows in a data frame, then use the unique() function to remove the duplicates.

Convert data types: Use the as.numeric(), as.character(), and as.factor() functions to convert data to numeric, character, or factor data types.

Rename columns: Use the rename() function from the "dplyr" package to rename the columns of a data frame.

Replace values: Use the replace() function to replace certain values in a data frame or vector with other values.

Combine columns: Use the unite() function from the "tidyr" package to combine multiple columns into a single column.

Split columns: Use the separate() function from the "tidyr" package to split a single column into multiple columns.

Here are three examples of using these techniques to clean data in R:


Removing missing values:

mydata %>%

filter(!is.na(x)) %>%

filter(!is.na(y))

removes the NA values


Converting data types:

mydata %>%

mutate(x = as.numeric(x),

y = as.character(y))

Make x numeric and y character.


Renaming and combining columns:

mydata %>%

rename(x_new = x,

y_new = y) %>%

unite(z, x_new, y_new)

It renamed x to x_new and y to y_new and then it combines them in a column. 


No comments:

Post a Comment

Binomial Distribution in very simple words

The binomial distribution is a probability distribution that describes the outcome of a series of independent "yes/no" experiments...