Kai's Study Note in San Francisco: 8.4 Compare two dataframes + Only focus on the differences

2015年8月4日星期二

8.4 Compare two dataframes + Only focus on the differences

# Find elements that are not identical in two dataframes
# Useful link: http://stackoverflow.com/questions/3171426/compare-two-data-frames-to-find-the-rows-in-data-frame-1-that-are-not-present-in

library(xlsx)
library(proto)
library(sqldf)
library(compare)

x <- c(1,2,3,4,5)
y1 <- c(1,2,3,4,5)
y2 <- c(1,2,3,4,6)

d1 <- data.frame(x, y1)
d2 <- data.frame(x, y2)

compare(d1,d2)

# SQLDF solution
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

require(sqldf)

a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')

# DPLYR solution
library(dplyr)

anti_join(a1,a2)
semi_join(a1,a2)
full_join(a1,a2)

(8.6 update) To only focus on the columns containing different elements instead of viewing a bunch of identical columns (such as ages, education, you know they will all be the same and less likely to be different), the trick is to add an extra row below the original dataframe which is a logical value, indicating if the two columns are identical. Can use idential() function in R to do that. Then, it is just a matter of dataframe subsetting.

Pretty cool. Huh?

Kai's Study Note in San Francisco

2015年8月4日星期二

8.4 Compare two dataframes + Only focus on the differences

沒有留言:

張貼留言

2015年8月4日 星期二

8.4 Compare two dataframes + Only focus on the differences

沒有留言:

張貼留言

2015年8月4日星期二