作者celestialgod (攸蓝)
看板R_Language
标题Re: [讨论] 每行抓取特定值
时间Mon Jul 20 14:16:38 2015
※ 引述《fifish89 (OMG)》之铭言:
: [问题类型]:
: 想针对每行判别是否有包含某特定值
: [软体熟悉度]:
: 使用者(已经有用R 做过不少作品)
: [问题叙述]:
: 资料格式如下:
: ID D1 D2 D3 D4....D50
: A 123 23 ......... 55
: B 24 005
: C 504 . . 002
: D 002 . . .
: . . . ...
: H . . 002 ...
: . . . ...
: XX 410 . ...
: ============================
: CODE1<-有指定某些数字一
: CODE2<-有指定某些数字二
: .
: .
: .
: CODE15<-有指定某些数字十五
: 这是医院资料,
: 每个人若最多有50个诊断栏位(D1-->D50),
: 而CODE1...CODE15是每个疾病的指定疾病码,
: 想要判断每个人是否有罹患这些疾病。
: EX:
: 假设CODE1=002(中风(stroke)疾病码),
: 我就会去判断每个诊断栏位中(D1-->D50)是否有002这个疾病码,
: ID=C,在D4中有抓到中风码
: ID=D,在D1中有抓到中风码
: ID=H,在D3中有抓到中风码
: 希望输出结果为
: ID stroke ....
: A 0
: B 0
: C 1
: D 1
: . .
: . .
: H 1
: . .
: XX .
程式好读版:
http://pastebin.com/MN73bPPH
library(data.table)
library(tidyr)
library(dplyr)
library(magrittr)
N = 1e5
dat = data.table(ID = paste0("P", 1:N))
for (i in 1:50)
eval(parse(text = paste0("dat %<>% mutate(D", i,
"=paste0(sample(LETTERS, N, TRUE),",
"sample(as.character(1:100), N, TRUE)))")))
# single search
code1 = "D5"
st = proc.time()
a = dat %>% gather(D, code, -ID) %>% group_by(ID) %>%
summarise(stroke = any(code == code1)) %>% distinct()
proc.time() - st
# user system elapsed
# 0.42 0.05 0.47
# multiple search
codes_search = c("D5", "E55", "Z2", "A96")
st = proc.time()
a = dat %>% gather(D, code, -ID) %>% group_by(ID)
for (i in 1:length(codes_search))
eval(parse(text = paste0("a %<>% mutate(codes_search_", i,
"=codes_search[", i, "])")))
b = a %>% group_by(ID) %>% summarise_(.dots = paste0(
"any(code == codes_search_", 1:length(codes_search),")")) %>%
setnames(c("ID", paste0("stroke", 1:length(codes_search))))
proc.time() - st
# user system elapsed
# 2.74 0.26 3.01
## list search
diseases = vector('list', 15)
for (i in 1:length(diseases))
diseases[[i]] = paste0(sample(LETTERS, i, TRUE),
sample(1:100, i, TRUE))
st = proc.time()
a = dat %>% gather(D, code, -ID) %>% group_by(ID)
out_list = llply(1:length(diseases), function(i){
a %>% setkey(ID) %>% group_by(ID) %>%
summarise(stroke = any(code %in% diseases[[i]])) %>%
setnames("stroke", paste0("stroke_", i))
})
out = Reduce(function(x, y) merge(x, y), out_list)
proc.time() - st
# user system elapsed
# 8.88 0.36 9.23
## faster way
st = proc.time()
a = dat %>% gather(D, code, -ID) %>% group_by(ID)
eval(parse(text = paste("a %<>% summarise(", paste0("stroke_",
1:length(diseases) ,"=any(diseases[[", 1:length(diseases),
"]] %in% code)", collapse = ","), ")")))
proc.time() - st
# user system elapsed
# 5.79 0.05 5.85
## the fastest way
st = proc.time()
out2 = vector('list', length(diseases))
for (i in 1:length(diseases))
out2[[i]] = rowMeans(sweep(as.matrix(dat[,2:51, with = FALSE]), 2,
1:(ncol(dat)-1), function(x, y){matrix(x %in% diseases[[i]], nrow(dat))})) > 0
out2 = do.call(cbind, out2) %>% data.table %>% cbind(dat$ID, .) %>%
setnames(c("ID", paste0("stroke_", 1:length(diseases)))) %>% tbl_dt(FALSE)
proc.time() - st
# user system elapsed
# 3.20 0.52 3.71
all.equal(out2, a) # TRUE## the fastest way
关於multiple search部分,我稍微说明一下
发现summarise_里面lazyeval部分的environment设定在data内部
可能是避免一些问题才这样设计
所以没办法直接用.dots把global variables塞进去跑
只好先手动mutate进去之後
再一次summarise
--
※ 发信站: 批踢踢实业坊(ptt.cc), 来自: 123.205.27.107
※ 文章网址: https://webptt.com/cn.aspx?n=bbs/R_Language/M.1437373001.A.DE6.html
1F:→ andrew43: C大有空可以为大家介绍这几套你常用的资料整理套件技术 07/20 22:32
2F:→ andrew43: 这些东西非常好用,但没接触的朋友应该都看不懂。 07/20 22:33
3F:→ andrew43: 像我也是边看边猜,但目前用不到就没认真吸收。XD 07/20 22:35
4F:→ celestialgod: 有空来写dplyr跟tidyr的详细用法 07/20 23:07
5F:推 fifish89: 感谢楼上各位高手,也努力学习dplyr技术中... 07/21 09:55
※ 编辑: celestialgod (123.205.27.107), 07/22/2015 14:46:55