Re: [讨论] 每行抓取特定值

时间Mon Jul 20 14:16:38 2015

※ 引述《fifish89 (OMG)》之铭言： : [问题类型]: : 想针对每行判别是否有包含某特定值 : [软体熟悉度]: : 使用者(已经有用R 做过不少作品) : [问题叙述]: : 资料格式如下: : ID D1 D2 D3 D4....D50 : A 123 23 ......... 55 : B 24 005 : C 504 . . 002 : D 002 . . . : . . . ... : H . . 002 ... : . . . ... : XX 410 . ... : ============================ : CODE1<-有指定某些数字一 : CODE2<-有指定某些数字二 : . : . : . : CODE15<-有指定某些数字十五 : 这是医院资料， : 每个人若最多有50个诊断栏位(D1-->D50)， : 而CODE1...CODE15是每个疾病的指定疾病码， : 想要判断每个人是否有罹患这些疾病。 : EX: : 假设CODE1=002(中风(stroke)疾病码)， : 我就会去判断每个诊断栏位中(D1-->D50)是否有002这个疾病码， : ID=C,在D4中有抓到中风码 : ID=D,在D1中有抓到中风码 : ID=H,在D3中有抓到中风码 : 希望输出结果为 : ID stroke .... : A 0 : B 0 : C 1 : D 1 : . . : . . : H 1 : . . : XX . 程式好读版：http://pastebin.com/MN73bPPH library(data.table) library(tidyr) library(dplyr) library(magrittr) N = 1e5 dat = data.table(ID = paste0("P", 1:N)) for (i in 1:50) eval(parse(text = paste0("dat %<>% mutate(D", i, "=paste0(sample(LETTERS, N, TRUE),", "sample(as.character(1:100), N, TRUE)))"))) # single search code1 = "D5" st = proc.time() a = dat %>% gather(D, code, -ID) %>% group_by(ID) %>% summarise(stroke = any(code == code1)) %>% distinct() proc.time() - st # user system elapsed # 0.42 0.05 0.47 # multiple search codes_search = c("D5", "E55", "Z2", "A96") st = proc.time() a = dat %>% gather(D, code, -ID) %>% group_by(ID) for (i in 1:length(codes_search)) eval(parse(text = paste0("a %<>% mutate(codes_search_", i, "=codes_search[", i, "])"))) b = a %>% group_by(ID) %>% summarise_(.dots = paste0( "any(code == codes_search_", 1:length(codes_search),")")) %>% setnames(c("ID", paste0("stroke", 1:length(codes_search)))) proc.time() - st # user system elapsed # 2.74 0.26 3.01 ## list search diseases = vector('list', 15) for (i in 1:length(diseases)) diseases[[i]] = paste0(sample(LETTERS, i, TRUE), sample(1:100, i, TRUE)) st = proc.time() a = dat %>% gather(D, code, -ID) %>% group_by(ID) out_list = llply(1:length(diseases), function(i){ a %>% setkey(ID) %>% group_by(ID) %>% summarise(stroke = any(code %in% diseases[[i]])) %>% setnames("stroke", paste0("stroke_", i)) }) out = Reduce(function(x, y) merge(x, y), out_list) proc.time() - st # user system elapsed # 8.88 0.36 9.23 ## faster way st = proc.time() a = dat %>% gather(D, code, -ID) %>% group_by(ID) eval(parse(text = paste("a %<>% summarise(", paste0("stroke_", 1:length(diseases) ,"=any(diseases[[", 1:length(diseases), "]] %in% code)", collapse = ","), ")"))) proc.time() - st # user system elapsed # 5.79 0.05 5.85 ## the fastest way st = proc.time() out2 = vector('list', length(diseases)) for (i in 1:length(diseases)) out2[[i]] = rowMeans(sweep(as.matrix(dat[,2:51, with = FALSE]), 2, 1:(ncol(dat)-1), function(x, y){matrix(x %in% diseases[[i]], nrow(dat))})) > 0 out2 = do.call(cbind, out2) %>% data.table %>% cbind(dat$ID, .) %>% setnames(c("ID", paste0("stroke_", 1:length(diseases)))) %>% tbl_dt(FALSE) proc.time() - st # user system elapsed # 3.20 0.52 3.71 all.equal(out2, a) # TRUE## the fastest way 关於multiple search部分，我稍微说明一下发现summarise_里面lazyeval部分的environment设定在data内部可能是避免一些问题才这样设计所以没办法直接用.dots把global variables塞进去跑只好先手动mutate进去之後再一次summarise --

※ 发信站: 批踢踢实业坊(ptt.cc), 来自: 123.205.27.107 ※ 文章网址: https://webptt.com/cn.aspx?n=bbs/R_Language/M.1437373001.A.DE6.html

1^F：→ andrew43: C大有空可以为大家介绍这几套你常用的资料整理套件技术 07/20 22:32

2^F：→ andrew43: 这些东西非常好用，但没接触的朋友应该都看不懂。 07/20 22:33

3^F：→ andrew43: 像我也是边看边猜，但目前用不到就没认真吸收。XD 07/20 22:35

4^F：→ celestialgod: 有空来写dplyr跟tidyr的详细用法 07/20 23:07

5^F：推 fifish89: 感谢楼上各位高手，也努力学习dplyr技术中... 07/21 09:55

※ 编辑: celestialgod (123.205.27.107), 07/22/2015 14:46:55

	[问题/行为] 猫晚上进房间会不会有憋尿问题
	Re: [闲聊] 选了错误的女孩成为魔法少女 XDDDDDDDDDD
	[正妹] 瑞典一张
	[心得] EMS高领长版毛衣.墨小楼MC1002
	[分享] 丹龙隔热纸GE55+33+22
	[问题] 清洗洗衣机
	[寻物] 窗台下的空间
	[闲聊] 双极の女神1 木魔爵
	[售车] 新竹 1997 march 1297cc 白色四门
	[讨论] 能从照片感受到摄影者心情吗
	[狂贺] 贺贺贺贺贺！岛村卯月！总选举NO.1
	[难过] 羡慕白皮肤的女生
	阅读文章
	[黑特]
	[问题] SBK S1安装於安全帽位置
	[分享] 旧woo100绝版开箱!!
	Re: [无言] 关於小包卫生纸
	[开箱] E5-2683V3 RX480Strix 快睿C1 简单测试
	[心得] 苍の海贼龙地狱执行者16PT
	[售车] 1999年Virage iO 1.8EXi
	[心得] 挑战33 LV10 狮子座pt solo
	[闲聊] 手把手教你不被桶之新手主购教学
	[分享] Civic Type R 量产版官方照无预警流出
	[售车] Golf 4 2.0 银色自排
	[出售] Graco提篮汽座（有底座）2000元诚可议
	[问题] 请问补牙材质掉了还能再补吗?(台中半年内
	[问题] 44th 单曲生写竟然都给重复的啊啊！
	[心得] 华南红卡/icash 核卡
	[问题] 拔牙矫正这样正常吗
	[赠送] 老莫高业初业 102年版
	[情报] 三大行动支付本季掀战火
	[宝宝] 博客来Amos水蜡笔5/1特价五折
	Re: [心得] 新鲜人一些面试分享
	[心得] 苍の海贼龙地狱麒麟25PT
	Re: [闲聊] (君の名は。雷慎入) 君名二创漫画翻译
	Re: [闲聊] OGN中场影片：失踪人口局 (英文字幕)
	[问题] 台湾大哥大4G讯号差
	[出售] [全国]全新千寻侘草LED灯, 水草

WEB批踢踢(PTT)

R_Language 板

Re: [讨论] 每行抓取特定值

热门看板

赞助商连结