作者ctr1 (【积π】)
看板DataScience
标题[问题] 大资料使用进行groupby
时间Thu Jan 16 16:37:52 2020
language:python 3.7
资料笔数:2730万笔 约1.5G
档案格式:CSV档,资料集如下
我想要进行groupby
df_login_count = df.groupby(['year', 'month', 'day', 'userid'], as_index=False)['count'].count()
df_login_count.to_csv('login_count.csv',index = False)
但资料量实在太大,处理非常的久
想请问各位前辈有什麽建议的解法吗
给小弟些keyword
先感谢各位了
year month day time clftp1 SessionID user user_id
2019 Mar 27 23:21:16 clftp1 ftpd[5376]: USER fXXex
2019 Mar 27 23:21:16 clftp1 ftpd[5379]: USER umX
2019 Mar 27 23:21:17 clftp1 ftpd[5380]: USER umX
2019 Mar 27 23:21:17 clftp1 ftpd[5383]: USER umX
2019 Mar 27 23:21:18 clftp1 ftpd[5385]: USER umX
2019 Mar 27 23:21:18 clftp1 ftpd[5388]: USER umX
2019 Mar 27 23:21:19 clftp1 ftpd[5389]: USER umX
2019 Mar 27 23:21:19 clftp1 ftpd[5392]: USER umX
2019 Mar 27 23:21:20 clftp1 ftpd[5394]: USER umX
2019 Mar 27 23:21:23 clftp1 ftpd[5402]: USER dXX_ft
2019 Mar 27 23:21:45 clftp1 ftpd[5462]: USER sXXXon
2019 Mar 27 23:21:51 clftp1 ftpd[5476]: USER oXXX_m
2019 Mar 27 23:21:59 clftp1 ftpd[5497]: USER sXXXon
2019 Mar 27 23:22:01 clftp1 ftpd[5503]: USER sXXXon
2019 Mar 27 23:22:02 clftp1 ftpd[5505]: USER sXXXon
2019 Mar 27 23:22:04 clftp1 ftpd[5509]: USER sXXXon
2019 Mar 27 23:22:26 clftp1 ftpd[5559]: USER vtXXXrm
2019 Mar 27 23:22:27 clftp1 ftpd[5563]: USER vtXXXrm
2019 Mar 27 23:22:28 clftp1 ftpd[5568]: USER vtXXXrm
--
※ 发信站: 批踢踢实业坊(ptt.cc), 来自: 114.137.193.101 (台湾)
※ 文章网址: https://webptt.com/cn.aspx?n=bbs/DataScience/M.1579163874.A.E1C.html
1F:推 ebullient: 把时间跟使用者id当字串接起来算nunique看看 01/16 20:58
2F:推 drajan: 试试看modin 01/17 00:07
3F:推 CPBLWANG5566: 存到sqlite或database作,没必要ㄧ定用pandas。27M 01/18 10:57
4F:→ CPBLWANG5566: 以database来说是小数目 01/18 10:57
5F:→ youngman77: 不用pandas的话..用bash: cut -f1,2,3,8|sort|uniq -c 01/18 20:34
6F:→ ctr1: 感谢楼上 我用modin跟db分别做看看 01/19 01:32