[问题] pickle 无法 serialize > 4GB

时间Fri Jan 18 00:50:00 2019

首先先感谢看本文的人，文章可能有点长。然後我是python 超新手，某些词汇表达不是很精确..造成困扰的话，先说声抱歉。基本上问题就是： OverflowError: cannot serialize a bytes object larger than 4 GiB *************来自github作者，声明发生这个问题的原因***************** Hi, this is a common problem and stems from some of the patents having a crazily large amount of text in them. Reduce the size of the sample on which you're running inference. E.g., instead of 20% (0.2), reduce it to 0.05 to start with and try ratcheting it up slowly. *********结论：patent档案太大了参考 https://github.com/google/patents-public-data/issues/16 *****请问要怎麽切档案？他把所有的档案，存进一个叫td的东西（在python 上面打 td，他只会出现 <train_data.LandscapeTrainingDataUtil at 0x1369595c0> 完全没有想法要怎麽切，也不知道他长怎样.... -----------------以下文章长-------------- 我在github 下载了一个透过machine learning方法，找某个领域相关专利的专案。 https://github.com/google/patents-public-data/blob/master/models/landscaping/README.md 遵照LandscapeNotebook.ipynb 文件的指示，整个流程跑得非常顺利。然後，问题来了，这是一个相对样本较小的范例，如下所示： subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = \ expander.sample_for_inference(td, 0.02) 在参数设为0.02(随机抽取td 2% 资料量）是成功的。但是我想要的是整个资料为1(100%)下去跑测试完的结果（其实设20%它就不行了）。当参数设太大的时候，会出现 OverflowError: cannot serialize a bytes object larger than 4 GiB” 的问题。 Google 後查（or想）到几个解决方案： 1.)把pickle 模组换成 sklearn （失败） from sklearn.externals import joblib joblib.dump(clf, 'filename.pkl') 参考： https://stackoverflow.com/questions/48074419/how-to-pickle-files-2-gib-by-splitting-them-into-smaller-fragments 2.) 在pickle.dump ()里面放protocol =4 (失败-还是我放错位置了？) 在expansion.py 档案里，有下面这个code: pickle.dump( (training_data_full_df, seed_patents_df, l1_patents_df, l2_patents_df, anti_seed_patents), outfile) 我放的protocol =4 位置如下（但都失败） pickle.dump((training_data_full_df, seed_patents_df, l1_patents_df, l2_patents_df, anti_seed_patents, protocol =4), outfile) or pickle.dump( (training_data_full_df, seed_patents_df, l1_patents_df, l2_patents_df, anti_seed_patents), outfile, protocol =4) 参考： https://github.com/stan-dev/pystan/issues/197 3.) multiprocessing (没试过，但我对於这个code有两个问题) 我的理解是，就是做一个pickle4reducer 模组，模组如下： from multiprocessing.reduction import ForkingPickler, AbstractReducer class ForkingPickler4(ForkingPickler): def __init__(self, *args): if len(args) > 1: args[1] = 2 else: args.append(2) super().__init__(*args) @classmethod def dumps(cls, obj, protocol=4): return ForkingPickler.dumps(obj, protocol) def dump(obj, file, protocol=4): ForkingPickler4(file, protocol).dump(obj) class Pickle4Reducer(AbstractReducer): ForkingPickler = ForkingPickler4 register = ForkingPickler4.register dump = dump 在“主程式”的地方放下面这个code import pickle4reducer import multiprocessing as mp ctx = mp.get_context() ctx.reducer = pickle4reducer.Pickle4Reducer() with mp.Pool(4) as p: # do something 我的问题是， a.我想这个主程式以专案来说，应该是expansion. py 这里。但是具体位置要放哪里？ b. p:後面的do something是要写什麽？？？ with mp.Pool(4) as p: # do something 参考： https://stackoverflow.com/questions/51562221/python-multiprocessing-overflowerrorcannot-serialize-a-bytes-object-larger-t 4. 把档案限制在4GB以下，然後循环下载（没试过） import pickle import os.path file_path = "pkl.pkl" n_bytes = 2**31 max_bytes = 2**31 - 1 data = bytearray(n_bytes) ## write bytes_out = pickle.dumps(data) with open(file_path, 'wb') as f_out: for idx in range(0, len(bytes_out), max_bytes): f_out.write(bytes_out[idx:idx+max_bytes]) ## read bytes_in = bytearray(0) input_size = os.path.getsize(file_path) with open(file_path, 'rb') as f_in: for _ in range(0, input_size, max_bytes): bytes_in += f_in.read(max_bytes) data2 = pickle.loads(bytes_in) assert(data == data2) 请问要贴在哪里啊？我要改什麽吗？参考 https://stackoverflow.com/questions/31468117/python-3-can-pickle-handle-byte-objects-larger-than-4gb 5. 上google cloud platform 开一个远端电脑，CPU 和ram 能加多大，就加多大＝＝暴力解决？但我感觉应该不是这个问题。因为我看了issue 24658 上面po的问题，看起来是不知道哪来的bug~= =? 还是这个bug就是因为电脑运算能力本身会产生的问题？ Ps 我的电脑 mac pro /ram 8G /processor i5 参考：https://bugs.python.org/issue24658 6. 其他？？？谢谢大家，文章真的有点长.... --

※ 发信站: 批踢踢实业坊(ptt.cc), 来自: 141.23.163.194 ※ 文章网址: https://webptt.com/cn.aspx?n=bbs/Python/M.1547743809.A.053.html

1^F：推 Neisseria: 未看先猜档案系统问题 01/18 08:58

2^F：→ magines: 虽然不懂，不过还是谢谢你^^ 01/18 09:39

3^F：→ acer1832a: 你的Python是装32bit还是64bit? 01/18 17:06

acer大，你好：我上网查了一下，透过下面这个code >>> import struct >>> print(struct.calcsize("P") * 8) 出来数字是64 所以是64 bit python 版本是3.5.6 processor 是core i5 谢谢acer大大 acer大大，你好在文章的开始，我放了作者声明这个问题的原因，要下载（训练）的档案td太大了可是打td， python 出现<train_data.LandscapeTrainingDataUtil at 0x1369595c0> 这跟我认识的dataframe好像不一样？请问要怎麽看档案内容？要怎麽切？谢谢 ※ 编辑: magines (109.41.192.113), 01/18/2019 17:51:14 ※ 编辑: magines (109.41.192.113), 01/18/2019 18:09:50

4^F：→ benson415: LandscapeTrainingDataUtil is a class :) 01/18 20:28

5^F：→ benson415: 问题不只是protocol，你dump的时候还要by batch 01/18 20:29

6^F：→ benson415: 你可以用buffer去接每个batch，再去读或写 01/18 20:31

Benson大大，你好: 我再按照你给的关键字查看看，谢谢! ※ 编辑: magines (109.41.192.113), 01/18/2019 20:44:33

7^F：推 alen84204: 原始档案切割呢(训练样) 切成10分分开跑 01/20 01:50

alen大你好：後来的解决方案是参考了 https://stackoverflow.com/questions/31468117/python-3-can-pickle-handle-byte-objects-larger-than-4gb 这一篇，基本上是综合了前面几位大大的线索。谢谢 ※ 编辑: magines (109.41.3.215), 01/24/2019 01:02:23

	[问题/行为] 猫晚上进房间会不会有憋尿问题
	Re: [闲聊] 选了错误的女孩成为魔法少女 XDDDDDDDDDD
	[正妹] 瑞典一张
	[心得] EMS高领长版毛衣.墨小楼MC1002
	[分享] 丹龙隔热纸GE55+33+22
	[问题] 清洗洗衣机
	[寻物] 窗台下的空间
	[闲聊] 双极の女神1 木魔爵
	[售车] 新竹 1997 march 1297cc 白色四门
	[讨论] 能从照片感受到摄影者心情吗
	[狂贺] 贺贺贺贺贺！岛村卯月！总选举NO.1
	[难过] 羡慕白皮肤的女生
	阅读文章
	[黑特]
	[问题] SBK S1安装於安全帽位置
	[分享] 旧woo100绝版开箱!!
	Re: [无言] 关於小包卫生纸
	[开箱] E5-2683V3 RX480Strix 快睿C1 简单测试
	[心得] 苍の海贼龙地狱执行者16PT
	[售车] 1999年Virage iO 1.8EXi
	[心得] 挑战33 LV10 狮子座pt solo
	[闲聊] 手把手教你不被桶之新手主购教学
	[分享] Civic Type R 量产版官方照无预警流出
	[售车] Golf 4 2.0 银色自排
	[出售] Graco提篮汽座（有底座）2000元诚可议
	[问题] 请问补牙材质掉了还能再补吗?(台中半年内
	[问题] 44th 单曲生写竟然都给重复的啊啊！
	[心得] 华南红卡/icash 核卡
	[问题] 拔牙矫正这样正常吗
	[赠送] 老莫高业初业 102年版
	[情报] 三大行动支付本季掀战火
	[宝宝] 博客来Amos水蜡笔5/1特价五折
	Re: [心得] 新鲜人一些面试分享
	[心得] 苍の海贼龙地狱麒麟25PT
	Re: [闲聊] (君の名は。雷慎入) 君名二创漫画翻译
	Re: [闲聊] OGN中场影片：失踪人口局 (英文字幕)
	[问题] 台湾大哥大4G讯号差
	[出售] [全国]全新千寻侘草LED灯, 水草

WEB批踢踢(PTT)

Python 板

[问题] pickle 无法 serialize > 4GB

热门看板

赞助商连结