作者RockLee (Now of all times)
看板Prob_Solve
标题[问题] Canopy-Clustering using Map-Reduce
时间Thu Sep 20 14:44:31 2012
小弟最近刚开始看一些 Map-Reduce 相关的东西.
在 Google 的 Cluster Computing and MapReduce 的 Lecture 4 中提到的应用是,
将 Canopy-Clustering 用 Map-Reduce 来做分散运算.
不过 Lecture 4 讲得很简略, 後来找到 K-Means Clustering in Map Reduce 这个网页,
有 Canopy-Clustering using Map-Reduce 演算法的 pseudo-code.
不过小弟对该网页的 pseudo-code 有两点疑问:
(1) Reduce Part of Round 1 检查为何是 if dist(canopy, finalCanopy) < 0.5 * T2?
这样似乎违反一个 canopy center 范围 T2 内不会再有另一个 canopy center 的描述.
是否应像小弟改的那样, 改为 if dist(canopy, finalCanopy) < T2,
然後距离过近的话, 继续检查 listOfCanopies 的下一个.
(2) 要做分散运算的话, 是否应该满足 T1 > 2 * T2 的条件,
才能保证任何一个 data point,
至少被一个 Round 1 找出来的 canopy center 范围 T1 包含?
相关连结:
(1) Google: Cluster Computing and MapReduce
http://code.google.com/intl/en/edu/submissions/mapreduce-minilecture/listing.html
(2) K-Means Clustering in Map Reduce
http://horicky.blogspot.tw/2011/04/k-means-clustering-in-map-reduce.html
(3) Canopy-Clustering Paper
http://www.kamalnigam.com/papers/canopy-kdd00.pdf
Paper 中对 Canopy Creation 的描述:
Start with a list of the data points in any order,
and with two distance thresholds, T1 and T2, where T1 > T2.
Pick a point off the list and approximately measure its distance to all other
points.
Put all points that are within distance threshold T1 into a canopy.
Remove from the list all points that are within distance threshold T2.
Repeat until the list is empty.
/* Reduce Part of Round 1: Identify Canopies, modified by me */
finalCanopyList = {}
reduce(key, listOfCanopies)
{
scanCanopy:
for canopy in listOfCanopies
{
for finalCanopy in finalCanopyList
{
if dist(canopy, finalCanopy) < T2
{
continue scanCanopy
}
}
finalCanopyList.add(canopy)
}
writeToS3(finalCanopyList)
}
--
※ 发信站: 批踢踢实业坊(ptt.cc)
◆ From: 59.126.30.95
※ 编辑: RockLee 来自: 59.126.30.95 (09/20 14:49)