作者dragon0139 (豆子)
看板Statistics
標題[問題] 如何決定最佳集群數Cluster
時間Fri Jun 25 12:36:19 2021
關於群集分析,我需要根據6個變項分數將手上的資料分作數個群集
目前先以SPSS跑二階段
已用Ward’和Euclidean distance跑出樹狀圖,可以看出分成4~5個群集較適宜
但在跑k-means之前,我要怎麼決定是分成4個或5個最佳?
(參考文獻的資料是分成4~9個,最後覺得6個最佳,但我不懂的是6是怎麼出來的!)
參考文獻的做法是跑Cohen’s kappa
是指初始中心點和最終中心點的2種分群方式去跑嗎?(還是我對文獻理解錯誤?)
覺得很疑惑,為什麼可以這樣跑?
一般都是該如何決定最佳集群數呢?
補充--
文獻是這樣描述的:
This procedure begins by randomly assigning the sample into two groups.
The cluster centers of each group from the first step are used as
initial cluster centers for a series of k-means analyses that
assign participants to clusters ranging from four to nine.
Then, another set of k-means analyses are computed for each group,
but in this case, the cluster centers from the opposite group are used to
assign participants to the clusters.
The two sets of k-means yield two sets of cluster assignmentsper group.
The sets within a group are then compared via Cohen’s kappa (Cohen, 1960)
to determine the reliability of cluster assignment,
or in other words, the degree to which participants in each group
are assigned to the same cluster given different initial cluster centers.
感謝指點!
--
※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 118.232.16.73 (臺灣)
※ 文章網址: https://webptt.com/m.aspx?n=bbs/Statistics/M.1624595784.A.B2D.html
1F:→ andrew43: 沒有絕對的答案。光是指標就很多種。 06/25 13:06
2F:→ andrew43: 至於這篇文章的方法應該是把樣本隨機分二組分別做kmeans 06/25 13:12
3F:→ andrew43: 再看哪種群數在二個kmeans的結果較一致。 06/25 13:12
4F:→ andrew43: 同群數的kmeans刻意讓先做出來的中心強制成為後做的中心 06/25 13:16
5F:→ andrew43: 「先做」和「後做」分別對應隨機分成的二群樣本。 06/25 13:18
6F:→ dragon0139: 好的,謝謝你的幫忙! 06/25 17:39