本文實例講述了Python聚類算法之凝聚層次聚類。分享給大家供大家參考,具體如下:
凝聚層次聚類:所謂凝聚的,指的是該算法初始時,將每個點作為一個簇,每一步合并兩個最接近的簇。另外即使到最后,對于噪音點或是離群點也往往還是各占一簇的,除非過度合并。對于這里的“最接近”,有下面三種定義。我在實現是使用了MIN,該方法在合并時,只要依次取當前最近的點對,如果這個點對當前不在一個簇中,將所在的兩個簇合并就行:
單鏈(MIN):定義簇的鄰近度為不同兩個簇的兩個最近的點之間的距離。
全鏈(MAX):定義簇的鄰近度為不同兩個簇的兩個最遠的點之間的距離。
組平均:定義簇的鄰近度為取自兩個不同簇的所有點對鄰近度的平均值。
- # scoding=utf-8
- # Agglomerative Hierarchical Clustering(AHC)
- import pylab as pl
- from operator import itemgetter
- from collections import OrderedDict,Counter
- points = [[int(eachpoint.split('#')[0]), int(eachpoint.split('#')[1])] for eachpoint in open("points","r")]
- # 初始時每個點指派為單獨一簇
- groups = [idx for idx in range(len(points))]
- # 計算每個點對之間的距離
- disP2P = {}
- for idx1,point1 in enumerate(points):
- for idx2,point2 in enumerate(points):
- if (idx1 < idx2):
- distance = pow(abs(point1[0]-point2[0]),2) + pow(abs(point1[1]-point2[1]),2)
- disP2P[str(idx1)+"#"+str(idx2)] = distance
- # 按距離降序將各個點對排序
- disP2P = OrderedDict(sorted(disP2P.iteritems(), key=itemgetter(1), reverse=True))
- # 當前有的簇個數
- groupNum = len(groups)
- # 過分合并會帶入噪音點的影響,當簇數減為finalGroupNum時,停止合并
- finalGroupNum = int(groupNum*0.1)
- while groupNum > finalGroupNum:
- # 選取下一個距離最近的點對
- twopoins,distance = disP2P.popitem()
- pointA = int(twopoins.split('#')[0])
- pointB = int(twopoins.split('#')[1])
- pointAGroup = groups[pointA]
- pointBGroup = groups[pointB]
- # 當前距離最近兩點若不在同一簇中,將點B所在的簇中的所有點合并到點A所在的簇中,此時當前簇數減1
- if(pointAGroup != pointBGroup):
- for idx in range(len(groups)):
- if groups[idx] == pointBGroup:
- groups[idx] = pointAGroup
- groupNum -= 1
- # 選取規模最大的3個簇,其他簇歸為噪音點
- wantGroupNum = 3
- finalGroup = Counter(groups).most_common(wantGroupNum)
- finalGroup = [onecount[0] for onecount in finalGroup]
- dropPoints = [points[idx] for idx in range(len(points)) if groups[idx] not in finalGroup]
- # 打印規模最大的3個簇中的點
- group1 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalGroup[0]]
- group2 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalGroup[1]]
- group3 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalGroup[2]]
- pl.plot([eachpoint[0] for eachpoint in group1], [eachpoint[1] for eachpoint in group1], 'or')
- pl.plot([eachpoint[0] for eachpoint in group2], [eachpoint[1] for eachpoint in group2], 'oy')
- pl.plot([eachpoint[0] for eachpoint in group3], [eachpoint[1] for eachpoint in group3], 'og')
- # 打印噪音點,黑色
- pl.plot([eachpoint[0] for eachpoint in dropPoints], [eachpoint[1] for eachpoint in dropPoints], 'ok')
- pl.show()
運行效果截圖如下:
希望本文所述對大家Python程序設計有所幫助。
新聞熱點
疑難解答