Python聚類算法之DBSACN實例分析

2020-01-04 17:56:04

字體：大中小

來源：轉載

供稿：網友

這篇文章主要介紹了Python聚類算法之DBSACN,結合實例形式詳細分析了DBSACN算法的原理與具體實現技巧,具有一定參考借鑒價值,需要的朋友可以參考下

本文實例講述了Python聚類算法之DBSACN。分享給大家供大家參考，具體如下：

DBSCAN：

是一種簡單的，基于密度的聚類算法。本次實現中，DBSCAN使用了基于中心的方法。在基于中心的方法中，每個數據點的密度通過對以該點為中心以邊長為2*EPs的網格(鄰域)內的其他數據點的個數來度量。根據數據點的密度分為三類點:

核心點：該點在鄰域內的密度超過給定的閥值MinPs。

邊界點：該點不是核心點，但是其鄰域內包含至少一個核心點。

噪音點：不是核心點，也不是邊界點。

有了以上對數據點的劃分，聚合可以這樣進行：各個核心點與其鄰域內的所有核心點放在同一個簇中，把邊界點跟其鄰域內的某個核心點放在同一個簇中。

# scoding=utf-8

import pylab as pl

from collections import defaultdict,Counter

points = [[int(eachpoint.split("#")[0]), int(eachpoint.split("#")[1])] for eachpoint in open("points","r")]

# 計算每個數據點相鄰的數據點，鄰域定義為以該點為中心以邊長為2*EPs的網格

Eps = 10

surroundPoints = defaultdict(list)

for idx1,point1 in enumerate(points):

for idx2,point2 in enumerate(points):

if (idx1 < idx2):

if(abs(point1[0]-point2[0])<=Eps and abs(point1[1]-point2[1])<=Eps):

surroundPoints[idx1].append(idx2)

surroundPoints[idx2].append(idx1)

# 定義鄰域內相鄰的數據點的個數大于4的為核心點

MinPts = 5

corePointIdx = [pointIdx for pointIdx,surPointIdxs in surroundPoints.iteritems() if len(surPointIdxs)>=MinPts]

# 鄰域內包含某個核心點的非核心點，定義為邊界點

borderPointIdx = []

for pointIdx,surPointIdxs in surroundPoints.iteritems():

if (pointIdx not in corePointIdx):

for onesurPointIdx in surPointIdxs:

if onesurPointIdx in corePointIdx:

borderPointIdx.append(pointIdx)

break

# 噪音點既不是邊界點也不是核心點

noisePointIdx = [pointIdx for pointIdx in range(len(points)) if pointIdx not in corePointIdx and pointIdx not in borderPointIdx]

corePoint = [points[pointIdx] for pointIdx in corePointIdx]

borderPoint = [points[pointIdx] for pointIdx in borderPointIdx]

noisePoint = [points[pointIdx] for pointIdx in noisePointIdx]

# pl.plot([eachpoint[0] for eachpoint in corePoint], [eachpoint[1] for eachpoint in corePoint], 'or')

# pl.plot([eachpoint[0] for eachpoint in borderPoint], [eachpoint[1] for eachpoint in borderPoint], 'oy')

# pl.plot([eachpoint[0] for eachpoint in noisePoint], [eachpoint[1] for eachpoint in noisePoint], 'ok')

groups = [idx for idx in range(len(points))]

# 各個核心點與其鄰域內的所有核心點放在同一個簇中

for pointidx,surroundIdxs in surroundPoints.iteritems():

for oneSurroundIdx in surroundIdxs:

if (pointidx in corePointIdx and oneSurroundIdx in corePointIdx and pointidx < oneSurroundIdx):

for idx in range(len(groups)):

if groups[idx] == groups[oneSurroundIdx]:

groups[idx] = groups[pointidx]

# 邊界點跟其鄰域內的某個核心點放在同一個簇中

for pointidx,surroundIdxs in surroundPoints.iteritems():

for oneSurroundIdx in surroundIdxs:

if (pointidx in borderPointIdx and oneSurroundIdx in corePointIdx):

groups[pointidx] = groups[oneSurroundIdx]

break

# 取簇規模最大的5個簇

wantGroupNum = 3

finalGroup = Counter(groups).most_common(3)

finalGroup = [onecount[0] for onecount in finalGroup]

group1 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalGroup[0]]

group2 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalGroup[1]]

group3 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalGroup[2]]

pl.plot([eachpoint[0] for eachpoint in group1], [eachpoint[1] for eachpoint in group1], 'or')

pl.plot([eachpoint[0] for eachpoint in group2], [eachpoint[1] for eachpoint in group2], 'oy')

pl.plot([eachpoint[0] for eachpoint in group3], [eachpoint[1] for eachpoint in group3], 'og')

# 打印噪音點，黑色

pl.plot([eachpoint[0] for eachpoint in noisePoint], [eachpoint[1] for eachpoint in noisePoint], 'ok')

pl.show()