基础知识¶

层次聚类通过对数据集在不同层次进行划分,从而形成树形的聚类结构。数据集的划分可采用“自底向上”的聚合(agglomerative)策略,也可采用“自顶向下”的分拆(divisive)策略。“自底而上”的算法开始时把每一个原始数据看作一个单一的聚类簇,然后不断聚合小的聚类簇成为大的聚类。“自顶向下”的算法开始把所有数据看作一个聚类,通过不断分割大的聚类直到每一个单一的数据都被划分。

层次聚类缺点¶

  1. 合并或分裂点选择困难,一旦决定,前一步的合并和分裂的操作就不可撤销
  2. 伸缩性差,合并或者分裂的决定需要检查和估算大量的簇和对象

Kmeans¶

DBSCAN( Density-Based Spatial Clustering of Applications with Noise)

DBSCAN 需要两个参数距离ε (eps) 和形成高密度区域所需要的最少点数 (minPts),它由一个任意未被访问的点开始,然后探索这个点的 ε-邻域,如果 ε-邻域里有足够的点,则建立一个新的聚类,否则这个点被标签为杂音。注意这个点之后可能被发现在其它点的 ε-邻域里,而该 ε-邻域可能有足够的点届时这个点会被加入该聚类中。如果一个点位于一个聚类的密集区域里,它的 ε-邻域里的点也属于该聚类,当这些新的点被加进聚类后,如果它(们)也在密集区域里,它(们)的 ε-邻域里的点也会被加进聚类里。这个过程将一直重复,直至不能再加进更多的点为止,这样,一个密度连结的聚类被完整地找出来。然后,一个未曾被访问的点将被探索,从而发现一个新的聚类或杂音。DBSCAN 可以找出任何形状的聚类,甚至能找出一个聚类,它包围但不连接另一个聚类,另外,由于 MinPts 参数,single-link effect (不同聚类以一点或极幼的线相连而被当成一个聚类)能有效地被避免。

  1. 核心点 如果一个点 p 在距离 ε 范围内有至少 minPts 个点(包括自己),则这个点被称为核心点
  2. 可达点 点 B 和点 C 不是核心点,但它们可由 A 经其他核心点可达,所以也属于同一个聚类。
  3. 局外点 点 N 是局外点,它既不是核心点,又不由其他点可达

BIRCH(Balanced Iterative ReducingandClustering using Hierarchies)

针对层次聚类不适于解决大量数据的问题,Zhang 等人提出了一种基于聚类特征树(Clustering Feature Tree)的层次聚类算法:BIRCH算法(Balanced Iterative Reducing and Clustering Using Hierarchies)。BIRCH算法实现了只需要一次对数据集的遍历就可以完成聚类。其聚类的主要过程就是将所有数据依次插入构建聚类特征树的过程,而树上的每一个节点所包含的数据构成了一个对应的聚类簇。同时对于包含数据极少的节点我们可以认为它是异常点从而进行去除。

CURE ( Clustering Using Representative)

不同于一个质心或者单个点来代表一个类,CURE 算法中每个簇有多个代表点,这使得 CURE算法可以适应非球形的几何形状。CURE 算法还通过“收缩因子”减少离群点对聚类效果的影响。

  • 代表点的选取
  1. 首先选择簇中距离质心最远的点做为第一个点
  2. 然后依次选择距离已选到的点最远的点
  3. 直到选到c 个点为止,这些点尽量得分散,捕获了簇的形状和大小

ROCK (RObust Clustering using linKs)

A Hierarchical ClusteringAlgorithm for Categorical Attributes主要用在categorical的数据类型上;

Chameleon1¶

(A Hierarchical Clustering AlgorithmUsing Dynamic Modeling)Chameleon,变色龙算法。算法如其名,它可以自动地、适应地合并簇,对各种奇葩的形状也能应对自如。

它里用到的linkage是kNN(k-nearest-neighbor)算法, K-NN是一种基于实例的学习,或者是局部近似和将所有计算推迟到分类之后的惰性学习。k-近邻算法是所有的机器学习算法中最简 单的之一。无论是分类还是回归,衡量邻居的权重都非常有用,使较近邻居的权重比较远邻居的权重大。例如,一种常见的加权 方案是给每个邻居权重赋值为1/ d,其中d是到邻居的距离。邻居都取自一组已经正确分类(在回归的情况下,指属性值正确)的对象。虽然没要求明确的训练步骤,但这也可以当作是此算法的一个训练样本集。k-nn算法的缺点是对数据的局部结构非常敏感。并以linkage构建一个graph.

Chameleon使用hMetis算法根据最小化截断的边的权重之和来分隔k近邻图。Chameleon的聚类效果被认为非常强大,比BIRCH好用,但运算复杂度很高,O(n^2)。它是由George Karypis 教授提出的一种基于动态建模(Dynamic modeling)思想的层次聚类算法。该算法通过计算两个聚类簇之间的相似度和互连性从而克服CURE和ROCK等算法的缺点。变色龙算法首先将所有的数据根据他们之间的距离划分成众多的小的簇,通过计算相近簇之间的相似度和互连性不断合并形成大的聚类簇。

[1]Chameleon聚类算法的Weka实现

R操作¶

标准化数据矩阵¶

在R里面我用scale标准化,每次scale只能读取处理标准化dataframe数据的一列, read.csv加载进来的数据就是dataframe类型,然而scale处理的是numeric matric对象。

怎么把dataframe类型转为matric类型呢,用


data<-data.matrix(data)

data_scale<-scale(data)

scale(x, center = TRUE, scale = TRUE)

默认情况下:scale函数是将一组数的每个数都减去这组数的平均值后再除以这组数的均方根。¶

  • If center is TRUE then centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns,

    • if center is FALSE, no centering is done.
  • If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations

  • if center is TRUE, and the root mean square(均方根) otherwise. If scale is FALSE, no scaling is done.

每个列数据标准化之后,得到下表¶

X1 X2 X3 X4 X5 X6 X7 X8
1 3.90366495000079 -1.51495896020302 -2.17635522268827 0.322329185610152 0.382445911299242 1.08987468404102 2.64090577059576 0.684491814022795
2 0.442398553372214 1.55268488189912 -1.71778441371367 -0.80582296402538 1.95998682484506 1.69422800414596 1.73718862572623 0.283813678985061
3 -0.192680778460667 0.131711720805258 0.591846611975485 -1.61164592805076 0.128905369369434 0.175452433334835 -1.47159057137507 3.39165091454191
4 0.477196568123692 0.751228382611171 -1.21984362309052 -1.28931674244061 0.974777740423126 0.0911240630876348 1.14061864354945 0.509453298845228
5 0.285356171056844 1.41219969093198 -2.14805804837838 -1.45048133524568 3.3104269912777 2.59902817158035 1.72305402291422 -0.311007469936935
6 -0.69625598477924 2.35414138118706 1.1848010372874 -0.966987556830456 -0.411250590895777 -0.920600153288398 -0.595918273354763 -0.461778108378969
7 -0.2748358410157 -0.480895833248172 0.899704209951727 1.28931674244061 0.344042863314632 -0.919843360222077 -0.596366990904351 -0.461261770576085
8 -0.780854378423711 1.47898773253931 1.02279132587284 -1.12815214963553 -0.322179647140896 0.184750176721065 -0.914507733561996 -0.449902338912644
9 -0.411662382024195 -0.388774396548409 0.77656117076123 0.966987556830456 -0.612514732427687 0.0235532535946855 -0.346655674558822 -0.439059245052087
10 -0.337428692236829 -0.455562438155737 0.642233477693304 1.12815214963553 -0.309512673198328 0.176857906172289 -0.241880126730106 -0.42821615119153
11 -0.562806531964426 -0.349622785951009 -0.263723486378472 -0.644658371220304 -0.404213383149906 0.340541435088009 -0.412841513123001 -0.466941486407805
12 -0.961800942607716 0.0718327869504112 0.536370728743435 1.61164592805076 -0.763211509728553 -1.36635126935149 -0.595693914579969 -0.459712757167434
13 -0.688568051287635 -0.400289576135879 0.435429227499636 1.45048133524568 -0.635938581067515 -0.814432897413076 -0.50191194671615 -0.466425148604922
14 -0.556752673405247 -0.890836226562122 0.284324553615565 0.80582296402538 -0.695754846907418 0.299025929735541 -0.224380142296187 2.1426297693663
15 -0.56224627770188 0.0280751045180236 0.726062458504637 0 -0.459907855881513 -0.599071213397147 -0.805020651462609 -0.462294446181852
16 -0.214935322778473 -0.819442113119805 0.317878515247853 -0.161164592805076 -0.365810335165296 -1.32548444377015 -0.472520947218159 -0.461778108378969
17 0.0844894553156418 -0.803320861697346 0.463390862193209 -0.322329185610152 -0.152180814308498 -0.64188407829188 -0.149893029064637 -0.387941802566603
18 0.441931674820092 -0.628290131967795 -0.0453990426910512 0.644658371220304 -0.641769410342665 -0.564799298822323 -0.0626174656698387 -0.461261770576085
19 0.384163235304221 -0.948412124499475 0.0493908989201624 -0.483493778415228 -0.290110086127569 -0.769998333090512 0.260234811258477 -0.431830515811715
20 0.0705920370808162 0.541652114119208 -0.0287898316830688 0.483493778415228 -0.825038406352991 0.158154306104641 -0.209123745610207 -0.442673609672272
21 0.151035211611403 -0.642108347472759 -0.330831409643048 0.161164592805076 -0.211192827834588 1.08987468404102 0.098920852181716 -0.419954746345391

欧式距离矩阵¶

d <- dist(data_scale, method = "euclidean")
dx<-data.matrix(d)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1 0 5.16768267071901 7.48575722380227 4.90086394693087 6.12275772592537 8.09640951038881 6.69025332851862 7.64790400773649 6.42965937999364 6.20998945646654 6.14264643893194 7.3717192454791 6.80372283820118 6.22073581208283 6.89962806714961 6.4267245042567 5.89065870675833 5.46108060778468 5.39781372911768 5.93964641852189 5.12932030573575
2 5.16768267071901 0 5.83272589412447 2.25225283979474 1.90826680191786 5.37114802388519 5.59785637654296 4.91661167750756 5.30444831865893 5.08211290816887 4.40200542936744 6.15371412798448 5.78178784410306 5.46602629347795 5.30000955702435 5.53039795032183 4.95995546358709 5.03231561184843 4.82851623282771 4.48411320945066 4.00239788142592
3 7.48575722380227 5.83272589412447 0 4.48017472717933 7.02228258660968 4.80299676161744 5.074377565519 4.22516475929107 4.84921899234108 4.91636652132477 4.28449483333421 5.45536869016055 5.22684475786161 3.30572244247894 4.35964836798575 4.6291202175384 4.40558755755516 4.94463148395993 4.66412719794801 4.71363327825505 4.76002192603086
4 4.90086394693087 2.25225283979474 4.48017472717933 0 3.7574551412246 4.09097107749106 4.31332875516147 3.74198922709707 4.09901718872613 3.98481171186931 2.99226346672802 4.80947716476779 4.47170084361266 4.21686481747946 3.78375190257866 3.73267127454576 3.25841377997733 3.52950701732329 3.02659939246861 3.27763590359809 3.04876530550837
5 6.12275772592537 1.90826680191786 7.02228258660968 3.7574551412246 0 6.69563771108826 6.87583855904284 6.10737590673844 6.6670483917486 6.41537605001404 5.6133008480163 7.59290692337563 7.17800686648422 6.94066673451484 6.62238698794796 6.83619037928493 6.24886277810613 6.46836606640369 6.1493693074572 5.98716296880147 5.24089375539703
6 8.09640951038881 5.37114802388519 4.80299676161744 4.09097107749106 6.69563771108826 0 3.73599912731914 1.46855208714037 3.5363937721449 3.74812176096848 3.33981140430848 3.55975196049012 3.75010372123482 4.79504612279738 2.59297539585995 3.44752521226362 3.44446489777236 3.84256514655836 3.79055914844463 2.98911001160155 4.22277948500448
7 6.69025332851862 5.59785637654296 5.074377565519 4.31332875516147 6.87583855904284 3.73599912731914 0 3.42371530913812 1.4191957130731 1.36142069424736 2.7159076897842 1.56164071183728 1.18305553912539 3.21643180763394 1.68134298712308 1.80121993654563 1.88389453509044 1.796400847503 2.38285070380383 2.31419735389588 2.79762141062436
8 7.64790400773649 4.91661167750756 4.22516475929107 3.74198922709707 6.10737590673844 1.46855208714037 3.42371530913812 0 2.91683565243723 3.102896227269 2.35864530332377 3.52960174794504 3.4353842032372 4.15956683893917 2.03941125471757 3.08435029425612 2.86718163832637 3.40727481058496 3.30235502697041 2.4612959361684 3.27438570813145
9 6.42965937999364 5.30444831865893 4.84921899234108 4.09901718872613 6.6670483917486 3.5363937721449 1.4191957130731 2.91683565243723 0 0.424732022346578 1.96290629218087 1.73351607765932 1.07450893630448 2.70264162847394 1.32500659751976 1.89853028716554 1.69542661663572 1.41198106083333 2.16361796555593 1.43625594440237 1.93738992880826
10 6.20998945646654 5.08211290816887 4.91636652132477 3.98481171186931 6.41537605001404 3.74812176096848 1.36142069424736 3.102896227269 0.424732022346578 0 2.02289928406865 1.90433006913973 1.19637195137402 2.69132387884786 1.58340647987955 2.05657005804155 1.77200649819164 1.42741697429807 2.20488238158622 1.5146527922077 1.76487862426198
11 6.14264643893194 4.40200542936744 4.28449483333421 2.99226346672802 5.6133008480163 3.33981140430848 2.7159076897842 2.35864530332377 1.96290629218087 2.02289928406865 0 3.02386024780175 2.50846647877445 3.10311153281185 1.60552686605985 1.92210223944254 1.5367135644914 1.94831245750701 1.75496373584515 1.66611356552123 1.45321632468476
12 7.3717192454791 6.15371412798448 5.45536869016055 4.80947716476779 7.59290692337563 3.55975196049012 1.56164071183728 3.52960174794504 1.73351607765932 1.90433006913973 3.02386024780175 0 0.814476121578169 3.38995070955314 1.87602781252214 2.17199190337697 2.59003938731198 2.16230548480846 2.96515297691796 2.31438238830753 3.38022082639339
13 6.80372283820118 5.78178784410306 5.22684475786161 4.47170084361266 7.17800686648422 3.75010372123482 1.18305553912539 3.4353842032372 1.07450893630448 1.19637195137402 2.50846647877445 0.814476121578169 0 2.97051988059554 1.59908620459353 1.82928017433938 2.07305637824893 1.57032404064268 2.45847764284691 1.9188840861979 2.68008560568578
14 6.22073581208283 5.46602629347795 3.30572244247894 4.21686481747946 6.94066673451484 4.79504612279738 3.21643180763394 4.15956683893917 2.70264162847394 2.69132387884786 3.10311153281185 3.38995070955314 2.97051988059554 0 3.11028730045238 3.26351629553388 3.05184658938868 2.95914165313176 3.28271643237101 3.06073623929525 2.98134794320285
15 6.89962806714961 5.30000955702435 4.35964836798575 3.78375190257866 6.62238698794796 2.59297539585995 1.68134298712308 2.03941125471757 1.32500659751976 1.58340647987955 1.60552686605985 1.87602781252214 1.59908620459353 3.11028730045238 0 1.29560230615043 1.34671014816217 1.74222205623268 1.93251481826925 1.59059825132493 2.41541203910721
16 6.4267245042567 5.53039795032183 4.6291202175384 3.73267127454576 6.83619037928493 3.44752521226362 1.80121993654563 3.08435029425612 1.89853028716554 2.05657005804155 1.92210223944254 2.17199190337697 1.82928017433938 3.26351629553388 1.29560230615043 0 0.871521983121097 1.439496786239 1.18476042172799 2.22523799989627 2.62226306472392
17 5.89065870675833 4.95995546358709 4.40558755755516 3.25841377997733 6.24886277810613 3.44446489777236 1.88389453509044 2.86718163832637 1.69542661663572 1.77200649819164 1.5367135644914 2.59003938731198 2.07305637824893 3.05184658938868 1.34671014816217 0.871521983121097 0 1.26923784109928 0.716794709372098 1.9493692653925 1.99006961385727
18 5.46108060778468 5.03231561184843 4.94463148395993 3.52950701732329 6.46836606640369 3.84256514655836 1.796400847503 3.40727481058496 1.41198106083333 1.42741697429807 1.94831245750701 2.16230548480846 1.57032404064268 2.95914165313176 1.74222205623268 1.439496786239 1.26923784109928 0 1.28778818285901 1.45291669656696 1.83062629900573
19 5.39781372911768 4.82851623282771 4.66412719794801 3.02659939246861 6.1493693074572 3.79055914844463 2.38285070380383 3.30235502697041 2.16361796555593 2.20488238158622 1.75496373584515 2.96515297691796 2.45847764284691 3.28271643237101 1.93251481826925 1.18476042172799 0.716794709372098 1.28778818285901 0 2.15123828300878 2.04935182425302
20 5.93964641852189 4.48411320945066 4.71363327825505 3.27763590359809 5.98716296880147 2.98911001160155 2.31419735389588 2.4612959361684 1.43625594440237 1.5146527922077 1.66611356552123 2.31438238830753 1.9188840861979 3.06073623929525 1.59059825132493 2.22523799989627 1.9493692653925 1.45291669656696 2.15123828300878 0 1.71557629716017
21 5.12932030573575 4.00239788142592 4.76002192603086 3.04876530550837 5.24089375539703 4.22277948500448 2.79762141062436 3.27438570813145 1.93738992880826 1.76487862426198 1.45321632468476 3.38022082639339 2.68008560568578 2.98134794320285 2.41541203910721 2.62226306472392 1.99006961385727 1.83062629900573 2.04935182425302 1.71557629716017 0

确定最佳的簇数¶

函数fviz_nbclust()¶

fviz_nbclust()用于划分聚类分析中,使用silhouette(轮廓系数),WSS(簇内平方误差和)确定和可视化最佳的簇数

  library(factoextra)

  fviz_nbclust(data_scale, kmeans, method ="wss")
  `

"silhouette" (for average silhouette width)轮廓系数.

"wss" (for total within sum of square)

故选择K=5,分五类。

层次聚类法在不同linkage方法下¶

par(mfrow =c(2,2))  
for (i in c( "complete", "average","ward.D", "single")){
     result_hc <- hclust(d, method = i)
     plot(result_hc,cex = 0.8,hang = -1,main = i)}

K-means和DBSCAN聚类对比¶

kmeans DBSCAN

拓展¶

  • Single Linkage(单连接):方法是将两个组合数据点中距离最近的两个数据点间的距离作为这两个组合数据点的距离。这种方法容易受到极端值的影响。两个不相似的组合数据点可能由于其中的某个极端的数据点距离较近而组合在一起。

该方法的缺点是,它倾向于产生长而细的簇,其中同一簇的相邻元素之间的距离很小,但是在簇相对两端的元素可能比其他簇的两个元素彼此相距更远。这可能会导致难以定义可以有效细分数据的类。 image.png