DBSCAN 需要两个参数距离ε (eps) 和形成高密度区域所需要的最少点数 (minPts),它由一个任意未被访问的点开始,然后探索这个点的 ε-邻域,如果 ε-邻域里有足够的点,则建立一个新的聚类,否则这个点被标签为杂音。注意这个点之后可能被发现在其它点的 ε-邻域里,而该 ε-邻域可能有足够的点届时这个点会被加入该聚类中。如果一个点位于一个聚类的密集区域里,它的 ε-邻域里的点也属于该聚类,当这些新的点被加进聚类后,如果它(们)也在密集区域里,它(们)的 ε-邻域里的点也会被加进聚类里。这个过程将一直重复,直至不能再加进更多的点为止,这样,一个密度连结的聚类被完整地找出来。然后,一个未曾被访问的点将被探索,从而发现一个新的聚类或杂音。DBSCAN 可以找出任何形状的聚类,甚至能找出一个聚类,它包围但不连接另一个聚类,另外,由于 MinPts 参数,single-link effect (不同聚类以一点或极幼的线相连而被当成一个聚类)能有效地被避免。
针对层次聚类不适于解决大量数据的问题,Zhang 等人提出了一种基于聚类特征树(Clustering Feature Tree)的层次聚类算法:BIRCH算法(Balanced Iterative Reducing and Clustering Using Hierarchies)。BIRCH算法实现了只需要一次对数据集的遍历就可以完成聚类。其聚类的主要过程就是将所有数据依次插入构建聚类特征树的过程,而树上的每一个节点所包含的数据构成了一个对应的聚类簇。同时对于包含数据极少的节点我们可以认为它是异常点从而进行去除。
不同于一个质心或者单个点来代表一个类,CURE 算法中每个簇有多个代表点,这使得 CURE算法可以适应非球形的几何形状。CURE 算法还通过“收缩因子”减少离群点对聚类效果的影响。
A Hierarchical ClusteringAlgorithm for Categorical Attributes主要用在categorical的数据类型上;
(A Hierarchical Clustering AlgorithmUsing Dynamic Modeling)Chameleon,变色龙算法。算法如其名,它可以自动地、适应地合并簇,对各种奇葩的形状也能应对自如。
它里用到的linkage是kNN(k-nearest-neighbor)算法, K-NN是一种基于实例的学习,或者是局部近似和将所有计算推迟到分类之后的惰性学习。k-近邻算法是所有的机器学习算法中最简 单的之一。无论是分类还是回归,衡量邻居的权重都非常有用,使较近邻居的权重比较远邻居的权重大。例如,一种常见的加权 方案是给每个邻居权重赋值为1/ d,其中d是到邻居的距离。邻居都取自一组已经正确分类(在回归的情况下,指属性值正确)的对象。虽然没要求明确的训练步骤,但这也可以当作是此算法的一个训练样本集。k-nn算法的缺点是对数据的局部结构非常敏感。并以linkage构建一个graph.
Chameleon使用hMetis算法根据最小化截断的边的权重之和来分隔k近邻图。Chameleon的聚类效果被认为非常强大,比BIRCH好用,但运算复杂度很高,O(n^2)。它是由George Karypis 教授提出的一种基于动态建模(Dynamic modeling)思想的层次聚类算法。该算法通过计算两个聚类簇之间的相似度和互连性从而克服CURE和ROCK等算法的缺点。变色龙算法首先将所有的数据根据他们之间的距离划分成众多的小的簇,通过计算相近簇之间的相似度和互连性不断合并形成大的聚类簇。
[1]Chameleon聚类算法的Weka实现在R里面我用scale标准化,每次scale只能读取处理标准化dataframe数据的一列, read.csv加载进来的数据就是dataframe类型,然而scale处理的是numeric matric对象。
怎么把dataframe类型转为matric类型呢,用
data<-data.matrix(data)
data_scale<-scale(data)
scale(x, center = TRUE, scale = TRUE)
If center is TRUE then centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns,
If scale is TRUE then scaling is done by dividing the (centered) columns of x by their standard deviations
if center is TRUE, and the root mean square(均方根) otherwise. If scale is FALSE, no scaling is done.
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | |
---|---|---|---|---|---|---|---|---|
1 | 3.90366495000079 | -1.51495896020302 | -2.17635522268827 | 0.322329185610152 | 0.382445911299242 | 1.08987468404102 | 2.64090577059576 | 0.684491814022795 |
2 | 0.442398553372214 | 1.55268488189912 | -1.71778441371367 | -0.80582296402538 | 1.95998682484506 | 1.69422800414596 | 1.73718862572623 | 0.283813678985061 |
3 | -0.192680778460667 | 0.131711720805258 | 0.591846611975485 | -1.61164592805076 | 0.128905369369434 | 0.175452433334835 | -1.47159057137507 | 3.39165091454191 |
4 | 0.477196568123692 | 0.751228382611171 | -1.21984362309052 | -1.28931674244061 | 0.974777740423126 | 0.0911240630876348 | 1.14061864354945 | 0.509453298845228 |
5 | 0.285356171056844 | 1.41219969093198 | -2.14805804837838 | -1.45048133524568 | 3.3104269912777 | 2.59902817158035 | 1.72305402291422 | -0.311007469936935 |
6 | -0.69625598477924 | 2.35414138118706 | 1.1848010372874 | -0.966987556830456 | -0.411250590895777 | -0.920600153288398 | -0.595918273354763 | -0.461778108378969 |
7 | -0.2748358410157 | -0.480895833248172 | 0.899704209951727 | 1.28931674244061 | 0.344042863314632 | -0.919843360222077 | -0.596366990904351 | -0.461261770576085 |
8 | -0.780854378423711 | 1.47898773253931 | 1.02279132587284 | -1.12815214963553 | -0.322179647140896 | 0.184750176721065 | -0.914507733561996 | -0.449902338912644 |
9 | -0.411662382024195 | -0.388774396548409 | 0.77656117076123 | 0.966987556830456 | -0.612514732427687 | 0.0235532535946855 | -0.346655674558822 | -0.439059245052087 |
10 | -0.337428692236829 | -0.455562438155737 | 0.642233477693304 | 1.12815214963553 | -0.309512673198328 | 0.176857906172289 | -0.241880126730106 | -0.42821615119153 |
11 | -0.562806531964426 | -0.349622785951009 | -0.263723486378472 | -0.644658371220304 | -0.404213383149906 | 0.340541435088009 | -0.412841513123001 | -0.466941486407805 |
12 | -0.961800942607716 | 0.0718327869504112 | 0.536370728743435 | 1.61164592805076 | -0.763211509728553 | -1.36635126935149 | -0.595693914579969 | -0.459712757167434 |
13 | -0.688568051287635 | -0.400289576135879 | 0.435429227499636 | 1.45048133524568 | -0.635938581067515 | -0.814432897413076 | -0.50191194671615 | -0.466425148604922 |
14 | -0.556752673405247 | -0.890836226562122 | 0.284324553615565 | 0.80582296402538 | -0.695754846907418 | 0.299025929735541 | -0.224380142296187 | 2.1426297693663 |
15 | -0.56224627770188 | 0.0280751045180236 | 0.726062458504637 | 0 | -0.459907855881513 | -0.599071213397147 | -0.805020651462609 | -0.462294446181852 |
16 | -0.214935322778473 | -0.819442113119805 | 0.317878515247853 | -0.161164592805076 | -0.365810335165296 | -1.32548444377015 | -0.472520947218159 | -0.461778108378969 |
17 | 0.0844894553156418 | -0.803320861697346 | 0.463390862193209 | -0.322329185610152 | -0.152180814308498 | -0.64188407829188 | -0.149893029064637 | -0.387941802566603 |
18 | 0.441931674820092 | -0.628290131967795 | -0.0453990426910512 | 0.644658371220304 | -0.641769410342665 | -0.564799298822323 | -0.0626174656698387 | -0.461261770576085 |
19 | 0.384163235304221 | -0.948412124499475 | 0.0493908989201624 | -0.483493778415228 | -0.290110086127569 | -0.769998333090512 | 0.260234811258477 | -0.431830515811715 |
20 | 0.0705920370808162 | 0.541652114119208 | -0.0287898316830688 | 0.483493778415228 | -0.825038406352991 | 0.158154306104641 | -0.209123745610207 | -0.442673609672272 |
21 | 0.151035211611403 | -0.642108347472759 | -0.330831409643048 | 0.161164592805076 | -0.211192827834588 | 1.08987468404102 | 0.098920852181716 | -0.419954746345391 |
d <- dist(data_scale, method = "euclidean")
dx<-data.matrix(d)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 5.16768267071901 | 7.48575722380227 | 4.90086394693087 | 6.12275772592537 | 8.09640951038881 | 6.69025332851862 | 7.64790400773649 | 6.42965937999364 | 6.20998945646654 | 6.14264643893194 | 7.3717192454791 | 6.80372283820118 | 6.22073581208283 | 6.89962806714961 | 6.4267245042567 | 5.89065870675833 | 5.46108060778468 | 5.39781372911768 | 5.93964641852189 | 5.12932030573575 |
2 | 5.16768267071901 | 0 | 5.83272589412447 | 2.25225283979474 | 1.90826680191786 | 5.37114802388519 | 5.59785637654296 | 4.91661167750756 | 5.30444831865893 | 5.08211290816887 | 4.40200542936744 | 6.15371412798448 | 5.78178784410306 | 5.46602629347795 | 5.30000955702435 | 5.53039795032183 | 4.95995546358709 | 5.03231561184843 | 4.82851623282771 | 4.48411320945066 | 4.00239788142592 |
3 | 7.48575722380227 | 5.83272589412447 | 0 | 4.48017472717933 | 7.02228258660968 | 4.80299676161744 | 5.074377565519 | 4.22516475929107 | 4.84921899234108 | 4.91636652132477 | 4.28449483333421 | 5.45536869016055 | 5.22684475786161 | 3.30572244247894 | 4.35964836798575 | 4.6291202175384 | 4.40558755755516 | 4.94463148395993 | 4.66412719794801 | 4.71363327825505 | 4.76002192603086 |
4 | 4.90086394693087 | 2.25225283979474 | 4.48017472717933 | 0 | 3.7574551412246 | 4.09097107749106 | 4.31332875516147 | 3.74198922709707 | 4.09901718872613 | 3.98481171186931 | 2.99226346672802 | 4.80947716476779 | 4.47170084361266 | 4.21686481747946 | 3.78375190257866 | 3.73267127454576 | 3.25841377997733 | 3.52950701732329 | 3.02659939246861 | 3.27763590359809 | 3.04876530550837 |
5 | 6.12275772592537 | 1.90826680191786 | 7.02228258660968 | 3.7574551412246 | 0 | 6.69563771108826 | 6.87583855904284 | 6.10737590673844 | 6.6670483917486 | 6.41537605001404 | 5.6133008480163 | 7.59290692337563 | 7.17800686648422 | 6.94066673451484 | 6.62238698794796 | 6.83619037928493 | 6.24886277810613 | 6.46836606640369 | 6.1493693074572 | 5.98716296880147 | 5.24089375539703 |
6 | 8.09640951038881 | 5.37114802388519 | 4.80299676161744 | 4.09097107749106 | 6.69563771108826 | 0 | 3.73599912731914 | 1.46855208714037 | 3.5363937721449 | 3.74812176096848 | 3.33981140430848 | 3.55975196049012 | 3.75010372123482 | 4.79504612279738 | 2.59297539585995 | 3.44752521226362 | 3.44446489777236 | 3.84256514655836 | 3.79055914844463 | 2.98911001160155 | 4.22277948500448 |
7 | 6.69025332851862 | 5.59785637654296 | 5.074377565519 | 4.31332875516147 | 6.87583855904284 | 3.73599912731914 | 0 | 3.42371530913812 | 1.4191957130731 | 1.36142069424736 | 2.7159076897842 | 1.56164071183728 | 1.18305553912539 | 3.21643180763394 | 1.68134298712308 | 1.80121993654563 | 1.88389453509044 | 1.796400847503 | 2.38285070380383 | 2.31419735389588 | 2.79762141062436 |
8 | 7.64790400773649 | 4.91661167750756 | 4.22516475929107 | 3.74198922709707 | 6.10737590673844 | 1.46855208714037 | 3.42371530913812 | 0 | 2.91683565243723 | 3.102896227269 | 2.35864530332377 | 3.52960174794504 | 3.4353842032372 | 4.15956683893917 | 2.03941125471757 | 3.08435029425612 | 2.86718163832637 | 3.40727481058496 | 3.30235502697041 | 2.4612959361684 | 3.27438570813145 |
9 | 6.42965937999364 | 5.30444831865893 | 4.84921899234108 | 4.09901718872613 | 6.6670483917486 | 3.5363937721449 | 1.4191957130731 | 2.91683565243723 | 0 | 0.424732022346578 | 1.96290629218087 | 1.73351607765932 | 1.07450893630448 | 2.70264162847394 | 1.32500659751976 | 1.89853028716554 | 1.69542661663572 | 1.41198106083333 | 2.16361796555593 | 1.43625594440237 | 1.93738992880826 |
10 | 6.20998945646654 | 5.08211290816887 | 4.91636652132477 | 3.98481171186931 | 6.41537605001404 | 3.74812176096848 | 1.36142069424736 | 3.102896227269 | 0.424732022346578 | 0 | 2.02289928406865 | 1.90433006913973 | 1.19637195137402 | 2.69132387884786 | 1.58340647987955 | 2.05657005804155 | 1.77200649819164 | 1.42741697429807 | 2.20488238158622 | 1.5146527922077 | 1.76487862426198 |
11 | 6.14264643893194 | 4.40200542936744 | 4.28449483333421 | 2.99226346672802 | 5.6133008480163 | 3.33981140430848 | 2.7159076897842 | 2.35864530332377 | 1.96290629218087 | 2.02289928406865 | 0 | 3.02386024780175 | 2.50846647877445 | 3.10311153281185 | 1.60552686605985 | 1.92210223944254 | 1.5367135644914 | 1.94831245750701 | 1.75496373584515 | 1.66611356552123 | 1.45321632468476 |
12 | 7.3717192454791 | 6.15371412798448 | 5.45536869016055 | 4.80947716476779 | 7.59290692337563 | 3.55975196049012 | 1.56164071183728 | 3.52960174794504 | 1.73351607765932 | 1.90433006913973 | 3.02386024780175 | 0 | 0.814476121578169 | 3.38995070955314 | 1.87602781252214 | 2.17199190337697 | 2.59003938731198 | 2.16230548480846 | 2.96515297691796 | 2.31438238830753 | 3.38022082639339 |
13 | 6.80372283820118 | 5.78178784410306 | 5.22684475786161 | 4.47170084361266 | 7.17800686648422 | 3.75010372123482 | 1.18305553912539 | 3.4353842032372 | 1.07450893630448 | 1.19637195137402 | 2.50846647877445 | 0.814476121578169 | 0 | 2.97051988059554 | 1.59908620459353 | 1.82928017433938 | 2.07305637824893 | 1.57032404064268 | 2.45847764284691 | 1.9188840861979 | 2.68008560568578 |
14 | 6.22073581208283 | 5.46602629347795 | 3.30572244247894 | 4.21686481747946 | 6.94066673451484 | 4.79504612279738 | 3.21643180763394 | 4.15956683893917 | 2.70264162847394 | 2.69132387884786 | 3.10311153281185 | 3.38995070955314 | 2.97051988059554 | 0 | 3.11028730045238 | 3.26351629553388 | 3.05184658938868 | 2.95914165313176 | 3.28271643237101 | 3.06073623929525 | 2.98134794320285 |
15 | 6.89962806714961 | 5.30000955702435 | 4.35964836798575 | 3.78375190257866 | 6.62238698794796 | 2.59297539585995 | 1.68134298712308 | 2.03941125471757 | 1.32500659751976 | 1.58340647987955 | 1.60552686605985 | 1.87602781252214 | 1.59908620459353 | 3.11028730045238 | 0 | 1.29560230615043 | 1.34671014816217 | 1.74222205623268 | 1.93251481826925 | 1.59059825132493 | 2.41541203910721 |
16 | 6.4267245042567 | 5.53039795032183 | 4.6291202175384 | 3.73267127454576 | 6.83619037928493 | 3.44752521226362 | 1.80121993654563 | 3.08435029425612 | 1.89853028716554 | 2.05657005804155 | 1.92210223944254 | 2.17199190337697 | 1.82928017433938 | 3.26351629553388 | 1.29560230615043 | 0 | 0.871521983121097 | 1.439496786239 | 1.18476042172799 | 2.22523799989627 | 2.62226306472392 |
17 | 5.89065870675833 | 4.95995546358709 | 4.40558755755516 | 3.25841377997733 | 6.24886277810613 | 3.44446489777236 | 1.88389453509044 | 2.86718163832637 | 1.69542661663572 | 1.77200649819164 | 1.5367135644914 | 2.59003938731198 | 2.07305637824893 | 3.05184658938868 | 1.34671014816217 | 0.871521983121097 | 0 | 1.26923784109928 | 0.716794709372098 | 1.9493692653925 | 1.99006961385727 |
18 | 5.46108060778468 | 5.03231561184843 | 4.94463148395993 | 3.52950701732329 | 6.46836606640369 | 3.84256514655836 | 1.796400847503 | 3.40727481058496 | 1.41198106083333 | 1.42741697429807 | 1.94831245750701 | 2.16230548480846 | 1.57032404064268 | 2.95914165313176 | 1.74222205623268 | 1.439496786239 | 1.26923784109928 | 0 | 1.28778818285901 | 1.45291669656696 | 1.83062629900573 |
19 | 5.39781372911768 | 4.82851623282771 | 4.66412719794801 | 3.02659939246861 | 6.1493693074572 | 3.79055914844463 | 2.38285070380383 | 3.30235502697041 | 2.16361796555593 | 2.20488238158622 | 1.75496373584515 | 2.96515297691796 | 2.45847764284691 | 3.28271643237101 | 1.93251481826925 | 1.18476042172799 | 0.716794709372098 | 1.28778818285901 | 0 | 2.15123828300878 | 2.04935182425302 |
20 | 5.93964641852189 | 4.48411320945066 | 4.71363327825505 | 3.27763590359809 | 5.98716296880147 | 2.98911001160155 | 2.31419735389588 | 2.4612959361684 | 1.43625594440237 | 1.5146527922077 | 1.66611356552123 | 2.31438238830753 | 1.9188840861979 | 3.06073623929525 | 1.59059825132493 | 2.22523799989627 | 1.9493692653925 | 1.45291669656696 | 2.15123828300878 | 0 | 1.71557629716017 |
21 | 5.12932030573575 | 4.00239788142592 | 4.76002192603086 | 3.04876530550837 | 5.24089375539703 | 4.22277948500448 | 2.79762141062436 | 3.27438570813145 | 1.93738992880826 | 1.76487862426198 | 1.45321632468476 | 3.38022082639339 | 2.68008560568578 | 2.98134794320285 | 2.41541203910721 | 2.62226306472392 | 1.99006961385727 | 1.83062629900573 | 2.04935182425302 | 1.71557629716017 | 0 |
"silhouette" (for average silhouette width)轮廓系数. |
"wss" (for total within sum of square) |
故选择K=5,分五类。
par(mfrow =c(2,2))
for (i in c( "complete", "average","ward.D", "single")){
result_hc <- hclust(d, method = i)
plot(result_hc,cex = 0.8,hang = -1,main = i)}
kmeans | DBSCAN |
该方法的缺点是,它倾向于产生长而细的簇,其中同一簇的相邻元素之间的距离很小,但是在簇相对两端的元素可能比其他簇的两个元素彼此相距更远。这可能会导致难以定义可以有效细分数据的类。