Colab/머신러닝

16. K-평균 (K-means) & 실루엣 계수 (silhouette coefficient) 01

HicKee 2023. 3. 10. 16:25

K-평균 (K-means) : 거리기반

특징

군집을 원모양으로 간주

모든 특성은 동일한 Scale을 가져야한다

이상치에 취약하다

 

Inertia value를 이용한 적정 군집구 판단

inertia

군집내 데이터들과 중심간의 거리의 합으로 군집의 응집도를 나타내는 값이다.

값이 작을 수록 응집도가 높게 군집화가 잘되었다고 평가

kmeandml inertia_ 속성으로 조회

군집단위 별로 inertia 값을 조회한수 급격히 떨어지는 지점이 적정 군집수 라고 판단

평가

실루엣 지표 (Silhouette)

실루엣 계수 (silhouette coefficient)

*   개별 관측치가 해당 군집내의 데이터와 얼마나 가깝고 
    가장 가까운 다른 군집과 얼마나 먼지를 나타내는 지표

*   -1 ~ 1사이의 값을 가지며 1에 가까울수록 좋은 지표
    *   1에 가까우면 자신이 속한 군집에 잘 속해 있다
        중심 가까이에 있다
    *   0에 가까우면 군집에 경계에 위치
    *   -1에 가까우면 잘못된 클러스터에 할당되어 있다

silhouette_samples()

*   개별 관측치의 실루엣 계수 반환

silhouette_score()

*   실루엣 계수들의 평균

좋은 군집화의 지표

*   실루엣 계수 평균이 1에 가까울수록 좋다
*   실루엣 계수 평균과 개별 군집의 실루엣 계수 평균의 
    편차가 크지 않아야 한다

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from  sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X, y = make_blobs(n_samples = 1000,
                  n_features=2,
                  centers=5,
                  random_state=40)
y.shape
plt.scatter(X[:,0],X[:,1],
            c='gray',
            edgecolors='black',
            marker='o')
plt.show()
# 5개의 군집이 만들어짐

kmc = KMeans(n_clusters=5,
             init='random',
             max_iter=100,
             random_state=123)
kmc.fit(X)
label_kmc = kmc.labels_
print(label_kmc)
kmc_columns = ['kmc_comp1','kmc_comp2']
X_kmc_df = pd.DataFrame(X, columns=kmc_columns)
X_kmc_df['target'] = y
X_kmc_df['label_kmc']= label_kmc
X_kmc_df
markers=['o','x','^','s','*']
for i , mark in enumerate(markers):
  df_i=X_kmc_df[X_kmc_df['label_kmc']==i]
  target_i =i
  X1 = df_i['kmc_comp1']
  X2 = df_i['kmc_comp2']
  plt.scatter(X1,X2,marker=mark, label=target_i)

plt.xlabel('kmc_component1')
plt.ylabel('kmc_component2')
plt.legend()
plt.show()

markers=['o','x','^','s','*']
for i , mark in enumerate(markers):
  df_i=X_kmc_df[X_kmc_df['target']==i]
  target_i =i
  X1 = df_i['kmc_comp1']
  X2 = df_i['kmc_comp2']
  plt.scatter(X1,X2,marker=mark, label=target_i)

plt.xlabel('kmc_component1')
plt.ylabel('kmc_component2')
plt.legend()
plt.show()

# 실루엣 스코어
silhouette_score(X, label_kmc)
더보기

0.6639397841888107

'Colab > 머신러닝' 카테고리의 다른 글

17. K-평균 (K-means) & 실루엣 계수 (silhouette coefficient) 02  (0) 2023.03.10
15. XGBoost 01  (0) 2023.03.10
14. GBoost 02  (0) 2023.03.10
13. GBoost 01  (0) 2023.03.10
12. 부스팅(Boosting) 01  (0) 2023.03.09