{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 机器学习练习 7 - K-means 和PCA(主成分分析)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "本章代码涵盖了基于Python的解决方案,用于Coursera机器学习课程的第六个编程练习。 请参考[练习文本](ex7.pdf)了解详细的说明和公式。\n", "\n", "代码修改并注释:黄海广,haiguang2000@qq.com" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在本练习中,我们将实现K-means聚类,并使用它来压缩图像。 我们将从一个简单的2D数据集开始,以了解K-means是如何工作的,然后我们将其应用于图像压缩。 我们还将对主成分分析进行实验,并了解如何使用它来找到面部图像的低维表示。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## K-means 聚类" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们将实施和应用K-means到一个简单的二维数据集,以获得一些直观的工作原理。 K-means是一个迭代的,无监督的聚类算法,将类似的实例组合成簇。 该算法通过猜测每个簇的初始聚类中心开始,然后重复将实例分配给最近的簇,并重新计算该簇的聚类中心。 我们要实现的第一部分是找到数据中每个实例最接近的聚类中心的函数。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sb\n", "from scipy.io import loadmat\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def find_closest_centroids(X, centroids):\n", " m = X.shape[0]\n", " k = centroids.shape[0]\n", " idx = np.zeros(m)\n", " \n", " for i in range(m):\n", " min_dist = 1000000\n", " for j in range(k):\n", " dist = np.sum((X[i,:] - centroids[j,:]) ** 2)\n", " if dist < min_dist:\n", " min_dist = dist\n", " idx[i] = j\n", " \n", " return idx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "让我们来测试这个函数,以确保它的工作正常。 我们将使用练习中提供的测试用例。" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([ 0., 2., 1.])" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = loadmat('data/ex7data2.mat')\n", "X = data['X']\n", "initial_centroids = initial_centroids = np.array([[3, 3], [6, 2], [8, 5]])\n", "\n", "idx = find_closest_centroids(X, initial_centroids)\n", "idx[0:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "输出与文本中的预期值匹配(记住我们的数组是从零开始索引的,而不是从一开始索引的,所以值比练习中的值低一个)。 接下来,我们需要一个函数来计算簇的聚类中心。 聚类中心只是当前分配给簇的所有样本的平均值。" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | X1 | \n", "X2 | \n", "
---|---|---|
0 | \n", "1.842080 | \n", "4.607572 | \n", "
1 | \n", "5.658583 | \n", "4.799964 | \n", "
2 | \n", "6.352579 | \n", "3.290854 | \n", "
3 | \n", "2.904017 | \n", "4.612204 | \n", "
4 | \n", "3.231979 | \n", "4.939894 | \n", "