{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 机器学习练习 1 - 线性回归" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这个是另一位大牛写的,作业内容在根目录: [作业文件](ex1.pdf)\n", "\n", "代码修改并注释:黄海广,haiguang2000@qq.com" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 单变量线性回归" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationProfit
06.110117.5920
15.52779.1302
28.518613.6620
37.003211.8540
45.85986.8233
\n", "
" ], "text/plain": [ " Population Profit\n", "0 6.1101 17.5920\n", "1 5.5277 9.1302\n", "2 8.5186 13.6620\n", "3 7.0032 11.8540\n", "4 5.8598 6.8233" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = 'ex1data1.txt'\n", "data = pd.read_csv(path, header=None, names=['Population', 'Profit'])\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationProfit
count97.00000097.000000
mean8.1598005.839135
std3.8698845.510262
min5.026900-2.680700
25%5.7077001.986900
50%6.5894004.562300
75%8.5781007.046700
max22.20300024.147000
\n", "
" ], "text/plain": [ " Population Profit\n", "count 97.000000 97.000000\n", "mean 8.159800 5.839135\n", "std 3.869884 5.510262\n", "min 5.026900 -2.680700\n", "25% 5.707700 1.986900\n", "50% 6.589400 4.562300\n", "75% 8.578100 7.046700\n", "max 22.203000 24.147000" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "看下数据长什么样子" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "data.plot(kind='scatter', x='Population', y='Profit', figsize=(12,8))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "现在让我们使用梯度下降来实现线性回归,以最小化成本函数。 以下代码示例中实现的方程在“练习”文件夹中的“ex1.pdf”中有详细说明。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "首先,我们将创建一个以参数θ为特征函数的代价函数\n", "$$J\\left( \\theta \\right)=\\frac{1}{2m}\\sum\\limits_{i=1}^{m}{{{\\left( {{h}_{\\theta }}\\left( {{x}^{(i)}} \\right)-{{y}^{(i)}} \\right)}^{2}}}$$\n", "其中:\\\\[{{h}_{\\theta }}\\left( x \\right)={{\\theta }^{T}}X={{\\theta }_{0}}{{x}_{0}}+{{\\theta }_{1}}{{x}_{1}}+{{\\theta }_{2}}{{x}_{2}}+...+{{\\theta }_{n}}{{x}_{n}}\\\\] " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def computeCost(X, y, theta):\n", " inner = np.power(((X * theta.T) - y), 2)\n", " return np.sum(inner) / (2 * len(X))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "让我们在训练集中添加一列,以便我们可以使用向量化的解决方案来计算代价和梯度。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "data.insert(0, 'Ones', 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "现在我们来做一些变量初始化。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# set X (training data) and y (target variable)\n", "cols = data.shape[1]\n", "X = data.iloc[:,0:cols-1]#X是所有行,去掉最后一列\n", "y = data.iloc[:,cols-1:cols]#X是所有行,最后一列" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "观察下 X (训练集) and y (目标变量)是否正确." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OnesPopulation
016.1101
115.5277
218.5186
317.0032
415.8598
\n", "
" ], "text/plain": [ " Ones Population\n", "0 1 6.1101\n", "1 1 5.5277\n", "2 1 8.5186\n", "3 1 7.0032\n", "4 1 5.8598" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.head()#head()是观察前5行" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Profit
017.5920
19.1302
213.6620
311.8540
46.8233
\n", "
" ], "text/plain": [ " Profit\n", "0 17.5920\n", "1 9.1302\n", "2 13.6620\n", "3 11.8540\n", "4 6.8233" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "代价函数是应该是numpy矩阵,所以我们需要转换X和Y,然后才能使用它们。 我们还需要初始化theta。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "X = np.matrix(X.values)\n", "y = np.matrix(y.values)\n", "theta = np.matrix(np.array([0,0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "theta 是一个(1,2)矩阵" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "matrix([[0, 0]])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "看下维度" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((97, 2), (1, 2), (97, 1))" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape, theta.shape, y.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "计算代价函数 (theta初始值为0)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "32.072733877455676" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "computeCost(X, y, theta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# batch gradient decent(批量梯度下降)\n", "$${{\\theta }_{j}}:={{\\theta }_{j}}-\\alpha \\frac{\\partial }{\\partial {{\\theta }_{j}}}J\\left( \\theta \\right)$$" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def gradientDescent(X, y, theta, alpha, iters):\n", " temp = np.matrix(np.zeros(theta.shape))\n", " parameters = int(theta.ravel().shape[1])\n", " cost = np.zeros(iters)\n", " \n", " for i in range(iters):\n", " error = (X * theta.T) - y\n", " \n", " for j in range(parameters):\n", " term = np.multiply(error, X[:,j])\n", " temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))\n", " \n", " theta = temp\n", " cost[i] = computeCost(X, y, theta)\n", " \n", " return theta, cost" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "初始化一些附加变量 - 学习速率α和要执行的迭代次数。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "alpha = 0.01\n", "iters = 1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "现在让我们运行梯度下降算法来将我们的参数θ适合于训练集。" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "matrix([[-3.24140214, 1.1272942 ]])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g, cost = gradientDescent(X, y, theta, alpha, iters)\n", "g" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "最后,我们可以使用我们拟合的参数计算训练模型的代价函数(误差)。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.515955503078912" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "computeCost(X, y, g)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "现在我们来绘制线性模型以及数据,直观地看出它的拟合。" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = np.linspace(data.Population.min(), data.Population.max(), 100)\n", "f = g[0, 0] + (g[0, 1] * x)\n", "\n", "fig, ax = plt.subplots(figsize=(12,8))\n", "ax.plot(x, f, 'r', label='Prediction')\n", "ax.scatter(data.Population, data.Profit, label='Traning Data')\n", "ax.legend(loc=2)\n", "ax.set_xlabel('Population')\n", "ax.set_ylabel('Profit')\n", "ax.set_title('Predicted Profit vs. Population Size')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "由于梯度方程式函数也在每个训练迭代中输出一个代价的向量,所以我们也可以绘制。 请注意,代价总是降低 - 这是凸优化问题的一个例子。" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(figsize=(12,8))\n", "ax.plot(np.arange(iters), cost, 'r')\n", "ax.set_xlabel('Iterations')\n", "ax.set_ylabel('Cost')\n", "ax.set_title('Error vs. Training Epoch')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 多变量线性回归" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "练习1还包括一个房屋价格数据集,其中有2个变量(房子的大小,卧室的数量)和目标(房子的价格)。 我们使用我们已经应用的技术来分析数据集。" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SizeBedroomsPrice
021043399900
116003329900
224003369000
314162232000
430004539900
\n", "
" ], "text/plain": [ " Size Bedrooms Price\n", "0 2104 3 399900\n", "1 1600 3 329900\n", "2 2400 3 369000\n", "3 1416 2 232000\n", "4 3000 4 539900" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = 'ex1data2.txt'\n", "data2 = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])\n", "data2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对于此任务,我们添加了另一个预处理步骤 - 特征归一化。 这个对于pandas来说很简单" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SizeBedroomsPrice
00.130010-0.2236750.475747
1-0.504190-0.223675-0.084074
20.502476-0.2236750.228626
3-0.735723-1.537767-0.867025
41.2574761.0904171.595389
\n", "
" ], "text/plain": [ " Size Bedrooms Price\n", "0 0.130010 -0.223675 0.475747\n", "1 -0.504190 -0.223675 -0.084074\n", "2 0.502476 -0.223675 0.228626\n", "3 -0.735723 -1.537767 -0.867025\n", "4 1.257476 1.090417 1.595389" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data2 = (data2 - data2.mean()) / data2.std()\n", "data2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "现在我们重复第1部分的预处理步骤,并对新数据集运行线性回归程序。" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-1.10910099e-16 8.78503652e-01 -4.69166570e-02]]\n" ] }, { "data": { "text/plain": [ "0.1307033696077189" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# add ones column\n", "data2.insert(0, 'Ones', 1)\n", "\n", "# set X (training data) and y (target variable)\n", "cols = data2.shape[1]\n", "X2 = data2.iloc[:,0:cols-1]\n", "y2 = data2.iloc[:,cols-1:cols]\n", "\n", "# convert to matrices and initialize theta\n", "X2 = np.matrix(X2.values)\n", "y2 = np.matrix(y2.values)\n", "theta2 = np.matrix(np.array([0,0,0]))\n", "\n", "# perform linear regression on the data set\n", "g2, cost2 = gradientDescent(X2, y2, theta2, alpha, iters)\n", "\n", "print(g2)\n", "# get the cost (error) of the model\n", "computeCost(X2, y2, g2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们也可以快速查看这一个的训练进程。" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(figsize=(12,8))\n", "ax.plot(np.arange(iters), cost2, 'r')\n", "ax.set_xlabel('Iterations')\n", "ax.set_ylabel('Cost')\n", "ax.set_title('Error vs. Training Epoch')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们也可以使用scikit-learn的线性回归函数,而不是从头开始实现这些算法。 我们将scikit-learn的线性回归算法应用于第1部分的数据,并看看它的表现。" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression()" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import linear_model\n", "model = linear_model.LinearRegression()\n", "model.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "scikit-learn model的预测表现" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = np.array(X[:, 1].A1)\n", "f = model.predict(X).flatten()\n", "\n", "fig, ax = plt.subplots(figsize=(12,8))\n", "ax.plot(x, f, 'r', label='Prediction')\n", "ax.scatter(data.Population, data.Profit, label='Traning Data')\n", "ax.legend(loc=2)\n", "ax.set_xlabel('Population')\n", "ax.set_ylabel('Profit')\n", "ax.set_title('Predicted Profit vs. Population Size')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. normal equation(正规方程)\n", "正规方程是通过求解下面的方程来找出使得代价函数最小的参数的:$\\frac{\\partial }{\\partial {{\\theta }_{j}}}J\\left( {{\\theta }_{j}} \\right)=0$ 。\n", " 假设我们的训练集特征矩阵为 X(包含了${{x}_{0}}=1$)并且我们的训练集结果为向量 y,则利用正规方程解出向量 $\\theta ={{\\left( {{X}^{T}}X \\right)}^{-1}}{{X}^{T}}y$ 。\n", "上标T代表矩阵转置,上标-1 代表矩阵的逆。设矩阵$A={{X}^{T}}X$,则:${{\\left( {{X}^{T}}X \\right)}^{-1}}={{A}^{-1}}$\n", "\n", "梯度下降与正规方程的比较:\n", "\n", "梯度下降:需要选择学习率α,需要多次迭代,当特征数量n大时也能较好适用,适用于各种类型的模型\t\n", "\n", "正规方程:不需要选择学习率α,一次计算得出,需要计算${{\\left( {{X}^{T}}X \\right)}^{-1}}$,如果特征数量n较大则运算代价大,因为矩阵逆的计算时间复杂度为$O(n3)$,通常来说当$n$小于10000 时还是可以接受的,只适用于线性模型,不适合逻辑回归模型等其他模型" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# 正规方程\n", "def normalEqn(X, y):\n", " theta = np.linalg.inv(X.T@X)@X.T@y#X.T@X等价于X.T.dot(X)\n", " return theta" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "matrix([[-3.89578088],\n", " [ 1.19303364]])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "final_theta2=normalEqn(X, y)#感觉和批量梯度下降的theta的值有点差距\n", "final_theta2" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "#梯度下降得到的结果是matrix([[-3.24140214, 1.1272942 ]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在练习2中,我们将看看分类问题的逻辑回归。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }