使用神经网络进行机器学习

发表于 2020-04-13 更新于 2020-12-27 分类于机器学习本文字数： 18k 阅读时长 ≈ 16 分钟

数据介绍

本次练习所用的数据集有5000个训练样本，每个样本对应于20x20大小的灰度图像。这些训练样本包括了9-0共十个数字的手写图像。这些样本中每个像素都用浮点数表示。加载得到的数据中，每幅图像都被展开为一个400维的向量，构成了数据矩阵中的一行。完整的训练数据是一个5000x400的矩阵，其每一行为一个训练样本（数字的手写图像）。数据中，对应于数字”0”的图像被标记为”10”，而数字”1”到”9”按照其自然顺序被分别标记为”1”到”9”。数据集保存在NN_data.mat.

模型表示

我们准备训练的神经网络是一个三层的结构，一个输入层，一个隐层以及一个输出层。由于我们训练样本（图像）是20x20的，所以输入层单元数为400（不考虑额外的偏置项，如果考虑单元个数需要+1）。在我们的程序中，数据会被加载到变量 $X$ 和 $y$ 里。

本项练习提供了一组训练好的网络参数 $(\Theta^{(1)}, \Theta^{(2)})$ 。这些数据存储在数据文件 NN_weights.mat，在程序中被加载到变量 Theta1 与 Theta2 中。参数的维度对应于第二层有25个单元、10个输出单元（对应于10个数字的类别）的网络。

import numpy as np
import scipy.io as sio
from scipy.optimize import fmin_cg
import matplotlib.pyplot as plt

def display_data(data, img_width=20):
    """将图像数据 data 按照矩阵形式显示出来"""
    plt.figure()
    # 计算数据尺寸相关数据
    n_rows, n_cols = data.shape
    img_height = n_cols // img_width

    # 计算显示行数与列数
    disp_rows = int(np.sqrt(n_rows))
    disp_cols = (n_rows + disp_rows - 1) // disp_rows

    # 图像行与列之间的间隔
    pad = 1
    disp_array = np.ones((pad + disp_rows*(img_height + pad),
                          pad + disp_cols*(img_width + pad)))

    idx = 0
    for row in range(disp_rows):
        for col in range(disp_cols):
            if idx > m:
                break
            # 复制图像块
            rb = pad + row*(img_height + pad)
            cb = pad + col*(img_width + pad)
            disp_array[rb:rb+img_height, cb:cb+img_width] = data[idx].reshape((img_height, -1), order='F')
            # 获得图像块的最大值，对每个训练样本分别归一化
            max_val = np.abs(data[idx].max())
            disp_array[rb:rb+img_height, cb:cb+img_width] /= max_val
            idx += 1

    plt.imshow(disp_array)

    plt.gray()
    plt.axis('off')
    plt.savefig('data-array.png', dpi=150)
    plt.show()

前向传播与代价函数

现在你需要实现神经网络的代价函数及其梯度。首先需要使得函数 nn_cost_function 能够返回正确的代价值。

神经网络的代价函数（不包括正则化项）的定义为：

$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \left[-y_k^{(i)} \log\left((h_{\theta}(x^{(i)}))_k\right) -(1 - y_k^{(i)}) \log\left(1 - (h_{\theta}(x^{(i)}))_k\right) \right]$

其中 $h_{\theta}(x^{(i)})$ 的计算如神经网络结构图所示， $K=10$ 是所有可能的类别数。这里的 $y$ 使用了one-hot 的表达方式。

运行程序，使用预先训练好的网络参数，确认你得到的代价函数是正确的。（正确的代价约为0.287629）。

代价函数的正则化

神经网络包括正则化项的代价函数为: </br>

$J(\theta) = \frac{1}{m}\sum_{i=1}^{m} \sum_{k=1}^{K} \left[-y_k^{(i)} \log\left((h_{\theta}(x^{(i)}))_k\right) -(1 - y_k^{(i)}) \log\left(1 - (h_{\theta}(x^{(i)}))_k\right) \right] + \frac{\lambda}{2m} \left[\sum_{j=1}^{25} \sum_{k=1}^{400} (\Theta_{j,k}^{(1)})^2 +\sum_{j=1}^{10} \sum_{k=1}^{25} (\Theta_{j,k}^{(2)})^2 \right]$

注意在上面式子中，正则化项的加和形式与练习中设定的网络结构一致。但是你的代码实现要保证能够用于任意大小的神经网络。
此外，还需要注意，对应于偏置项的参数不能包括在正则化项中。对于矩阵 Theta1 与 Theta2 而言，这些项对应于矩阵的第一列。

运行程序，使用预先训练好的权重数据，设置正则化系数$\lambda=1$ (lmb) 确认你得到的代价函数是正确的。（正确的代价约为0.383770）。

此步练习需要你补充实现 nn_cost_function 。

def convert_to_one_hot(y,c):
    v = np.eye(c+1)[y.reshape(-1)].T  # 11是因为有个非常讨厌的编码10出现
    v = np.delete(v, 0, axis = 0)    # 编码后是11位制的，第一行都是0删掉
    return v.T

def nn_cost_function(nn_params, *args):
    """神经网络的损失函数"""
    # Unpack parameters from *args
    input_layer_size, hidden_layer_size, num_labels, lmb, X, y = args
    # Unroll weights of neural networks from nn_params
    Theta1 = nn_params[:hidden_layer_size*(input_layer_size + 1)]
    Theta1 = Theta1.reshape((hidden_layer_size, input_layer_size + 1))
    Theta2 = nn_params[hidden_layer_size*(input_layer_size + 1):]
    Theta2 = Theta2.reshape((num_labels, hidden_layer_size + 1))
    # 设置变量
    m = X.shape[0]
    # You need to return the following variable correctly
    J = 0.0
    # ====================== 你的代码 ======================
    a_1 = np.hstack((np.ones((m, 1)), X))   # Add bias as 1

    Z_2 = np.dot(Theta1, a_1.T) 
    a_2 = sigmoid(Z_2)
    a_2 = a_2.T
    a_2 = np.hstack((np.ones((m, 1)), a_2)) # Add bias as 1

    Z_3 = np.dot(Theta2, a_2.T)
    hypothesis = sigmoid(Z_3)
   
    one_hot_y = convert_to_one_hot(y, num_labels)  # 这里需要对y进行one-hot编码

    Theta1_without_bias = np.delete(Theta1, 0, axis = 1) # 删除 input add bias as 1 在第一列
    Theta2_without_bias = np.delete(Theta2, 0, axis = 1)

    Regular = lmb / (2*m) * (np.square(Theta1_without_bias).sum() 
                           + np.square(Theta2_without_bias).sum() ) # 正则项, 偏置项不要正则
    loss = 0.0    
    for i in range(m): # 样本数量
        for j in range(num_labels): # 标签类别
            loss += (-one_hot_y[i,j] * np.log(hypothesis[j,i]) - (1 - one_hot_y[i,j]) * np.log(1 - hypothesis[j,i]))

    J = 1/m * loss + Regular
    #print('current cost: ', J)
    # ======================================================
    return J

误差反传训练算法 (Backpropagation)

现在你需要实现误差反传训练算法。误差反传算法的思想大致可以描述如下。对于一个训练样本 $(x^{(t)}, y^{(t)})$ ，我们首先使用前向传播计算网络中所有单元（神经元）的激活值（activation），包括假设输出 $h_{\Theta}(x)$ 。那么，对于第 $l$ 层的第 $j$ 个节点，我们期望计算出一个“误差项” $\delta_{j}^{(l)}$ 用于衡量该节点对于输出的误差的“贡献”。

对于输出节点，我们可以直接计算网络的激活值与真实目标值之间的误差。对于我们所训练的第3层为输出层的网络，这个误差定义了 $\delta_{j}^{(3)}$ 。对于隐层单元，需要根据第 $l+1$ 层的节点的误差的加权平均来计算 $\delta_{j}^{(l)}$ 。

下面是误差反传训练算法的细节（如图3所示）。你需要在一个循环中实现步骤1至4。循环的每一步处理一个训练样本。第5步将累积的梯度除以 $m$ 以得到神经网络代价函数的梯度。

设输入层的值 $a^{(1)}$ 为第 $t$ 个训练样本 $x^{(t)}$ 。执行前向传播，计算第2层与第3层各节点的激活值( $z^{(2)}, a^{(2)}, z^{(3)}, a^{(3)}$ )。注意你需要在 $a^{(1)}$ 与 $a^{(2)}$ 增加一个全部为 +1 的向量，以确保包括了偏置项。在 numpy 中可以使用函数 ones ， hstack, vstack 等完成（向量化版本）。
对第3层中的每个输出单元 $k$ ，计算
$\delta_{k}^{(3)} = a_{k}^{(3)} - y_k$
其中 $y_k \in \{0, 1\}$ 表示当前训练样本是否是第 $k$ 类。
对隐层 $l=2$ , 计算
$\delta^{(2)} = \left( \Theta^{(2)} \right)^T \delta^{(3)} .* g^{\prime} (z^{(2)})$
其中$g^{\prime}$ 表示 Sigmoid 函数的梯度， .* 在 numpy 中是通常的逐个元素相乘的乘法，矩阵乘法应当使用 numpy.dot 函数。
使用下式将当前样本梯度进行累加：
$\Delta^{(l)} = \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$
在 numpy 中，数组可以使用 += 运算。
计算神经网络代价函数的（未正则化的）梯度，
$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)}$

这里，你需要（部分）完成函数 nn_grad_function 。程序将使用函数 check_nn_gradients 来检查你的实现是否正确。在使用循环的方式完成函数 nn_grad_function 后，建议尝试使用向量化的方式重新实现这个函数。

神经网络的正则化

你正确实现了误差反传训练算法之后，应当在梯度中加入正则化项。

假设你在误差反传算法中计算了 $\Delta_{ij}^{(l)}$ ，你需要增加的正则化项为

$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)} \qquad \text{for } j = 0 \frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)} + \frac{\lambda}{m} \Theta_{ij}^{(l)} \qquad \text{for } j \geq 1$

注意你不应该正则化 $\Theta^{(l)}$ 的第一列，因其对应于偏置项。

此步练习需要你补充实现函数 nn_grad_function 。


def nn_grad_function(nn_params, *args):
    """神经网络的损失函数梯度计算 """
    
    # 获得参数信息
    input_layer_size, hidden_layer_size, num_labels, lmb, X, y = args
    # 得到各个参数的权重值
    Theta1 = nn_params[:hidden_layer_size*(input_layer_size + 1)]
    Theta1 = Theta1.reshape((hidden_layer_size, input_layer_size + 1))
    Theta2 = nn_params[hidden_layer_size*(input_layer_size + 1):]
    Theta2 = Theta2.reshape((num_labels, hidden_layer_size + 1))

    # 设置变量
    m = X.shape[0]

    # ====================== 你的代码 =====================
    a_1 = np.hstack((np.ones((m, 1)), X)) # Add bias as 1

    Z_2 = np.dot(Theta1, a_1.T)
    a_2 = sigmoid(Z_2)
    a_2 = a_2.T
    a_2 = np.hstack((np.ones((m, 1)), a_2)) # Add bias as 1

    Z_3 = np.dot(Theta2, a_2.T)
    a_3 = sigmoid(Z_3)
    a_3 = a_3.T

    one_hot_y = convert_to_one_hot(y, num_labels) # 这里需要对y进行one-hot编码

    Delta_2 = np.zeros_like(Theta2)
    Delta_1 = np.zeros_like(Theta1)

    for i in range(m): # 对每一个样本
        delta_3 = a_3[i] - one_hot_y[i]
        grad = np.dot(Theta2.T, delta_3)
        part_grad = np.delete(grad, 0)    # 把偏置项删掉, 我不确定是否应该这样做
        delta_2 = part_grad * sigmoid_gradient(Z_2[:,i])
        Delta_2 = Delta_2 + np.dot(delta_3.reshape(-1, 1), np.matrix(a_2[i]))
        Delta_1 = Delta_1 + np.dot(delta_2.reshape(-1, 1), np.matrix(a_1[i]))
    
    Theta2_grad = 1/m * Delta_2 + lmb/m * Theta2
    Theta1_grad = 1/m * Delta_1 + lmb/m * Theta1
    # =====================================================
    
    grad = np.hstack((Theta1_grad.flatten(), Theta2_grad.flatten()))
    grad = np.array(grad).flatten() # 如果不加这一行，会在optimize.py等多处报错，例如deltak = numpy.dot(gfk, gfk)，error：(dim 1) != 1 (dim 0)
    return grad

误差反传训练算法

`Sigmoid` 函数及其梯度

Sigmoid 函数定义为

$\text{sigmoid}(z) = g(z) = \frac{1}{1+\exp(-z)}$

Sigmoid 函数的梯度可以按照下式进行计算

$g^{\prime}(z) = \frac{d}{dz} g(z) = g(z)(1-g(z))$

为验证你的实现是正确的，以下事实可供你参考。当 $z=0$ 是，梯度的精确值为 0.25 。当 $z$ 的值很大（可正可负）时，梯度值接近于0。

这里，你需要补充完成函数 sigmoid 与 sigmoid_gradient 。你需要保证实现的函数的输入参数可以为矢量和矩阵( numpy.ndarray)。

网络参数的随机初始化

训练神经网络时，使用随机数初始化网络参数非常重要。一个非常有效的随机初始化策略为，在范围 $[ -\epsilon_{init}, \epsilon_{init} ]$ 内按照均匀分布随机选择参数 $\Theta^{(l)}$ 的初始值。这里你需要设置 $\epsilon_{init} = 0.12$ 。这个范围保证了参数较小且训练过程高效。

你需要补充实现函数 rand_initialize_weigths 。

对于一般的神经网络，如果第 $l$ 层的输入单元数为 $L_{in}$ ，输出单元数为 $L_{out}$ ，则 $\epsilon_{init} = {\sqrt{6}}/{\sqrt{L_{in} + L_{out}}}$ 可以做为有效的指导策略。

1
2
3

def sigmoid(z):
    """Sigmoid 函数"""
    return 1.0/(1.0 + np.exp(-np.asarray(z)))

def sigmoid_gradient(z):
    """计算Sigmoid 函数的梯度"""
    g = np.zeros_like(z)
    # ======================　你的代码 ======================
    
    # 计算Sigmoid 函数的梯度g的值
    g = sigmoid(z)*(1.0-sigmoid(z))
    # =======================================================
    return g

def rand_initialize_weights(L_in, L_out):
    """ 初始化网络层权重参数"""

    # You need to return the following variables correctly
    W = np.zeros((L_out, 1 + L_in))
    # ====================== 你的代码 ======================
    #epsilon_init = 0.12
    epsilon_init = np.sqrt(6.0) / np.sqrt(L_in + L_out)
    print('epsilon_init: ', epsilon_init)
    #初始化网络层的权重参数
    M, N = W.shape
    for i in range(M):
        for j in range(N):
            W[i,j] = np.random.uniform(-epsilon_init, epsilon_init)
    #print(W)
    # ======================================================
    return W

def debug_initialize_weights(fan_out, fan_in):
    """Initalize the weights of a layer with
    fan_in incoming connections and
    fan_out outgoing connection using a fixed strategy."""

    W = np.linspace(1, fan_out*(fan_in+1), fan_out*(fan_in+1))
    W = 0.1*np.sin(W).reshape(fan_out, fan_in + 1)
    return W

def compute_numerical_gradient(cost_func, theta):
    """Compute the numerical gradient of the given cost_func
    at parameter theta"""

    numgrad = np.zeros_like(theta)
    perturb = np.zeros_like(theta)
    eps = 1.0e-4
    for idx in range(len(theta)):
        perturb[idx] = eps
        loss1 = cost_func(theta - perturb)
        loss2 = cost_func(theta + perturb)
        numgrad[idx] = (loss2 - loss1)/(2*eps)
        perturb[idx] = 0.0
    return numgrad

检查梯度

在神经网络中，需要最小化代价函数 $J(\Theta)$ 。为了检查梯度计算是否正确，考虑把参数 $\Theta^{(1)}$ 和 $\Theta^{(2)}$ 展开为一个长的向量 $\theta$ 。假设函数 $f_i(\theta)$ 表示 $\frac{\partial}{\partial \theta_i} J(\theta)$ 。

令

$\theta^{(i+)} = \theta + \begin{bmatrix} 0 \\ 0 \\ \vdots \\ \epsilon \\ \vdots \\ 0 \end{bmatrix} \qquad \theta^{(i-)} = \theta - \begin{bmatrix} 0 \\ 0 \\ \vdots \\ \epsilon \\ \vdots \\ 0 \end{bmatrix}$

上式中， $\theta^{(i+)}$ 除了第 $i$ 个元素增加了 $\epsilon$ 之外，其他元素均与 $\theta$ 相同。类似的， $\theta^{(i-)}$ 中仅第 $i$ 个元素减少了 $\epsilon$ 。可以使用数值近似验证 $f_i(\theta)$ 计算是否正确：

$f_i(\theta) \approx \frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2\epsilon}$

如果设 $\epsilon=10^{-4}$ ，通常上式左右两端的差异出现于第4位有效数字之后（经常会有更高的精度）。

在练习的程序代码中，函数 compute_numerical_gradient 已经实现，建议你认真阅读该函数并理解其实现原理与方案。

之后，程序将执行 check_nn_gradients 函数。该函数将创建一个较小的神经网络用于检测你的误差反传训练算法所计算得到的梯度是否正确。如果你的实现是正确的，你得到的梯度与数值梯度之后的绝对误差（各分量的绝对值差之和）应当小于 $10^{-9}$ 。

def check_nn_gradients(lmb=0.0):
    """Creates a small neural network to check the backgropagation
    gradients."""
    input_layer_size, hidden_layer_size = 3, 5
    num_labels, m = 3, 5

    Theta1 = debug_initialize_weights(hidden_layer_size, input_layer_size)
    Theta2 = debug_initialize_weights(num_labels, hidden_layer_size)

    X = debug_initialize_weights(m, input_layer_size - 1)
    y = np.array([1 + (t % num_labels) for t in range(m)])
    nn_params = np.hstack((Theta1.flatten(), Theta2.flatten()))

    cost_func = lambda x: nn_cost_function(x,
                                           input_layer_size,
                                           hidden_layer_size,
                                           num_labels, lmb, X, y)
    grad = nn_grad_function(nn_params,
                            input_layer_size, hidden_layer_size,
                            num_labels, lmb, X, y)
    numgrad = compute_numerical_gradient(cost_func, nn_params)
    print(np.vstack((numgrad, grad)).T, np.sum(np.abs(numgrad - grad)))
    print('The above two columns you get should be very similar.')
    print('(Left-Your Numerical Gradient, Right-Analytical Gradient)')

def predict(Theta1, Theta2, X):
    """模型预测"""
   
    m = X.shape[0]
    # num_labels = Theta2.shape[0]

    p = np.zeros((m,1), dtype=int)
    # ====================== 你的代码============================
    
    # 神经网络模型预测
    
    a_1 = np.hstack((np.ones((m, 1)), X))     # Add bias as 1

    Z_2 = np.dot(Theta1, a_1.T) 
    a_2 = sigmoid(Z_2)
    a_2 = a_2.T
    a_2 = np.hstack((np.ones((m, 1)), a_2)) # Add bias as 1

    Z_3 = np.dot(Theta2, a_2.T)
    hypothesis = sigmoid(Z_3)
    hypothesis = hypothesis.T

    one_hot_to_val = np.argmax(hypothesis, axis=1) + 1.0
    p = one_hot_to_val.reshape(-1, 1)

    ok = 0
    for i in range(m):
        if p[i] == y[i]:
            ok = ok + 1

    print("ok",ok)

    acc_rate = ok * 1.0 / m

    # ============================================================
    return acc_rate

# Parameters
input_layer_size = 400          # 20x20 大小的输入图像，图像内容为手写数字
hidden_layer_size = 25          # 25 hidden units
num_labels = 10                 # 10 类标号 从1到10

加载数据集

# =========== 第一部分 ===============
# 加载训练数据
print("Loading and Visualizing Data...")
data = sio.loadmat('data/NN_data.mat')
X, y = data['X'], data['y']

m = X.shape[0]

# 随机选取100个数据显示
rand_indices = np.array(range(m))
np.random.shuffle(rand_indices)
X_sel = X[rand_indices[:100]]

display_data(X_sel)

Loading and Visualizing Data...

png

加载神经网络模型的权重

# =========== 第二部分 ===============
print('Loading Saved Neural Network Parameters ...')

# Load the weights into variables Theta1 and Theta2
data = sio.loadmat('data/NN_weights.mat')
Theta1, Theta2 = data['Theta1'], data['Theta2']

# print Theta1.shape, (hidden_layer_size, input_layer_size + 1)
# print Theta2.shape, (num_labels, hidden_layer_size + 1)

Loading Saved Neural Network Parameters ...

# ================ Part 3: Compute Cost (Feedforward) ================

print('\nFeedforward Using Neural Network ...')

# Weight regularization parameter (we set this to 0 here).
lmb = 0.0

nn_params = np.hstack((Theta1.flatten(), Theta2.flatten()))
J = nn_cost_function(nn_params,
                     input_layer_size, hidden_layer_size,
                     num_labels, lmb, X, y)

print('Cost at parameters (loaded from PRML_NN_weights): %f ' % J)
print('(this value should be about 0.287629)')

Feedforward Using Neural Network ...
Cost at parameters (loaded from PRML_NN_weights): 0.287629 
(this value should be about 0.287629)

# =============== Part 4: Implement Regularization ===============
print('Checking Cost Function (w/ Regularization) ... ')
lmb = 1.0

J = nn_cost_function(nn_params,
                     input_layer_size, hidden_layer_size,
                     num_labels, lmb, X, y)

print('Cost at parameters (loaded from PRML_NN_weights): %f ' % J)
print('(this value should be about 0.383770)')

Checking Cost Function (w/ Regularization) ... 
Cost at parameters (loaded from PRML_NN_weights): 0.383770 
(this value should be about 0.383770)

# ================ Part 5: Sigmoid Gradient  ================
print('Evaluating sigmoid gradient...')

g = sigmoid_gradient([1, -0.5, 0, 0.5, 1])
print('Sigmoid gradient evaluated at [1 -0.5 0 0.5 1]:  ', g)

Evaluating sigmoid gradient...
Sigmoid gradient evaluated at [1 -0.5 0 0.5 1]:   [0.19661193 0.23500371 0.25       0.23500371 0.19661193]

神经网络参数初始化

#  ================ Part 6: Initializing Pameters ================
print('Initializing Neural Network Parameters ...')
initial_Theta1 = rand_initialize_weights(input_layer_size, hidden_layer_size)
initial_Theta2 = rand_initialize_weights(hidden_layer_size, num_labels)

# Unroll parameters
initial_nn_params = np.hstack((initial_Theta1.flatten(),
                               initial_Theta2.flatten()))

Initializing Neural Network Parameters ...
epsilon_init:  0.1188177051572009
epsilon_init:  0.4140393356054125


# =============== Part 7: Implement Backpropagation ===============
print('Checking Backpropagation... ')

# Check gradients by running checkNNGradients
check_nn_gradients()

Checking Backpropagation... 
[[ 1.27220311e-02  1.27220311e-02]
 [ 1.58832809e-04  1.58832809e-04]
 [ 2.17690452e-04  2.17690455e-04]
 [ 7.64045027e-05  7.64045009e-05]
 [ 6.46352264e-03  6.46352265e-03]
 [ 2.34983744e-05  2.34983735e-05]
 [-3.74199116e-05 -3.74199098e-05]
 [-6.39345021e-05 -6.39345006e-05]
 [-5.74199923e-03 -5.74199923e-03]
 [-1.34052016e-04 -1.34052019e-04]
 [-2.59146269e-04 -2.59146269e-04]
 [-1.45982635e-04 -1.45982634e-04]
 [-1.26792390e-02 -1.26792390e-02]
 [-1.67913183e-04 -1.67913187e-04]
 [-2.41809017e-04 -2.41809017e-04]
 [-9.33867517e-05 -9.33867522e-05]
 [-7.94573534e-03 -7.94573535e-03]
 [-4.76254503e-05 -4.76254501e-05]
 [-2.64923861e-06 -2.64923844e-06]
 [ 4.47626736e-05  4.47626708e-05]
 [ 1.09347722e-01  1.09347722e-01]
 [ 5.67965185e-02  5.67965185e-02]
 [ 5.25298306e-02  5.25298306e-02]
 [ 5.53542907e-02  5.53542907e-02]
 [ 5.59290833e-02  5.59290833e-02]
 [ 5.23534682e-02  5.23534682e-02]
 [ 1.08133003e-01  1.08133003e-01]
 [ 5.67319602e-02  5.67319602e-02]
 [ 5.14442931e-02  5.14442931e-02]
 [ 5.48296085e-02  5.48296085e-02]
 [ 5.56926532e-02  5.56926532e-02]
 [ 5.11795651e-02  5.11795651e-02]
 [ 3.06270372e-01  3.06270372e-01]
 [ 1.59463135e-01  1.59463135e-01]
 [ 1.45570264e-01  1.45570264e-01]
 [ 1.56700533e-01  1.56700533e-01]
 [ 1.56043968e-01  1.56043968e-01]
 [ 1.45771544e-01  1.45771544e-01]] 9.96691174908528e-11
The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)

# =============== Part 8: Implement Regularization ===============
print('Checking Backpropagation (w/ Regularization) ... ')
# Check gradients by running checkNNGradients
lmb = 3.0
check_nn_gradients(lmb)

Checking Backpropagation (w/ Regularization) ... 
[[ 0.01272203  0.06321029]
 [ 0.05471668  0.05471668]
 [ 0.00868489  0.00868489]
 [-0.04533175 -0.04533175]
 [ 0.00646352 -0.05107193]
 [-0.01674143 -0.01674143]
 [ 0.03938178  0.03938178]
 [ 0.05929756  0.05929756]
 [-0.005742    0.01898511]
 [-0.03277532 -0.03277532]
 [-0.06025856 -0.06025856]
 [-0.03234036 -0.03234036]
 [-0.01267924  0.01253078]
 [ 0.05926853  0.05926853]
 [ 0.03877546  0.03877546]
 [-0.01736759 -0.01736759]
 [-0.00794574 -0.06562958]
 [-0.04510686 -0.04510686]
 [ 0.00898998  0.00898998]
 [ 0.05482148  0.05482148]
 [ 0.10934772  0.15983598]
 [ 0.11135436  0.11135436]
 [ 0.06099703  0.06099703]
 [ 0.00994614  0.00994614]
 [-0.00160637 -0.00160637]
 [ 0.03558854  0.03558854]
 [ 0.108133    0.1475522 ]
 [ 0.11609346  0.11609346]
 [ 0.0761714   0.0761714 ]
 [ 0.02218834  0.02218834]
 [-0.00430676 -0.00430676]
 [ 0.01898519  0.01898519]
 [ 0.30627037  0.33148039]
 [ 0.21889958  0.21889958]
 [ 0.18458753  0.18458753]
 [ 0.13942633  0.13942633]
 [ 0.09836012  0.09836012]
 [ 0.10071231  0.10071231]] 0.33076217369064975
The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)

训练神经网络

# =================== Part 8: Training NN ===================
print('Training Neural Network...')

lmb, maxiter = 1.0, 50
args = (input_layer_size, hidden_layer_size, num_labels, lmb, X, y)
nn_params, cost_min, _, _, _ = fmin_cg(nn_cost_function,
                                       initial_nn_params,
                                       fprime=nn_grad_function,
                                       args=args,
                                       maxiter=maxiter,
                                       full_output=True)

Theta1 = nn_params[:hidden_layer_size*(input_layer_size + 1)]
Theta1 = Theta1.reshape((hidden_layer_size, input_layer_size + 1))
Theta2 = nn_params[hidden_layer_size*(input_layer_size + 1):]
Theta2 = Theta2.reshape((num_labels, hidden_layer_size + 1))

Training Neural Network...
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.449704
         Iterations: 50
         Function evaluations: 99
         Gradient evaluations: 99

模型预测

# ================= Part 9: Implement Predict =================

pred = predict(Theta1, Theta2, X)
# print(pred.shape, y.shape)
# print(np.hstack((pred, y)))

print('Training Set Accuracy:', pred)

ok 4796
Training Set Accuracy: 0.9592