Constructing a Convolutional Neural Network with a Suitable … 243 trained on it. In our work, this type of dataset has a low complexity and a simpler model can perform well on it. Three methods are proposed in Sect. 3 to evaluate the complexity of a dataset which reflects the lowest capacity required for a model to perform well on a dataset. The methods are based on the number of variances in scales, locations and the consistency between predictions from different models. Optimization of model capacity. The definition of optimal capacity was given in [57]. The optimal capacity corresponds to the boundary between under-fitting regime and over-fitting regime. Existing research on capacity adjustment can be divided into three categories. The first category focuses on evaluating the contributions from height, width and the order of computations in a CNN to expressive power [58, 59]. Developmental learning [60] tried to enable a dataset to be enlarged continuously until the capacity of a CNN is not expressive enough. The second category of work tried to compute the information-theoretic quantities inside a CNN [61, 62]. The third type of methods focus on evaluating the model capacity with mathematical methods. Typical works are Algebraic Topology [63] and linear threshold functions [4]. The implementation of methods in the first category includes increasing either the width or height of a CNN and fine-tuning the entire network. It is well-known that deeper neurons allow for new compositions of existing neurons while wider neurons allow for the discovery of additional task-specific clues [58]. However, the increase in capacity is at the cost of efficiency. Reference [59] proposed to introduce computations of higher orders to better utilize existing features without introducing extra parameters. High-order functions enrich the hypothesis space and improve the performance of existing network models such as ResNet [11] and WRN [64] on CIFAR10 and CIFAR100 [65]. However, over-fitting is easier to occur. The second category of work aims at describing the learning of a CNN with the Information Bottleneck Principle. The entropies and mutual information within convolutional layers are computed using heuristic statistical physics methods under the assump- tion that weight matrices are independent. The behaviors of entropies and mutual information throughout the process of learning are studied in [61] and a conclusion is drawn that the relationship between compression and generalization are elusive. The third type of work addressed the issue of designing the architecture of a CNN by explaining how the architecture is determined by the topological complexity of a dataset. Empirical characterization of the topological capacity of a neural network model was provided in [63]. The topological phase transitions in neural networks upon the increase in datasets’ complexity were introduced. Moreover, [4] provided ways of estimating the capacity of typical neuronal models, including linear and polynomial threshold gates, linear and polynomial threshold gates with constrained weights and ReLU neurons. However, only fully-connected neural networks were discussed in [4, 63]. Additionally, only very simple networks have been studied and none of the three categories of work has focused on matching the capacity of a CNN to the complexity of a high-level vision task such as segmentation. In our work, the analysis of capacity on complex models and high-level vision tasks is included. The details are given in Sect. 4.
244 Y. Jiang and Z. Chi 2 Techniques to Fully Explore the Potential of Low-Capacity Networks 2.1 Methodology 2.1.1 Training Strategies 1. Layer-wise re-training In this part, we propose a layer-wise training scheme which outperforms existing end-to-end training schemes as well as existing layer-wise training schemes which are proposed in [28]. It is common practice to train a neural network in an end-to-end fashion on the ImageNet dataset and then fine-tune on segmentation tasks. However, training stage- by-stage can boost the performance of a CNN model. Suppose a CNN is divided into two parts: a feature extractor and task-specific layers. We propose to train the feature extractor layer-by-layer (Fig. 3). Denote xK as the output of the feature extractor and zK the prediction from task- specific layers. K is the number of layers in the feature extractor. In Training Step 1, the original feature extractor and the task-specific layers are trained together. In each step that follows, an additional layer is added to the feature extractor and the task-specific layers which have a fixed structure are re-built on top of the newly added layers. H layers are added one-by-one. The outputs of the feature extractor and those of task-specific layers are denoted as xK+h and zK+h, h = 1, . . . , H . Denote the dataset for training as x0n, yn , n = 1, . . . , N where N is the number of training samples. θh, h = 0, . . . , H denotes the parameters of the feature extractor after the h-th layer is added, γh, h = 0, . . . , H denotes the parameters of the task- specific layers upon adding the h-th layer and fine-tuning the overall network. θ0 and γ0 correspond to the original network. The training process in the h-th step (h ≥ 1) can be formalized as the minimization of the soft-max loss: θh∗, γh∗ 1 Lsof tmax zK +h x0n ; θh , γh , yn (2) = arg minθh,γh N n zK z K +1 zK+H Training Step 1 Training Step 2 Training Step H+1 Input image Task-specific Layers Task-specific Layers Task-specific Layers xK xK +1 xK+H Feature Extractor (K layers) Added Layer 1 . . . Added Layer H Fig. 3 The proposed scheme for adding layers to a CNN and re-training
Constructing a Convolutional Neural Network with a Suitable … 245 The process of training is shown in Algorithm 1. As is shown in Algorithm 1, the network is re-trained for two times upon the addition of each convolutional layer (Step 3 and Step 4). In Step 3, only the parameters in the newly added layer are optimized while in Step 4, the overall network is re- trained, θh − θh ∩ θh−1 denotes the parameters in Layer h. It is addressed in [28] that re-training upon the addition of each layer contributes to the improvements in accuracy and out-performs end-to-end training: Rˆ zK +h ; θh∗−1, (θh − θh ∩ θh−1)∗ , γh∗−1 ≤ Rˆ zK +h−1; θh∗−1, γh∗−1 (3) where Rˆ zK +h; θh∗, γh∗ = 1 Lsof tmax zK +h x0n; θh∗, γh∗ , yn and h ≥ 1. Differ- N n ent from [28] where only Step 3 is conducted for each added layer, all the layers in the CNN are optimized again after optimizing the single added layer: Rˆ zK +h ; θh∗, γh∗ ≤ Rˆ zK +h ; θh∗−1, (θh − θh ∩ θh−1)∗ , γh∗−1 (4) Rˆ zK +h ; θh∗, γh∗ ≤ Rˆ zK +h−1; θh∗−1, γh∗−1 (5) It is obvious that Stochastic Gradient Descent (SGD) can at least keep the loss from increasing (4). From (4) and (5) it can be inferred that by optimizing the complete CNN after the optimization of the added layer in each step, the loss can be lower than only optimizing the added layer. It is already shown in [28] that by adding one layer and train the added layer in each step, a network with 11 convolutional layers performs as well as VGG-13
246 Y. Jiang and Z. Chi [9] with 13 layers and the same width but trained end-to-end. Moreover, the layer- wise trained network outperforms VGG-11 [9] with 11 layers and the same width but trained end-to-end. As a result, the two-step optimization scheme proposed in Algorithm 1 contributes to larger improvements over end-to-end training schemes than the one-step scheme proposed in [28]. A comparison was made on the Look into Person Dataset [66]. There are 30,462 images for training, 10,000 images for validation and 10,000 images for testing. Layers are added to the pre-trained baseline model with 25 layers [67]. Upon the addition of each layer, training was conducted for 40,000 iterations and batch size was set to 10. The end-to-end trained network with the same structure was trained for 40, 000 × N umber o f layer s added × 2 times. For instance, if two layers are added to the baseline model, the network constructed with Algorithm 1 is trained for 40,000 iterations for optimizing Layer 1 only, 40,000 iterations for optimizing the overall network, 40,000 iterations for optimizing Layer 2 only and 40,000 iterations for optimizing the overall network. The end-to-end counterpart is trained at once for 160,000 iterations. The initial learning rate is 2e–4 and polynomial learning policy is adopted with its power set to 0.9. The preprocessing techniques proposed in [68] is applied for both networks. Comparisons are shown in Table 1. From Table 1, it can be inferred that the network trained with Algorithm 1 out- performs the network with the same capacity but trained end-to-end. Moreover, the networks trained with Algorithm 1 (2-step optimization) outperforms those trained using the scheme proposed in [28] (1-step optimization for each layer). As a result, Algorithm 1 can better explore the potential of a CNN without increasing the number of parameters. 2. Teacher-student method In vision tasks, it is addressed in [69] that a larger model is better at extracting the structures from huge datasets than smaller models. However, the knowledge acquired by a large model can be transferred to a small model, the smaller model trained in this way outperforms the same network which is trained directly on the hard ground truth labels. Reference [69] proposed the way of transferring the generalization capability Table 1 mIOU (%) of CNNs on the look into person dataset [66] mIOU (%) Layer-wise trained (2-step optimization proposed in Algorithm 1) 45.27 H =0 46.61 H =1 47.53 H =2 47.93 H =3 Layer-wise trained (1-step optimization proposed in [28]) 47.26 H =3 End-to-end trained 47.12 H =3 47.15 H =4
Constructing a Convolutional Neural Network with a Suitable … 247 of a larger model to a smaller model, the method takes the activations from a large model, i.e. the soft-max activations, as soft targets for training a smaller model. The soft targets contain the similarity structures over training examples. For instance, a dog is more similar to a cat than a flower, the similarities are not provided by hard labels. The advantage of soft targets comes from the additional information which isn’t included in hard targets. To solve the problem that the probabilities of some classes are too small to influ- ence the cross-entropy cost functions of smaller networks, distillation processing is proposed by [69]. The temperature T of the final soft-max functions can be raised until the values are suitable to train smaller networks: qi = exp zi T (6) exp z j T j qi denotes the probability of an input image or a pixel to belong to class i while zi denotes the corresponding activation. The higher the temperature T is, the softer the distribution over classes becomes. The training data for distilling knowledge into a small model includes both original training images and unlabeled images. On the training set, the target activation is a weighted sum of the soft target produced by the large model and the hard target produced by the ground truth labels with temperature equal to 1. The soft target has a high temperature in the soft-max function. The weight of the former should be larger than that of the latter and the magnitude of the first term should be multiplied by T 2 to cancel the influence brought by the changes in temperature. The teacher-student method is applied in our work, as will be addressed in Sect. 3. 2.1.2 Strategies of Data Augmentation Unlike the scenarios in a classification task where the gap between training accuracy and test accuracy is lower than 3% as long as the training set and the test set have similar distributions, the gap in a segmentation task is always larger than 10%. This indicates a pixel-specific task is more sensitive to the (small) variations than an image- specific task. The variations in the image segmentation task include the changes in poses, color, illumination and so forth. In our work, two methods are proposed to bridge the gap between the training data and test data. The variances in the test data are studied to enrich the training data. With the variances of the training data being more comprehensive, the potential of a model can be better explored. As was introduced in [29], incomplete supervision concerns the situation in which a small amount of labeled data and abundant unlabeled data are available. In that case, the labeled data is insufficient to train a good learner. The unlabeled data are exploited by semi-supervised learning in addition to labeled data to improve learning performance.
248 Y. Jiang and Z. Chi Transductive learning proposed by [29] is a special type of semi-supervised learning and assumes that the available unlabeled data is exactly the test data. The two methods of data augmentation proposed in our work are similar to transductive learning by exploiting unlabeled test data to improve performance. 1. Bridge the gap between poses The first method for data augmentation is through generating training images which have similar poses to those from the test set. Besides a CNN model for segmenta- tion, a model for human pose analysis was trained, as is introduced in our previous work [68]. Skeleton detection is a simpler task than person part segmentation and there are more training data samples available. As a result, the predictions from the model for pose analysis tend to be more generalizable. In the proposed method, each image is firstly divided into regions each of which contains one person. For each region, a vector describing poses is predicted. An algorithm is developed to find the person from training data with the most similar gesture to each person in test data. Upon matching each pair of people, homography transformation is conducted on the people from training images to make them more similar to those in test images. The ground truth masks for training images are transformed in the same way. The transformations produce mocked test images which are similar to test images but with labels. Some examples are shown in Fig. 4. People have been segmented out for clear demonstration. As can be seen from Fig. 4, a mocked test set with labels is generated based on training data. The transformations are based on the coordinates of predicted joints. Pair 1 Pair 2 Pair 3 Pair 4 Pair 5 Pair 6 Fig. 4 Examples showing six pairs of images containing people with similar poses. Within each pair, the left image is from the test set and the right one is from the training set. We have searched the training set to find the person with the most similar pose to each person in each test image. Homography transformations is conducted on the foreground regions from training images
Constructing a Convolutional Neural Network with a Suitable … 249 (a) 0 (b) 2 15 3 6 48 Center 7 11 9 12 10 13 Fig. 5 The proposed descriptor for describing poses. a The indices of joints. b The offsets of different joints from the center. The offsets are normalized with respect to the maximal distance between different pairs of joints. The top histogram describes the normalized offsets in the vertical direction while the bottom one describes those in the horizontal direction The mocked test set can enrich the training set and bridge the gap between training data and test data. Experiments will show that the involvement of the mocked test set during training improves the performance of segmentation. To evaluate the similarity in poses, a feature descriptor is developed in our work. Firstly the definition of 14 joints is based on the discussion in [70]. The 14 joints are nose, neck, left shoulder, left elbow, left wrist, right shoulder, right elbow, right wrist, left hip, left knee, left ankle, right hip, right knee and right ankle. A center point can be computed by averaging the coordinates of the predicted joints, it is shown by the red dot in Fig. 5a. Figure 5 describes the proposed descriptor, two histograms are computed to describe the pose of each person in the form of horizontal and vertical translations. The similarity in pose is evaluated by the sum of Euclidean distances between the two pairs of histogram vectors. For a test image with multiple people, its mocked counterpart is composed of foreground regions from different training images, an example is shown in Pair 6 in Fig. 4. The existence of other types of variances, such as rotations, scaling or changes in illumination may still cause the gap between training data and test data. As a result, the data augmentation in color variances is still required. 2. Bridge the gap in color Besides the variances in pose, the differences in color also contribute to the gap between training set and test set. In our work, a self-supervised CNN for colorization
250 Y. Jiang and Z. Chi is adopted to learn on the test set, the model is used to colorize the grayscale versions of training images based on the distributions of colors which are learned from the test set. It is already shown in Fig. 4 that the training images can be used to generate mocked test images. Similarly, with the joint usage of the colorization model, the mocked test images can show more similarity in color to real test images. Similar to human parsing, colorization is also a pixel-specific task. As a result, our model for segmentation can be used to conduct colorization by making changes to the loss function. According to the method proposed in [71], an RGB image is firstly converted to Lab format. The L-channel corresponds to light while a-channel and b-channel correspond to colors. As a result, L-channel can be treated as inputs and the remaining two channels as ground truth outputs. Suppose the height and width of an input image are H and W , respectively. Denote XH×W×1 as an input image and YH×W×2 the ground truth. As is addressed in [71], colorization can be divided into two steps. In the first step, a distribution over ab-values Zˆ ∈ [0, 1]H×W×Q is predicted for each pixel where Q is the number of possible ab pairs, the distributions are converted to color labels in the second step: L Zˆ , Z = − weight(h, w) Zh,w,q log Zˆ h,w,q (7) h,w q where weight(h, w) denotes the weights on pixels, the most frequently appeared pixels are assigned lower weights. All pixels are assumed to subject to an independent and identical distribution which is the sum of the color distribution from ImageNet training set p and a uniform distribution 1 : Q weight(h, w) ∝ 0.5p+ 0.5 −1 (8) Q In the second step, the predicted distribution mask Zˆ ∈ [0, 1]H×W×Q is converted to color values using a soft-max function with temperature equal to 0.38 for each pixel. The two parts mentioned above can produce mocked test images which are similar both in colors and poses to real test images. The involvement of the generated images during training can improve the performance of human parsing, as is shown in Table 2. Table 2 Improvements on mIOU (%) brought by data augmentation on poses and colors Method mIOU (%) Original Deep Lab-V2 [50] 64.94 Deep Lab-V2 [50] with data augmentation applied 65.41 Original segmentation module proposed in [68] 67.43 Segmentation module proposed in [68] with data augmentation applied 67.94
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318
- 319
- 320
- 321
- 322
- 323
- 324
- 325
- 326
- 327
- 328
- 329
- 330
- 331
- 332
- 333
- 334
- 335
- 336
- 337
- 338
- 339
- 340
- 341
- 342
- 343
- 344
- 345
- 346
- 347