I have just finished Andrew Ng’s new Coursera courses of Deep Learning (1-3). They are one part of his new project DeepLearning.ai.
In those courses, there is a series of interview of Heroes of Deep Learning, which is very helpful for a newbie to enter this field. I heard several times those masters require a newbie to build a whole deep network from scratch, maybe just use Python and Numpy, to understand things better. So, after the courses, I decided to build one on my own.
First of all, I made a todo list, those are functions or algorithms mentioned in the courses. I planed to implement most of them.
Second, I decided to begin with MNIST dataset. It is the Hello, World dataset. But I will switch to other datasets to test my model. In my mind, maybe too ambitious, I want to build a transferable model, I think that’s the correct direction to general thinking.
After these, I can now start this project.
I implemented a basic model, including those functions:
ReLU(X) softmax(X) forward_propagation_each_layer(W, A_prev, b, activation_function=ReLU) loss(Y_hat, Y) cost(loss) predict(Y_hat) accuracy(Y_predict, Y) backpropagate_cost(Y, AL) backpropagate_softmax(AL, dAL, Y=None, ZL=None) backpropagate_linear(dZ, W, A_prev) backpropagate_ReLU(dA, Z) forwardpropagation_all(X) backpropagate_all(X, Y) update_parameters() model(X, Y, learning_rate=0.01, print_every=100, iteration=500, hidden_layers=, batch_size=128)
I had to say, the math is complex for me. When I implemented the first time, I almost have 10 bugs in the calculation!
There were a few un-concrete concepts, such as what loss function should I use for multi-class classification? What is the derivative of softmax? When should I divide the result by m (the number of examples)?
After maybe 10 hours of debugging, I even implemented a bunch of tensorflow alternative functions, finally, the model worked out!
Day 2’s notebook <- Here were the code and formula. (I found that Jupyter Notebook is great to comment codes!)
The training set accuracy was already 1.0, so I looked at the dev set accuracy: 0.6? Oh, that’s bad. So I had a variance problem.
I tried to implement regularization, but seems had little help to this variance problem.
Then suddenly I figured out why: because I use random batch to train, the distribution of random batch can not cover all training examples, so I wasted a lot of training examples.
So I changed to mini-batch, which define a mini-batch size, every time use a segment of training data, so it can sure every single example was used for training.
And I also realize a very interesting aspect of mini-batch, it has a very good effect to variance problem, especially when the network is relatively shallow. I think it acts just like dropout! The network can not depend on any single data!
Thanks to mini-batch, my dev accuracy jumped to 0.98, and test set accuracy was also 0.98. Not bad for today’s work!
Day 3’s notebook <- Here is mini-batch, regularization, train/dev/test accuracy.
Since the code was work and was ugly… I decided to refactor the code.
Day 4’s notebook <- I had refactored half of the code.
Refactor done. I was happy that the code looks clean now.
During refactoring, I attended to calculate the derivative of Z=WA+b, the dL/dA, but since Z,W,A are all matrices, I failed to understand the derivative of Matrix-by-Matrix. According to the Wikipedia, it seems results a four-rank tensor. So, it was lucky that in the neural network, the final $L$ is a scalar, so derivative of Scalar-by-Martix is much easy to understand.
I also noticed that in Tensorflow, the final loss function is a Vector! So, they must understand what is the derivative of Matrix-by-Matrix!
Day 5’s notebook <- Now the code is runable and clean.
As there was still a variance problem, and L2 regularization seems not help much, I decided to implement Dropout.
I chose the “inverted dropout”, which introduced by Andrew Ng in the course.
I just watched a video comparing algebra and geometrics, it says that calculus and algebra can give you great power of solving a problem by just computing, but geometrics sometimes has its own beauty–it sometimes can solve a problem in a very simple way. Today, I felt like that the dropout technic is an analogy to geometrics, simple, effective, and beautiful.
Now after 100 iterations learning from training set, the dev accuracy raised to 98.34%.
There’s another lovely feature I added into the code: during training, I can just press the stop button of notebook, and change some of the hyperparameters (only except the arcitecture–the hidden layers), and Ctrl+Enter run the cell, the parameters W and b are kept, not re-initialized, so the training can go on without restart from beginning.
But till now, I spent more and more time running the program by CPU–actually I am lucky that I have MKL for numpy, so I can use all of my CPU–I felt a little wasting of time. Maybe I will implement those in Tensorflow, and use my GPU to save time. And Tensorflow has auto-gradient computation …
Day 6’s notebook <- Dropout version
This was the last day. I implemented some gradient checking functions to double-check my understanding and code.
It was quite strange to me that I have misunderstood the backpropagation for Softmax. When I tested it, I have a relatively large difference between the Calculated result and Approximate result. So I looked for more information online, but I only found an explanation about Softmax function with a vector(one sample) which I could understand.
Finally, I used the way Tensorflow does–calculate softmax and loss function at the same time. That formula is really simple. But I still felt not fully understand the principles.
Day 7’s notebook <- After correct some functions, now the result … had no improvement … WHY…
I thought I was kind of stuck here, so I decided to move on and try some new ideas. Maybe I would come back and refine all those codes later when I engaged more knowledge.
Thank you for reading, and welcome to leave a message below.