Jekyll2020-08-14T12:53:09+00:00https://liusida.github.io/feed.xmlSida LiuThe world is so complex that we cannot stop learning.Sida LiuHow to use PyCUDA to bring significant speedup2020-08-02T00:00:00+00:002020-08-02T00:00:00+00:00https://liusida.github.io/cuda/2020/08/02/pycuda<p>Imagine that we have designed an computational experiment in Python, and we waited 3 days for the results, and after that, unfortunately we discovered there was a typo or a small bug in the source code. What do you think we would say when we restart the experiment? I would hope that the experiment could be run in half a hour.</p>
<p>It is possible, by making the code parallized.</p>
<p>CUDA is a C++-like program language for parallel programs which can run on Nvidia GPU. <a href="https://developer.nvidia.com/cuda-toolkit">CUDA website</a></p>
<p>PyCUDA is an open source Python interface to compile CUDA source code on the fly and execute it. <a href="https://documen.tician.de/pycuda/">PyCUDA documentation</a></p>
<p>Here we show an example of using CUDA and PyCUDA to rewrite a Python program.</p>
<p>Source code: <a href="https://github.com/liusida/JacksCarRental-via-PyCUDA">GitHub repo</a></p>
<p>The file <code class="language-plaintext highlighter-rouge">car_rental.py</code> is a Python program. It is slow because there are huge nested loops. We can exam this by searching for the keywords <code class="language-plaintext highlighter-rouge">while True</code> and <code class="language-plaintext highlighter-rouge">for</code>.</p>
<p>The file <code class="language-plaintext highlighter-rouge">car_rental_cuda.py</code> is the CUDA-optimized version of the original program. The <code class="language-plaintext highlighter-rouge">gpu_policy_evaluation</code> and <code class="language-plaintext highlighter-rouge">gpu_policy_improvement</code> are two kernels (CUDA interfaces) that can run 21*21 (num_state=21) threads in parallel. In this code, it prepares the pre-defined constant vairables and read in the CUDA source file <code class="language-plaintext highlighter-rouge">car_rental_cuda.py.cu</code>, compiles them on the fly, and expose the interfaces as Python functions.</p>
<p>By running them, we can get the results in the <code class="language-plaintext highlighter-rouge">images/</code> folder. And we can see the CUDA version only takes 6 seconds while the original version would take more than a hour.</p>Sida LiuImagine that we have designed an computational experiment in Python, and we waited 3 days for the results, and after that, unfortunately we discovered there was a typo or a small bug in the source code. What do you think we would say when we restart the experiment? I would hope that the experiment could be run in half a hour.Nuance in Monty Hall Paradox2018-07-02T00:00:00+00:002018-07-02T00:00:00+00:00https://liusida.github.io/old/2018/07/02/nuance-in-monty-hall-paradox<p>Marilyn vos Savant has made a mistake. She knew the game show Let’s Make a Deal too well, that she assumed the rules of the game show also applied to the question she was asked.</p>
<p>On the website of Marilyn vos Savant, the original question could be found here http://marilynvossavant.com/game-show-problem/ :</p>
<p>“Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car, behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He says to you, “Do you want to pick door #2?” Is it to your advantage to switch your choice of doors?”</p>
<p>Interestingly, the original question never indicate any established rules of Let’s Make a Deal, which is critical to this question. Examination of the original question reveals that the host is not mandatory to open another door, i.e., the host could just open the door that is choosen. In this case, which the host is free to choose opening the door directly or opening another door for you then asking you to switch, his choice of opening another door is probably in order to lead you away from winning the car. (Unforturenately, the rule of winning what ever showed behind the door is not metioned in the question either.)</p>
<p>For this reason, the “Marilyn’s question” is different from “Monty Hall Paradox.”</p>
<p>So how to solve “Marilyn’s question”?</p>Sida LiuMarilyn vos Savant has made a mistake. She knew the game show Let’s Make a Deal too well, that she assumed the rules of the game show also applied to the question she was asked.What is Mathematics According to Keith Devlin2018-06-27T00:00:00+00:002018-06-27T00:00:00+00:00https://liusida.github.io/old/2018/06/27/what-is-mathematics<p>上午读了Keith Devlin教授的课程背景材料，有点感触，摘抄了几段对我很有启发的文字，以备日后参考。</p>
<p>In Keith Devlin’s book ‘Introduction to Mathematical Thinking’, he writes,</p>
<p>“Virtually nothing (with just two further advances, both from the 17th century: calculus and probability theory) from the last three hundred years has found its way into the classroom. Yet most of the mathematics used in today’s world was developed in the last two hundred years! As a result, anyone whose view of mathematics is confined to what is typically taught in schools is unlikely to appreciate that research in mathematics is a thriving, worldwide activity, or to accept that mathematics permeates, often to a considerable extent, most walks of present-day life and society.”</p>
<p>小学到高中课堂里所涉及到的数学，基本都是三百年前的东西了，说三百年还是给面子了，因为三百年前的也就是微积分和概率，其他的就更古老了。但是呢，今天我们在社会上用到的数学大多都是近两百年内发展出来的。所以如果认为数学就是我们中学课本里教的那些的话，就无法欣赏最新的数学研究发展咯。</p>
<p>“…mathematical notation no more is mathematics than musical notation is music. … In 1623, Galileo wrote, ‘The great book of nature can be read only by those who know the language in which it was written. And this language is mathematics.’…”</p>
<p>说数学符号就是数学，就跟说音乐符号就是音乐一样，其实他只是数学的语言，并不是数学本身。但是呢，伽利略说，这个自然界是数学写成的。所以学习数学语言挺重要啊。</p>
<p>“As one of the greatest creations of human civilization, mathematics should be taught alongside science, literature, history, and art in order to pass along the jewels of our culture from one generation to the next. We humans are far more than the jobs we do and the careers we pursue.”</p>
<p>数学作为人类创造的最伟大的东西之一，跟科学、文学、历史和艺术一并，都应该教给下一代，这是人类最珍贵的东西了。我们人类嘛，不仅仅是个干活的工人，我们值更多。</p>
<p>“…those skills (use mathematics as a tool) fall into two categories. … The second category comprises people who can take a new problem, say in manufacturing, identify and describe key features of the problem mathematically, and use that mathematical description to analyze the problem in a precise fashion. … I propose to give them one (name): innovative mathematical thinkers.”</p>
<p>使用数学可以有两类，一类是拿到已经定义好的数学问题想办法计算结果，第二类是在实际生活中把遇到的问题数学化，并可以用精确地风格分析这些问题。我称这些人：有创意的数学思考者。</p>Sida Liu上午读了Keith Devlin教授的课程背景材料，有点感触，摘抄了几段对我很有启发的文字，以备日后参考。Dynamic NN Allowing Additional Evidence?2018-05-29T00:00:00+00:002018-05-29T00:00:00+00:00https://liusida.github.io/2018/05/29/dynamic-nn<p>We have traditional Neural Network (NN), with static structure like this:</p>
<p>signal -> input -> hidden layer -> prediction =?= truth</p>
<p>-> means to propagate forward
=?= means to minimize difference</p>
<p>What if we already trained one model like that and there is another evidence (signal) coming in front of us? Can we add the signal into the model dynamically without abandon what has been trained already?</p>
<p>I think we should use Bayesian Theory, but I havn’t figure out how yet.</p>Sida LiuWe have traditional Neural Network (NN), with static structure like this:Why does the person with highest IQ not become the most successful one?2017-09-27T00:00:00+00:002017-09-27T00:00:00+00:00https://liusida.github.io/2017/09/27/highest-IQ<p>I heard about the “<a href="https://euler.epfl.ch/files/content/sites/euler/files/users/144617/public/LubinskiPersson.pdf">Study of Mathematically Precocious Youth After 35 Years</a>” years ago, but after studying machine learning, especially the generalization problem, I guess I have glanced some possible reasons.</p>
<p>While a human is learning, the process is more or less like the machine learning. The talent, the IQ testing result, can somehow prove the human has a more complex brain, just like the neural networks have more complex architectures. Unfortunately, overfitting often occurs when a model with more complex architecture learning. When a model stuck at overfitting, the training error will go down steadily while the validation error will become larger. This problem is called generalization problem. In the context of machine learning, we have several tricks to partially solve it. Here is a list of methods and their analogies of human learning.</p>
<ul>
<li>
<p>More data. This is the best method both for machine learning and human learning. If we are smart youths, just keep learning new stuff, and we can avoid overfitting to the knowledge we learn, and generalize better.</p>
</li>
<li>
<p>Dropout. This is my favorite method. Don’t study all the time. Do something else, or just do nothing, maybe sleep. And then we will find our ability of generalization improved. Sounds nice!</p>
</li>
<li>
<p>Adding noise to input. This is also a practical method. If we need study something several times to master a perticular idea, maybe after we feel confident enough, we can add some noise, e.g. maybe use an alternative material, or maybe focusing on different details.</p>
</li>
<li>
<p>L2 regularization. This is pushing all the unrelated weights to be near zero. When we study, maybe after several rounds of learning, we ask ourselves to not doubt what we learned. If we are not 100% sure, then don’t trust it.</p>
</li>
</ul>Sida LiuI heard about the “Study of Mathematically Precocious Youth After 35 Years” years ago, but after studying machine learning, especially the generalization problem, I guess I have glanced some possible reasons.Use Tensorflow to Compute Gradient2017-09-24T00:00:00+00:002017-09-24T00:00:00+00:00https://liusida.github.io/2017/09/24/use-tensorflow-to-compute-gradient<p>In most of Tensorflow tutorials, we use minimize(loss) to automatically update parameters of the model.</p>
<p>In fact, minimize() is an integration of two steps: computing gradients, and applying the gradients to update parameters.</p>
<p>Let’s take a look at an example:</p>
\[Y = (100 - 3W - B)^2\]
<p>What is the gradient of W and B <strong>when W=1.0, B=1.0</strong>?</p>
<p>We can calculate them by hand:</p>
<p>let \(N = 100 - 3W - B\), so that \(Y = N^2\)</p>
\[\frac{\partial{Y}}{\partial{W}} =
\frac{\partial{Y}}{\partial{N}} * \frac{\partial{N}}{\partial{W}} =
2N * 3 = 600 - 18W - 6B = 576\]
\[\frac{\partial{Y}}{\partial{B}} =
\frac{\partial{Y}}{\partial{N}} * \frac{\partial{N}}{\partial{B}} =
2N * 1 = 200 - 3W - B = 196\]
<p>ok, now let use tensorflow to compute that:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="c1"># make an example:
# Y = (100 - W X - B)^2
</span><span class="n">X</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">constant</span><span class="p">(</span><span class="mf">3.</span><span class="p">)</span>
<span class="n">W</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">Variable</span><span class="p">(</span><span class="mf">1.</span><span class="p">)</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">Variable</span><span class="p">(</span><span class="mf">1.</span><span class="p">)</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="mi">100</span> <span class="o">-</span> <span class="n">W</span><span class="o">*</span><span class="n">X</span> <span class="o">-</span> <span class="n">B</span><span class="p">)</span>
<span class="c1">#the lr here is not about gradient computing. it only effect when appling
</span><span class="n">Ops</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">train</span><span class="p">.</span><span class="n">GradientDescentOptimizer</span><span class="p">(</span><span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.001</span><span class="p">)</span>
<span class="n">grads_and_vars</span> <span class="o">=</span> <span class="n">Ops</span><span class="p">.</span><span class="n">compute_gradients</span><span class="p">(</span><span class="n">Y</span><span class="p">)</span>
<span class="c1"># we can modify the gradient here and then:
# Op_update = Ops.apply_gradients(grads_and_vars)
</span>
<span class="k">with</span> <span class="n">tf</span><span class="p">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">sess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">global_variables_initializer</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">sess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">grads_and_vars</span><span class="p">))</span>
</code></pre></div></div>
<p>run it, and we get:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[(-576.0, 1.0), (-192.0, 1.0)]
</code></pre></div></div>
<p>So next time your professor ask you to implement a back-propagation for some complex networks by your self, maybe this trick can help you double-check your implementation. Hooray!</p>Sida LiuIn most of Tensorflow tutorials, we use minimize(loss) to automatically update parameters of the model.Scree of PCA(Principal Component Analysis)2017-09-23T00:00:00+00:002017-09-23T00:00:00+00:00https://liusida.github.io/2017/09/23/scree-of-pca<p>I learned the concept of PCA today, and found out this method of reducing dimension is quite terse.</p>
<p>If we do PCA to a 40-d dataset, reduce it into a 2-d dataset, it simply choose the 2 most “Principal Components”, i.e. the 2 most “important” dimensions, and drop others.</p>
<p>So, before we do PCA, we’d better do a scree of PCA, to plot the proportion of variance of each dimension.</p>
<p>take a look at <a href="https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/">this implementation</a>.</p>
<p><img src="/images/2017-09-23-scree-of-pca/proportion-of-variance.png" alt="img" /></p>
<p>In this example, I think we are quite safe to simply drop dimensions after PC30, i.e. we can use PCA to reduce the dataset to 30-d quite safely. (and then we may use t-sne, a more time-consuming method.)</p>Sida LiuI learned the concept of PCA today, and found out this method of reducing dimension is quite terse.Learning Rate is Too Large2017-09-09T00:00:00+00:002017-09-09T00:00:00+00:00https://liusida.github.io/2017/09/09/learning-rate-too-large<p>What if I see a training accuracy scalar graphic like this:</p>
<p><img src="/images/2017-09-09-learning-rate-too-large/accuracy-1.png" alt="Accuracy" /></p>
<p>The accuracy curve of training mini-batch is going down a little bit over time after reached a relative high point. That might tell me the learning rate is too large.</p>
<p>When the learning rate is too large, the optimizer function can not converge the loss by adding derivative to variables–every step is too large, and the loss will become biger and biger.</p>Sida LiuWhat if I see a training accuracy scalar graphic like this:Manipulating Tensorboard2017-09-08T00:00:00+00:002017-09-08T00:00:00+00:00https://liusida.github.io/2017/09/08/manipulating-tensorboard<p>Tensorboard is a very useful tool for visualizing the logs of Tensorflow. It is now an independent project on GitHub, here’s the <a href="https://github.com/tensorflow/tensorboard">link</a>.</p>
<p>In the past, if we were doing small projects, we usually printed some log information on the screen or wrote them into log files. The disadvantage is that if there are so many outputs, we can easily get lost in them.</p>
<p>So I think Tensorboard is a very helpful tool since it can re-organize log information, and present it in a Web form.</p>
<p>There are several features which I think is worth to talk about:</p>
<h3 id="1-organize-logs-into-sub-directories">1 Organize logs into sub-directories.</h3>
<p>I selected a certain directory for all tensorboard logs, say it’s ~/tensorboard_logdir . And then, say, I had a project called Project_One, and I can just make a sub-directory in it. And every time I run the program, I can write log files into a sub-sub-directory which has a name of current time.</p>
<p>When I run <code class="language-plaintext highlighter-rouge">tensorboard --logdir=~/tensorboard_logdir</code>, I get this:</p>
<p><img src="/images/2017-09-08-tensorboard/tensorboard_project_one.png" alt="Project One" /></p>
<h3 id="2-the-histogram">2 The Histogram.</h3>
<p>We usually use histograms in bar style. The Tensorboard doesn’t.</p>
<p>According to dandelionmane, the developer of tensorboard, <a href="https://github.com/tensorflow/tensorflow/issues/5381">the y-axis of the histogram is Density</a>, but I think it is Frequency. For example, I created an array with 7 items–[1,2,3,4,5,6,7], and I wrote them to a histogram, I’d got this interesting result:</p>
<p><img src="/images/2017-09-08-tensorboard/seven_items.png" alt="Seven Items" /></p>
<p>I added all values of those points, and I’d got 7. There is an interesting phenomena on the right, the item [7] was reperesented by three points which are all roughly 0.3, add up to 1. So, I think if there are not so many examples, the histogram will have problems of reperesenting. But if there are enough examples, the histogram will look smooth and nice:</p>
<p><img src="/images/2017-09-08-tensorboard/thousand_items.png" alt="Thousand Items" /></p>
<h3 id="3-flush-if-using-ipython-jupyter-notebook">3 Flush if using Ipython (Jupyter) Notebook.</h3>
<p>Suppose we wrote some log files in Notebook. Because the program was not ended, the files might have not been writen. I have found only part of information when I run some program in Notebook because of this. So every time, please call tf.summary.FileWriter.flush or close to make sure the information is fully outputed.</p>
<h3 id="4-in-histogram-z-axis-denotes-iterations">4 In histogram, z-axis denotes iterations</h3>
<p>Through z-axis, we can observe the change over time. I wrote a small piece of code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span> <span class="c1"># version: r1.3
</span>
<span class="n">W1</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">random_uniform</span><span class="p">([</span><span class="mi">10000</span><span class="p">]))</span>
<span class="n">W2</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">random_normal</span><span class="p">([</span><span class="mi">10000</span><span class="p">],</span> <span class="n">stddev</span><span class="o">=</span><span class="mf">0.13</span><span class="p">))</span>
<span class="n">w1_hist</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">summary</span><span class="p">.</span><span class="n">histogram</span><span class="p">(</span><span class="s">"from_uniform_distribution"</span><span class="p">,</span> <span class="n">W1</span><span class="p">)</span>
<span class="n">w2_hist</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">summary</span><span class="p">.</span><span class="n">histogram</span><span class="p">(</span><span class="s">"from_normal_distribution"</span><span class="p">,</span> <span class="n">W2</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">W1</span><span class="o">-</span><span class="n">W2</span><span class="p">)</span>
<span class="n">train_op</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">train</span><span class="p">.</span><span class="n">AdamOptimizer</span><span class="p">(</span><span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.01</span><span class="p">).</span><span class="n">minimize</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>
<span class="n">summary_op</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">summary</span><span class="p">.</span><span class="n">merge_all</span><span class="p">()</span>
<span class="k">with</span> <span class="n">tf</span><span class="p">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">writer</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">summary</span><span class="p">.</span><span class="n">FileWriter</span><span class="p">(</span><span class="s">'/tmp/tensorboard/'</span><span class="p">,</span> <span class="n">sess</span><span class="p">.</span><span class="n">graph</span><span class="p">)</span>
<span class="n">tf</span><span class="p">.</span><span class="n">global_variables_initializer</span><span class="p">().</span><span class="n">run</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="n">m</span><span class="p">,</span><span class="n">_</span> <span class="o">=</span> <span class="n">sess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="n">summary_op</span><span class="p">,</span> <span class="n">train_op</span><span class="p">])</span>
<span class="k">if</span> <span class="n">i</span><span class="o">%</span><span class="mi">10</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span>
<span class="n">writer</span><span class="p">.</span><span class="n">add_summary</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
<span class="n">v1</span><span class="p">,</span> <span class="n">v2</span> <span class="o">=</span> <span class="n">sess</span><span class="p">.</span><span class="n">run</span><span class="p">([</span><span class="n">W1</span><span class="p">,</span> <span class="n">W2</span><span class="p">])</span>
<span class="n">writer</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div></div>
<p>We can see those two distributions merged together during training:</p>
<p><img src="/images/2017-09-08-tensorboard/merged_distributions.png" alt="Merged Distributions" /></p>
<p>In fact, after the last iteration, the data W1 and W2 are almost same, but why in histogram, those two seems different? Because the scale of y-axis is different. We can see from_normal_distribution, the initial max value is higher than the from_uniform_distribution graphic.</p>Sida LiuTensorboard is a very useful tool for visualizing the logs of Tensorflow. It is now an independent project on GitHub, here’s the link.Implement a Deep Neural Network using Python and Numpy2017-08-24T00:00:00+00:002017-08-24T00:00:00+00:00https://liusida.github.io/2017/08/24/deep-scratch<p>I have just finished <a href="https://www.coursera.org/specializations/deep-learning">Andrew Ng’s new Coursera courses of Deep Learning (1-3)</a>. They are one part of his new project <a href="https://www.deeplearning.ai/">DeepLearning.ai</a>.</p>
<p>In those courses, there is a series of interview of <a href="https://youtu.be/-eyhCTvrEtE?list=PLfsVAYSMwsksjfpy8P2t_I52mugGeA5gR">Heroes of Deep Learning</a>, which is very helpful for a newbie to enter this field. I heard several times those masters require a newbie to build a whole deep network from scratch, maybe just use Python and Numpy, to understand things better. So, after the courses, I decided to build one on my own.</p>
<h2 id="day-1">Day 1</h2>
<p>I created a new repository named <a href="https://github.com/liusida/DeepScratch">Deep Scratch</a> in github, and a <a href="https://github.com/liusida/DeepScratch/blob/master/main.ipynb">main.ipynb</a> file, to do this job.</p>
<p>First of all, I made a todo list, those are functions or algorithms mentioned in the courses. I planed to implement most of them.</p>
<p>Second, I decided to begin with MNIST dataset. It is the Hello, World dataset. But I will switch to other datasets to test my model. In my mind, maybe too ambitious, I want to build a transferable model, I think that’s the correct direction to general thinking.</p>
<p>After these, I can now start this project.</p>
<h2 id="day-2">Day 2</h2>
<p>I implemented a basic model, including those functions:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ReLU</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">softmax</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">forward_propagation_each_layer</span><span class="p">(</span><span class="n">W</span><span class="p">,</span> <span class="n">A_prev</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">activation_function</span><span class="o">=</span><span class="n">ReLU</span><span class="p">)</span>
<span class="n">loss</span><span class="p">(</span><span class="n">Y_hat</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="n">cost</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>
<span class="n">predict</span><span class="p">(</span><span class="n">Y_hat</span><span class="p">)</span>
<span class="n">accuracy</span><span class="p">(</span><span class="n">Y_predict</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="n">backpropagate_cost</span><span class="p">(</span><span class="n">Y</span><span class="p">,</span> <span class="n">AL</span><span class="p">)</span>
<span class="n">backpropagate_softmax</span><span class="p">(</span><span class="n">AL</span><span class="p">,</span> <span class="n">dAL</span><span class="p">,</span> <span class="n">Y</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">ZL</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
<span class="n">backpropagate_linear</span><span class="p">(</span><span class="n">dZ</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">A_prev</span><span class="p">)</span>
<span class="n">backpropagate_ReLU</span><span class="p">(</span><span class="n">dA</span><span class="p">,</span> <span class="n">Z</span><span class="p">)</span>
<span class="n">forwardpropagation_all</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">backpropagate_all</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span>
<span class="n">update_parameters</span><span class="p">()</span>
<span class="n">model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">print_every</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">iteration</span><span class="o">=</span><span class="mi">500</span><span class="p">,</span> <span class="n">hidden_layers</span><span class="o">=</span><span class="p">[</span><span class="mi">100</span><span class="p">],</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">)</span>
</code></pre></div></div>
<p>I had to say, the math is complex for me. When I implemented the first time, I almost have 10 bugs in the calculation!</p>
<p>There were a few un-concrete concepts, such as what loss function should I use for multi-class classification? What is the derivative of softmax? When should I divide the result by m (the number of examples)?</p>
<p>After maybe 10 hours of debugging, I even implemented a bunch of tensorflow alternative functions, finally, the model worked out!</p>
<p><a href="https://github.com/liusida/DeepScratch/blob/day2/main.ipynb">Day 2’s notebook</a> <- Here were the code and formula. (I found that Jupyter Notebook is great to comment codes!)</p>
<h2 id="day-3">Day 3</h2>
<p>The training set accuracy was already 1.0, so I looked at the dev set accuracy: 0.6? Oh, that’s bad. So I had a variance problem.</p>
<p>I tried to implement regularization, but seems had little help to this variance problem.</p>
<p>Then suddenly I figured out why: because I use random batch to train, the distribution of random batch can not cover all training examples, so I wasted a lot of training examples.</p>
<p>So I changed to mini-batch, which define a mini-batch size, every time use a segment of training data, so it can sure every single example was used for training.</p>
<p>And I also realize a very interesting aspect of mini-batch, it has a very good effect to variance problem, especially when the network is relatively shallow. I think it acts just like dropout! The network can not depend on any single data!</p>
<p>Thanks to mini-batch, my dev accuracy jumped to 0.98, and test set accuracy was also 0.98. Not bad for today’s work!</p>
<p><a href="https://github.com/liusida/DeepScratch/blob/day3/main.ipynb">Day 3’s notebook</a> <- Here is mini-batch, regularization, train/dev/test accuracy.</p>
<h2 id="day-4">Day 4</h2>
<p>Since the code was work and was ugly… I decided to refactor the code.</p>
<p><a href="https://github.com/liusida/DeepScratch/blob/day4/main.ipynb">Day 4’s notebook</a> <- I had refactored half of the code.</p>
<h2 id="day-5">Day 5</h2>
<p>Refactor done. I was happy that the code looks clean now.</p>
<p>During refactoring, I attended to calculate the derivative of <strong>Z=WA+b</strong>, the <strong>dL/dA</strong>, but since <strong>Z,W,A</strong> are all matrices, I failed to understand the derivative of Matrix-by-Matrix. According to the Wikipedia, it seems results a four-rank tensor. So, it was lucky that in the neural network, the final $L$ is a scalar, so derivative of Scalar-by-Martix is much easy to understand.</p>
<p>I also noticed that in Tensorflow, the final loss function is a Vector! So, they must understand what is the derivative of Matrix-by-Matrix!</p>
<p><a href="https://github.com/liusida/DeepScratch/blob/day5/main.ipynb">Day 5’s notebook</a> <- Now the code is runable and clean.</p>
<h2 id="day-6">Day 6</h2>
<p>As there was still a variance problem, and L2 regularization seems not help much, I decided to implement Dropout.</p>
<p>I chose the “inverted dropout”, which introduced by Andrew Ng in the course.</p>
<p>I just watched a video comparing algebra and geometrics, it says that calculus and algebra can give you great power of solving a problem by just computing, but geometrics sometimes has its own beauty–it sometimes can solve a problem in a very simple way. Today, I felt like that the dropout technic is an analogy to geometrics, simple, effective, and beautiful.</p>
<p>Now after 100 iterations learning from training set, the dev accuracy raised to 98.34%.</p>
<p>There’s another lovely feature I added into the code: during training, I can just press the stop button of notebook, and change some of the hyperparameters (only except the arcitecture–the hidden layers), and Ctrl+Enter run the cell, the parameters W and b are kept, not re-initialized, so the training can go on without restart from beginning.</p>
<p>But till now, I spent more and more time running the program by CPU–actually I am lucky that I have MKL for numpy, so I can use all of my CPU–I felt a little wasting of time. Maybe I will implement those in Tensorflow, and use my GPU to save time. And Tensorflow has auto-gradient computation …</p>
<p><a href="https://github.com/liusida/DeepScratch/blob/day6/main.ipynb">Day 6’s notebook</a> <- Dropout version</p>
<h2 id="day-7">Day 7</h2>
<p>This was the last day. I implemented some gradient checking functions to double-check my understanding and code.</p>
<p>It was quite strange to me that I have misunderstood the backpropagation for Softmax. When I tested it, I have a relatively large difference between the Calculated result and Approximate result. So I looked for more information online, but I only found <a href="http://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/">an explanation about Softmax function with a vector(one sample)</a> which I could understand.</p>
<p>Finally, I used the way Tensorflow does–calculate softmax and loss function at the same time. That formula is really simple. But I still felt not fully understand the principles.</p>
<p><a href="https://github.com/liusida/DeepScratch/blob/day7/main.ipynb">Day 7’s notebook</a> <- After correct some functions, now the result … had no improvement … WHY…</p>
<p>I thought I was kind of stuck here, so I decided to move on and try some new ideas. Maybe I would come back and refine all those codes later when I engaged more knowledge.</p>
<p>Thank you for reading, and welcome to leave a message below.</p>Sida LiuI have just finished Andrew Ng’s new Coursera courses of Deep Learning (1-3). They are one part of his new project DeepLearning.ai.