tag:blogger.com,1999:blog-66403542996996647362024-03-13T12:04:52.426-07:00Lili's Sharings Hi everyone. This blog is used to record my "ahh-ha!" moments, along the way of solving problems in the areas of data science, operations research, programming, database, full stack web development, cloud computing system and services (particularly database service). Hope you will find some insights here. Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.comBlogger19125tag:blogger.com,1999:blog-6640354299699664736.post-52801717317104370022016-07-08T03:58:00.000-07:002016-10-25T07:13:18.477-07:00My Learning Notes on Artificial Neural Network<div class="MsoNormal" style="text-align: justify;">
<i><span style="font-family: "times"; mso-bidi-font-family: Times;">This post aims to summarize some key ideas for understanding the
intuitions/theories of artificial neural network and its implementation, based
on my own learning experiences. My approach is to ask questions and then
explore answers for further understanding. It turns out that almost every
question in the artificial neural network deserves much more efforts to dig
deeper. <o:p></o:p></span></i></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<i><span style="font-family: "times"; mso-bidi-font-family: Times;">Note that this post DOES NOT aim to introduce the artificial neural
network algorithm systematically. Many excellent comprehensive tutorials of
deep learning and neural network can be found online. Some are listed in the
post. </span></i><i><span style="font-family: "times"; mso-bidi-font-family: Times;">Feel free to correct me if there is anything incorrect in this post.
After all, I am also on the way of learning.</span></i></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<span style="font-family: "times"; mso-bidi-font-family: Times;">In
March 2016, AlphaGo won the ancient Chinese board game "Go" against a
human champion. A key to the winning of AlphaGo is the deep learning algorithm
developed in it. Besides that, deep learning's achievements in many
applications have made it very attractive, such as computer vision, speech
recognition, and natural language processing. As known, deep learning is
derived from artificial neural network with some additional techniques
included. To learn deep learning, I guess that it is reasonable to learn
artificial neural network first. <o:p></o:p></span></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<span style="font-family: "times"; mso-bidi-font-family: Times;">As
I make efforts to grasp the intuitions/theories and implementation of
artificial neural network, I have found that the artificial neural network
will make sense if we can understand the following key ideas. </span><o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
</div>
<ol>
<li><span style="text-indent: -0.25in;">Activation Function</span></li>
<li><span style="text-indent: -0.25in;">Cost Function</span></li>
<li><span style="text-indent: -0.25in;">Gradient Descent Algorithm</span></li>
<li><span style="text-indent: -0.25in;">Backpropagation Algorithm</span></li>
<li><span style="text-indent: -0.25in;">Artificial Neural Network Architecture</span></li>
</ol>
<div class="MsoNormal" style="text-align: justify;">
<b style="mso-bidi-font-weight: normal;">Activation Function<o:p></o:p></b></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
The ultimate goal of artificial neural network or any kind
of supervised machine learning algorithm is to find a functional relationship
that maps the inputs to the outputs as accurate as possible. An advantage of
artificial neural network over other machine learning algorithms is that it is
capable of representing any kind of relationship, especially nonlinear relationship.
Here is <a href="http://neuralnetworksanddeeplearning.com/chap4.html">a visual
proof that neural nets can compute any function</a>. The more accurately an
algorithm models the functional relationship between inputs and outputs, the
more accurately the model generated by that algorithm can predict the unknown outputs
with corresponding known inputs. <o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
The technique that makes an artificial neural network able
to represent any function is the activation function. Commonly used activation
functions are log-sigmoid function, tan-sigmoid function, and softmax function.<br />
<br />
To further explain why the activation function makes the neural network capable of representing any kind of relationship, let's take a look at the visualization of sigmoid function.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYLXI9viFk2Xzj_kIcO2J6BOyJe-L__JPuR0TIm-XzqiOUcS4UeN9tGb3YMAZp0jOhBOe1ZIj9AgNP76XrQPhkLvRFbQa9EaoF37G-XdnMeUoZEDSJ6AS-N76UWOcB1_kEo83Ribc5-cfM/s1600/sigmoid.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYLXI9viFk2Xzj_kIcO2J6BOyJe-L__JPuR0TIm-XzqiOUcS4UeN9tGb3YMAZp0jOhBOe1ZIj9AgNP76XrQPhkLvRFbQa9EaoF37G-XdnMeUoZEDSJ6AS-N76UWOcB1_kEo83Ribc5-cfM/s320/sigmoid.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<i>source: <span style="text-align: justify;">http://sebastianraschka.com/faq/docs/logisticregr-neuralnet.html</span></i></div>
<br />
As shown in the above figure, roughly speaking, we can notice the following properties of the function .<br />
<br />
For -1 <= z <= 1, the function maps a linear relationship.<br />
For -5 < z < -1 and 1 < z < 5, the function maps a nonlinear relationship.<br />
For z <= -5 and z >= 5, the function maps a constant relationship.<br />
<br />
Back to the artificial neural network, it is made up of neurons and links between
neurons. On each neuron, there are two steps to be done.</div>
<div class="MsoListParagraphCxSpFirst" style="text-align: justify; text-indent: -0.25in;">
</div>
<ol>
<li><span style="text-indent: -0.25in;">Sum up the weighted inputs.</span><span style="text-align: justify; text-indent: -0.25in;"><span style="font-family: "times new roman"; line-height: normal;"> </span></span></li>
<li><span style="text-align: justify; text-indent: -0.25in;">Calculate the activation function with the sum
of the weighted inputs if the threshold value is reached.</span></li>
</ol>
<div class="MsoNormal" style="text-align: justify;">
In some tutorials, these two steps are not talked about
explicitly. But I feel it is easy to understand what a neuron does in two
steps. <o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
For more details about the activation function, refer to <a href="https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_Functions">artificial
neural networks/activation function in wikibooks</a>. <o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
At this moment, it is natural to ask “Why do we use the
above functions as activation functions in neural network?” We already have some idea about how they can map different kinds of relationship. Another benefit is that they all constrain the output in the range from 0 to 1. And then why do
we want to constrain the output in the range from 0 to 1? Based on <a href="http://neuralnetworksanddeeplearning.com/chap1.html">the section “Sigmoid
Neurons” in Chapter 1 Using neural nets to recognize handwritten digits of
Michael Nielsen’s book Neural Networks and Deep Learning</a>, one reason is
that small changes on the weights and bias of inputs can only cause small
changes on the output. We can also notice that it is easy to interpret the output as the probability.</div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<b style="mso-bidi-font-weight: normal;">Cost Function</b><br />
<b style="mso-bidi-font-weight: normal;"><br /></b>
In the parametric models, such as regression and neural network, when we use different parameters in the model, we will get different predictions/outputs. We want to find a set of best parameters which minimize the difference between actual values and predictions.<br />
<br />
How do we measure the difference between actual values and prediction (i.e. error)?<br />
<br />
In general, there are two ways. , which is the<br />
<br />
1. Sum of squared difference between predicted output and actual value of each observation in the training set.<br />
2. Sum of negative log-likelihood between predicted output and actual value of each observation in the training set.<br />
<br />
The function that describes the squared error or negative log-likelihood is called the cost function. And we want to minimize it.<br />
<b style="mso-bidi-font-weight: normal;"><br /></b>
<b style="mso-bidi-font-weight: normal;">Gradient Descent
Algorithm<o:p></o:p></b></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
By this moment, we want to minimize some cost function, which is complicated and has no closed form in such situation like the artificial neural network.<br />
<br />
How can we perform the optimization process?<br />
<br />
The gradient descent algorithm can make the magic! It is an optimization technique
used to reach a local minimum of a complex function gradually. Here is a
beautiful and commonly used analogy about what the gradient descent algorithm
does. A complex function can be visualized as adjacent mountains and valleys.
There are tops and bottoms. Imagine we are currently standing at some point of
a valley and want to climb down to the bottom. But the fog is very heavy and
our sight is limited. We cannot find a whole path directly down to the bottom
and walk down along that path. What we can only do is to choose the next step
that can bring us down a little bit. There are many directions we can walk our
next step toward. But the direction that brings us down the most is favored.
Gradient descent algorithm is to find that direction and then put us down by
one step. <o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
Mathematically speaking, the gradient descent algorithm
finds that direction by calculating the partial derivative of the function. What
the function means here is the <b style="mso-bidi-font-weight: normal;">cost
function</b>. This brings another
advantage of commonly used activation functions mentioned earlier – their
partial derivatives are easy to compute.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
There are two questions related to the gradient descent
algorithm.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
</div>
<ol>
<li><span style="text-indent: -0.25in;">What step size should we choose? If our step
size is too small, it takes long to reach the bottom. But if our step size is
too big, we may step over the bottom and miss it forever.</span></li>
<li><span style="text-indent: -0.25in;">What initial position (i.e. weights, biases) should
we choose? We must start from somewhere. A set of parameters of weights and
biases in the artificial neural network gives an initial position.</span></li>
</ol>
<div class="MsoNormal" style="text-align: justify;">
The rough answer I have for these two questions at the
moment is to do some experiments. I guess as our experiences grow, we can make
a better judgment on these two choices easier. Moreover, some scientific approach
may be found in the literature.<o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
There is one question related to the cost function. <o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
</div>
<ol>
<li><span style="text-indent: -0.25in;">Why do we favor the quadratic format of the
difference between predicted output and actual value as the cost function?</span></li>
</ol>
<div class="MsoNormal" style="text-align: justify;">
The first reason must be that it can help measure the
prediction accuracy. For other reasons, when Michael Nielsen talks about the
issue why we introduce the quadratic cost in this book, he mentions that the
quadratic function is smooth and easy to figure out how to make small changes
on weights and biases in order to improve the output. <o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<b style="mso-bidi-font-weight: normal;">Backpropagation Algorithm<o:p></o:p></b></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
The backpropagation algorithm is a set of rules to update the
weights and biases of an artificial neural network by partitioning the
prediction error into all neurons with the aid of gradient descent on each
neuron. <o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
Chapter 7 Neural Networks in the book <a href="https://www.amazon.com/Discovering-Knowledge-Data-Introduction-Mining/dp/0471666572/ref=sr_1_2?s=books&ie=UTF8&qid=1467973839&sr=1-2&keywords=discovering+knowledge+in+data+an+introduction+to+data+mining">Discovering
Knowledge in Data: An Introduction to Data Mining</a> illustrates a simple example how to perform backpropagation manually in the neural network by hand. <o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
<b style="mso-bidi-font-weight: normal;">Artificial Neural
Network Architecture<o:p></o:p></b></div>
<div class="MsoNormal" style="text-align: justify;">
<br /></div>
<div class="MsoNormal" style="text-align: justify;">
In the construction process of an artificial neural network,
there are two hard questions related to hidden layers, compared to the input
layer and output layer. Note that the input layer can be decided based on the
feature engineering and the output layer is pretty straightforward. <o:p></o:p></div>
<div class="MsoNormal" style="text-align: justify;">
</div>
<ol>
<li><span style="text-indent: -0.25in;">How many hidden layers should we include in
order to achieve the best?</span></li>
<li><span style="text-indent: -0.25in;">How many neurons should we include in each
hidden layer?</span></li>
</ol>
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]-->
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>JA</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
<w:UseFELayout/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="276">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" Name="toc 9"/>
<w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--><br />
<div class="MsoNormal" style="text-align: justify;">
More sigmoid hidden layers and neurons will add the capacity of the neural network but will also tend to cause overfitting.<br />
<br />
Many tutorials mention that some heuristics help. I guess it
really takes lots of experiments and domain knowledge to make a quite good
choice. </div>
Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-42000450732224739562016-05-17T04:04:00.001-07:002016-05-21T03:32:23.360-07:00A Guide to Database Service Trove for OpenStack Liberty on Ubuntu 14.04Let's start with some brief introduction of four components of Trove and their functions in the whole OpenStack environment.<br />
<br />
We can image the whole OpenStack as a construction company which helps people build houses. Note that virtual machine servers built by OpenStack are analogous to houses built by this construction company. The components of OpenStack are like departments of the company, which provide different services to make the company run well. The Keystone department is responsible for authorizing customers' identifications. The Glance department is responsible for providing the blueprint of the house. The Nova department is responsible for construction work, The Cinder department is responsible for providing bricks.<br />
<br />
In this construction company, we are going to establish a department called Trove. It is responsible for installing an intelligent equipment in the house. It is common sense that people take different responsibilities in a department. So in the department of Trove, there are three full-time guys working there.<br />
<br />
Trove-api: He acts like the department front-desk. All requests from customers come to him first. He is responsible for handling and dispatching requests. For example, if a customer requests the service of installing an intelligent equipment for him or her, but he or she doesn't have a house to place the intelligent equipment and permits Trove department to do everything in order to install that equipment. Trove-api will tell the manager Trove-taskmanager about the request.<br />
<br />
Trove-taskmanager: He acts like the department manager. He has the right to request other departments (Nova, Cinder, Glance, etc) to help him finish the work of building the house. Because in order to install the intelligent equipment, an appropriate house has to be built first. So Trove-taskmanager will ask other departments to help build the house first. Once the house is ready, Trove-taskmanager will get the message. He will hire a new technician guy called Trove-guestagent and send him to go to the house and install the intelligent equipment.<br />
<br />
Trove-conductor: He acts like the department secretary. The technician guy Trove-guestagent will update the progress of installation of the intelligent equipment, such as what error he encounters, and whether the installation is completed successfully.<br />
<br />
Trove-guestagent: He acts like the technician guy who is responsible for installing the intelligent equipment. He sends the installation progress message to the secretary Trove-conductor.<br />
<br />
These four guys talk to each with cell phone, which is the role of Rabbit client in OpenStack.<br />
<b><br /></b>
The below deployment assumes that an OpenStack environment has been set up and runs correctly with services Keystone, Horizon, Glance, Nova, Cinder, and Swift. Note that Swift is optional for Trove deployment unless the backup operation is expected.<br />
<b><br /></b>
Note: Replace capital words like TROVE_DBPASS, NETWORK_LABEL, RABBIT_USERNAME, RABBIT_PASS, TROVE_PASS with appropriate values.<br />
<b><br /></b>
<b>Prerequisites </b><br />
<div>
<ol>
<li>Create the database for trove.</li>
<ul>
<li>$ mysql -u root -p</li>
<li>> CREATE DATABASE trove;</li>
<li>> GRANT ALL PRIVILEGES ON trove.* TO 'trove'@'localhost' IDENTIFIED BY 'TROVE_DBPASS';</li>
<li>> GRANT ALL PRIVILEGES ON trove.* TO 'trove'@'%' IDENTIFIED BY 'TROVE_DBPASS';</li>
<li>> FLUSH PRIVILEGES;</li>
</ul>
<li>Source the admin credentials of OpenStack environment.</li>
<ul>
<li>$ source admin-openrc.sh</li>
</ul>
<li>Create the service credentials for trove.</li>
<ul>
<li>$ openstack user create --domain default --password-prompt trove</li>
<li>$ openstack role add --project service --user user trove admin</li>
<li>$ openstack service create --name trove --description "OpenStack Database Service" database</li>
</ul>
<li>Create the database service endpoints.</li>
<ul>
<li>$ openstack endpoint create --region RegionOne database public http://controller:8779/v1.0/%\(tenant_id\)s </li>
<li>$ openstack endpoint create --region RegionOne database internal http://controller:8779/v1.0/%\(tenant_id\)s </li>
<li>$ openstack endpoint create --region RegionOne database admin http://controller:8779/v1.0/%\(tenant_id\)s </li>
</ul>
</ol>
<div>
<b>Install and Configure Components</b></div>
<div>
<ol>
<li>Install the packages.</li>
<ul>
<li># apt-get install python-trove python-troveclient trove-common trove-api trove-taskmanager trove-conductor</li>
</ul>
<li>Edit the /etc/trove/trove.conf file.</li>
<ul>
<li>[database]</li>
<ul>
<li>connection = mysql+pymysql://trove:TROVE_DBPASS@controller/trove</li>
</ul>
<li>[DEFAULT]</li>
<ul>
<li>verbose = True</li>
<li>debug = True</li>
<li>rpc_backend = rabbit</li>
<li>auth_strategy = keystone</li>
<li>trove_auth_url = http://controller:5000/v2.0</li>
<li>nova_compute_url = http://controller:8774/v2</li>
<li>cinder_url = http://controller:8776/v2</li>
<li>neutron_url = http://controller:9696/</li>
<li>add_addresses = True</li>
<li>network_label_regex = ^NETWORK_LABEL$</li>
<li>log_dir = /var/log/trove/</li>
<li>log_file = trove-api.log</li>
</ul>
<li>[oslo_messaging_rabbit]</li>
<ul>
<li>rabbit_host = controller</li>
<li>rabbit_userid = RABBIT_USERNAME</li>
<li>rabbit_password = RABBIT_PASS</li>
</ul>
<li>[keystone_authtoken]</li>
<ul>
<li>auth_uri = http://controller:5000</li>
<li>auth_url = http://controller:35357</li>
<li>auth_plugin = password</li>
<li>project_domain_id = default</li>
<li>user_domain_id = default</li>
<li>project_name = service</li>
<li>username = trove</li>
<li>password = TROVE_PASS</li>
</ul>
</ul>
<li>Edit the /etc/trove/trove-taskmanager.conf file.</li>
<ul>
<li>[database]</li>
<ul>
<li>connection = mysql+pymysql://trove:TROVE_DBPASS@controller/trove</li>
</ul>
<li>[DEFAULT]</li>
<ul>
<li>verbose = True</li>
<li>debug = True</li>
<li>rpc_backend = rabbit</li>
<li>trove_auth_url = http://controller:5000/v2.0</li>
<li>nova_compute_url = http://controller:8774/v2</li>
<li>cinder_url = http://controller:8776/v2</li>
<li>neutron_url = http://controller:9696/</li>
<li>nova_proxy_admin_user = admin</li>
<li>nova_proxy_admin_pass = admin</li>
<li>nova_proxy_admin_tenant_name = admin</li>
<li>nova_proxy_admin_tenant_id = *************</li>
<li>taskmanager_manager = trove.taskmanager.manager.Manager</li>
<li>add_addresses = True</li>
<li>network_label_regex = ^NETWORK_LABEL$</li>
<li>log_dir = /var/log/trove/</li>
<li>log_file = trove-taskmanager.log</li>
<li>guest_config = /etc/trove/trove-guestagent.conf</li>
<li>guest_info = guest_info.conf</li>
<li>injected_config_location = /etc/trove/conf.d</li>
<li>cloudinit_location = /etc/trove/cloudinit</li>
</ul>
<li>[oslo_messaging_rabbit]</li>
<ul>
<li>rabbit_host = controller</li>
<li>rabbit_userid = RABBIT_USERNAME</li>
<li>rabbit_password = RABBIT_PASS</li>
</ul>
</ul>
<li>Edit the /etc/trove/trove-conductor.conf file.</li>
<ul>
<li>[database]</li>
<ul>
<li>connection = mysql+pymysql://trove:TROVE_DBPASS@controller/trove</li>
</ul>
<li>[DEFAULT]</li>
<ul>
<li>verbose = True</li>
<li>debug = True</li>
<li>trove_auth_url = http://controller:5000/v2.0</li>
<li>rpc_backend = rabbit</li>
<li>log_dir = /var/log/trove/</li>
<li>log_file = trove-conductor.log</li>
</ul>
<li>[oslo_messaging_rabbit]</li>
<ul>
<li>rabbit_host = controller</li>
<li>rabbit_userid = RABBIT_USERNAME</li>
<li>rabbit_password = RABBIT_PASS</li>
</ul>
</ul>
<li>Edit the /etc/trove/trove-guestagent.conf file.</li>
<ul>
<li>[DEFAULT]</li>
<ul>
<li>verbose = True</li>
<li>debug = True</li>
<li>trove_auth_url = http://controller:5000/v2.0</li>
<li>nova_proxy_admin_user = admin</li>
<li>nova_proxy_admin_pass = admin</li>
<li>nova_proxy_admin_tenant_name = admin</li>
<li>nova_proxy_admin_tenant_id = *************</li>
<li>rpc_backend = rabbit</li>
<li>log_dir = /var/log/trove/</li>
<li>log_file = trove-guestagent.log</li>
<li>datastore_registry_ext = vertica:trove.guestagent.datastore.experimental.vertica.manager.Manager</li>
<li>(Based on the notes above datastore_registry_ext in the configuration file, add datastore you want.)</li>
</ul>
<li>[oslo_messaging_rabbit]</li>
<ul>
<li>rabbit_host = controller</li>
<li>rabbit_userid = RABBIT_USERNAME</li>
<li>rabbit_password = RABBIT_PASS</li>
</ul>
</ul>
<li>Edit the /etc/init/trove-taskmanager.conf file.</li>
<ul>
<li>--exec /usr/bin/trove-taskmanager -- --config-file=/etc/trove/<span style="color: red;">trove-taskmanager.conf </span>${DAEMON_ARGS}</li>
</ul>
<li>Edit the /etc/init/trove-conductor.conf file.</li>
<ul>
<li>--exec /usr/bin/trove-conductor -- --config-file=/etc/trove/<span style="color: red;">trove-conductor.conf</span> ${DAEMON_ARGS}</li>
</ul>
<li>Populate the trove service database.</li>
<ul>
<li> # trove-manage db_sync</li>
</ul>
</ol>
</div>
<div>
<b>Finalize Installation</b></div>
</div>
<div>
<ol>
<li>Restart the database services.</li>
<ul>
<li># service trove-api restart</li>
<li># service trove-taskmanager restart</li>
<li># service trove-conductor restart</li>
</ul>
<li>Remove the SQLite database file.</li>
<ul>
<li># rm -f /var/lib/trove/trove.sqlite</li>
</ul>
</ol>
<div>
<br /></div>
</div>
<div>
<b>Build database guest image for OpenStack Trove.</b><br />
<b><br /></b>
Reference:<br />
<a href="http://www.dbta.com/BigDataQuarterly/Articles/Building-a-database-guest-image-for-OpenStack-Trove-107368.aspx">Building a guest database image for OpenStack Trove</a><br />
<a href="http://lovelearning9.blogspot.hk/2015/11/build-trove-guest-image-manually-in.html">Build Trove guest image manually in OpenStack</a></div>
<div>
<br /></div>
<div>
<b>Configure the database guest. </b></div>
<div>
<ol>
<li>Edit the /etc/trove/trove-guestagent.conf file.</li>
<ul>
<li>[DEFAULT]</li>
<ul>
<li>verbose = True</li>
<li>debug = True</li>
<li>trove_auth_url = http://controller:5000/v2.0</li>
<li>nova_proxy_admin_user = admin</li>
<li>nova_proxy_admin_pass = admin</li>
<li>nova_proxy_admin_tenant_name = admin</li>
<li>nova_proxy_admin_tenant_id = *************</li>
<li>rpc_backend = rabbit</li>
<li>log_dir = /var/log/trove/</li>
<li>log_file = trove-guestagent.log</li>
<li>datastore_registry_ext = vertica:trove.guestagent.datastore.experimental.vertica.manager.Manager</li>
<li>(Based on the notes above datastore_registry_ext in the configuration file, add datastore you want.)</li>
</ul>
<li>[oslo_messaging_rabbit]</li>
<ul>
<li>rabbit_host = controller</li>
<li>rabbit_userid = openstack</li>
<li>rabbit_password = RABBIT_PASS</li>
</ul>
<ul>
</ul>
</ul>
<li>Edit the /etc/init/trove-guestagent.conf file.</li>
<ul>
<li>--exec /usr/bin/trove-guestagent -- <span style="color: red;">--config-file=/etc/trove/conf.d/guest_info.conf</span> <span style="color: red;">--config-file=/etc/trove/</span><span style="color: red;">trove-guestagent.conf</span> ${DAEMON_ARGS}</li>
</ul>
</ol>
</div>
<div>
<b>Debug Common Problems.</b><br />
<b><br /></b>
Read logs and find out specific information. The first place to check is trove-taskmanager.log on the controller node and then trove-guestagent.log on the guest.<br />
<br />
<ol>
<li>No corresponding Nova instances are launched.</li>
<ul>
<li>Check the configuration values for nova_proxy_admin_user, nova_proxy_admin_pass, nova_proxy_admin_tenant_name, nova_proxy_admin_tenant_id.</li>
</ul>
<li>Trove taskmanager, trove conductor, and trove guestagent cannot talk to each other via RPC.</li>
<ul>
<li>Check their logs to see whether there is "connected to AMQP". If there is no such message, check the configuration values of rabbit_host, rabbit_userid, rabbit_password, and rabbit_backend.</li>
</ul>
<li>The guest_info.conf cannot be injected to the guest.</li>
<ol>
<li><span id="docs-internal-guid-a945d1ba-be53-c09d-93cf-20e9e777055b"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "arial"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;">1. Install the following packages on compute nodes</span></div>
<br /><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;"># apt-get install libguestfs-tools python-libguestfs</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;"># sudo update-guestfs-appliance</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;"># sudo usermod -a -G kvm yourlogin</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;"># chmod 0644 /boot/vmlinuz*</span></div>
<br /><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "arial"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;">2. Add the following configuration options in /ect/nova/nova-compute.conf on</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "arial"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;">compute nodes.</span></div>
<br /><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;">[DEFAULT]</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;">compute_driver=libvirt.LibvirtDriver</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;">rootwrap_config = /etc/nova/rootwrap.conf</span></div>
<br /><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;">[libvirt]</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;">virt_type=qemu</span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;">inject_partition = -1</span></div>
<br /><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-left: 36pt; margin-top: 0pt;">
<span style="background-color: white; color: #222222; font-family: "arial"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;">3. Restart nova-compute service on compute nodes</span></div>
<br /><span style="background-color: white; color: #222222; font-family: "consolas"; font-size: 13.333333333333332px; vertical-align: baseline; white-space: pre-wrap;"># service nova-compute restart</span></span></li>
</ol>
<li>Trove guestagent cannot get guest id from guest_info.conf and Trove instance is stuck at 'BUILD' status forever, but guest info is already injected to the guest.</li>
<ul>
<li>Check /etc/init/trove-guestagent.conf file on the guest and make <span style="color: red;">/etc/trove/conf.d/guest_info.conf </span>as one of configuration files.</li>
</ul>
</ol>
</div>
Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com11tag:blogger.com,1999:blog-6640354299699664736.post-81052932155086905482016-04-24T06:50:00.000-07:002016-05-02T21:19:59.437-07:00MySQL Router: A High Availability Solution for MySQLNowadays more and more companies start to use cloud computing system and services. From the perspective of cloud service users, they want their service level guaranteed. Take the database cloud service as the example. The users expect the database on the cloud available whenever they use it. But there can always be some reason that may cause the database not available, such as the network failure and database exceptional shutdown. So a big problem for cloud service providers is how to guarantee the service level declared in the service level agreement. High available solutions can solve that problem. "<a href="https://en.wikipedia.org/wiki/High_availability">High Availability</a>" is a fundamental and important concept in the cloud computing. It means the service/server on the cloud is available/up at a high rate, like 99.99% time of a given year.<br />
<br />
In this post, I would like to introduce one high availability solution for MySQL -- <a href="https://dev.mysql.com/doc/mysql-router/en/">MySQL Router</a>. First, I will describe what it is. And then I will show how to use it.<br />
<br />
What is MySQL Router?<br />
<br />
MySQL Router is a middleware between MySQL client and MySQL server, which redirects a query from the MySQL client to a MySQL server. What we usually do is to use a MySQL client directly to connect to a MySQL server. So why do we want this middleware? The reason is, with MySQL Router, we can set up more than one backend MySQL servers. When one server is down, MySQL Router can automatically redirect the query to another available server, which helps guarantee the service level.<br />
<br />
How to use MySQL Router?<br />
<br />
1. Set up MySQL backend servers.<br />
<br />
What I did is to launch two virtual machines with Ubuntu Linux system and install a MySQL server on each virtual machine. Another setup scenario is introduced in <a href="https://dev.mysql.com/doc/mysql-router/en/mysql-router-installation-postinstallation.html">MySQL Router tutorial</a>. In real practice, the MySQL backend servers are set in the replication topology.<br />
<br />
Command of installing MySQL server on Ubuntu:<br />
shell> sudo apt-get install mysql-server<br />
<br />
2. Set up MySQL Router.<br />
<br />
-- Download the MySQL APT repository.<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>shell> wget http://dev.mysql.com/get/mysql-apt-config_0.7.2-1_all.deb<br />
<br />
-- Install the MySQL APT repository.<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>shell> sudo dpkg -i mysql-apt-config_0.7.2-1_all.deb<br />
<br />
-- Update the APT repository.<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>shell> sudo apt-get update<br />
<br />
-- Install MySQL Router.<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>shell> sudo apt-get install mysql-router<br />
<br />
-- Install MySQL Utilities.<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>shell> sudo apt-get install mysql-utilities<br />
<br />
3. Configure MySQL Router.<br />
<br />
The configuration is under /etc/mysqlrouter/mysqlrouter.ini.<br />
<br />
The configuration instructions can be found <a href="https://dev.mysql.com/doc/mysql-router/en/mysql-router-configuration.html">here</a>. The most important part is the option 'destinations' in the [routing] section. Its format is host_ip:port. For example, if there are two MySQL servers on the host 192.168.25.200/201 respectively, the 'destinations' can be set as follows.<br />
<br />
destinations = 192.168.25.200:3306, 192.168.25.201:3306<br />
<br />
4. Start MySQL Router.<br />
<br />
shell> sudo mysqlrouter -c /etc/mysqlrouter/mysqlrouter.ini<br />
<br />
5. Set up MySQL client to connect to MySQL Router.<br />
<br />
A python script will be used here as MySQL client. Based on the documentation, the client must be executed on the same machine where the MySQL Router is running.<br />
<br />
(About 'cox' statement: -- change the user and password to your own in the 'cnx' statement.<br />
-- By default, the host and port of MySQL Router are 'localhost' and 7001, unless they are configured by the option 'bind_address' in mysqlrouter.ini. )<br />
<br />
shell> python<br />
>>> import mysql.connector<br />
>>> cnx = mysql.connector.connect(host='localhost', port=7001, user='root', password='secret')<br />
>>> cur = cnx.cursor()<br />
>>> cur.execute("SHOW DATABASES")<br />
>>> print cur.fetchall()<br />
<br />
6. Play with MySQL Router.<br />
<br />
-- Create different databases on each MySQL server under the same user specified in the 'cnx' statement. <br />
-- Execute the code in Step 5. And check the output.<br />
-- Stop a MySQL server, re-execute code from 'cnx', and examine the difference.<br />
<br />
More details can be found in the part of <a href="https://dev.mysql.com/doc/mysql-router/en/mysql-router-installation-postinstallation.html">Testing the Router</a>.Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-4922078544397448182016-04-16T08:54:00.000-07:002016-04-17T22:47:25.868-07:00Google Charts: a great way to draw interactive and realtime charts on JSON data requested from REST APIData visualization is helpful not only for data analysts but also for system administrators. Data analysts use data visualization to untangle correlated relationship between some features, while system administrators use data visualization to monitor the system status and users' status. Many plotting packages are developed to visualize data, such as ggplot2 in R and matplotlib in Python, which are very powerful to make elegant charts. But the charts made by those packages are STATIC. In some cases, we want to make charts based on users' interaction or refresh charts automatically in order to trace the latest updated data in the data set. Here <a href="https://developers.google.com/chart/">Google Charts</a> can play the role. Another benefit Google Charts provides is to showcase the work with charts on the web! Moreover, Google Charts make it easy to use AJAX to attach JSON data retrieved from a REST API to the chart data table.<br />
<br />
In this post, I will give a complete example how to use Google Charts to draw a realtime chart, which reads data from a REST API and automatically refresh itself. In order to use Google Charts, the familiarity of the language HTML, JavaScript, and CSS will be required. I think the <a href="https://developers.google.com/chart/interactive/docs/quick_start">guide </a>of Google Charts make it very easy for new learners to approach the technique. For people who feel hard following the guide, some tutorial resources have been provided at the end of the post to help catch up the required basic concepts.<br />
<br />
The code of the example can be found in <a href="https://github.com/Lili-Updating/projects/blob/master/google_charts_example/google_chart_example.html">my GitHub</a>. My working flow on this program is as follows.<br />
<br />
<ol>
<li>Prepare a REST API which provides JSON data to serve as chart data for visualization. A <a href="https://api.github.com/repos/hadley/ggplot2/issues">public GitHub API</a> mentioned in this <a href="https://cran.r-project.org/web/packages/jsonlite/vignettes/json-apis.html">post</a> is used. It contains some data about issues related to ggplot2 repository on GitHub. We can create our own REST API, as mentioned in my last post.</li>
<li>Use AJAX to read JSON data from REST API to a JavaScript variable, which will be transformed to the format of Google Chart data table. <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrx4WTfV9kpQnWzGpQxIG4a1qTEqRZl9SuWXevj8RyYZQoRorpNiSOgNHArEodmfnLUEkl7rUnaCTzK55XEOTcuZkdm-fYPE3Ji23pJ_M_qt7gjLK7bQQGwCPCG2HY8n0PHbmCUD6ZGCIL/s1600/Screen+Shot+2016-04-16+at+11.16.48+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrx4WTfV9kpQnWzGpQxIG4a1qTEqRZl9SuWXevj8RyYZQoRorpNiSOgNHArEodmfnLUEkl7rUnaCTzK55XEOTcuZkdm-fYPE3Ji23pJ_M_qt7gjLK7bQQGwCPCG2HY8n0PHbmCUD6ZGCIL/s1600/Screen+Shot+2016-04-16+at+11.16.48+PM.png" /></a></li>
<li>Set up Google Chart refresh effect by using the function <a href="https://developers.google.com/chart/interactive/docs/reference#methods_11">setRefreshInterval()</a>. <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwpNM7dL6nlRiDuZd6rY6E8YcgJHw8nelaToMTVq_EUNOkTHhvxH-qZ_BV7VMBKLPTM7AQg-QHGR6pODYXVmLxxpnUdqh32Ilww3owTRZNzhSFZ7IR11K9ZFgr1PmiegXeh0zfRcExEK01/s1600/Screen+Shot+2016-04-16+at+11.22.39+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwpNM7dL6nlRiDuZd6rY6E8YcgJHw8nelaToMTVq_EUNOkTHhvxH-qZ_BV7VMBKLPTM7AQg-QHGR6pODYXVmLxxpnUdqh32Ilww3owTRZNzhSFZ7IR11K9ZFgr1PmiegXeh0zfRcExEK01/s1600/Screen+Shot+2016-04-16+at+11.22.39+PM.png" /></a></li>
<li>Set up Google Chart option <a href="https://developers.google.com/chart/interactive/docs/animation#null">animation</a>. <div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2CNsVBWN-foO3886b6yrCg_fLy1SSTHx9VoLFa7jJ7GEwGMmLDjFLcH3PokyInX2m2O3DTGQzsNpOhpFUlAoiBHQEwEbDuC4Drbtbh5Hudnj3_UKPvLVnkGJqhLq-qM275WkPHAhO-Tvl/s1600/Screen+Shot+2016-04-16+at+11.24.50+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2CNsVBWN-foO3886b6yrCg_fLy1SSTHx9VoLFa7jJ7GEwGMmLDjFLcH3PokyInX2m2O3DTGQzsNpOhpFUlAoiBHQEwEbDuC4Drbtbh5Hudnj3_UKPvLVnkGJqhLq-qM275WkPHAhO-Tvl/s1600/Screen+Shot+2016-04-16+at+11.24.50+PM.png" /></a></div>
</li>
<li>Draw the chart. <div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoviyCis9LXh8SM3Tyai67MtHJDyHrhlZLJ3T7QDctwX-lTAs2MkNq12cNNO7xSXeehWj7yooMLMl0ZQvOAPi13huluKzaggUSkxyKv0Ne07yLNhycVECcMuAfhgVTMdkR3rC_HtdkEmHj/s1600/Screen+Shot+2016-04-16+at+11.56.35+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoviyCis9LXh8SM3Tyai67MtHJDyHrhlZLJ3T7QDctwX-lTAs2MkNq12cNNO7xSXeehWj7yooMLMl0ZQvOAPi13huluKzaggUSkxyKv0Ne07yLNhycVECcMuAfhgVTMdkR3rC_HtdkEmHj/s1600/Screen+Shot+2016-04-16+at+11.56.35+PM.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlT0gNa6osGTAhOyTh7eBfxLHABGTQDCCj94wscv0Fa1rbNvPf8ZIqiWjboiSQD1IQgBwdxPF1h1ijTHGcAoL9-US4Vu3z9l2t9zJNTyNXrLvZLsa0Qo85l3suWCdidQ_j4bU0sq1ynUu5/s1600/Screen+Shot+2016-04-16+at+11.56.45+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlT0gNa6osGTAhOyTh7eBfxLHABGTQDCCj94wscv0Fa1rbNvPf8ZIqiWjboiSQD1IQgBwdxPF1h1ijTHGcAoL9-US4Vu3z9l2t9zJNTyNXrLvZLsa0Qo85l3suWCdidQ_j4bU0sq1ynUu5/s1600/Screen+Shot+2016-04-16+at+11.56.45+PM.png" /></a></div>
</li>
</ol>
<div>
For people who haven't had experiences with HTML, CSS, JavaScript, and AJAX, I would recommend the following tutorial resources. Three weeks ago, I only knew a little bit about HTML and absolutely nothing about CSS, JavaScript, AJAX, or the concept of front end development. I referred to the following tutorials, grasped all of them in three weeks, and finished my task of user interface development from work. </div>
<div>
<ul>
<li><a href="http://www.tutorialspoint.com/">tutorialspoint</a> (English)</li>
<li><a href="https://www.udacity.com/course/front-end-web-developer-nanodegree--nd001">Front-End Web Developer Nanodegree courses on Udacity</a> (English)</li>
<li><a href="http://www.maiziedu.com/course/python/">Python Web Development on maiziedu</a> (Chinese) </li>
</ul>
<div>
For people who want to learn more about REST API development, Python full-stack web development, and HTTP, the above resources will also be helpful. </div>
</div>
<div>
<br /></div>
<div>
Last but not least, I would like to share a little bit about my motivation to learn the web development. After reading many posts and studying current successful business applications, my feeling is that machine learning and web development go hand by hand. Many machine learning systems are deployed on web applications. A complete web solution may contain some machine learning algorithms as a specific problem solution. I guess that makes sense by just thinking about what Amazon does. </div>
Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-25532320297702683872016-03-27T01:29:00.001-07:002016-04-03T21:48:52.528-07:00RESTful web service and OpenStack dashboard <i>In the past three weeks, I have worked on the task of developing a RESTful web service and integrating it to OpenStack dashboard. In this post, I will share some knowledge about them and also the tools (i.e. Flask framework, Flask RESTful framework) that help me finish this task. </i><br />
<i><br /></i>
Nowadays many people benefit from the convenience of cloud computing services on OpenStack, Amazon Web Services, and many other cloud providers. Some may wonder how I get the service like a virtual machine by just clicking several buttons in the browser. This relies on the technique of RESTful web service.<br />
<i><br /></i>
In the concept of RESTful web service, everything can be categorized as either a resource or an operation/method on a resource. A resource is represented by a URI. And there are four kinds of operations on a resource, namely, GET, POST, PUT, and DELETE. The operation is done by HTTP request. In RESTful web service, a response to a HTTP request is serialized data in format of JSON or XML, which makes the information transportation between machines efficient.<br />
<br />
To learn more details about RESTful web service, many online tutorials can be referred to. When I started to study this concept, I read lots of posts. And I have found the following mapping way is helpful. For people who have object-oriented (OO) programming experience, the resource-oriented architecture of RESTful web service may be understood in the OO paradigm. In OO, there are only objects and methods on objects, where an method on a object may be done by another object. But note that the work mechanism of RESTful web service and OO are very different. The former is done through HTTP request, while the latter is just a programming paradigm.<br />
<br />
Because there are many repetitive works involved in the web application and service development, many frameworks are created to do those repetitive works automatically for developers. So the developers can only focus on their own parts. The frameworks <a href="http://flask.pocoo.org/">Flask</a> and <a href="http://flask-restful-cn.readthedocs.org/en/0.3.4/">Flask RESTful</a> have helped me develop RESTful web service. I chose these two frameworks because I found that they were fairly easy to learn and use, and suitable in my case, after exploring other Python web frameworks like Django and Django REST. Another difference between Flask and Django is that the developer is allowed to choose their own object relational mapper (ORM) in Flask, while Django provides the default ORM. At the moment, I feel more comfortable using <a href="http://www.sqlalchemy.org/">SQLAlchemy</a> as ORM in the development. Their documentations provide the code of the example project to help get started. So the example code will not be provided here. Note that as mentioned in Flask RESTful documentation, the python package <a href="http://docs.python-requests.org/en/master/">requests</a> can serve as an easy way to test whether RESTful web service or URIs work as expected.<br />
<br />
After RESTful web service is developed and ready for use, a user-friendly front-end dashboard will also be desired not only by users but also by administrators. Because the RESTful web service in my case is related to OpenStack, the OpenStack dashboard is customized to host it, considering the fact that system administrators and users always want to do everything in just one dashboard. To figure out how to integrate an external web service to OpenStack dashboard, I have mainly referred to two posts from <a href="http://keithtenzer.com/2015/02/16/building-custom-dashboards-in-openstack-horizon/">Keith</a> and <a href="http://docs.openstack.org/developer/horizon/topics/tutorial.html">OpenStack tutorial</a> respectively, which answer most of my questions. <br />
<br />
For people who want to develop their own front-end dashboard, the knowledge of HTML, CSS, and JavaScript will be required. HTML is in charge of the content of a webpage. CSS is in charge of the style of a webpage. And JavaScript makes a webpage more interactive. I don't have many experiences to share on this aspect. But I believe as long as we know what we want to achieve, we will find the bridge to it. Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-90537711341796188112016-02-27T03:19:00.000-08:002016-10-25T05:20:33.928-07:00Automating system administrative tasks of Vertica cluster on OpenStack with Python <div style="text-align: justify;">
<i>In this post, I would like to share some experiences on how to automate system administrative tasks with Python. This is the first time for me to do such tasks. One of big lessons I have learned is how to break a complex task down into small sequence tasks. </i></div>
<div style="text-align: justify;">
<i><br /></i></div>
<div style="text-align: justify;">
Based on the project need, recently I have been working on some system administrative tasks. And one task is to automate the following process with a Python program. </div>
<br />
<ul>
<li style="text-align: justify;">Create a Vertica cluster on OpenStack Trove. </li>
<li style="text-align: justify;">Migrate some certain users and their tables to this newly created cluster from the cluster they currently sit on. This task can be broken down into following sequence tasks. </li>
<ul>
<li>Connect to the current Vertica cluster and <a href="https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/EXPORT_OBJECTS.htm">export objects</a> of those users into a sql file on the machine where the python program is executed.</li>
<li>Transfer the sql file to the target Vertica cluster. </li>
<li>Connect to the target Vertica cluster with the dbadmin credentials to <a href="https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/SQLReferenceManual/Statements/CREATEUSER.htm">create users</a> and <a href="https://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/GRANT/GRANTDatabase.htm%3FTocPath%3DSQL%2520Reference%2520Manual%7CSQL%2520Statements%7CGRANT%2520Statements%7C_____2">grant database privilege to them</a>. </li>
<li>Connect to the target Vertica cluster with each user's credentials to <a href="https://community.dev.hpe.com/t5/Vertica-Forum/load-a-sql-file-into-vertica/td-p/218217">create their objects (schemas and tables) from sql file</a>.</li>
<li><a href="https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPYFROMVERTICA.htm">Copy table data</a> to the target Vertica cluster from the current Vertica cluster.</li>
</ul>
</ul>
<br />
<div style="text-align: justify;">
The reason why we are interested in launching a Vertica cluster is that Vertica is a massively parallel processing database (MPPDB). The MPPDB-as-a-service on the cloud would be very attractive for companies who perform analytical tasks on huge amounts of data. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The following Python packages have been used to implement the above tasks. </div>
<br />
<ul>
<li style="text-align: justify;"><a href="https://github.com/openstack/python-troveclient">troveclient</a>: Call Trove Python API. </li>
<li style="text-align: justify;"><a href="https://pypi.python.org/pypi/vertica-python">python-vertica</a>: Interact with Vertica server on the cluster. </li>
<li style="text-align: justify;"><a href="https://docs.python.org/2/library/subprocess.html">subprocess</a>: Execute a command line on the local machine. </li>
<li style="text-align: justify;"><a href="http://www.paramiko.org/">paramiko</a>: Execute a command line on the remote machine (i.e. virtual machines that host Vertica server on OpenStack)</li>
</ul>
<div style="text-align: justify;">
Let's see how things work out with examples. </div>
<br />
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
1. As it mentions in <a href="http://www.ibm.com/developerworks/cloud/library/cl-openstack-pythonapis/">this article</a>, OpenStack is a popular open source cloud operating system for deploying infrastructure as a service and cloud-based tasks can be automated through working with OpenStack Python APIs. The author provides the examples of Keystone API, Nova API, and so on, but doesn't cover Trove API. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
An Example of Trove API can be found in the screenshot below. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqYz7m7Ji06npPMauH6JGZU-hdPlvKVGrwNc5upk06oss4JX6dGlpK22j8zfdJXAzAKZ-z0g1h21oha-9DuMyLnUYjgWcsyj2cX5_LgY3TZeF8mGlzmDQ-bVVL-4pg8reYWA_EjIZb7Mfr/s1600/Screen+Shot+2016-02-27+at+6.07.43+PM.png" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-right: 1em; text-align: center;"><img border="0" height="392" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqYz7m7Ji06npPMauH6JGZU-hdPlvKVGrwNc5upk06oss4JX6dGlpK22j8zfdJXAzAKZ-z0g1h21oha-9DuMyLnUYjgWcsyj2cX5_LgY3TZeF8mGlzmDQ-bVVL-4pg8reYWA_EjIZb7Mfr/s640/Screen+Shot+2016-02-27+at+6.07.43+PM.png" width="640" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
2. Every database management system adopts the server-client paradigm, which means we can interact with the server with a client. The Python client package for Vertica is python_vertica. </div>
<div>
<br /></div>
<div>
Below screenshot provides an example how to user python_vertica to export objects. </div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEga6c2atDlgxDWr0gT7t5jAUMR927p3069eFQ-DPEYFWLp9XTlSqNirLrA3Bs_JQMpLqpr6aLuD2JywzhJd6tVRraRn0kD-3FsVbbEreO1KxyYmqO0Dt0OCzXSzDVtf6QD6IE7pU0W42sQD/s1600/Screen+Shot+2016-02-27+at+6.31.44+PM.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="306" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEga6c2atDlgxDWr0gT7t5jAUMR927p3069eFQ-DPEYFWLp9XTlSqNirLrA3Bs_JQMpLqpr6aLuD2JywzhJd6tVRraRn0kD-3FsVbbEreO1KxyYmqO0Dt0OCzXSzDVtf6QD6IE7pU0W42sQD/s640/Screen+Shot+2016-02-27+at+6.31.44+PM.png" width="640" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<br />
<div style="text-align: justify;">
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
3. As mentioned in the above section, one small task is to transfer sql file from the local machine to a remote machine (the target Vertica cluster). <span class="n" style="color: #333333; font-family: "consolas" , "menlo" , "liberation mono" , "courier" , monospace; font-size: 12px;">subprocess</span><span class="o" style="color: #333333; font-family: "consolas" , "menlo" , "liberation mono" , "courier" , monospace; font-size: 12px; font-weight: 700;">.</span><span class="n" style="color: #333333; font-family: "consolas" , "menlo" , "liberation mono" , "courier" , monospace; font-size: 12px;">check_call() </span>is used. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
4. Another small task is to execute a command line to create a user's objects from sql file on the target Vertica cluster. So how can we execute a command line on a remote machine?</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Below is the code screenshot to use paramiko to do that. Note in this case, SSH login of the remote machine is set via key pair file. So the parameter "KEY_FILE_PATH_LOCAL" is need. If SSH login is set via password, refer to paramiko document and make corresponding changes. </div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmvDRnFySXemzN8PCAgmaK7FI6L_MJ0uomAsI7jQzOPjWDRDxvZfSoR6CZG8vccOk-_mcesSQ9IeTNsEbS6sLQ8Ic0GB4WTR5t0oVkiTWwaW8jDaNp1CqMf_0Y9v8C5gC3_LnjIF3Gb8fJ/s1600/Screen+Shot+2016-02-27+at+7.01.22+PM.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="276" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmvDRnFySXemzN8PCAgmaK7FI6L_MJ0uomAsI7jQzOPjWDRDxvZfSoR6CZG8vccOk-_mcesSQ9IeTNsEbS6sLQ8Ic0GB4WTR5t0oVkiTWwaW8jDaNp1CqMf_0Y9v8C5gC3_LnjIF3Gb8fJ/s640/Screen+Shot+2016-02-27+at+7.01.22+PM.png" width="640" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
This task requires the knowledge in the operating system, database system (i.e. Vertica), and Python. I am glad that I make it, since the growth always starts with small steps. There is still some space to optimize the code, which is what I will work on. </div>
Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-71717970537138250282016-02-21T05:36:00.001-08:002016-02-22T03:25:44.730-08:00An Initial Exploration of Electricity Price Forecasting<div style="text-align: justify;">
<i>One month ago, I decided to perform some exploration on the problem electricity price forecasting to get some knowledge in electricity market and sharpen skills in data management and modeling. This post will describe what I have done and achieved in this project, from problem definition, literature review, data collection, data exploration, feature creation, model implementation, model evaluation, and result discussion. Feature engineering and Spark are two skills I aim to gain and improve in this project. </i></div>
<div style="text-align: justify;">
<i><br /></i></div>
<div style="text-align: justify;">
The price of a commodity in a market influences the behaviors of all market participants from suppliers to consumers. So knowledge about the future price plays a determinant role on selling and buying decisions for suppliers to make profits and for consumers to save cost.<br />
<br />
Electricity is the commodity in the electricity market. In deregulated electricity market, generating companies (GENCOs) submit production bids one day ahead. When GENCOs decide the bids, both electricity load and price for the coming day are not known. So those decisions rely on the forecasting of electricity load and price. Electricity load forecasting has moved to an advanced stage both in the industry and academic with low enough prediction error, while electricity price forecasting is not as mature as electricity load forecasting in the respect of tools and algorithms. That is because the components of electricity price are more complicated than electricity load. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
1. Literature Review</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<i>"If I have been able to see further, it was only because I stood on the shoulder of giants." -- Newton</i></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The review paper (<a href="http://www.sciencedirect.com/science/article/pii/S0142061508000884">Electricity Price Forecasting in Deregulated Markets: A Review and Evaluation</a>) has been mainly referred to. In this paper, both price-influencing factors and models are summarized. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
2. Exploratory Data Analysis</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The <a href="http://mis.nyiso.com/public/P-2Alist.htm">locational based marginal price (LBMP) in day ahead market provided by New York Independent System Operator (ISO)</a> is used. Because of computational resource limitation, this project is only to forecast the price of "WEST" zone for the coming day.<br />
<br />
<i>"Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." -- Pedro Domingos in "<a href="http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf">A Few Things to Know about Machine Learning</a>"</i></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Based on the scatter plot and correlation coefficient, the following variables are used as model inputs. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
- Marginal cost losses</div>
<div style="text-align: justify;">
- The square of marginal cost losses</div>
<div style="text-align: justify;">
- Marginal cost congestion</div>
<div style="text-align: justify;">
- The square of marginal cost congestion</div>
<div style="text-align: justify;">
- The average price in the past 1 day</div>
<div style="text-align: justify;">
- The average price in the past 7 days</div>
<div style="text-align: justify;">
- The average price in the past 28 days</div>
<div style="text-align: justify;">
- The average price in the past 364 days</div>
<div style="text-align: justify;">
- The standard deviation of price in the past 1 day</div>
<div style="text-align: justify;">
- The standard deviation of price in the past 7 days</div>
<div style="text-align: justify;">
- The standard deviation of price in the past 28 days</div>
<div style="text-align: justify;">
- The standard deviation of price in the past 364 days</div>
<div style="text-align: justify;">
- The ratio of average price in the past 1 day over average price in the past 7 days</div>
<div style="text-align: justify;">
- The ratio of average price in the past 7 days over average price in the past 364 days</div>
<div style="text-align: justify;">
- The ratio of average price in the past 28 days over average price in the past 364 days</div>
<div style="text-align: justify;">
- The ratio of standard deviation in the past 1 day over standard deviation in the past 7 days</div>
<div style="text-align: justify;">
- The ratio of standard deviation in the past 7 days over standard deviation in the past 364 days</div>
<div style="text-align: justify;">
- The ratio of standard deviation in the past 28 days over standard deviation in the past 364 days</div>
<div style="text-align: justify;">
- The year </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
3. Model Development</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The linear regression is used as the initial model for exploration. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
4. Result Presentation and Analysis </div>
<br />
The mean absolute percentage error (MAPE) is used as the performance error. The MAPE is currently around 53%, which is high. The solution can be improved in the following respects.<br />
<ul>
<li>Create more efficient input variables, like electricity load. The electricity load data in New York ISO is provided in 5-minute interval. Those data has been retrieved, but is under manipulation process, like imputing missing value and aggregating to hour-level. </li>
<li>Use other models like neural network. </li>
</ul>
<div>
The code developed for this project can be found <a href="https://github.com/Lili-Updating/projects/tree/master/electricity_price_prediction">here</a>. The computation was tried on Spark system. And feature engineering was paid especial attention to in this project. There are still a lot to do in this project in order to improve forecasting accuracy. I will try to continue this topic if I get enough time and energy. </div>
<br />Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-76261216093994976812016-02-03T07:24:00.002-08:002016-02-03T20:31:56.154-08:00Multi-tenant Management for Database Service on the Cloud Computing Platform <i>In this post, the concept of multi-tenancy of database service on the cloud computing platform is discussed, as well as its benefits, technical challenge, and solution architecture. And the example of car renting is given to help people understand in a more general way. The reason why I wrote this post is that I am currently working on the project of implementing parallel database service on the cloud computing platform at an extremely low cost through careful multi-tenancy management. And I realize that what I have learned from this project is not only knowledge in database and big data system but also the application of management science. </i><br />
<br />
Multi-tenancy is one of key problems faced by cloud service providers. For database service on the cloud computing platform, multi-tenancy means that more than one tenants can be deployed on a single database server. If it is managed well, both the cost and energy will be saved a lot. But we cannot randomly deploy some tenants together, because we have to meet a more important objective than cost saving, which is to satisfy the service level every tenant requests. A good solution architecture must be able to handle that challenge.<br />
<br />
To help people who are unfamiliar with the concept of multi-tenancy or database service on the cloud understand what it is and also its benefits, we can think about the following real-world example. The cloud computing platform can be thought as a car renting company. And the tenants on the cloud computing platform can be thought as customers of the car renting company. The only difference between them is the service. The cloud computing platform provides computing services like database service, while a car renting company provides the car renting service. Let's image that there are two customers for this car renting company. If the renting schedule of these two customers overlap, the company has to assign one car for each customer. In this case, two cars have to be available in order to serve those two customers. But what if the renting schedule of these two customers are different, then the company can assign the same car for them. In this case, only one car is required to be available. As known, the renting company has to buy enough cars to make them available in order to satisfy all customers. If customers who have different renting schedule are assigned to a single car, then the total number of cars the company needs to buy will be reduced. This is a common strategy for a renting company to save money. So it can also be a very good strategy for the cloud computing platform.<br />
<br />
But this strategy is challenging on the cloud computing platform, because it is not scheduled in advance when a tenant uses the service on the cloud computing platform. And a tenant's service level agreement may not be satisfied at each time period if it is deployed on a database server with some other tenants. If we don't know when a tenant uses the service, how could we know which tenant should be put together? So in order to realize multi-tenancy, the first step is to know roughly or predict tenants' behavior like when they use the service, based on their records.<br />
<br />
Prof. Lo's group came up with a system architecture as a solution for multi-tenancy of parallel database as service on the cloud computing platform in 2013, which can be found in the paper <a href="http://www4.comp.polyu.edu.hk/~cscllo/research/sigmod13.pdf">Parallel Analytics as A Service</a>. The solution can be understood in the following simplified car renting scenario.<br />
<br />
Let's assume that the car renting company serves customers who frequently rent cars and their schedules are not known in advance. When a customer comes in, there must be a car available for him or her. As discussed earlier, we want to first know roughly when a customer rents a car based on his or her records, in order to find which customers can be assigned to use the same car in different time periods, which in turn reduces the total number of cars the renting car company needs to have. But how? the solution in the paper suggests when a new customer comes in, assign a car to that customer and only that customer can use that car in certain period like one month. So in this one month, whenever the customer wants to use the car, there is always a car available. And during this month, the time periods when the customer uses the car will be recorded. As time goes on, we collect all customers' records. And based on the records, we can predict when one customer rents a car and then decide which customers can use the same car without schedule conflicts through some analysis.<br />
<br />
Back to the cloud computing platform case, the general logic of the solution is the same. In the proposed system architecture, there are four components, which are Tenant Activity Monitor, Deployment Advisor, Deployment Master, and Query Router. I am mainly in charge of implementation of the first three components, which have been almost done, and my teammate has finished the implementation of query router. Hopefully this will bring more people benefits as an open source project pretty soon. Compared with car renting service, there are many more technical details involved in the database service on the cloud computing platform. Take this as an example. If a customer of the car renting company is assigned to a different car, he or she just goes to get it. But on the cloud computing platform, if a tenant is assigned to another database server, all its data has to be migrated to another database server. But this is not the main point of this post.<br />
<br />
Considering the fact that the active tenant ratio is very low (i.e. 10% in IBM's database-as-a-service), multi-tenancy of parallel database service on the cloud computing is very attractive. For any kind of service providers like cloud service provider and car renting company, resource can be utilized more efficiently and cost can be reduced a lot, through careful management by some analytical methods. In July 2015, I took the job of implementing parallel database service on OpenStack as a research associate from Prof. Lo because of the reputation of his group in the database and big data system, which I wanted to learn more about in order to realize my career goal. Now besides the knowledge in database and big data system, this project also provides a vision about the application of resource management, which help me further understand the spirit of <b>management science</b>.<br />
<br />Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com1tag:blogger.com,1999:blog-6640354299699664736.post-86232232454906194502015-12-27T03:17:00.000-08:002015-12-31T16:43:45.673-08:00The First Python Project in Data Science: Stock Price Prediction<div style="text-align: justify;">
<span style="font-family: inherit;"><i>In this post, I will explain what I have done in my first Python project in data science - stock price prediction, combined with the code. I started to learn how to use Python to perform data analytical works during my after-working hours at the beginning of December. So this post will also serve as a summary of what I have learned in the past three weeks. Hope t</i></span><i><span style="font-family: inherit;">his post can also give readers some insights on the whole process of how to get a data science project done with Python from beginning to end. I want to thank Vik from </span>dataquest for comments he has provided, which I have really benefited from. </i></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">My code for this project can be found <a href="https://github.com/Lili-Updating/projects/tree/master/stock_price_prediction" target="_blank">here</a> on GitHub. It can be customized by changing settings in </span><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">settings.py</span><span style="font-family: inherit; font-size: x-small;"> </span><span style="font-family: inherit;">and automatically process data acquisition and prediction tasks in this project. It can also be easily modified to adapt more users' needs. </span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><b>1. Get and store data.</b></span><br />
<span style="font-family: inherit;"><b><br /></b></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">There are basically three data resources. </span></div>
<div style="text-align: justify;">
</div>
<ul>
<li style="text-align: justify;"><span style="font-family: inherit;">Files, such as EXCEL, CSV, and TEXT. </span></li>
<li style="text-align: justify;"><span style="font-family: inherit;">Database, such as SQLite, MySQL, and MongoDB. </span></li>
<li style="text-align: justify;"><span style="font-family: inherit;">Web.</span></li>
</ul>
<div style="text-align: justify;">
<span style="font-family: inherit;">The historical data used in this project is web data from <a href="https://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices" target="_blank">Yahoo Finance Historical Prices</a>. There are basically two ways to retrieve web data. </span></div>
<ul>
<li style="text-align: justify;"><span style="font-family: inherit;">Web API, such as Twitter API, Facebook API, and OpenNotify API. </span></li>
<li style="text-align: justify;"><span style="font-family: inherit;">Web Scraping. </span></li>
</ul>
<div style="text-align: justify;">
<div style="text-align: start;">
<span style="font-family: inherit;">I chose the web scraping method to retrieve historical data, since I just learned it and wanted to practice it. In the code, the program </span><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">data_acquisition.py</span><span style="font-family: inherit; font-size: x-small;"> </span><span style="font-family: inherit;">scrapes historical data and stores them in either csv file or mysql database. </span><span style="font-family: inherit;">The function </span><span style="background-color: white; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">get_query_input</span><span style="background-color: white; color: #333333; font-family: inherit; font-size: 12px; white-space: pre;">() </span><span style="font-family: inherit;">will get the date range of historical data that the user wants to retrieve (i.e. from 2015-01-01 to 2015-12-25), form the url with parameters extracted from user inputs, and return that url. With that url, we then can get data by sending HTTP request to it and scraping it, which is what the function </span><span class="pl-en" style="box-sizing: border-box; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">scrape_page_data</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">(</span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">pageUrl</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">)</span><span style="font-family: inherit;"> does. </span><span class="pl-en" style="box-sizing: border-box; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">aggregate_data_to_mysql</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">(</span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">total_table</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">pageUrl</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">output_table</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">database_connection</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">) </span><span style="background-color: white; color: #333333; text-align: start; white-space: pre;"><span style="font-family: inherit;">will store the scraped data in mysql </span></span><span style="color: #333333; font-family: inherit;"><span style="white-space: pre;">database, while </span></span><span class="pl-en" style="box-sizing: border-box; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; white-space: pre;">aggregate_data_to_csv</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; white-space: pre;">(</span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; white-space: pre;">total_table</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; white-space: pre;">pageUrl</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; white-space: pre;">output_file</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; white-space: pre;">) </span><span style="color: #333333;"><span style="white-space: pre;">will store the scraped data in a csv file. </span></span><span style="font-family: inherit; text-align: justify;">Notice that these two functions are recursive functions. That is because all historical data requested may not be in the same webpage, and if not, we need to go to the next page url and keep scraping data. So the question is how we can scrape data in a webpage and then go to next page and continue this process until all historical data requested is retrieved.</span></div>
</div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">To scrape data in a webpage, we first take a look at its HTML source code. In Safari, right click the webpage and select "show page source". This may require a little bit knowledge about HTML, but it can be quickly caught up. By investigating the HTML source code of <a href="https://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices" target="_blank">stock price page</a>, we can find that the data we want to retrieve is in the table named "<span style="background-color: white; color: #183691; font-size: 12px; white-space: pre;">yfnc_datamodoutline1</span>". CSS selectors make it easy for us to select the rows in the table. And the next page url can be obtained from the attribute of <i><a rel="next" href=</i><span style="font-family: inherit;"><i>, </i>where tag <i><a </i>indicates a link in HTML. </span></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><i>By the time when I write this post, I find <a href="https://code.google.com/p/yahoo-finance-managed/wiki/YahooFinanceAPIs" target="_blank">Yahoo! Finance APIs</a>. Next time I need it, I will play with that to see how it would make things easier. I guess when working with data, the first thing to check is whether that website provides APIs.</i> </span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><b>2. Explore data.</b></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">The purpose of data exploration may fall in one of the below categories. </span></div>
<div style="text-align: justify;">
</div>
<ul>
<li style="text-align: justify;"><span style="font-family: inherit;">Missing values and outliers</span></li>
<li style="text-align: justify;"><span style="font-family: inherit;">The pattern of the target variable</span></li>
<li style="text-align: justify;"><span style="font-family: inherit;">Potential predictors</span></li>
<li style="text-align: justify;"><span style="font-family: inherit;">The relationship between the target variable and potential predictors</span></li>
<li style="text-align: justify;"><span style="font-family: inherit;">The distribution of variables</span></li>
</ul>
<div style="text-align: justify;">
<span style="font-family: inherit;">Data visualization and statistics (i.e. correlation) are good ways to explore data. One specific way</span><span style="font-family: inherit;"> for checking missing values can be found </span><a href="http://stackoverflow.com/questions/28199524/best-way-to-count-the-number-of-rows-with-missing-values-in-a-pandas-dataframe" style="font-family: inherit;" target="_blank">here</a><span style="font-family: inherit;">. It turns out there is no missing value in the historical data.</span>For time series data, we set the date as the index and sort the data in ascending order, which can be done though the function <a href="http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.set_index.html" target="_blank">pandas.DataFrame.set_index()</a> and <a href="http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.sort_index.html" target="_blank">pandas.DataFrame.sort_index()</a>. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">The objective of this project is to predict stock price. Take "Close" price as the example. To predict it, we will be interested in its own pattern and the relationship between it and other factors. One question is what factors may influence our target variable ("Close" price). We can get some insights on those driving factors by studying the problem deeper and doing some research on relative literature and available modelings. Frankly speaking, I almost knew nothing about the stock market, before I started to work on this project. So I got indicators of "Close" price from the project information received from dataquest. </span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<ul style="margin-bottom: 0pt; margin-top: 0pt;"><span id="docs-internal-guid-2dd04215-e299-63f7-a090-dee7bd9f1eb9" style="font-family: inherit;">
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The average price from the past five days.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The average price for the past month.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The average price for the past year.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The ratio between the average price for the past five days, and the average price for the past year.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The standard deviation of the price over the past five days.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The standard deviation of the price over the past year.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The ratio between the standard deviation for the past five days, and the standard deviation for the past year.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The average volume over the past five days.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The average volume over the past year.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The ratio between the average volume for the past five days, and the average volume for the past year.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The standard deviation of the average volume over the past five days.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The standard deviation of the average volume over the past year.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: justify;">
<span style="vertical-align: baseline; white-space: pre-wrap;">The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.</span></div>
</li>
<li dir="ltr" style="list-style-type: disc; text-align: justify; vertical-align: baseline;"><span style="vertical-align: baseline; white-space: pre-wrap;">The year component of the date.</span></li>
</span></ul>
<div style="text-align: justify;">
<span style="font-family: inherit;"><span style="white-space: pre-wrap;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">In Pandas, there are <a href="http://pandas.pydata.org/pandas-docs/stable/computation.html#moving-rolling-statistics-moments" target="_blank">functions to compute moving (rolling) statistics</a>, such as rolling_mean and rolling_std. But we need to shift the column of that rolling statistics forward by one day. The reason is as follows. W</span><span style="font-family: inherit;">e want t</span><span style="font-family: inherit;">he average price from the past five days for </span><span style="font-family: inherit;">2015-12-12 to be the mean of prices </span><span style="font-family: inherit;">from 2015-12-07 to 2015-12-11, but the returned value of</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">the rolling_mean function for </span>2015-12-12 is <span style="font-family: inherit;">the mean of prices from 2015-12-08 to 2015-12-12</span><span style="font-family: inherit;">. So </span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">t</span><span style="font-family: inherit;">he average price from the past five days for </span><span style="font-family: inherit;">2015-12-12 is actually </span><span style="font-family: inherit;">the returned value of the function rolling_mean for </span>2015-12-11. </div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><b>3. Clean data. </b></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">When computing some indicators, a year of historical data is required. So those indicators for the earliest year will be missing, indicated as NaN in Python. We will just remove rows with NaN values.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">This step and following steps are done in the program </span><span style="font-family: Courier New, Courier, monospace; font-size: x-small;">prediction</span><span style="font-family: 'Courier New', Courier, monospace; font-size: xx-small; text-align: start;">.py</span><span style="font-family: inherit;">. First, users' prediction requests will be retrieved by </span><span class="pl-en" style="box-sizing: border-box; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">get_prediction_req</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">(</span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">data_storage_method</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">). </span><span style="background-color: white; color: #333333; text-align: start; white-space: pre;"><span style="font-family: inherit;">Then historical data will be read</span><span style="font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px;"> </span><span style="font-family: inherit;">and loaded to a DataFrame </span></span><span style="background-color: white; color: #333333; text-align: start; white-space: pre;"><span style="font-family: inherit;">either from mysql database or csv file with the function </span></span><span class="pl-en" style="box-sizing: border-box; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">read_data_from_mysql</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">(</span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">database_connection</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">historical_data_table</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">) </span><span style="background-color: white; color: #333333; font-family: inherit; text-align: start; white-space: pre;">or </span><span class="pl-en" style="box-sizing: border-box; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">read_data_from_csv</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">(</span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">historical_data_file</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">)</span><span style="background-color: white; color: #333333; font-family: inherit; text-align: start; white-space: pre;">. </span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><b>4. Build the predictive model. </b></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">Some common predictive methods we use are regression, classification, decision tree, neutral network, and supporting vector machine. </span><span style="font-family: inherit;">For this project, multiple linear regression and random forest will be chosen as initial methods, whose interfaces are provided in </span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">sklearn </span><span style="background-color: white; color: #333333; text-align: start; white-space: pre;"><span style="font-family: inherit;">package.</span></span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;"> </span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><b>5. Validate and select the model. </b></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">To validate the model, two things need to be done. </span></div>
<div style="text-align: justify;">
</div>
<ul>
<li style="text-align: justify;"><span style="font-family: inherit;">Split the historical data into train set and test set. </span></li>
<li style="text-align: justify;"><span style="font-family: inherit;">Choose an error performance measurement and calculate it. </span></li>
</ul>
<div style="text-align: justify;">
<span style="font-family: inherit;">The function is </span><span class="pl-en" style="box-sizing: border-box; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">predict</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">(</span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">df</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">prediction_start_date</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">prediction_end_date</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">predict_tommorrow</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">)</span><span style="font-family: inherit;">, which performs predictive tasks and calculates the error </span>measurement after <span style="font-family: inherit;">calculating the indicators and cleaning the data. </span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">The big question here is how to split the historical data into train set and test for prediction. For example, the user wants to </span>predict<span style="font-family: inherit;"> the price from </span>2015-01-03<span style="font-family: inherit;"> to 2015-12-27. T</span><span style="font-family: inherit;">he train and test data set will be split based on the backtesting technique. Let's say the earliest date in historical data is 1950-01-03. For example, to predict the price for </span>2015-01-03, the train set will be historical data from 1950-01-03 to 2015-01-02, and the test set will be 2015-01-03. And to<span style="font-family: inherit;"> predict the price for </span>2015-01-04, the train set will be historical data from 1950-01-03 to 2015-01-03, and the test set will be 2015-01-04. Keep this process until we get all predicted values <span style="font-family: inherit;">from </span>2015-01-03<span style="font-family: inherit;"> to 2015-12-27. I</span><span style="font-family: inherit;">n the code, this part is done by looping over the index set of the prediction period. </span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
The mean absolute error (MAE) is picked as the error performance measure. The model with a smaller MAE will chosen to predict the price for tomorrow. The MAEs and predicted value for tomorrow will be written to file "predicted_value_for_tommorrow". And the actual and predicted value for test data will be stored in either a mysql database or csv file by function <span class="pl-en" style="box-sizing: border-box; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">write_prediction_to_mysql</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">(</span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">df_prediction</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">database_connection</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">predicted_output_table</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">) </span><span style="background-color: white; color: #333333; text-align: start; white-space: pre;"><span style="font-family: inherit;">or </span></span><span class="pl-en" style="box-sizing: border-box; color: #795da3; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">write_prediction_to_csv</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">(</span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">df_prediction</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">, </span><span class="pl-smi" style="box-sizing: border-box; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">predicted_output_file</span><span style="background-color: white; color: #333333; font-family: Consolas, 'Liberation Mono', Menlo, Courier, monospace; font-size: 12px; text-align: start; white-space: pre;">) </span><span style="background-color: white; color: #333333; text-align: start; white-space: pre;"><span style="font-family: inherit;">for further analysis, which may give some insights on where the model doesn't perform well and how to improve it. </span></span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><b>6. Present the results. </b></span></div>
<div style="text-align: justify;">
<br />
<span style="font-family: inherit;">Beside predicting the value for tomorrow, a simple experiment is done by using the whole year data in 2014 and 2015 as test respectively. The MAE result is as follows. </span><br />
<span style="font-family: inherit;"><br /></span>
<br />
<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse; width: 223px;">
<!--StartFragment-->
<colgroup><col style="mso-width-alt: 3968; mso-width-source: userset; width: 93pt;" width="93"></col>
<col span="2" style="width: 65pt;" width="65"></col>
</colgroup><tbody>
<tr height="15" style="height: 15.0pt;">
<td class="xl63" height="15" style="height: 15.0pt; width: 93pt;" width="93"></td>
<td class="xl64" style="width: 65pt;" width="65"><span style="font-family: inherit;">2014</span></td>
<td class="xl64" style="width: 65pt;" width="65"><span style="font-family: inherit;">2015</span></td>
</tr>
<tr height="15" style="height: 15.0pt;">
<td class="xl63" height="15" style="height: 15.0pt;"><span style="font-family: inherit;">Regression</span></td>
<td class="xl64"><span style="font-family: inherit;">15.62</span></td>
<td class="xl64"><span style="font-family: inherit;">19.98</span></td>
</tr>
<tr height="15" style="height: 15.0pt;">
<td class="xl63" height="15" style="height: 15.0pt;"><span style="font-family: inherit;">Random Forest</span></td>
<td class="xl64"><span style="font-family: inherit;">13.99</span></td>
<td class="xl64"><span style="font-family: inherit;">19.08</span></td>
</tr>
<!--EndFragment-->
</tbody></table>
<br /></div>
<div style="text-align: justify;">
<span style="font-family: inherit;">We can have two simple findings. </span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">- The model performs differently in different years. </span><span style="font-family: inherit;"> shows that the model performs better for the stock market in the year 2014.</span><br />
<span style="font-family: inherit;">- Random forest seems to perform better than multiple linear regression based on their MAE. </span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">The time series plots for 2014 and 2015 with actual and predicted values can be found below. As shown, the trend is caught well in the whole. One potential improvement point for regression model is that the peak falls behind slightly. And one potential improvement point for random forest model is that its predicted value is smaller than actual value in general. </span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">I will tweak algorithms more later and then present more findings. After all, the initial main purpose of this project is to get myself familiar with the full lifecycle of a data science project and Python packages/functions that are frequently used in a data science project. </span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgorG2r5WvEV3ROcH9vhqjuLuGOZXQqDoq-Y3mkToUYDxRAbK2fL5gYXh60juwAW3UbSos4DpkwV7KWZClZUbwGjvF8aP8I93Sc8kw8wuhfAbCfO7dda7gVGY6_Sdfox3x9jNC4EZoSIBs0/s1600/stock-prediction-2015.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgorG2r5WvEV3ROcH9vhqjuLuGOZXQqDoq-Y3mkToUYDxRAbK2fL5gYXh60juwAW3UbSos4DpkwV7KWZClZUbwGjvF8aP8I93Sc8kw8wuhfAbCfO7dda7gVGY6_Sdfox3x9jNC4EZoSIBs0/s400/stock-prediction-2015.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEje1Qg2cdUjoIz1E1cnDfjy2q4kPJSQyd5Pt6mj_rT8St8MmealxnfIt3Z31G4-ZoCqvDlOVrjNTlu3FpOPZBC5hyphenhyphenHcjip7iutRPEPfqt1vPN_eaMmlRTU7JJF0SQCt3QLrjrNNzqNkgeRY/s1600/stock_price_prediction.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEje1Qg2cdUjoIz1E1cnDfjy2q4kPJSQyd5Pt6mj_rT8St8MmealxnfIt3Z31G4-ZoCqvDlOVrjNTlu3FpOPZBC5hyphenhyphenHcjip7iutRPEPfqt1vPN_eaMmlRTU7JJF0SQCt3QLrjrNNzqNkgeRY/s400/stock_price_prediction.png" width="400" /></a></div>
</div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: inherit;"><b><br /></b></span></div>
Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com1tag:blogger.com,1999:blog-6640354299699664736.post-81016669200646626632015-12-25T17:36:00.000-08:002015-12-25T18:00:37.335-08:00A Simple but Complete Guide for OpenStack Trove<div style="text-align: justify;">
<i>This post illustrates the framework and some details that help learners get started with OpenStack and its database service Trove. I started to study OpenStack and its Trove project five months ago from scratch. If I can do it, you can also do it. I appreciate the guidance I have received in every way, especially my teammates whom I have worked closely with every weekday. </i></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Trove is database-as-a-service project on OpenStack, which can help users save the cost on database infrastructure and eliminate administrative tasks like deployment, configurations, and backups, and so on. To understand its benefit, image that you need a database server for your business. Traditionally, you have to set up an infrastructure first, install database package, configure it, and maintain it, which cost lots of money and time. With OpenStack Trove, what you do is just to open an account, execute some commands or click some tabs to launch a database server, and only pay for what you use.<br />
<br />
Before we start this exciting journey, I would like to highlight the following documentation and reference. </div>
<div style="text-align: justify;">
</div>
<ul>
<li><a href="http://docs.openstack.org/liberty/" target="_blank">OpenStack Documentation</a>, where you can find installation guide, admin user guide, end user guide, and command line reference. That will cover almost all tasks performed on OpenStack and its service components. Reading the documentation carefully can always help us avoid some naive mistakes and save us much time.</li>
<li><a href="http://www.amazon.com/OpenStack-Trove-Amrith-Kumar/dp/1484212223/ref=sr_1_1?ie=UTF8&qid=1450600307&sr=8-1&keywords=openstack+trove" target="_blank">OpenStack Trove</a>, where you can find comprehensive details about Trove. </li>
</ul>
<div>
The following steps can help you deploy OpenStack Trove and then operate it from scratch. </div>
<div>
<br /></div>
<div style="text-align: justify;">
1. Set up an OpenStack environment and add identity service (Keystone), image service (Glance), computer service (Nova), dashboard (Horizon), networking service (Neutron or Nova-network) and block storage service (Cinder). </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
This can be done by following OpenStack Installation Guide, which can be found in <a href="http://docs.openstack.org/liberty/" target="_blank">OpenStack documentation</a>. Be sure to choose the right version of the installation guide for your operating system (Ubuntu 14.04, Red Hat Enterprise Linux 7, CentOS 7, openSUSE 13.2, and SUSE Linux Enterprise Server 12) and the version of OpenStack (Juno, Kilo, and Liberty) you want to deploy. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
2. Add database as service (Trove) on OpenStack. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
This part is not provided in the official installation documentation. For Ubuntu users, follow the steps and commands provided in my earlier <a href="http://lovelearning9.blogspot.hk/2015_10_01_archive.html" target="_blank">post</a>. For other Linux operating system users, I guess it will still be a good way to follow steps in that post and change commands correspondingly. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
3. Obtain or build a Trove guest image. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
This is the first step to use Trove. There are currently 3 ways. </div>
<br />
<ul>
<li style="text-align: justify;">Download a pre-built guest image from <a href="http://tarballs.openstack.org/trove/images/ubuntu/" target="_blank">here</a>. This method is DevStack based.</li>
<li style="text-align: justify;">Build a guest image using OpenStack Trove tools (Disk Image Builder, redstack). I tried these two tools on DevStack and my OpenStack, respectively. Using these two tools, I got a working image on DevStack, but failed to get a working image on my OpenStack. Back to the time I tried, probably there was some bug related to their DevStack dependency. By the working image, I mean that a working image can be used to launch a Trove instance successfully with active status. I am not sure whether they work well now or not. After struggling with the tools for one month, I came up with a customized way. </li>
<li style="text-align: justify;"><b>Build a guest image using customized way. The way I used can be found in this <a href="http://lovelearning9.blogspot.hk/2015/11/build-trove-guest-image-manually-in.html" target="_blank">post</a>. I got a working image by performing those steps. That post also provides some insights on how the image works.</b></li>
</ul>
<div style="text-align: justify;">
For more details about the first two ways, see this <a href="http://www.dbta.com/BigDataQuarterly/Articles/Building-a-database-guest-image-for-OpenStack-Trove-107368.aspx" target="_blank">post</a>. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
4. Add the Trove guest image to the datastore. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The image we get from the above is just a QCOW2 file. In order to tell Trove where it is and let Trove use it, we must add it to Trove datastore, by performing step 2-6 in the <a href="http://docs.openstack.org/admin-guide-cloud/database.html#create-a-datastore" target="_blank">database service chapter of OpenStack Administration Guide</a>. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
5. Launch a Trove instance. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
This can be done either through dashboard or Trove command line client. For the latter, refer to "<a href="http://docs.openstack.org/cli-reference/content/troveclient_commands.html#troveclient_subcommand_create" target="_blank">trove create" command in OpenStack Command Line Reference</a>. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
6. Debug. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
There are some common errors that OpenStack Trove users can encounter. Just to name some.</div>
<div style="text-align: justify;">
<ul>
<li>Status goes to "Error" shortly after the creation. </li>
<li>Status is stuck at "Build" and goest to "ERROR" after reaching the timeout values. </li>
<li>No host is assigned.</li>
</ul>
</div>
<div style="text-align: justify;">
To figure out the reason, the starting point should always be trove logs, including trove-api.log, trove-taskmanager.log, and trove-conductor.log, which by default are in the directory /var/log/trove on the trove controller node, and also trove-guestagent.log, which by default are in the directory /var/log/trove/ on the guest. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
With the insights log files provide, google and <a href="https://ask.openstack.org/en/questions/" target="_blank">ask openstack</a> are good places that can help target the errors. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Good luck and enjoy! </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com1tag:blogger.com,1999:blog-6640354299699664736.post-62229708178158713372015-12-20T01:31:00.000-08:002015-12-22T08:10:12.875-08:00Web Scraping and Data Analysis with Python<div style="text-align: justify;">
Recently I have been working on a project about predicting stock price. Thanks to the guidance received from dataquest, I was excited to grasp how to use web scraping technique to collect data and how to utilize various Python packages to analyze data and build models. I wish I had knowledge about these techniques back to 2013, when I was involved in a project requiring me to collect a large amount of web data and analyze them. If I had knew the existence of these techniques, I would definitely learn them. And that would make my work more efficient and accurate, since it would be less likely to make a mistake and easy to track if there was something wrong by programming data collection and analysis tasks in Python.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In the past one and a half month, I have read through the following three books. I have found them very helpful by empowering myself with techniques of automating data related tasks in Python.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
1. <a href="http://www.pythonlearn.com/book.php" target="_blank">Python for Informatics</a></div>
<div style="text-align: justify;">
2. <a href="http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793/ref=sr_1_1?ie=UTF8&qid=1450603625&sr=8-1&keywords=python+for+data+analysis" target="_blank">Python for Data Analysis</a></div>
<div style="text-align: justify;">
3. <a href="http://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/ref=sr_1_1?ie=UTF8&qid=1450603684&sr=8-1&keywords=web+scraping+with+python" target="_blank">Web Scraping with Python</a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Just take a look at these books and take away what you need. And you are welcome to check my <a href="https://github.com/Lili-Updating/projects/blob/master/stock_price_prediction.py" target="_blank">code</a> on GitHub for the stock price prediction project. The program produces right results. But I am still working on making it better and getting it updated.</div>
Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-68867039924835496742015-11-19T19:59:00.000-08:002016-04-06T05:54:47.738-07:00Build Trove Guest Image Manually in OpenStackI came up with this solution to manually building an image for OpenStack Trove based on my knowledge and intuition, after struggling with using Disk Image Builder on my OpenStack environment.<br />
<br />
The first step to use Trove is to obtain or build a Trove image. Compared to regular OpenStack virtual machine image used for Nova, there are two extra essential components on a Trove image: Trove guest agent (or the ability of getting Trove guest agent) and a database server package.<br />
<br />
If you already have an OpenStack environment set up, you can build a Trove guest image following the below steps.<br />
<br />
1. Launch a Nova instance with a flavor that satisfies the system requirements (CPU, RAM, swap disk, etc) of the database server you want to install. Perform step 2-6 on this Nova instance.<br />
<br />
2. Install and configure cloud-init on the Nova instance launched in Step 1.<br />
<br />
# apt-get install cloud-init cloud-utils cloud-initramfs-growroot cloud-initramfs-rescuevol<br />
# echo 'manage_etc_hosts: True' > /etc/cloud/cloud.cfg.d/10_etc_hosts.cfg<br />
<br />
3. Set up basic environment for trove-guestagent on the Nova instance launched in Step 1.<br />
<br />
# apt-get update<br />
# apt-get install ubuntu-cloud-keyring<br />
# echo "deb http://ubuntu-cloud.archive.canonical.com/ubuntu" \<br />
"trusty-updates/kilo main" > /etc/apt/sources.list.d/cloudarchive-kilo.list<br />
<br />
(Note: change "kilo" to the version of your OpenStack.)<br />
<br />
# apt-get update && apt-get dist-upgrade<br />
<br />
4. Install Trove and Trove-guestagent on the Nova instance launched in Step 1.<br />
<br />
# apt-get install python-trove python-troveclient trove-common trove-guestagent<br />
<br />
5. Configure Trove-guestagent on the Nova instance launched in Step 1.<br />
<br />
Edit /etc/trove/trove-guestagent.conf.<br />
<br />
[DEFAULT]<br />
rabbit_host = controller<br />
rabbit_userid =<i> RABBIT_USER</i><br />
rabbit_password = <i>RABBIT_PASS</i><br />
nova_proxy_admin_user = admin<br />
nova_proxy_admin_pass = admin<br />
nova_proxy_admin_tenant_name = admin<br />
trove_auth_url = http://controller:35357/v2.0<br />
log_file = trove-guestagent.log<br />
<br />
( Change <i>RABBIT_USER, </i><i>RABBIT_PASS</i>)<br />
Refer to Step 1 in the <a href="http://docs.openstack.org/admin-guide/database.html" target="_blank">database chapter of OpenStack Administration Guide</a>.<br />
<br />
6. Upload the database server package to the Nova instance launched in Step 1.<br />
<br />
Note: After uploading the database server package, you can choose to install it manually or not. If you don't install it, Trove-guestagent should install it at the boot time of a Trove instance. But different database servers have different scenarios. Refer to the source code of the database you want to use under<a href="https://github.com/openstack/trove/tree/stable/kilo/trove/guestagent/datastore" target="_blank"> /trove/guestagent/datastore</a> to find out the default path of the database server package, where the package should be uploaded to. For example, if you use Vertica, the default path that Vertica package should be uploaded to is root directory '/' based on the value of option "INSTALL_VERTICA" in the source file <a href="https://github.com/openstack/trove/blob/stable/liberty/trove/guestagent/datastore/experimental/vertica/system.py">/trove/guestagent/datastore/experimental/vertica/system.py</a>. And the community edition of Vertica is free and can be downloaded <a href="https://my.vertica.com/community/">here</a>.<br />
<br />
7. In the OpenStack dashboard, shut off this Nova instance, take a snapshot of it.<br />
<br />
Note: In OpenStack, a snapshot is an image.<br />
<br />
8. Through trove client, add the snapshot to datastore.<br />
<br />
Refer to Step 2-6 in the <a href="http://docs.openstack.org/admin-guide-cloud/database.html#create-a-datastore" target="_blank">database chapter of OpenStack Administration Guide</a>.<br />
<br />
9. Restart all trove service on the controller node.<br />
<br />
# service trove-api restart<br />
# service trove-taskmanager restart<br />
# service trove-conductor restart<br />
<br />
10. Launch a Trove instance.<br />
<br />
Refer to <a href="http://docs.openstack.org/cli-reference/content/troveclient_commands.html#troveclient_subcommand_create" target="_blank">OpenStack Command Line Reference</a>.Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-34396256437804810692015-10-27T03:06:00.001-07:002016-04-06T05:54:39.168-07:00Add Database Service (Trove) for OpenStack - Kilo - Ubuntu<br />
Note: Change the values in CAPITAL and <i>italic</i>, like <i>TROVE_DBPASS, </i><i>TROVE_PASS, </i><i>RABBIT_USER, </i><i>RABBIT_PASS, </i><i>NETWORK_LABEL. </i>And also change controller to IP address of OpenStack controller node, if necessary.<br />
<br />
1. Prepare trove database<br />
<br />
$ mysql -u root -p<br />
mysql> CREATE DATABASE trove;<br />
mysql> GRANT ALL PRIVILEGES ON trove.* TO trove@'localhost' IDENTIFIED BY '<i>TROVE_DBPASS</i>';<br />
mysql> GRANT ALL PRIVILEGES ON trove.* TO trove@'%' IDENTIFIED BY '<i>TROVE_DBPASS</i>';<br />
mysql> FLUSH PRIVILEGES;<br />
<br />
2. Install required Trove components<br />
<br />
# apt-get install python-trove python-troveclient trove-common trove-api trove-taskmanager trove-conductor<br />
<br />
3. Prepare OpenStack<br />
<br />
$ source ~/admin-openrc.sh<br />
$ keystone user-create --name trove --pass <i>TROVE_PASS</i><br />
$ keystone user-role-add --user trove --tenant service --role admin<br />
<br />
4. Add the following configuration options to [filter:authtoken] section in /etc/trove/api-paste.ini<br />
<br />
[filter:authtoken]<br />
<br />
auth_uri = http://controller:5000<br />
auth_url = http://controller:35357<br />
auth_plugin = password<br />
project_domain_id = default<br />
user_domain_id = default<br />
project_name = service<br />
username = trove<br />
password = <i>TROVE_PASS</i><br />
<br />
5. Edit the configuration following options in [DEFAULT] section in the following files<br />
<br />
/etc/trove/trove.conf<br />
/etc/trove/trove-taskmanager.conf<br />
/etc/trove/trove-conductor.conf<br />
<br />
[DEFAULT]<br />
log_dir = /var/log/trove<br />
trove_auth_url = http://controller:5000/v2.0<br />
nova_compute_url = http://controller:8774/v2<br />
cinder_url = http://controller:8776/v2<br />
swift_url = http://controller:8080/v1/AUTH_<br />
notifier_queue_hostname = controller<br />
control_exchange = trove<br />
rabbit_host = controller<br />
rabbit_userid = <i>RABBIT_USER</i><br />
rabbit_password = <i>RABBIT_PASS</i><br />
rabbit_virtual_host= /<br />
rpc_backend = trove.openstack.common.rpc.impl_kombu<br />
sql_connection = mysql://trove:<i>TROVE_DBPASS</i>@controller/trove<br />
<br />
(note: comment old configuration options if any, such as trove_auth_url, connection)<br />
<br />
6. Edit the following configuration options in [DEFAULT] section in /etc/trove/trove.conf<br />
<br />
[DEFAULT]<br />
add_addresses = True<br />
network_label_regex = ^<i>NETWORK_LABEL</i>$<br />
<br />
(note: replace NETWORK_LABLEL with network label you want to connect. Find it out by "nova net-list" or "neutron net-list".)<br />
<br />
7. Edit the following configuration options in [DEFAULT] section in /etc/trove/trove-taskmanager.conf<br />
<br />
[DEFAULT]<br />
nova_proxy_admin_user = admin<br />
nova_proxy_admin_pass = admin<br />
nova_proxy_admin_tenant_name = admin <br />
taskmanager_manager = trove.taskmanager.manager.Manager<br />
log_file = trove-taskmanager.log<br />
<br />
8. Initialize database<br />
<br />
# trove-manage db_sync<br />
<br />
9. Edit /etc/init/trove-conductor.conf to make the following option mataching<br />
<br />
--exec /usr/bin/trove-conductor -- --config-file=/etc/trove/trove-conductor.conf ${DAEMON_ARGS}<br />
<br />
10. Edit /etc/init/trove-taskmanager.conf to make the following option matching<br />
<br />
--exec /usr/bin/trove-taskmanager -- --config-file=/etc/trove/trove-taskmanager.conf ${DAEMON_ARGS}<br />
<br />
11. Configure the Trove Endpoint in Keystone<br />
<br />
$ keystone service-create --name trove --type database --description "OpenStack Database Service"<br />
$ keystone endpoint-create \<br />
--service-id $(keystone service-list | awk '/ trove / {print $2}') \<br />
--publicurl http://controller:8779/v1.0/%\(tenant_id\)s \<br />
--internalurl http://controller:8779/v1.0/%\(tenant_id\)s \<br />
--adminurl http://controller:8779/v1.0/%\(tenant_id\)s \<br />
--region regionOne<br />
<br />
12. Restart the Trove Services<br />
<br />
$ sudo service trove-api restart<br />
$ sudo service trove-taskmanager restart<br />
$ sudo service trove-conductor restartAnonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com6tag:blogger.com,1999:blog-6640354299699664736.post-35157749907137033842015-07-09T18:44:00.000-07:002015-07-09T18:44:33.379-07:00Introduction to MySQL and SQL for Absolute Beginners or as Quick RefresherThe database management system (DBMS) becomes extremely more important in the data era. MySQL is an open source software as relational database management system (RDBMS). It can be freely downloaded <a href="http://dev.mysql.com/downloads/" target="_blank">here</a>. The instructions for installations, testing, and simple operations are provided <a href="http://www.elated.com/articles/mysql-for-absolute-beginners/" target="_blank">here</a>. And more SQL commands can be found on the website of <a href="http://www.sqlcourse.com/" target="_blank">SQLCourse</a>, which is also an interactive online SQL training platform.Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-16978944728101168892015-07-06T07:11:00.001-07:002015-07-06T07:11:38.544-07:00An Introductory Book to Business IntelligenceBusiness intelligence (BI) has been a hot concept. I have been wondering what it is exactly. Now I get a satisfactory answer from the book <a href="http://smile.amazon.com/Business-Intelligence-Decisions-through-Analytics-ebook/dp/B004UMOYBA/ref=sr_1_1?ie=UTF8&qid=1436191311&sr=8-1&keywords=business+intelligence+making+decisions+through+data+analytics" target="_blank">Business Intelligence: Making Decisions through Data Analytics</a>. This book gives a thorough and systematic introduction to BI and its tools. Basically, BI and its tools are derived from the following four areas: statistics and econometrics, operations research, artificial intelligence, and database technologies. This book successfully connects what I have grasped to BI.<br />
<br />
I guess it would be a good idea to start to explore BI from this book.Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-8860146070102934482015-06-25T14:07:00.001-07:002015-06-25T14:07:13.934-07:00A Tutorial Website for Simple Learning - TutorialspointThe tutorials library provided by <a href="http://www.tutorialspoint.com/index.htm" target="_blank">Tutorialspoint</a> covers broad topics in the computer related technology and targets the main points. Self-learners may find the topic they want to learn more accessible and grasp the pin points in a short time. And also it provides online terminal and IDEs for practice.<br />
<br />
The tutorial of Hadoop quickly led me to understand the function, structure, and operation of Hadoop as a solution to big data.Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-9706996812112178242015-06-21T18:05:00.001-07:002015-06-26T14:58:16.251-07:00An Introductory Book to Data Mining<a href="http://www.amazon.com/Discovering-Knowledge-Data-Introduction-Applications/dp/0470908742/ref=sr_1_1?ie=UTF8&qid=1434934653&sr=8-1&keywords=discovering+knowledge+in+data" target="_blank">Discovering Knowledge in Data: An Introduction to Data Mining</a> is a pretty interesting and straightforward book for beginners in data mining. I read 5 chapters of this book just this afternoon. It is so attractive.<br>
<br>
The course materials (i.e. data sets, homework, project) of <a href="https://datamining.bus.utk.edu/syllabus.asp" target="_blank">data mining course</a> provided by University of Tennessee can also serve very good practical reference.<br>Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-67057937535953160592015-06-15T19:30:00.000-07:002015-06-15T19:30:23.054-07:00SAS Global Certification ProgramIt is a big data era. People who have data analysis and modeling skills can find plenty of job opportunities. <a href="http://support.sas.com/certify/" target="_blank">SAS Global Certification Program</a> provides a bridge to enter this profession area by getting them certified.<br />
<br />
Based on my own experiences, the procedure of preparing for the certification exam can help study the software and theories specifically and thoroughly in an organized time framework.<br />
<br />
Currently I hold the following certificates.<br />
<br />
- SAS Certified Base Programmer for SAS 9 (08/2014)<br />
- SAS Certified Advanced Programmer for SAS 9 (12/2014)<br />
- SAS Certified Statistical Business Analyst Using SAS 9: Regression and Modeling (05/2015)<br />
- SAS Certified Predictive Modeler Using Enterprise Miner 7 (06/2015)<br />
<br />
I mainly used the SAS online tutor and course notes. If you need any study materials for study purpose only, please leave a comment.Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0tag:blogger.com,1999:blog-6640354299699664736.post-80050524291505492952015-06-09T19:34:00.001-07:002015-06-09T20:34:15.138-07:00An Introductory Example to Lagrangian RelaxationThe example provided <a href="http://mat.gsia.cmu.edu/classes/mstc/relax/node9.html" target="_blank">here</a> makes the Lagrangian relaxation (LR) not a mystery for me any more. Each step becomes very clear. I have read a lot of materials in LR. This is one of the best for beginners to get to know how LR works. <br />
<br />
An implementation of LR with C++ in ILOG CPLEX can be found <a href="http://ieor.berkeley.edu/~atamturk/ieor264/samples/concert/lagrangian.cpp" target="_blank">here</a>. <br />
<br />
That makes me think that there is probably no really unsolvable problems. It is just we haven't found a right way to approach it. Anonymoushttp://www.blogger.com/profile/08099599579233144958noreply@blogger.com0