Lili's Sharings

My Learning Notes on Artificial Neural Network

2016-07-08T03:58:00.000-07:00

This post aims to summarize some key ideas for understanding the intuitions/theories of artificial neural network and its implementation, based on my own learning experiences. My approach is to ask questions and then explore answers for further understanding. It turns out that almost every question in the artificial neural network deserves much more efforts to dig deeper.

Note that this post DOES NOT aim to introduce the artificial neural network algorithm systematically. Many excellent comprehensive tutorials of deep learning and neural network can be found online. Some are listed in the post. Feel free to correct me if there is anything incorrect in this post. After all, I am also on the way of learning.

In March 2016, AlphaGo won the ancient Chinese board game "Go" against a human champion. A key to the winning of AlphaGo is the deep learning algorithm developed in it. Besides that, deep learning's achievements in many applications have made it very attractive, such as computer vision, speech recognition, and natural language processing. As known, deep learning is derived from artificial neural network with some additional techniques included. To learn deep learning, I guess that it is reasonable to learn artificial neural network first.

As I make efforts to grasp the intuitions/theories and implementation of artificial neural network, I have found that the artificial neural network will make sense if we can understand the following key ideas.

Activation Function
Cost Function
Gradient Descent Algorithm
Backpropagation Algorithm
Artificial Neural Network Architecture

Activation Function

The ultimate goal of artificial neural network or any kind of supervised machine learning algorithm is to find a functional relationship that maps the inputs to the outputs as accurate as possible. An advantage of artificial neural network over other machine learning algorithms is that it is capable of representing any kind of relationship, especially nonlinear relationship. Here is a visual proof that neural nets can compute any function. The more accurately an algorithm models the functional relationship between inputs and outputs, the more accurately the model generated by that algorithm can predict the unknown outputs with corresponding known inputs.

The technique that makes an artificial neural network able to represent any function is the activation function. Commonly used activation functions are log-sigmoid function, tan-sigmoid function, and softmax function.

To further explain why the activation function makes the neural network capable of representing any kind of relationship, let's take a look at the visualization of sigmoid function.

source: http://sebastianraschka.com/faq/docs/logisticregr-neuralnet.html

As shown in the above figure, roughly speaking, we can notice the following properties of the function .

For -1 <= z <= 1, the function maps a linear relationship.
For -5 < z < -1 and 1 < z < 5, the function maps a nonlinear relationship.
For z <= -5 and z >= 5, the function maps a constant relationship.

Back to the artificial neural network, it is made up of neurons and links between neurons. On each neuron, there are two steps to be done.

Sum up the weighted inputs.
Calculate the activation function with the sum of the weighted inputs if the threshold value is reached.

In some tutorials, these two steps are not talked about explicitly. But I feel it is easy to understand what a neuron does in two steps.

For more details about the activation function, refer to artificial neural networks/activation function in wikibooks.

At this moment, it is natural to ask “Why do we use the above functions as activation functions in neural network?” We already have some idea about how they can map different kinds of relationship. Another benefit is that they all constrain the output in the range from 0 to 1. And then why do we want to constrain the output in the range from 0 to 1? Based on the section “Sigmoid Neurons” in Chapter 1 Using neural nets to recognize handwritten digits of Michael Nielsen’s book Neural Networks and Deep Learning, one reason is that small changes on the weights and bias of inputs can only cause small changes on the output. We can also notice that it is easy to interpret the output as the probability.

Cost Function

In the parametric models, such as regression and neural network, when we use different parameters in the model, we will get different predictions/outputs. We want to find a set of best parameters which minimize the difference between actual values and predictions.

How do we measure the difference between actual values and prediction (i.e. error)?

In general, there are two ways. , which is the

1. Sum of squared difference between predicted output and actual value of each observation in the training set.
2. Sum of negative log-likelihood between predicted output and actual value of each observation in the training set.

The function that describes the squared error or negative log-likelihood is called the cost function. And we want to minimize it.

Gradient Descent Algorithm

By this moment, we want to minimize some cost function, which is complicated and has no closed form in such situation like the artificial neural network.

How can we perform the optimization process?

The gradient descent algorithm can make the magic! It is an optimization technique used to reach a local minimum of a complex function gradually. Here is a beautiful and commonly used analogy about what the gradient descent algorithm does. A complex function can be visualized as adjacent mountains and valleys. There are tops and bottoms. Imagine we are currently standing at some point of a valley and want to climb down to the bottom. But the fog is very heavy and our sight is limited. We cannot find a whole path directly down to the bottom and walk down along that path. What we can only do is to choose the next step that can bring us down a little bit. There are many directions we can walk our next step toward. But the direction that brings us down the most is favored. Gradient descent algorithm is to find that direction and then put us down by one step.

Mathematically speaking, the gradient descent algorithm finds that direction by calculating the partial derivative of the function. What the function means here is the cost function. This brings another advantage of commonly used activation functions mentioned earlier – their partial derivatives are easy to compute.

There are two questions related to the gradient descent algorithm.

What step size should we choose? If our step size is too small, it takes long to reach the bottom. But if our step size is too big, we may step over the bottom and miss it forever.
What initial position (i.e. weights, biases) should we choose? We must start from somewhere. A set of parameters of weights and biases in the artificial neural network gives an initial position.

The rough answer I have for these two questions at the moment is to do some experiments. I guess as our experiences grow, we can make a better judgment on these two choices easier. Moreover, some scientific approach may be found in the literature.

There is one question related to the cost function.

Why do we favor the quadratic format of the difference between predicted output and actual value as the cost function?

The first reason must be that it can help measure the prediction accuracy. For other reasons, when Michael Nielsen talks about the issue why we introduce the quadratic cost in this book, he mentions that the quadratic function is smooth and easy to figure out how to make small changes on weights and biases in order to improve the output.

Backpropagation Algorithm

The backpropagation algorithm is a set of rules to update the weights and biases of an artificial neural network by partitioning the prediction error into all neurons with the aid of gradient descent on each neuron.

Chapter 7 Neural Networks in the book Discovering Knowledge in Data: An Introduction to Data Mining illustrates a simple example how to perform backpropagation manually in the neural network by hand.

Artificial Neural Network Architecture

In the construction process of an artificial neural network, there are two hard questions related to hidden layers, compared to the input layer and output layer. Note that the input layer can be decided based on the feature engineering and the output layer is pretty straightforward.

How many hidden layers should we include in order to achieve the best?
How many neurons should we include in each hidden layer?

More sigmoid hidden layers and neurons will add the capacity of the neural network but will also tend to cause overfitting.

Many tutorials mention that some heuristics help. I guess it really takes lots of experiments and domain knowledge to make a quite good choice.

A Guide to Database Service Trove for OpenStack Liberty on Ubuntu 14.04

2016-05-17T04:04:00.001-07:00

Let's start with some brief introduction of four components of Trove and their functions in the whole OpenStack environment.

We can image the whole OpenStack as a construction company which helps people build houses. Note that virtual machine servers built by OpenStack are analogous to houses built by this construction company. The components of OpenStack are like departments of the company, which provide different services to make the company run well. The Keystone department is responsible for authorizing customers' identifications. The Glance department is responsible for providing the blueprint of the house. The Nova department is responsible for construction work, The Cinder department is responsible for providing bricks.

In this construction company, we are going to establish a department called Trove. It is responsible for installing an intelligent equipment in the house. It is common sense that people take different responsibilities in a department. So in the department of Trove, there are three full-time guys working there.

Trove-api: He acts like the department front-desk. All requests from customers come to him first. He is responsible for handling and dispatching requests. For example, if a customer requests the service of installing an intelligent equipment for him or her, but he or she doesn't have a house to place the intelligent equipment and permits Trove department to do everything in order to install that equipment. Trove-api will tell the manager Trove-taskmanager about the request.

Trove-taskmanager: He acts like the department manager. He has the right to request other departments (Nova, Cinder, Glance, etc) to help him finish the work of building the house. Because in order to install the intelligent equipment, an appropriate house has to be built first. So Trove-taskmanager will ask other departments to help build the house first. Once the house is ready, Trove-taskmanager will get the message. He will hire a new technician guy called Trove-guestagent and send him to go to the house and install the intelligent equipment.

Trove-conductor: He acts like the department secretary. The technician guy Trove-guestagent will update the progress of installation of the intelligent equipment, such as what error he encounters, and whether the installation is completed successfully.

Trove-guestagent: He acts like the technician guy who is responsible for installing the intelligent equipment. He sends the installation progress message to the secretary Trove-conductor.

These four guys talk to each with cell phone, which is the role of Rabbit client in OpenStack.

The below deployment assumes that an OpenStack environment has been set up and runs correctly with services Keystone, Horizon, Glance, Nova, Cinder, and Swift. Note that Swift is optional for Trove deployment unless the backup operation is expected.

Note: Replace capital words like TROVE_DBPASS, NETWORK_LABEL, RABBIT_USERNAME, RABBIT_PASS, TROVE_PASS with appropriate values.

Prerequisites

Create the database for trove.

$ mysql -u root -p
> CREATE DATABASE trove;
> GRANT ALL PRIVILEGES ON trove.* TO 'trove'@'localhost' IDENTIFIED BY 'TROVE_DBPASS';
> GRANT ALL PRIVILEGES ON trove.* TO 'trove'@'%' IDENTIFIED BY 'TROVE_DBPASS';
> FLUSH PRIVILEGES;

Source the admin credentials of OpenStack environment.

$ source admin-openrc.sh

Create the service credentials for trove.

$ openstack user create --domain default --password-prompt trove
$ openstack role add --project service --user user trove admin
$ openstack service create --name trove --description "OpenStack Database Service" database

Create the database service endpoints.

$ openstack endpoint create --region RegionOne database public http://controller:8779/v1.0/%$tenant_id$s
$ openstack endpoint create --region RegionOne database internal http://controller:8779/v1.0/%$tenant_id$s
$ openstack endpoint create --region RegionOne database admin http://controller:8779/v1.0/%$tenant_id$s

Install and Configure Components

Install the packages.

# apt-get install python-trove python-troveclient trove-common trove-api trove-taskmanager trove-conductor

Edit the /etc/trove/trove.conf file.

[database]

connection = mysql+pymysql://trove:TROVE_DBPASS@controller/trove

[DEFAULT]

verbose = True
debug = True
rpc_backend = rabbit
auth_strategy = keystone
trove_auth_url = http://controller:5000/v2.0
nova_compute_url = http://controller:8774/v2
cinder_url = http://controller:8776/v2
neutron_url = http://controller:9696/
add_addresses = True
network_label_regex = ^NETWORK_LABEL$
log_dir = /var/log/trove/
log_file = trove-api.log

[oslo_messaging_rabbit]

rabbit_host = controller
rabbit_userid = RABBIT_USERNAME
rabbit_password = RABBIT_PASS

[keystone_authtoken]

auth_uri = http://controller:5000
auth_url = http://controller:35357
auth_plugin = password
project_domain_id = default
user_domain_id = default
project_name = service
username = trove
password = TROVE_PASS

Edit the /etc/trove/trove-taskmanager.conf file.

[database]

connection = mysql+pymysql://trove:TROVE_DBPASS@controller/trove

[DEFAULT]

verbose = True
debug = True
rpc_backend = rabbit
trove_auth_url = http://controller:5000/v2.0
nova_compute_url = http://controller:8774/v2
cinder_url = http://controller:8776/v2
neutron_url = http://controller:9696/
nova_proxy_admin_user = admin
nova_proxy_admin_pass = admin
nova_proxy_admin_tenant_name = admin
nova_proxy_admin_tenant_id = *************
taskmanager_manager = trove.taskmanager.manager.Manager
add_addresses = True
network_label_regex = ^NETWORK_LABEL$
log_dir = /var/log/trove/
log_file = trove-taskmanager.log
guest_config = /etc/trove/trove-guestagent.conf
guest_info = guest_info.conf
injected_config_location = /etc/trove/conf.d
cloudinit_location = /etc/trove/cloudinit

[oslo_messaging_rabbit]

rabbit_host = controller
rabbit_userid = RABBIT_USERNAME
rabbit_password = RABBIT_PASS

Edit the /etc/trove/trove-conductor.conf file.

[database]

connection = mysql+pymysql://trove:TROVE_DBPASS@controller/trove

[DEFAULT]

verbose = True
debug = True
trove_auth_url = http://controller:5000/v2.0
rpc_backend = rabbit
log_dir = /var/log/trove/
log_file = trove-conductor.log

[oslo_messaging_rabbit]

rabbit_host = controller
rabbit_userid = RABBIT_USERNAME
rabbit_password = RABBIT_PASS

Edit the /etc/trove/trove-guestagent.conf file.

[DEFAULT]

verbose = True
debug = True
trove_auth_url = http://controller:5000/v2.0
nova_proxy_admin_user = admin
nova_proxy_admin_pass = admin
nova_proxy_admin_tenant_name = admin
nova_proxy_admin_tenant_id = *************
rpc_backend = rabbit
log_dir = /var/log/trove/
log_file = trove-guestagent.log
datastore_registry_ext = vertica:trove.guestagent.datastore.experimental.vertica.manager.Manager
(Based on the notes above datastore_registry_ext in the configuration file, add datastore you want.)

[oslo_messaging_rabbit]

rabbit_host = controller
rabbit_userid = RABBIT_USERNAME
rabbit_password = RABBIT_PASS

Edit the /etc/init/trove-taskmanager.conf file.

--exec /usr/bin/trove-taskmanager -- --config-file=/etc/trove/trove-taskmanager.conf ${DAEMON_ARGS}

Edit the /etc/init/trove-conductor.conf file.

--exec /usr/bin/trove-conductor -- --config-file=/etc/trove/trove-conductor.conf ${DAEMON_ARGS}

Populate the trove service database.

# trove-manage db_sync

Finalize Installation

Restart the database services.

# service trove-api restart
# service trove-taskmanager restart
# service trove-conductor restart

Remove the SQLite database file.

# rm -f /var/lib/trove/trove.sqlite

Build database guest image for OpenStack Trove.

Reference:
Building a guest database image for OpenStack Trove
Build Trove guest image manually in OpenStack

Configure the database guest.

Edit the /etc/trove/trove-guestagent.conf file.

[DEFAULT]

verbose = True
debug = True
trove_auth_url = http://controller:5000/v2.0
nova_proxy_admin_user = admin
nova_proxy_admin_pass = admin
nova_proxy_admin_tenant_name = admin
nova_proxy_admin_tenant_id = *************
rpc_backend = rabbit
log_dir = /var/log/trove/
log_file = trove-guestagent.log
datastore_registry_ext = vertica:trove.guestagent.datastore.experimental.vertica.manager.Manager
(Based on the notes above datastore_registry_ext in the configuration file, add datastore you want.)

[oslo_messaging_rabbit]

rabbit_host = controller
rabbit_userid = openstack
rabbit_password = RABBIT_PASS

Edit the /etc/init/trove-guestagent.conf file.

--exec /usr/bin/trove-guestagent -- --config-file=/etc/trove/conf.d/guest_info.conf --config-file=/etc/trove/trove-guestagent.conf ${DAEMON_ARGS}

Debug Common Problems.

Read logs and find out specific information. The first place to check is trove-taskmanager.log on the controller node and then trove-guestagent.log on the guest.

No corresponding Nova instances are launched.

Check the configuration values for nova_proxy_admin_user, nova_proxy_admin_pass, nova_proxy_admin_tenant_name, nova_proxy_admin_tenant_id.

Trove taskmanager, trove conductor, and trove guestagent cannot talk to each other via RPC.

Check their logs to see whether there is "connected to AMQP". If there is no such message, check the configuration values of rabbit_host, rabbit_userid, rabbit_password, and rabbit_backend.

The guest_info.conf cannot be injected to the guest.

1. Install the following packages on compute nodes

# apt-get install libguestfs-tools python-libguestfs

# sudo update-guestfs-appliance

# sudo usermod -a -G kvm yourlogin

# chmod 0644 /boot/vmlinuz*

2. Add the following configuration options in /ect/nova/nova-compute.conf on

compute nodes.

[DEFAULT]

compute_driver=libvirt.LibvirtDriver

rootwrap_config = /etc/nova/rootwrap.conf

[libvirt]

virt_type=qemu

inject_partition = -1

3. Restart nova-compute service on compute nodes

# service nova-compute restart

Trove guestagent cannot get guest id from guest_info.conf and Trove instance is stuck at 'BUILD' status forever, but guest info is already injected to the guest.

Check /etc/init/trove-guestagent.conf file on the guest and make /etc/trove/conf.d/guest_info.conf as one of configuration files.

MySQL Router: A High Availability Solution for MySQL

2016-04-24T06:50:00.000-07:00

Nowadays more and more companies start to use cloud computing system and services. From the perspective of cloud service users, they want their service level guaranteed. Take the database cloud service as the example. The users expect the database on the cloud available whenever they use it. But there can always be some reason that may cause the database not available, such as the network failure and database exceptional shutdown. So a big problem for cloud service providers is how to guarantee the service level declared in the service level agreement. High available solutions can solve that problem. "High Availability" is a fundamental and important concept in the cloud computing. It means the service/server on the cloud is available/up at a high rate, like 99.99% time of a given year.

In this post, I would like to introduce one high availability solution for MySQL -- MySQL Router. First, I will describe what it is. And then I will show how to use it.

What is MySQL Router?

MySQL Router is a middleware between MySQL client and MySQL server, which redirects a query from the MySQL client to a MySQL server. What we usually do is to use a MySQL client directly to connect to a MySQL server. So why do we want this middleware? The reason is, with MySQL Router, we can set up more than one backend MySQL servers. When one server is down, MySQL Router can automatically redirect the query to another available server, which helps guarantee the service level.

How to use MySQL Router?

1. Set up MySQL backend servers.

What I did is to launch two virtual machines with Ubuntu Linux system and install a MySQL server on each virtual machine. Another setup scenario is introduced in MySQL Router tutorial. In real practice, the MySQL backend servers are set in the replication topology.

Command of installing MySQL server on Ubuntu:
shell> sudo apt-get install mysql-server

2. Set up MySQL Router.

-- Download the MySQL APT repository.
shell> wget http://dev.mysql.com/get/mysql-apt-config_0.7.2-1_all.deb

-- Install the MySQL APT repository.
shell> sudo dpkg -i mysql-apt-config_0.7.2-1_all.deb

-- Update the APT repository.
shell> sudo apt-get update

-- Install MySQL Router.
shell> sudo apt-get install mysql-router

-- Install MySQL Utilities.
shell> sudo apt-get install mysql-utilities

3. Configure MySQL Router.

The configuration is under /etc/mysqlrouter/mysqlrouter.ini.

The configuration instructions can be found here. The most important part is the option 'destinations' in the [routing] section. Its format is host_ip:port. For example, if there are two MySQL servers on the host 192.168.25.200/201 respectively, the 'destinations' can be set as follows.

destinations = 192.168.25.200:3306, 192.168.25.201:3306

4. Start MySQL Router.

shell> sudo mysqlrouter -c /etc/mysqlrouter/mysqlrouter.ini

5. Set up MySQL client to connect to MySQL Router.

A python script will be used here as MySQL client. Based on the documentation, the client must be executed on the same machine where the MySQL Router is running.

(About 'cox' statement: -- change the user and password to your own in the 'cnx' statement.
-- By default, the host and port of MySQL Router are 'localhost' and 7001, unless they are configured by the option 'bind_address' in mysqlrouter.ini. )

shell> python
>>> import mysql.connector
>>> cnx = mysql.connector.connect(host='localhost', port=7001, user='root', password='secret')
>>> cur = cnx.cursor()
>>> cur.execute("SHOW DATABASES")
>>> print cur.fetchall()

6. Play with MySQL Router.

-- Create different databases on each MySQL server under the same user specified in the 'cnx' statement.
-- Execute the code in Step 5. And check the output.
-- Stop a MySQL server, re-execute code from 'cnx', and examine the difference.

More details can be found in the part of Testing the Router.

Google Charts: a great way to draw interactive and realtime charts on JSON data requested from REST API

2016-04-16T08:54:00.000-07:00

Data visualization is helpful not only for data analysts but also for system administrators. Data analysts use data visualization to untangle correlated relationship between some features, while system administrators use data visualization to monitor the system status and users' status. Many plotting packages are developed to visualize data, such as ggplot2 in R and matplotlib in Python, which are very powerful to make elegant charts. But the charts made by those packages are STATIC. In some cases, we want to make charts based on users' interaction or refresh charts automatically in order to trace the latest updated data in the data set. Here Google Charts can play the role. Another benefit Google Charts provides is to showcase the work with charts on the web! Moreover, Google Charts make it easy to use AJAX to attach JSON data retrieved from a REST API to the chart data table.

In this post, I will give a complete example how to use Google Charts to draw a realtime chart, which reads data from a REST API and automatically refresh itself. In order to use Google Charts, the familiarity of the language HTML, JavaScript, and CSS will be required. I think the guide of Google Charts make it very easy for new learners to approach the technique. For people who feel hard following the guide, some tutorial resources have been provided at the end of the post to help catch up the required basic concepts.

The code of the example can be found in my GitHub. My working flow on this program is as follows.

Prepare a REST API which provides JSON data to serve as chart data for visualization. A public GitHub API mentioned in this post is used. It contains some data about issues related to ggplot2 repository on GitHub. We can create our own REST API, as mentioned in my last post.
Use AJAX to read JSON data from REST API to a JavaScript variable, which will be transformed to the format of Google Chart data table.
Set up Google Chart refresh effect by using the function setRefreshInterval().
Set up Google Chart option animation.
Draw the chart.

For people who haven't had experiences with HTML, CSS, JavaScript, and AJAX, I would recommend the following tutorial resources. Three weeks ago, I only knew a little bit about HTML and absolutely nothing about CSS, JavaScript, AJAX, or the concept of front end development. I referred to the following tutorials, grasped all of them in three weeks, and finished my task of user interface development from work.

tutorialspoint (English)
Front-End Web Developer Nanodegree courses on Udacity (English)
Python Web Development on maiziedu (Chinese)

For people who want to learn more about REST API development, Python full-stack web development, and HTTP, the above resources will also be helpful.

Last but not least, I would like to share a little bit about my motivation to learn the web development. After reading many posts and studying current successful business applications, my feeling is that machine learning and web development go hand by hand. Many machine learning systems are deployed on web applications. A complete web solution may contain some machine learning algorithms as a specific problem solution. I guess that makes sense by just thinking about what Amazon does.

RESTful web service and OpenStack dashboard

2016-03-27T01:29:00.001-07:00

In the past three weeks, I have worked on the task of developing a RESTful web service and integrating it to OpenStack dashboard. In this post, I will share some knowledge about them and also the tools (i.e. Flask framework, Flask RESTful framework) that help me finish this task.

Nowadays many people benefit from the convenience of cloud computing services on OpenStack, Amazon Web Services, and many other cloud providers. Some may wonder how I get the service like a virtual machine by just clicking several buttons in the browser. This relies on the technique of RESTful web service.

In the concept of RESTful web service, everything can be categorized as either a resource or an operation/method on a resource. A resource is represented by a URI. And there are four kinds of operations on a resource, namely, GET, POST, PUT, and DELETE. The operation is done by HTTP request. In RESTful web service, a response to a HTTP request is serialized data in format of JSON or XML, which makes the information transportation between machines efficient.

To learn more details about RESTful web service, many online tutorials can be referred to. When I started to study this concept, I read lots of posts. And I have found the following mapping way is helpful. For people who have object-oriented (OO) programming experience, the resource-oriented architecture of RESTful web service may be understood in the OO paradigm. In OO, there are only objects and methods on objects, where an method on a object may be done by another object. But note that the work mechanism of RESTful web service and OO are very different. The former is done through HTTP request, while the latter is just a programming paradigm.

Because there are many repetitive works involved in the web application and service development, many frameworks are created to do those repetitive works automatically for developers. So the developers can only focus on their own parts. The frameworks Flask and Flask RESTful have helped me develop RESTful web service. I chose these two frameworks because I found that they were fairly easy to learn and use, and suitable in my case, after exploring other Python web frameworks like Django and Django REST. Another difference between Flask and Django is that the developer is allowed to choose their own object relational mapper (ORM) in Flask, while Django provides the default ORM. At the moment, I feel more comfortable using SQLAlchemy as ORM in the development. Their documentations provide the code of the example project to help get started. So the example code will not be provided here. Note that as mentioned in Flask RESTful documentation, the python package requests can serve as an easy way to test whether RESTful web service or URIs work as expected.

After RESTful web service is developed and ready for use, a user-friendly front-end dashboard will also be desired not only by users but also by administrators. Because the RESTful web service in my case is related to OpenStack, the OpenStack dashboard is customized to host it, considering the fact that system administrators and users always want to do everything in just one dashboard. To figure out how to integrate an external web service to OpenStack dashboard, I have mainly referred to two posts from Keith and OpenStack tutorial respectively, which answer most of my questions.

For people who want to develop their own front-end dashboard, the knowledge of HTML, CSS, and JavaScript will be required. HTML is in charge of the content of a webpage. CSS is in charge of the style of a webpage. And JavaScript makes a webpage more interactive. I don't have many experiences to share on this aspect. But I believe as long as we know what we want to achieve, we will find the bridge to it.

Automating system administrative tasks of Vertica cluster on OpenStack with Python

2016-02-27T03:19:00.000-08:00

In this post, I would like to share some experiences on how to automate system administrative tasks with Python. This is the first time for me to do such tasks. One of big lessons I have learned is how to break a complex task down into small sequence tasks.

Based on the project need, recently I have been working on some system administrative tasks. And one task is to automate the following process with a Python program.

Create a Vertica cluster on OpenStack Trove.
Migrate some certain users and their tables to this newly created cluster from the cluster they currently sit on. This task can be broken down into following sequence tasks.

Connect to the current Vertica cluster and export objects of those users into a sql file on the machine where the python program is executed.
Transfer the sql file to the target Vertica cluster.
Connect to the target Vertica cluster with the dbadmin credentials to create users and grant database privilege to them.
Connect to the target Vertica cluster with each user's credentials to create their objects (schemas and tables) from sql file.
Copy table data to the target Vertica cluster from the current Vertica cluster.

The reason why we are interested in launching a Vertica cluster is that Vertica is a massively parallel processing database (MPPDB). The MPPDB-as-a-service on the cloud would be very attractive for companies who perform analytical tasks on huge amounts of data.

The following Python packages have been used to implement the above tasks.

troveclient: Call Trove Python API.
python-vertica: Interact with Vertica server on the cluster.
subprocess: Execute a command line on the local machine.
paramiko: Execute a command line on the remote machine (i.e. virtual machines that host Vertica server on OpenStack)

Let's see how things work out with examples.

1. As it mentions in this article, OpenStack is a popular open source cloud operating system for deploying infrastructure as a service and cloud-based tasks can be automated through working with OpenStack Python APIs. The author provides the examples of Keystone API, Nova API, and so on, but doesn't cover Trove API.

An Example of Trove API can be found in the screenshot below.

2. Every database management system adopts the server-client paradigm, which means we can interact with the server with a client. The Python client package for Vertica is python_vertica.

Below screenshot provides an example how to user python_vertica to export objects.

3. As mentioned in the above section, one small task is to transfer sql file from the local machine to a remote machine (the target Vertica cluster). subprocess.check_call() is used.

4. Another small task is to execute a command line to create a user's objects from sql file on the target Vertica cluster. So how can we execute a command line on a remote machine?

Below is the code screenshot to use paramiko to do that. Note in this case, SSH login of the remote machine is set via key pair file. So the parameter "KEY_FILE_PATH_LOCAL" is need. If SSH login is set via password, refer to paramiko document and make corresponding changes.

This task requires the knowledge in the operating system, database system (i.e. Vertica), and Python. I am glad that I make it, since the growth always starts with small steps. There is still some space to optimize the code, which is what I will work on.

An Initial Exploration of Electricity Price Forecasting

2016-02-21T05:36:00.001-08:00

One month ago, I decided to perform some exploration on the problem electricity price forecasting to get some knowledge in electricity market and sharpen skills in data management and modeling. This post will describe what I have done and achieved in this project, from problem definition, literature review, data collection, data exploration, feature creation, model implementation, model evaluation, and result discussion. Feature engineering and Spark are two skills I aim to gain and improve in this project.

The price of a commodity in a market influences the behaviors of all market participants from suppliers to consumers. So knowledge about the future price plays a determinant role on selling and buying decisions for suppliers to make profits and for consumers to save cost.

Electricity is the commodity in the electricity market. In deregulated electricity market, generating companies (GENCOs) submit production bids one day ahead. When GENCOs decide the bids, both electricity load and price for the coming day are not known. So those decisions rely on the forecasting of electricity load and price. Electricity load forecasting has moved to an advanced stage both in the industry and academic with low enough prediction error, while electricity price forecasting is not as mature as electricity load forecasting in the respect of tools and algorithms. That is because the components of electricity price are more complicated than electricity load.

1. Literature Review

"If I have been able to see further, it was only because I stood on the shoulder of giants." -- Newton

The review paper (Electricity Price Forecasting in Deregulated Markets: A Review and Evaluation) has been mainly referred to. In this paper, both price-influencing factors and models are summarized.

2. Exploratory Data Analysis

The locational based marginal price (LBMP) in day ahead market provided by New York Independent System Operator (ISO) is used. Because of computational resource limitation, this project is only to forecast the price of "WEST" zone for the coming day.

"Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." -- Pedro Domingos in "A Few Things to Know about Machine Learning"

Based on the scatter plot and correlation coefficient, the following variables are used as model inputs.

- Marginal cost losses

- The square of marginal cost losses

- Marginal cost congestion

- The square of marginal cost congestion

- The average price in the past 1 day

- The average price in the past 7 days

- The average price in the past 28 days

- The average price in the past 364 days

- The standard deviation of price in the past 1 day

- The standard deviation of price in the past 7 days

- The standard deviation of price in the past 28 days

- The standard deviation of price in the past 364 days

- The ratio of average price in the past 1 day over average price in the past 7 days

- The ratio of average price in the past 7 days over average price in the past 364 days

- The ratio of average price in the past 28 days over average price in the past 364 days

- The ratio of standard deviation in the past 1 day over standard deviation in the past 7 days

- The ratio of standard deviation in the past 7 days over standard deviation in the past 364 days

- The ratio of standard deviation in the past 28 days over standard deviation in the past 364 days

- The year

3. Model Development

The linear regression is used as the initial model for exploration.

4. Result Presentation and Analysis

The mean absolute percentage error (MAPE) is used as the performance error. The MAPE is currently around 53%, which is high. The solution can be improved in the following respects.

Create more efficient input variables, like electricity load. The electricity load data in New York ISO is provided in 5-minute interval. Those data has been retrieved, but is under manipulation process, like imputing missing value and aggregating to hour-level.
Use other models like neural network.

The code developed for this project can be found here. The computation was tried on Spark system. And feature engineering was paid especial attention to in this project. There are still a lot to do in this project in order to improve forecasting accuracy. I will try to continue this topic if I get enough time and energy.

Multi-tenant Management for Database Service on the Cloud Computing Platform

2016-02-03T07:24:00.002-08:00

In this post, the concept of multi-tenancy of database service on the cloud computing platform is discussed, as well as its benefits, technical challenge, and solution architecture. And the example of car renting is given to help people understand in a more general way. The reason why I wrote this post is that I am currently working on the project of implementing parallel database service on the cloud computing platform at an extremely low cost through careful multi-tenancy management. And I realize that what I have learned from this project is not only knowledge in database and big data system but also the application of management science.

Multi-tenancy is one of key problems faced by cloud service providers. For database service on the cloud computing platform, multi-tenancy means that more than one tenants can be deployed on a single database server. If it is managed well, both the cost and energy will be saved a lot. But we cannot randomly deploy some tenants together, because we have to meet a more important objective than cost saving, which is to satisfy the service level every tenant requests. A good solution architecture must be able to handle that challenge.

To help people who are unfamiliar with the concept of multi-tenancy or database service on the cloud understand what it is and also its benefits, we can think about the following real-world example. The cloud computing platform can be thought as a car renting company. And the tenants on the cloud computing platform can be thought as customers of the car renting company. The only difference between them is the service. The cloud computing platform provides computing services like database service, while a car renting company provides the car renting service. Let's image that there are two customers for this car renting company. If the renting schedule of these two customers overlap, the company has to assign one car for each customer. In this case, two cars have to be available in order to serve those two customers. But what if the renting schedule of these two customers are different, then the company can assign the same car for them. In this case, only one car is required to be available. As known, the renting company has to buy enough cars to make them available in order to satisfy all customers. If customers who have different renting schedule are assigned to a single car, then the total number of cars the company needs to buy will be reduced. This is a common strategy for a renting company to save money. So it can also be a very good strategy for the cloud computing platform.

But this strategy is challenging on the cloud computing platform, because it is not scheduled in advance when a tenant uses the service on the cloud computing platform. And a tenant's service level agreement may not be satisfied at each time period if it is deployed on a database server with some other tenants. If we don't know when a tenant uses the service, how could we know which tenant should be put together? So in order to realize multi-tenancy, the first step is to know roughly or predict tenants' behavior like when they use the service, based on their records.

Prof. Lo's group came up with a system architecture as a solution for multi-tenancy of parallel database as service on the cloud computing platform in 2013, which can be found in the paper Parallel Analytics as A Service. The solution can be understood in the following simplified car renting scenario.

Let's assume that the car renting company serves customers who frequently rent cars and their schedules are not known in advance. When a customer comes in, there must be a car available for him or her. As discussed earlier, we want to first know roughly when a customer rents a car based on his or her records, in order to find which customers can be assigned to use the same car in different time periods, which in turn reduces the total number of cars the renting car company needs to have. But how? the solution in the paper suggests when a new customer comes in, assign a car to that customer and only that customer can use that car in certain period like one month. So in this one month, whenever the customer wants to use the car, there is always a car available. And during this month, the time periods when the customer uses the car will be recorded. As time goes on, we collect all customers' records. And based on the records, we can predict when one customer rents a car and then decide which customers can use the same car without schedule conflicts through some analysis.

Back to the cloud computing platform case, the general logic of the solution is the same. In the proposed system architecture, there are four components, which are Tenant Activity Monitor, Deployment Advisor, Deployment Master, and Query Router. I am mainly in charge of implementation of the first three components, which have been almost done, and my teammate has finished the implementation of query router. Hopefully this will bring more people benefits as an open source project pretty soon. Compared with car renting service, there are many more technical details involved in the database service on the cloud computing platform. Take this as an example. If a customer of the car renting company is assigned to a different car, he or she just goes to get it. But on the cloud computing platform, if a tenant is assigned to another database server, all its data has to be migrated to another database server. But this is not the main point of this post.

Considering the fact that the active tenant ratio is very low (i.e. 10% in IBM's database-as-a-service), multi-tenancy of parallel database service on the cloud computing is very attractive. For any kind of service providers like cloud service provider and car renting company, resource can be utilized more efficiently and cost can be reduced a lot, through careful management by some analytical methods. In July 2015, I took the job of implementing parallel database service on OpenStack as a research associate from Prof. Lo because of the reputation of his group in the database and big data system, which I wanted to learn more about in order to realize my career goal. Now besides the knowledge in database and big data system, this project also provides a vision about the application of resource management, which help me further understand the spirit of management science.

The First Python Project in Data Science: Stock Price Prediction

2015-12-27T03:17:00.000-08:00

In this post, I will explain what I have done in my first Python project in data science - stock price prediction, combined with the code. I started to learn how to use Python to perform data analytical works during my after-working hours at the beginning of December. So this post will also serve as a summary of what I have learned in the past three weeks. Hope this post can also give readers some insights on the whole process of how to get a data science project done with Python from beginning to end. I want to thank Vik from dataquest for comments he has provided, which I have really benefited from.

My code for this project can be found here on GitHub. It can be customized by changing settings in settings.py and automatically process data acquisition and prediction tasks in this project. It can also be easily modified to adapt more users' needs.

1. Get and store data.

There are basically three data resources.

Files, such as EXCEL, CSV, and TEXT.
Database, such as SQLite, MySQL, and MongoDB.
Web.

The historical data used in this project is web data from Yahoo Finance Historical Prices. There are basically two ways to retrieve web data.

Web API, such as Twitter API, Facebook API, and OpenNotify API.
Web Scraping.

I chose the web scraping method to retrieve historical data, since I just learned it and wanted to practice it. In the code, the program data_acquisition.py scrapes historical data and stores them in either csv file or mysql database. The function get_query_input() will get the date range of historical data that the user wants to retrieve (i.e. from 2015-01-01 to 2015-12-25), form the url with parameters extracted from user inputs, and return that url. With that url, we then can get data by sending HTTP request to it and scraping it, which is what the function scrape_page_data(pageUrl) does. aggregate_data_to_mysql(total_table, pageUrl, output_table, database_connection) will store the scraped data in mysql database, while aggregate_data_to_csv(total_table, pageUrl, output_file) will store the scraped data in a csv file. Notice that these two functions are recursive functions. That is because all historical data requested may not be in the same webpage, and if not, we need to go to the next page url and keep scraping data. So the question is how we can scrape data in a webpage and then go to next page and continue this process until all historical data requested is retrieved.

To scrape data in a webpage, we first take a look at its HTML source code. In Safari, right click the webpage and select "show page source". This may require a little bit knowledge about HTML, but it can be quickly caught up. By investigating the HTML source code of stock price page, we can find that the data we want to retrieve is in the table named "yfnc_datamodoutline1". CSS selectors make it easy for us to select the rows in the table. And the next page url can be obtained from the attribute of <a rel="next" href=, where tag <a indicates a link in HTML.

By the time when I write this post, I find Yahoo! Finance APIs. Next time I need it, I will play with that to see how it would make things easier. I guess when working with data, the first thing to check is whether that website provides APIs.

2. Explore data.

The purpose of data exploration may fall in one of the below categories.

Missing values and outliers
The pattern of the target variable
Potential predictors
The relationship between the target variable and potential predictors
The distribution of variables

Data visualization and statistics (i.e. correlation) are good ways to explore data. One specific way for checking missing values can be found here. It turns out there is no missing value in the historical data.For time series data, we set the date as the index and sort the data in ascending order, which can be done though the function pandas.DataFrame.set_index() and pandas.DataFrame.sort_index().

The objective of this project is to predict stock price. Take "Close" price as the example. To predict it, we will be interested in its own pattern and the relationship between it and other factors. One question is what factors may influence our target variable ("Close" price). We can get some insights on those driving factors by studying the problem deeper and doing some research on relative literature and available modelings. Frankly speaking, I almost knew nothing about the stock market, before I started to work on this project. So I got indicators of "Close" price from the project information received from dataquest.

The average price from the past five days.

The average price for the past month.

The average price for the past year.

The ratio between the average price for the past five days, and the average price for the past year.

The standard deviation of the price over the past five days.

The standard deviation of the price over the past year.

The ratio between the standard deviation for the past five days, and the standard deviation for the past year.

The average volume over the past five days.

The average volume over the past year.

The ratio between the average volume for the past five days, and the average volume for the past year.

The standard deviation of the average volume over the past five days.

The standard deviation of the average volume over the past year.

The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.

The year component of the date.

In Pandas, there are functions to compute moving (rolling) statistics, such as rolling_mean and rolling_std. But we need to shift the column of that rolling statistics forward by one day. The reason is as follows. We want the average price from the past five days for 2015-12-12 to be the mean of prices from 2015-12-07 to 2015-12-11, but the returned value of the rolling_mean function for 2015-12-12 is the mean of prices from 2015-12-08 to 2015-12-12. So the average price from the past five days for 2015-12-12 is actually the returned value of the function rolling_mean for 2015-12-11.

3. Clean data.

When computing some indicators, a year of historical data is required. So those indicators for the earliest year will be missing, indicated as NaN in Python. We will just remove rows with NaN values.

This step and following steps are done in the program prediction.py. First, users' prediction requests will be retrieved by get_prediction_req(data_storage_method). Then historical data will be read and loaded to a DataFrame either from mysql database or csv file with the function read_data_from_mysql(database_connection, historical_data_table) or read_data_from_csv(historical_data_file).

4. Build the predictive model.

Some common predictive methods we use are regression, classification, decision tree, neutral network, and supporting vector machine. For this project, multiple linear regression and random forest will be chosen as initial methods, whose interfaces are provided in sklearn package.

5. Validate and select the model.

To validate the model, two things need to be done.

Split the historical data into train set and test set.
Choose an error performance measurement and calculate it.

The function is predict(df, prediction_start_date, prediction_end_date, predict_tommorrow), which performs predictive tasks and calculates the error measurement after calculating the indicators and cleaning the data.

The big question here is how to split the historical data into train set and test for prediction. For example, the user wants to predict the price from 2015-01-03 to 2015-12-27. The train and test data set will be split based on the backtesting technique. Let's say the earliest date in historical data is 1950-01-03. For example, to predict the price for 2015-01-03, the train set will be historical data from 1950-01-03 to 2015-01-02, and the test set will be 2015-01-03. And to predict the price for 2015-01-04, the train set will be historical data from 1950-01-03 to 2015-01-03, and the test set will be 2015-01-04. Keep this process until we get all predicted values from 2015-01-03 to 2015-12-27. In the code, this part is done by looping over the index set of the prediction period.

The mean absolute error (MAE) is picked as the error performance measure. The model with a smaller MAE will chosen to predict the price for tomorrow. The MAEs and predicted value for tomorrow will be written to file "predicted_value_for_tommorrow". And the actual and predicted value for test data will be stored in either a mysql database or csv file by function write_prediction_to_mysql(df_prediction, database_connection, predicted_output_table) or write_prediction_to_csv(df_prediction, predicted_output_file) for further analysis, which may give some insights on where the model doesn't perform well and how to improve it.

6. Present the results.

Beside predicting the value for tomorrow, a simple experiment is done by using the whole year data in 2014 and 2015 as test respectively. The MAE result is as follows.

	2014	2015
Regression	15.62	19.98
Random Forest	13.99	19.08

We can have two simple findings.

- The model performs differently in different years. shows that the model performs better for the stock market in the year 2014.
- Random forest seems to perform better than multiple linear regression based on their MAE.

The time series plots for 2014 and 2015 with actual and predicted values can be found below. As shown, the trend is caught well in the whole. One potential improvement point for regression model is that the peak falls behind slightly. And one potential improvement point for random forest model is that its predicted value is smaller than actual value in general.

I will tweak algorithms more later and then present more findings. After all, the initial main purpose of this project is to get myself familiar with the full lifecycle of a data science project and Python packages/functions that are frequently used in a data science project.

A Simple but Complete Guide for OpenStack Trove

2015-12-25T17:36:00.000-08:00

This post illustrates the framework and some details that help learners get started with OpenStack and its database service Trove. I started to study OpenStack and its Trove project five months ago from scratch. If I can do it, you can also do it. I appreciate the guidance I have received in every way, especially my teammates whom I have worked closely with every weekday.

Trove is database-as-a-service project on OpenStack, which can help users save the cost on database infrastructure and eliminate administrative tasks like deployment, configurations, and backups, and so on. To understand its benefit, image that you need a database server for your business. Traditionally, you have to set up an infrastructure first, install database package, configure it, and maintain it, which cost lots of money and time. With OpenStack Trove, what you do is just to open an account, execute some commands or click some tabs to launch a database server, and only pay for what you use.

Before we start this exciting journey, I would like to highlight the following documentation and reference.

OpenStack Documentation, where you can find installation guide, admin user guide, end user guide, and command line reference. That will cover almost all tasks performed on OpenStack and its service components. Reading the documentation carefully can always help us avoid some naive mistakes and save us much time.
OpenStack Trove, where you can find comprehensive details about Trove.

The following steps can help you deploy OpenStack Trove and then operate it from scratch.

1. Set up an OpenStack environment and add identity service (Keystone), image service (Glance), computer service (Nova), dashboard (Horizon), networking service (Neutron or Nova-network) and block storage service (Cinder).

This can be done by following OpenStack Installation Guide, which can be found in OpenStack documentation. Be sure to choose the right version of the installation guide for your operating system (Ubuntu 14.04, Red Hat Enterprise Linux 7, CentOS 7, openSUSE 13.2, and SUSE Linux Enterprise Server 12) and the version of OpenStack (Juno, Kilo, and Liberty) you want to deploy.

2. Add database as service (Trove) on OpenStack.

This part is not provided in the official installation documentation. For Ubuntu users, follow the steps and commands provided in my earlier post. For other Linux operating system users, I guess it will still be a good way to follow steps in that post and change commands correspondingly.

3. Obtain or build a Trove guest image.

This is the first step to use Trove. There are currently 3 ways.

Download a pre-built guest image from here. This method is DevStack based.
Build a guest image using OpenStack Trove tools (Disk Image Builder, redstack). I tried these two tools on DevStack and my OpenStack, respectively. Using these two tools, I got a working image on DevStack, but failed to get a working image on my OpenStack. Back to the time I tried, probably there was some bug related to their DevStack dependency. By the working image, I mean that a working image can be used to launch a Trove instance successfully with active status. I am not sure whether they work well now or not. After struggling with the tools for one month, I came up with a customized way.
Build a guest image using customized way. The way I used can be found in this post. I got a working image by performing those steps. That post also provides some insights on how the image works.

For more details about the first two ways, see this post.

4. Add the Trove guest image to the datastore.

The image we get from the above is just a QCOW2 file. In order to tell Trove where it is and let Trove use it, we must add it to Trove datastore, by performing step 2-6 in the database service chapter of OpenStack Administration Guide.

5. Launch a Trove instance.

This can be done either through dashboard or Trove command line client. For the latter, refer to "trove create" command in OpenStack Command Line Reference.

6. Debug.

There are some common errors that OpenStack Trove users can encounter. Just to name some.

Status goes to "Error" shortly after the creation.
Status is stuck at "Build" and goest to "ERROR" after reaching the timeout values.
No host is assigned.

To figure out the reason, the starting point should always be trove logs, including trove-api.log, trove-taskmanager.log, and trove-conductor.log, which by default are in the directory /var/log/trove on the trove controller node, and also trove-guestagent.log, which by default are in the directory /var/log/trove/ on the guest.

With the insights log files provide, google and ask openstack are good places that can help target the errors.

Good luck and enjoy!

Web Scraping and Data Analysis with Python

2015-12-20T01:31:00.000-08:00

Recently I have been working on a project about predicting stock price. Thanks to the guidance received from dataquest, I was excited to grasp how to use web scraping technique to collect data and how to utilize various Python packages to analyze data and build models. I wish I had knowledge about these techniques back to 2013, when I was involved in a project requiring me to collect a large amount of web data and analyze them. If I had knew the existence of these techniques, I would definitely learn them. And that would make my work more efficient and accurate, since it would be less likely to make a mistake and easy to track if there was something wrong by programming data collection and analysis tasks in Python.

In the past one and a half month, I have read through the following three books. I have found them very helpful by empowering myself with techniques of automating data related tasks in Python.

1. Python for Informatics

2. Python for Data Analysis

3. Web Scraping with Python

Just take a look at these books and take away what you need. And you are welcome to check my code on GitHub for the stock price prediction project. The program produces right results. But I am still working on making it better and getting it updated.

Build Trove Guest Image Manually in OpenStack

2015-11-19T19:59:00.000-08:00

I came up with this solution to manually building an image for OpenStack Trove based on my knowledge and intuition, after struggling with using Disk Image Builder on my OpenStack environment.

The first step to use Trove is to obtain or build a Trove image. Compared to regular OpenStack virtual machine image used for Nova, there are two extra essential components on a Trove image: Trove guest agent (or the ability of getting Trove guest agent) and a database server package.

If you already have an OpenStack environment set up, you can build a Trove guest image following the below steps.

1. Launch a Nova instance with a flavor that satisfies the system requirements (CPU, RAM, swap disk, etc) of the database server you want to install. Perform step 2-6 on this Nova instance.

2. Install and configure cloud-init on the Nova instance launched in Step 1.

# apt-get install cloud-init cloud-utils cloud-initramfs-growroot cloud-initramfs-rescuevol
# echo 'manage_etc_hosts: True' > /etc/cloud/cloud.cfg.d/10_etc_hosts.cfg

3. Set up basic environment for trove-guestagent on the Nova instance launched in Step 1.

# apt-get update
# apt-get install ubuntu-cloud-keyring
# echo "deb http://ubuntu-cloud.archive.canonical.com/ubuntu" \
"trusty-updates/kilo main" > /etc/apt/sources.list.d/cloudarchive-kilo.list

(Note: change "kilo" to the version of your OpenStack.)

# apt-get update && apt-get dist-upgrade

4. Install Trove and Trove-guestagent on the Nova instance launched in Step 1.

# apt-get install python-trove python-troveclient trove-common trove-guestagent

5. Configure Trove-guestagent on the Nova instance launched in Step 1.

Edit /etc/trove/trove-guestagent.conf.

[DEFAULT]
rabbit_host = controller
rabbit_userid = RABBIT_USER
rabbit_password = RABBIT_PASS
nova_proxy_admin_user = admin
nova_proxy_admin_pass = admin
nova_proxy_admin_tenant_name = admin
trove_auth_url = http://controller:35357/v2.0
log_file = trove-guestagent.log

( Change RABBIT_USER, RABBIT_PASS)
Refer to Step 1 in the database chapter of OpenStack Administration Guide.

6. Upload the database server package to the Nova instance launched in Step 1.

Note: After uploading the database server package, you can choose to install it manually or not. If you don't install it, Trove-guestagent should install it at the boot time of a Trove instance. But different database servers have different scenarios. Refer to the source code of the database you want to use under /trove/guestagent/datastore to find out the default path of the database server package, where the package should be uploaded to. For example, if you use Vertica, the default path that Vertica package should be uploaded to is root directory '/' based on the value of option "INSTALL_VERTICA" in the source file /trove/guestagent/datastore/experimental/vertica/system.py. And the community edition of Vertica is free and can be downloaded here.

7. In the OpenStack dashboard, shut off this Nova instance, take a snapshot of it.

Note: In OpenStack, a snapshot is an image.

8. Through trove client, add the snapshot to datastore.

Refer to Step 2-6 in the database chapter of OpenStack Administration Guide.

9. Restart all trove service on the controller node.

# service trove-api restart
# service trove-taskmanager restart
# service trove-conductor restart

10. Launch a Trove instance.

Refer to OpenStack Command Line Reference.

Add Database Service (Trove) for OpenStack - Kilo - Ubuntu

2015-10-27T03:06:00.001-07:00

Note: Change the values in CAPITAL and italic, like TROVE_DBPASS, TROVE_PASS, RABBIT_USER, RABBIT_PASS, NETWORK_LABEL. And also change controller to IP address of OpenStack controller node, if necessary.

1. Prepare trove database

$ mysql -u root -p
mysql> CREATE DATABASE trove;
mysql> GRANT ALL PRIVILEGES ON trove.* TO trove@'localhost' IDENTIFIED BY 'TROVE_DBPASS';
mysql> GRANT ALL PRIVILEGES ON trove.* TO trove@'%' IDENTIFIED BY 'TROVE_DBPASS';
mysql> FLUSH PRIVILEGES;

2. Install required Trove components

# apt-get install python-trove python-troveclient trove-common trove-api trove-taskmanager trove-conductor

3. Prepare OpenStack

$ source ~/admin-openrc.sh
$ keystone user-create --name trove --pass TROVE_PASS
$ keystone user-role-add --user trove --tenant service --role admin

4. Add the following configuration options to [filter:authtoken] section in /etc/trove/api-paste.ini

[filter:authtoken]

auth_uri = http://controller:5000
auth_url = http://controller:35357
auth_plugin = password
project_domain_id = default
user_domain_id = default
project_name = service
username = trove
password = TROVE_PASS

5. Edit the configuration following options in [DEFAULT] section in the following files

/etc/trove/trove.conf
/etc/trove/trove-taskmanager.conf
/etc/trove/trove-conductor.conf

[DEFAULT]
log_dir = /var/log/trove
trove_auth_url = http://controller:5000/v2.0
nova_compute_url = http://controller:8774/v2
cinder_url = http://controller:8776/v2
swift_url = http://controller:8080/v1/AUTH_
notifier_queue_hostname = controller
control_exchange = trove
rabbit_host = controller
rabbit_userid = RABBIT_USER
rabbit_password = RABBIT_PASS
rabbit_virtual_host= /
rpc_backend = trove.openstack.common.rpc.impl_kombu
sql_connection = mysql://trove:TROVE_DBPASS@controller/trove

(note: comment old configuration options if any, such as trove_auth_url, connection)

6. Edit the following configuration options in [DEFAULT] section in /etc/trove/trove.conf

[DEFAULT]
add_addresses = True
network_label_regex = ^NETWORK_LABEL$

(note: replace NETWORK_LABLEL with network label you want to connect. Find it out by "nova net-list" or "neutron net-list".)

7. Edit the following configuration options in [DEFAULT] section in /etc/trove/trove-taskmanager.conf

[DEFAULT]
nova_proxy_admin_user = admin
nova_proxy_admin_pass = admin
nova_proxy_admin_tenant_name = admin
taskmanager_manager = trove.taskmanager.manager.Manager
log_file = trove-taskmanager.log

8. Initialize database

# trove-manage db_sync

9. Edit /etc/init/trove-conductor.conf to make the following option mataching

--exec /usr/bin/trove-conductor -- --config-file=/etc/trove/trove-conductor.conf ${DAEMON_ARGS}

10. Edit /etc/init/trove-taskmanager.conf to make the following option matching

--exec /usr/bin/trove-taskmanager -- --config-file=/etc/trove/trove-taskmanager.conf ${DAEMON_ARGS}

11. Configure the Trove Endpoint in Keystone

$ keystone service-create --name trove --type database --description "OpenStack Database Service"
$ keystone endpoint-create \
--service-id $(keystone service-list | awk '/ trove / {print $2}') \
--publicurl http://controller:8779/v1.0/%$tenant_id$s \
--internalurl http://controller:8779/v1.0/%$tenant_id$s \
--adminurl http://controller:8779/v1.0/%$tenant_id$s \
--region regionOne

12. Restart the Trove Services

$ sudo service trove-api restart
$ sudo service trove-taskmanager restart
$ sudo service trove-conductor restart

Introduction to MySQL and SQL for Absolute Beginners or as Quick Refresher

2015-07-09T18:44:00.000-07:00

The database management system (DBMS) becomes extremely more important in the data era. MySQL is an open source software as relational database management system (RDBMS). It can be freely downloaded here. The instructions for installations, testing, and simple operations are provided here. And more SQL commands can be found on the website of SQLCourse, which is also an interactive online SQL training platform.

An Introductory Book to Business Intelligence

2015-07-06T07:11:00.001-07:00

Business intelligence (BI) has been a hot concept. I have been wondering what it is exactly. Now I get a satisfactory answer from the book Business Intelligence: Making Decisions through Data Analytics. This book gives a thorough and systematic introduction to BI and its tools. Basically, BI and its tools are derived from the following four areas: statistics and econometrics, operations research, artificial intelligence, and database technologies. This book successfully connects what I have grasped to BI.

I guess it would be a good idea to start to explore BI from this book.

A Tutorial Website for Simple Learning - Tutorialspoint

2015-06-25T14:07:00.001-07:00

The tutorials library provided by Tutorialspoint covers broad topics in the computer related technology and targets the main points. Self-learners may find the topic they want to learn more accessible and grasp the pin points in a short time. And also it provides online terminal and IDEs for practice.

The tutorial of Hadoop quickly led me to understand the function, structure, and operation of Hadoop as a solution to big data.

An Introductory Book to Data Mining

2015-06-21T18:05:00.001-07:00

Discovering Knowledge in Data: An Introduction to Data Mining is a pretty interesting and straightforward book for beginners in data mining. I read 5 chapters of this book just this afternoon. It is so attractive.

The course materials (i.e. data sets, homework, project) of data mining course provided by University of Tennessee can also serve very good practical reference.

SAS Global Certification Program

2015-06-15T19:30:00.000-07:00

It is a big data era. People who have data analysis and modeling skills can find plenty of job opportunities. SAS Global Certification Program provides a bridge to enter this profession area by getting them certified.

Based on my own experiences, the procedure of preparing for the certification exam can help study the software and theories specifically and thoroughly in an organized time framework.

Currently I hold the following certificates.

- SAS Certified Base Programmer for SAS 9 (08/2014)
- SAS Certified Advanced Programmer for SAS 9 (12/2014)
- SAS Certified Statistical Business Analyst Using SAS 9: Regression and Modeling (05/2015)
- SAS Certified Predictive Modeler Using Enterprise Miner 7 (06/2015)

I mainly used the SAS online tutor and course notes. If you need any study materials for study purpose only, please leave a comment.

An Introductory Example to Lagrangian Relaxation

2015-06-09T19:34:00.001-07:00

The example provided here makes the Lagrangian relaxation (LR) not a mystery for me any more. Each step becomes very clear. I have read a lot of materials in LR. This is one of the best for beginners to get to know how LR works.

An implementation of LR with C++ in ILOG CPLEX can be found here.

That makes me think that there is probably no really unsolvable problems. It is just we haven't found a right way to approach it.