Lili's Sharings : 2016

Friday, July 8, 2016

My Learning Notes on Artificial Neural Network

This post aims to summarize some key ideas for understanding the intuitions/theories of artificial neural network and its implementation, based on my own learning experiences. My approach is to ask questions and then explore answers for further understanding. It turns out that almost every question in the artificial neural network deserves much more efforts to dig deeper.

Note that this post DOES NOT aim to introduce the artificial neural network algorithm systematically. Many excellent comprehensive tutorials of deep learning and neural network can be found online. Some are listed in the post. Feel free to correct me if there is anything incorrect in this post. After all, I am also on the way of learning.

In March 2016, AlphaGo won the ancient Chinese board game "Go" against a human champion. A key to the winning of AlphaGo is the deep learning algorithm developed in it. Besides that, deep learning's achievements in many applications have made it very attractive, such as computer vision, speech recognition, and natural language processing. As known, deep learning is derived from artificial neural network with some additional techniques included. To learn deep learning, I guess that it is reasonable to learn artificial neural network first.

As I make efforts to grasp the intuitions/theories and implementation of artificial neural network, I have found that the artificial neural network will make sense if we can understand the following key ideas.

Activation Function
Cost Function
Gradient Descent Algorithm
Backpropagation Algorithm
Artificial Neural Network Architecture

Activation Function

The ultimate goal of artificial neural network or any kind of supervised machine learning algorithm is to find a functional relationship that maps the inputs to the outputs as accurate as possible. An advantage of artificial neural network over other machine learning algorithms is that it is capable of representing any kind of relationship, especially nonlinear relationship. Here is a visual proof that neural nets can compute any function. The more accurately an algorithm models the functional relationship between inputs and outputs, the more accurately the model generated by that algorithm can predict the unknown outputs with corresponding known inputs.

The technique that makes an artificial neural network able to represent any function is the activation function. Commonly used activation functions are log-sigmoid function, tan-sigmoid function, and softmax function.

To further explain why the activation function makes the neural network capable of representing any kind of relationship, let's take a look at the visualization of sigmoid function.

source: http://sebastianraschka.com/faq/docs/logisticregr-neuralnet.html

As shown in the above figure, roughly speaking, we can notice the following properties of the function .

For -1 <= z <= 1, the function maps a linear relationship.
For -5 < z < -1 and 1 < z < 5, the function maps a nonlinear relationship.
For z <= -5 and z >= 5, the function maps a constant relationship.

Back to the artificial neural network, it is made up of neurons and links between neurons. On each neuron, there are two steps to be done.

Sum up the weighted inputs.
Calculate the activation function with the sum of the weighted inputs if the threshold value is reached.

In some tutorials, these two steps are not talked about explicitly. But I feel it is easy to understand what a neuron does in two steps.

For more details about the activation function, refer to artificial neural networks/activation function in wikibooks.

At this moment, it is natural to ask “Why do we use the above functions as activation functions in neural network?” We already have some idea about how they can map different kinds of relationship. Another benefit is that they all constrain the output in the range from 0 to 1. And then why do we want to constrain the output in the range from 0 to 1? Based on the section “Sigmoid Neurons” in Chapter 1 Using neural nets to recognize handwritten digits of Michael Nielsen’s book Neural Networks and Deep Learning, one reason is that small changes on the weights and bias of inputs can only cause small changes on the output. We can also notice that it is easy to interpret the output as the probability.

Cost Function

In the parametric models, such as regression and neural network, when we use different parameters in the model, we will get different predictions/outputs. We want to find a set of best parameters which minimize the difference between actual values and predictions.

How do we measure the difference between actual values and prediction (i.e. error)?

In general, there are two ways. , which is the

1. Sum of squared difference between predicted output and actual value of each observation in the training set.
2. Sum of negative log-likelihood between predicted output and actual value of each observation in the training set.

The function that describes the squared error or negative log-likelihood is called the cost function. And we want to minimize it.

Gradient Descent Algorithm

By this moment, we want to minimize some cost function, which is complicated and has no closed form in such situation like the artificial neural network.

How can we perform the optimization process?

The gradient descent algorithm can make the magic! It is an optimization technique used to reach a local minimum of a complex function gradually. Here is a beautiful and commonly used analogy about what the gradient descent algorithm does. A complex function can be visualized as adjacent mountains and valleys. There are tops and bottoms. Imagine we are currently standing at some point of a valley and want to climb down to the bottom. But the fog is very heavy and our sight is limited. We cannot find a whole path directly down to the bottom and walk down along that path. What we can only do is to choose the next step that can bring us down a little bit. There are many directions we can walk our next step toward. But the direction that brings us down the most is favored. Gradient descent algorithm is to find that direction and then put us down by one step.

Mathematically speaking, the gradient descent algorithm finds that direction by calculating the partial derivative of the function. What the function means here is the cost function. This brings another advantage of commonly used activation functions mentioned earlier – their partial derivatives are easy to compute.

There are two questions related to the gradient descent algorithm.

What step size should we choose? If our step size is too small, it takes long to reach the bottom. But if our step size is too big, we may step over the bottom and miss it forever.
What initial position (i.e. weights, biases) should we choose? We must start from somewhere. A set of parameters of weights and biases in the artificial neural network gives an initial position.

The rough answer I have for these two questions at the moment is to do some experiments. I guess as our experiences grow, we can make a better judgment on these two choices easier. Moreover, some scientific approach may be found in the literature.

There is one question related to the cost function.

Why do we favor the quadratic format of the difference between predicted output and actual value as the cost function?

The first reason must be that it can help measure the prediction accuracy. For other reasons, when Michael Nielsen talks about the issue why we introduce the quadratic cost in this book, he mentions that the quadratic function is smooth and easy to figure out how to make small changes on weights and biases in order to improve the output.

Backpropagation Algorithm

The backpropagation algorithm is a set of rules to update the weights and biases of an artificial neural network by partitioning the prediction error into all neurons with the aid of gradient descent on each neuron.

Chapter 7 Neural Networks in the book Discovering Knowledge in Data: An Introduction to Data Mining illustrates a simple example how to perform backpropagation manually in the neural network by hand.

Artificial Neural Network Architecture

In the construction process of an artificial neural network, there are two hard questions related to hidden layers, compared to the input layer and output layer. Note that the input layer can be decided based on the feature engineering and the output layer is pretty straightforward.

How many hidden layers should we include in order to achieve the best?
How many neurons should we include in each hidden layer?

More sigmoid hidden layers and neurons will add the capacity of the neural network but will also tend to cause overfitting.

Many tutorials mention that some heuristics help. I guess it really takes lots of experiments and domain knowledge to make a quite good choice.

Tuesday, May 17, 2016

A Guide to Database Service Trove for OpenStack Liberty on Ubuntu 14.04

Let's start with some brief introduction of four components of Trove and their functions in the whole OpenStack environment.

We can image the whole OpenStack as a construction company which helps people build houses. Note that virtual machine servers built by OpenStack are analogous to houses built by this construction company. The components of OpenStack are like departments of the company, which provide different services to make the company run well. The Keystone department is responsible for authorizing customers' identifications. The Glance department is responsible for providing the blueprint of the house. The Nova department is responsible for construction work, The Cinder department is responsible for providing bricks.

In this construction company, we are going to establish a department called Trove. It is responsible for installing an intelligent equipment in the house. It is common sense that people take different responsibilities in a department. So in the department of Trove, there are three full-time guys working there.

Trove-api: He acts like the department front-desk. All requests from customers come to him first. He is responsible for handling and dispatching requests. For example, if a customer requests the service of installing an intelligent equipment for him or her, but he or she doesn't have a house to place the intelligent equipment and permits Trove department to do everything in order to install that equipment. Trove-api will tell the manager Trove-taskmanager about the request.

Trove-taskmanager: He acts like the department manager. He has the right to request other departments (Nova, Cinder, Glance, etc) to help him finish the work of building the house. Because in order to install the intelligent equipment, an appropriate house has to be built first. So Trove-taskmanager will ask other departments to help build the house first. Once the house is ready, Trove-taskmanager will get the message. He will hire a new technician guy called Trove-guestagent and send him to go to the house and install the intelligent equipment.

Trove-conductor: He acts like the department secretary. The technician guy Trove-guestagent will update the progress of installation of the intelligent equipment, such as what error he encounters, and whether the installation is completed successfully.

Trove-guestagent: He acts like the technician guy who is responsible for installing the intelligent equipment. He sends the installation progress message to the secretary Trove-conductor.

These four guys talk to each with cell phone, which is the role of Rabbit client in OpenStack.

The below deployment assumes that an OpenStack environment has been set up and runs correctly with services Keystone, Horizon, Glance, Nova, Cinder, and Swift. Note that Swift is optional for Trove deployment unless the backup operation is expected.

Note: Replace capital words like TROVE_DBPASS, NETWORK_LABEL, RABBIT_USERNAME, RABBIT_PASS, TROVE_PASS with appropriate values.

Prerequisites

Create the database for trove.

$ mysql -u root -p
> CREATE DATABASE trove;
> GRANT ALL PRIVILEGES ON trove.* TO 'trove'@'localhost' IDENTIFIED BY 'TROVE_DBPASS';
> GRANT ALL PRIVILEGES ON trove.* TO 'trove'@'%' IDENTIFIED BY 'TROVE_DBPASS';
> FLUSH PRIVILEGES;

Source the admin credentials of OpenStack environment.

$ source admin-openrc.sh

Create the service credentials for trove.

$ openstack user create --domain default --password-prompt trove
$ openstack role add --project service --user user trove admin
$ openstack service create --name trove --description "OpenStack Database Service" database

Create the database service endpoints.

$ openstack endpoint create --region RegionOne database public http://controller:8779/v1.0/%$tenant_id$s
$ openstack endpoint create --region RegionOne database internal http://controller:8779/v1.0/%$tenant_id$s
$ openstack endpoint create --region RegionOne database admin http://controller:8779/v1.0/%$tenant_id$s

Install and Configure Components

Install the packages.

# apt-get install python-trove python-troveclient trove-common trove-api trove-taskmanager trove-conductor

Edit the /etc/trove/trove.conf file.

[database]

connection = mysql+pymysql://trove:TROVE_DBPASS@controller/trove

[DEFAULT]

verbose = True
debug = True
rpc_backend = rabbit
auth_strategy = keystone
trove_auth_url = http://controller:5000/v2.0
nova_compute_url = http://controller:8774/v2
cinder_url = http://controller:8776/v2
neutron_url = http://controller:9696/
add_addresses = True
network_label_regex = ^NETWORK_LABEL$
log_dir = /var/log/trove/
log_file = trove-api.log

[oslo_messaging_rabbit]

rabbit_host = controller
rabbit_userid = RABBIT_USERNAME
rabbit_password = RABBIT_PASS

[keystone_authtoken]

auth_uri = http://controller:5000
auth_url = http://controller:35357
auth_plugin = password
project_domain_id = default
user_domain_id = default
project_name = service
username = trove
password = TROVE_PASS

Edit the /etc/trove/trove-taskmanager.conf file.

[database]

connection = mysql+pymysql://trove:TROVE_DBPASS@controller/trove

[DEFAULT]

verbose = True
debug = True
rpc_backend = rabbit
trove_auth_url = http://controller:5000/v2.0
nova_compute_url = http://controller:8774/v2
cinder_url = http://controller:8776/v2
neutron_url = http://controller:9696/
nova_proxy_admin_user = admin
nova_proxy_admin_pass = admin
nova_proxy_admin_tenant_name = admin
nova_proxy_admin_tenant_id = *************
taskmanager_manager = trove.taskmanager.manager.Manager
add_addresses = True
network_label_regex = ^NETWORK_LABEL$
log_dir = /var/log/trove/
log_file = trove-taskmanager.log
guest_config = /etc/trove/trove-guestagent.conf
guest_info = guest_info.conf
injected_config_location = /etc/trove/conf.d
cloudinit_location = /etc/trove/cloudinit

[oslo_messaging_rabbit]

rabbit_host = controller
rabbit_userid = RABBIT_USERNAME
rabbit_password = RABBIT_PASS

Edit the /etc/trove/trove-conductor.conf file.

[database]

connection = mysql+pymysql://trove:TROVE_DBPASS@controller/trove

[DEFAULT]

verbose = True
debug = True
trove_auth_url = http://controller:5000/v2.0
rpc_backend = rabbit
log_dir = /var/log/trove/
log_file = trove-conductor.log

[oslo_messaging_rabbit]

rabbit_host = controller
rabbit_userid = RABBIT_USERNAME
rabbit_password = RABBIT_PASS

Edit the /etc/trove/trove-guestagent.conf file.

[DEFAULT]

verbose = True
debug = True
trove_auth_url = http://controller:5000/v2.0
nova_proxy_admin_user = admin
nova_proxy_admin_pass = admin
nova_proxy_admin_tenant_name = admin
nova_proxy_admin_tenant_id = *************
rpc_backend = rabbit
log_dir = /var/log/trove/
log_file = trove-guestagent.log
datastore_registry_ext = vertica:trove.guestagent.datastore.experimental.vertica.manager.Manager
(Based on the notes above datastore_registry_ext in the configuration file, add datastore you want.)

[oslo_messaging_rabbit]

rabbit_host = controller
rabbit_userid = RABBIT_USERNAME
rabbit_password = RABBIT_PASS

Edit the /etc/init/trove-taskmanager.conf file.

--exec /usr/bin/trove-taskmanager -- --config-file=/etc/trove/trove-taskmanager.conf ${DAEMON_ARGS}

Edit the /etc/init/trove-conductor.conf file.

--exec /usr/bin/trove-conductor -- --config-file=/etc/trove/trove-conductor.conf ${DAEMON_ARGS}

Populate the trove service database.

# trove-manage db_sync

Finalize Installation

Restart the database services.

# service trove-api restart
# service trove-taskmanager restart
# service trove-conductor restart

Remove the SQLite database file.

# rm -f /var/lib/trove/trove.sqlite

Build database guest image for OpenStack Trove.

Reference:
Building a guest database image for OpenStack Trove
Build Trove guest image manually in OpenStack

Configure the database guest.

Edit the /etc/trove/trove-guestagent.conf file.

[DEFAULT]

verbose = True
debug = True
trove_auth_url = http://controller:5000/v2.0
nova_proxy_admin_user = admin
nova_proxy_admin_pass = admin
nova_proxy_admin_tenant_name = admin
nova_proxy_admin_tenant_id = *************
rpc_backend = rabbit
log_dir = /var/log/trove/
log_file = trove-guestagent.log
datastore_registry_ext = vertica:trove.guestagent.datastore.experimental.vertica.manager.Manager
(Based on the notes above datastore_registry_ext in the configuration file, add datastore you want.)

[oslo_messaging_rabbit]

rabbit_host = controller
rabbit_userid = openstack
rabbit_password = RABBIT_PASS

Edit the /etc/init/trove-guestagent.conf file.

--exec /usr/bin/trove-guestagent -- --config-file=/etc/trove/conf.d/guest_info.conf --config-file=/etc/trove/trove-guestagent.conf ${DAEMON_ARGS}

Debug Common Problems.

Read logs and find out specific information. The first place to check is trove-taskmanager.log on the controller node and then trove-guestagent.log on the guest.

No corresponding Nova instances are launched.

Check the configuration values for nova_proxy_admin_user, nova_proxy_admin_pass, nova_proxy_admin_tenant_name, nova_proxy_admin_tenant_id.

Trove taskmanager, trove conductor, and trove guestagent cannot talk to each other via RPC.

Check their logs to see whether there is "connected to AMQP". If there is no such message, check the configuration values of rabbit_host, rabbit_userid, rabbit_password, and rabbit_backend.

The guest_info.conf cannot be injected to the guest.

1. Install the following packages on compute nodes

# apt-get install libguestfs-tools python-libguestfs

# sudo update-guestfs-appliance

# sudo usermod -a -G kvm yourlogin

# chmod 0644 /boot/vmlinuz*

2. Add the following configuration options in /ect/nova/nova-compute.conf on

compute nodes.

[DEFAULT]

compute_driver=libvirt.LibvirtDriver

rootwrap_config = /etc/nova/rootwrap.conf

[libvirt]

virt_type=qemu

inject_partition = -1

3. Restart nova-compute service on compute nodes

# service nova-compute restart

Trove guestagent cannot get guest id from guest_info.conf and Trove instance is stuck at 'BUILD' status forever, but guest info is already injected to the guest.

Check /etc/init/trove-guestagent.conf file on the guest and make /etc/trove/conf.d/guest_info.conf as one of configuration files.

Sunday, April 24, 2016

MySQL Router: A High Availability Solution for MySQL

Nowadays more and more companies start to use cloud computing system and services. From the perspective of cloud service users, they want their service level guaranteed. Take the database cloud service as the example. The users expect the database on the cloud available whenever they use it. But there can always be some reason that may cause the database not available, such as the network failure and database exceptional shutdown. So a big problem for cloud service providers is how to guarantee the service level declared in the service level agreement. High available solutions can solve that problem. "High Availability" is a fundamental and important concept in the cloud computing. It means the service/server on the cloud is available/up at a high rate, like 99.99% time of a given year.

In this post, I would like to introduce one high availability solution for MySQL -- MySQL Router. First, I will describe what it is. And then I will show how to use it.

What is MySQL Router?

MySQL Router is a middleware between MySQL client and MySQL server, which redirects a query from the MySQL client to a MySQL server. What we usually do is to use a MySQL client directly to connect to a MySQL server. So why do we want this middleware? The reason is, with MySQL Router, we can set up more than one backend MySQL servers. When one server is down, MySQL Router can automatically redirect the query to another available server, which helps guarantee the service level.

How to use MySQL Router?

1. Set up MySQL backend servers.

What I did is to launch two virtual machines with Ubuntu Linux system and install a MySQL server on each virtual machine. Another setup scenario is introduced in MySQL Router tutorial. In real practice, the MySQL backend servers are set in the replication topology.

Command of installing MySQL server on Ubuntu:
shell> sudo apt-get install mysql-server

2. Set up MySQL Router.

-- Download the MySQL APT repository.
shell> wget http://dev.mysql.com/get/mysql-apt-config_0.7.2-1_all.deb

-- Install the MySQL APT repository.
shell> sudo dpkg -i mysql-apt-config_0.7.2-1_all.deb

-- Update the APT repository.
shell> sudo apt-get update

-- Install MySQL Router.
shell> sudo apt-get install mysql-router

-- Install MySQL Utilities.
shell> sudo apt-get install mysql-utilities

3. Configure MySQL Router.

The configuration is under /etc/mysqlrouter/mysqlrouter.ini.

The configuration instructions can be found here. The most important part is the option 'destinations' in the [routing] section. Its format is host_ip:port. For example, if there are two MySQL servers on the host 192.168.25.200/201 respectively, the 'destinations' can be set as follows.

destinations = 192.168.25.200:3306, 192.168.25.201:3306

4. Start MySQL Router.

shell> sudo mysqlrouter -c /etc/mysqlrouter/mysqlrouter.ini

5. Set up MySQL client to connect to MySQL Router.

A python script will be used here as MySQL client. Based on the documentation, the client must be executed on the same machine where the MySQL Router is running.

(About 'cox' statement: -- change the user and password to your own in the 'cnx' statement.
-- By default, the host and port of MySQL Router are 'localhost' and 7001, unless they are configured by the option 'bind_address' in mysqlrouter.ini. )

shell> python
>>> import mysql.connector
>>> cnx = mysql.connector.connect(host='localhost', port=7001, user='root', password='secret')
>>> cur = cnx.cursor()
>>> cur.execute("SHOW DATABASES")
>>> print cur.fetchall()

6. Play with MySQL Router.

-- Create different databases on each MySQL server under the same user specified in the 'cnx' statement.
-- Execute the code in Step 5. And check the output.
-- Stop a MySQL server, re-execute code from 'cnx', and examine the difference.

More details can be found in the part of Testing the Router.

Saturday, April 16, 2016

Google Charts: a great way to draw interactive and realtime charts on JSON data requested from REST API

Data visualization is helpful not only for data analysts but also for system administrators. Data analysts use data visualization to untangle correlated relationship between some features, while system administrators use data visualization to monitor the system status and users' status. Many plotting packages are developed to visualize data, such as ggplot2 in R and matplotlib in Python, which are very powerful to make elegant charts. But the charts made by those packages are STATIC. In some cases, we want to make charts based on users' interaction or refresh charts automatically in order to trace the latest updated data in the data set. Here Google Charts can play the role. Another benefit Google Charts provides is to showcase the work with charts on the web! Moreover, Google Charts make it easy to use AJAX to attach JSON data retrieved from a REST API to the chart data table.

In this post, I will give a complete example how to use Google Charts to draw a realtime chart, which reads data from a REST API and automatically refresh itself. In order to use Google Charts, the familiarity of the language HTML, JavaScript, and CSS will be required. I think the guide of Google Charts make it very easy for new learners to approach the technique. For people who feel hard following the guide, some tutorial resources have been provided at the end of the post to help catch up the required basic concepts.

The code of the example can be found in my GitHub. My working flow on this program is as follows.

Prepare a REST API which provides JSON data to serve as chart data for visualization. A public GitHub API mentioned in this post is used. It contains some data about issues related to ggplot2 repository on GitHub. We can create our own REST API, as mentioned in my last post.
Use AJAX to read JSON data from REST API to a JavaScript variable, which will be transformed to the format of Google Chart data table.
Set up Google Chart refresh effect by using the function setRefreshInterval().
Set up Google Chart option animation.
Draw the chart.

For people who haven't had experiences with HTML, CSS, JavaScript, and AJAX, I would recommend the following tutorial resources. Three weeks ago, I only knew a little bit about HTML and absolutely nothing about CSS, JavaScript, AJAX, or the concept of front end development. I referred to the following tutorials, grasped all of them in three weeks, and finished my task of user interface development from work.

tutorialspoint (English)
Front-End Web Developer Nanodegree courses on Udacity (English)
Python Web Development on maiziedu (Chinese)

For people who want to learn more about REST API development, Python full-stack web development, and HTTP, the above resources will also be helpful.

Last but not least, I would like to share a little bit about my motivation to learn the web development. After reading many posts and studying current successful business applications, my feeling is that machine learning and web development go hand by hand. Many machine learning systems are deployed on web applications. A complete web solution may contain some machine learning algorithms as a specific problem solution. I guess that makes sense by just thinking about what Amazon does.

Sunday, March 27, 2016

RESTful web service and OpenStack dashboard

In the past three weeks, I have worked on the task of developing a RESTful web service and integrating it to OpenStack dashboard. In this post, I will share some knowledge about them and also the tools (i.e. Flask framework, Flask RESTful framework) that help me finish this task.

Nowadays many people benefit from the convenience of cloud computing services on OpenStack, Amazon Web Services, and many other cloud providers. Some may wonder how I get the service like a virtual machine by just clicking several buttons in the browser. This relies on the technique of RESTful web service.

In the concept of RESTful web service, everything can be categorized as either a resource or an operation/method on a resource. A resource is represented by a URI. And there are four kinds of operations on a resource, namely, GET, POST, PUT, and DELETE. The operation is done by HTTP request. In RESTful web service, a response to a HTTP request is serialized data in format of JSON or XML, which makes the information transportation between machines efficient.

To learn more details about RESTful web service, many online tutorials can be referred to. When I started to study this concept, I read lots of posts. And I have found the following mapping way is helpful. For people who have object-oriented (OO) programming experience, the resource-oriented architecture of RESTful web service may be understood in the OO paradigm. In OO, there are only objects and methods on objects, where an method on a object may be done by another object. But note that the work mechanism of RESTful web service and OO are very different. The former is done through HTTP request, while the latter is just a programming paradigm.

Because there are many repetitive works involved in the web application and service development, many frameworks are created to do those repetitive works automatically for developers. So the developers can only focus on their own parts. The frameworks Flask and Flask RESTful have helped me develop RESTful web service. I chose these two frameworks because I found that they were fairly easy to learn and use, and suitable in my case, after exploring other Python web frameworks like Django and Django REST. Another difference between Flask and Django is that the developer is allowed to choose their own object relational mapper (ORM) in Flask, while Django provides the default ORM. At the moment, I feel more comfortable using SQLAlchemy as ORM in the development. Their documentations provide the code of the example project to help get started. So the example code will not be provided here. Note that as mentioned in Flask RESTful documentation, the python package requests can serve as an easy way to test whether RESTful web service or URIs work as expected.

After RESTful web service is developed and ready for use, a user-friendly front-end dashboard will also be desired not only by users but also by administrators. Because the RESTful web service in my case is related to OpenStack, the OpenStack dashboard is customized to host it, considering the fact that system administrators and users always want to do everything in just one dashboard. To figure out how to integrate an external web service to OpenStack dashboard, I have mainly referred to two posts from Keith and OpenStack tutorial respectively, which answer most of my questions.

For people who want to develop their own front-end dashboard, the knowledge of HTML, CSS, and JavaScript will be required. HTML is in charge of the content of a webpage. CSS is in charge of the style of a webpage. And JavaScript makes a webpage more interactive. I don't have many experiences to share on this aspect. But I believe as long as we know what we want to achieve, we will find the bridge to it.

Saturday, February 27, 2016

Automating system administrative tasks of Vertica cluster on OpenStack with Python

In this post, I would like to share some experiences on how to automate system administrative tasks with Python. This is the first time for me to do such tasks. One of big lessons I have learned is how to break a complex task down into small sequence tasks.

Based on the project need, recently I have been working on some system administrative tasks. And one task is to automate the following process with a Python program.

Create a Vertica cluster on OpenStack Trove.
Migrate some certain users and their tables to this newly created cluster from the cluster they currently sit on. This task can be broken down into following sequence tasks.

Connect to the current Vertica cluster and export objects of those users into a sql file on the machine where the python program is executed.
Transfer the sql file to the target Vertica cluster.
Connect to the target Vertica cluster with the dbadmin credentials to create users and grant database privilege to them.
Connect to the target Vertica cluster with each user's credentials to create their objects (schemas and tables) from sql file.
Copy table data to the target Vertica cluster from the current Vertica cluster.

The reason why we are interested in launching a Vertica cluster is that Vertica is a massively parallel processing database (MPPDB). The MPPDB-as-a-service on the cloud would be very attractive for companies who perform analytical tasks on huge amounts of data.

The following Python packages have been used to implement the above tasks.

troveclient: Call Trove Python API.
python-vertica: Interact with Vertica server on the cluster.
subprocess: Execute a command line on the local machine.
paramiko: Execute a command line on the remote machine (i.e. virtual machines that host Vertica server on OpenStack)

Let's see how things work out with examples.

1. As it mentions in this article, OpenStack is a popular open source cloud operating system for deploying infrastructure as a service and cloud-based tasks can be automated through working with OpenStack Python APIs. The author provides the examples of Keystone API, Nova API, and so on, but doesn't cover Trove API.

An Example of Trove API can be found in the screenshot below.

2. Every database management system adopts the server-client paradigm, which means we can interact with the server with a client. The Python client package for Vertica is python_vertica.

Below screenshot provides an example how to user python_vertica to export objects.

3. As mentioned in the above section, one small task is to transfer sql file from the local machine to a remote machine (the target Vertica cluster). subprocess.check_call() is used.

4. Another small task is to execute a command line to create a user's objects from sql file on the target Vertica cluster. So how can we execute a command line on a remote machine?

Below is the code screenshot to use paramiko to do that. Note in this case, SSH login of the remote machine is set via key pair file. So the parameter "KEY_FILE_PATH_LOCAL" is need. If SSH login is set via password, refer to paramiko document and make corresponding changes.

This task requires the knowledge in the operating system, database system (i.e. Vertica), and Python. I am glad that I make it, since the growth always starts with small steps. There is still some space to optimize the code, which is what I will work on.

Sunday, February 21, 2016

An Initial Exploration of Electricity Price Forecasting

One month ago, I decided to perform some exploration on the problem electricity price forecasting to get some knowledge in electricity market and sharpen skills in data management and modeling. This post will describe what I have done and achieved in this project, from problem definition, literature review, data collection, data exploration, feature creation, model implementation, model evaluation, and result discussion. Feature engineering and Spark are two skills I aim to gain and improve in this project.

The price of a commodity in a market influences the behaviors of all market participants from suppliers to consumers. So knowledge about the future price plays a determinant role on selling and buying decisions for suppliers to make profits and for consumers to save cost.

Electricity is the commodity in the electricity market. In deregulated electricity market, generating companies (GENCOs) submit production bids one day ahead. When GENCOs decide the bids, both electricity load and price for the coming day are not known. So those decisions rely on the forecasting of electricity load and price. Electricity load forecasting has moved to an advanced stage both in the industry and academic with low enough prediction error, while electricity price forecasting is not as mature as electricity load forecasting in the respect of tools and algorithms. That is because the components of electricity price are more complicated than electricity load.

1. Literature Review

"If I have been able to see further, it was only because I stood on the shoulder of giants." -- Newton

The review paper (Electricity Price Forecasting in Deregulated Markets: A Review and Evaluation) has been mainly referred to. In this paper, both price-influencing factors and models are summarized.

2. Exploratory Data Analysis

The locational based marginal price (LBMP) in day ahead market provided by New York Independent System Operator (ISO) is used. Because of computational resource limitation, this project is only to forecast the price of "WEST" zone for the coming day.

"Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." -- Pedro Domingos in "A Few Things to Know about Machine Learning"

Based on the scatter plot and correlation coefficient, the following variables are used as model inputs.

- Marginal cost losses

- The square of marginal cost losses

- Marginal cost congestion

- The square of marginal cost congestion

- The average price in the past 1 day

- The average price in the past 7 days

- The average price in the past 28 days

- The average price in the past 364 days

- The standard deviation of price in the past 1 day

- The standard deviation of price in the past 7 days

- The standard deviation of price in the past 28 days

- The standard deviation of price in the past 364 days

- The ratio of average price in the past 1 day over average price in the past 7 days

- The ratio of average price in the past 7 days over average price in the past 364 days

- The ratio of average price in the past 28 days over average price in the past 364 days

- The ratio of standard deviation in the past 1 day over standard deviation in the past 7 days

- The ratio of standard deviation in the past 7 days over standard deviation in the past 364 days

- The ratio of standard deviation in the past 28 days over standard deviation in the past 364 days

- The year

3. Model Development

The linear regression is used as the initial model for exploration.

4. Result Presentation and Analysis

The mean absolute percentage error (MAPE) is used as the performance error. The MAPE is currently around 53%, which is high. The solution can be improved in the following respects.

Create more efficient input variables, like electricity load. The electricity load data in New York ISO is provided in 5-minute interval. Those data has been retrieved, but is under manipulation process, like imputing missing value and aggregating to hour-level.
Use other models like neural network.

The code developed for this project can be found here. The computation was tried on Spark system. And feature engineering was paid especial attention to in this project. There are still a lot to do in this project in order to improve forecasting accuracy. I will try to continue this topic if I get enough time and energy.

Wednesday, February 3, 2016

Multi-tenant Management for Database Service on the Cloud Computing Platform

In this post, the concept of multi-tenancy of database service on the cloud computing platform is discussed, as well as its benefits, technical challenge, and solution architecture. And the example of car renting is given to help people understand in a more general way. The reason why I wrote this post is that I am currently working on the project of implementing parallel database service on the cloud computing platform at an extremely low cost through careful multi-tenancy management. And I realize that what I have learned from this project is not only knowledge in database and big data system but also the application of management science.

Multi-tenancy is one of key problems faced by cloud service providers. For database service on the cloud computing platform, multi-tenancy means that more than one tenants can be deployed on a single database server. If it is managed well, both the cost and energy will be saved a lot. But we cannot randomly deploy some tenants together, because we have to meet a more important objective than cost saving, which is to satisfy the service level every tenant requests. A good solution architecture must be able to handle that challenge.

To help people who are unfamiliar with the concept of multi-tenancy or database service on the cloud understand what it is and also its benefits, we can think about the following real-world example. The cloud computing platform can be thought as a car renting company. And the tenants on the cloud computing platform can be thought as customers of the car renting company. The only difference between them is the service. The cloud computing platform provides computing services like database service, while a car renting company provides the car renting service. Let's image that there are two customers for this car renting company. If the renting schedule of these two customers overlap, the company has to assign one car for each customer. In this case, two cars have to be available in order to serve those two customers. But what if the renting schedule of these two customers are different, then the company can assign the same car for them. In this case, only one car is required to be available. As known, the renting company has to buy enough cars to make them available in order to satisfy all customers. If customers who have different renting schedule are assigned to a single car, then the total number of cars the company needs to buy will be reduced. This is a common strategy for a renting company to save money. So it can also be a very good strategy for the cloud computing platform.

But this strategy is challenging on the cloud computing platform, because it is not scheduled in advance when a tenant uses the service on the cloud computing platform. And a tenant's service level agreement may not be satisfied at each time period if it is deployed on a database server with some other tenants. If we don't know when a tenant uses the service, how could we know which tenant should be put together? So in order to realize multi-tenancy, the first step is to know roughly or predict tenants' behavior like when they use the service, based on their records.

Prof. Lo's group came up with a system architecture as a solution for multi-tenancy of parallel database as service on the cloud computing platform in 2013, which can be found in the paper Parallel Analytics as A Service. The solution can be understood in the following simplified car renting scenario.

Let's assume that the car renting company serves customers who frequently rent cars and their schedules are not known in advance. When a customer comes in, there must be a car available for him or her. As discussed earlier, we want to first know roughly when a customer rents a car based on his or her records, in order to find which customers can be assigned to use the same car in different time periods, which in turn reduces the total number of cars the renting car company needs to have. But how? the solution in the paper suggests when a new customer comes in, assign a car to that customer and only that customer can use that car in certain period like one month. So in this one month, whenever the customer wants to use the car, there is always a car available. And during this month, the time periods when the customer uses the car will be recorded. As time goes on, we collect all customers' records. And based on the records, we can predict when one customer rents a car and then decide which customers can use the same car without schedule conflicts through some analysis.

Back to the cloud computing platform case, the general logic of the solution is the same. In the proposed system architecture, there are four components, which are Tenant Activity Monitor, Deployment Advisor, Deployment Master, and Query Router. I am mainly in charge of implementation of the first three components, which have been almost done, and my teammate has finished the implementation of query router. Hopefully this will bring more people benefits as an open source project pretty soon. Compared with car renting service, there are many more technical details involved in the database service on the cloud computing platform. Take this as an example. If a customer of the car renting company is assigned to a different car, he or she just goes to get it. But on the cloud computing platform, if a tenant is assigned to another database server, all its data has to be migrated to another database server. But this is not the main point of this post.

Considering the fact that the active tenant ratio is very low (i.e. 10% in IBM's database-as-a-service), multi-tenancy of parallel database service on the cloud computing is very attractive. For any kind of service providers like cloud service provider and car renting company, resource can be utilized more efficiently and cost can be reduced a lot, through careful management by some analytical methods. In July 2015, I took the job of implementing parallel database service on OpenStack as a research associate from Prof. Lo because of the reputation of his group in the database and big data system, which I wanted to learn more about in order to realize my career goal. Now besides the knowledge in database and big data system, this project also provides a vision about the application of resource management, which help me further understand the spirit of management science.