Saturday, February 27, 2016

Automating system administrative tasks of Vertica cluster on OpenStack with Python

In this post, I would like to share some experiences on how to automate system administrative tasks with Python. This is the first time for me to do such tasks. One of big lessons I have learned is how to break a complex task down into small sequence tasks. 

Based on the project need, recently I have been working on some system administrative tasks. And one task is to automate the following process with a Python program. 

  • Create a Vertica cluster on OpenStack Trove. 
  • Migrate some certain users and their tables to this newly created cluster from the cluster they currently sit on. This task can be broken down into following sequence tasks. 

The reason why we are interested in launching a Vertica cluster is that Vertica is a massively parallel processing database (MPPDB). The MPPDB-as-a-service on the cloud would be very attractive for companies who perform analytical tasks on huge amounts of data. 

The following Python packages have been used to implement the above tasks. 

  • troveclient: Call Trove Python API. 
  • python-vertica: Interact with Vertica server on the cluster. 
  • subprocess: Execute a command line on the local machine. 
  • paramiko: Execute a command line on the remote machine (i.e. virtual machines that host Vertica server on OpenStack)
Let's see how things work out with examples. 


1. As it mentions in this article, OpenStack is a popular open source cloud operating system for deploying infrastructure as a service and cloud-based tasks can be automated through working with OpenStack Python APIs. The author provides the examples of Keystone API, Nova API, and so on, but doesn't cover Trove API. 

An Example of Trove API can be found in the screenshot below. 


2. Every database management system adopts the server-client paradigm, which means we can interact with the server with a client. The Python client package for Vertica is python_vertica. 

Below screenshot provides an example how to user python_vertica to export objects. 







 







3. As mentioned in the above section, one small task is to transfer sql file from the local machine to a remote machine (the target Vertica cluster). subprocess.check_call() is used. 

4. Another small task is to execute a command line to create a user's objects from sql file on the target Vertica cluster. So how can we execute a command line on a remote machine?

Below is the code screenshot to use paramiko to do that. Note in this case, SSH login of the remote machine is set via key pair file. So the parameter "KEY_FILE_PATH_LOCAL" is need. If SSH login is set via password, refer to paramiko document and make corresponding changes. 



 









This task requires the knowledge in the operating system, database system (i.e. Vertica), and Python. I am glad that I make it, since the growth always starts with small steps. There is still some space to optimize the code, which is what I will work on. 

Sunday, February 21, 2016

An Initial Exploration of Electricity Price Forecasting

One month ago, I decided to perform some exploration on the problem electricity price forecasting to get some knowledge in electricity market and sharpen skills in data management and modeling. This post will describe what I have done and achieved in this project, from problem definition, literature review, data collection, data exploration, feature creation, model implementation, model evaluation, and result discussion. Feature engineering and Spark are two skills I aim to gain and improve in this project.  

The price of a commodity in a market influences the behaviors of all market participants from suppliers to consumers. So knowledge about the future price plays a determinant role on selling and buying decisions for suppliers to make profits and for consumers to save cost.

Electricity is the commodity in the electricity market. In deregulated electricity market, generating companies (GENCOs) submit production bids one day ahead. When GENCOs decide the bids, both electricity load and price for the coming day are not known. So those decisions rely on the forecasting of electricity load and price. Electricity load forecasting has moved to an advanced stage both in the industry and academic with low enough prediction error, while electricity price forecasting is not as mature as electricity load forecasting in the respect of tools and algorithms. That is because the components of electricity price are more complicated than electricity load. 

1. Literature Review

"If I have been able to see further, it was only because I stood on the shoulder of giants." -- Newton

The review paper (Electricity Price Forecasting in Deregulated Markets: A Review and Evaluation) has been mainly referred to. In this paper, both price-influencing factors and models are summarized. 

2. Exploratory Data Analysis

The locational based marginal price (LBMP) in day ahead market provided by New York Independent System Operator (ISO) is used. Because of computational resource limitation, this project is only to forecast the price of "WEST" zone for the coming day.

"Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used."   -- Pedro Domingos in "A Few Things to Know about Machine Learning"

Based on the scatter plot and correlation coefficient, the following variables are used as model inputs. 

- Marginal cost losses
- The square of marginal cost losses
- Marginal cost congestion
- The square of marginal cost congestion
- The average price in the past 1 day
- The average price in the past 7 days
- The average price in the past 28 days
- The average price in the past 364 days
- The standard deviation of price in the past 1 day
- The standard deviation of price in the past 7 days
- The standard deviation of price in the past 28 days
- The standard deviation of price in the past 364 days
- The ratio of average price in the past 1 day over average price in the past 7 days
- The ratio of average price in the past 7 days over average price in the past 364 days
- The ratio of average price in the past 28 days over average price in the past 364 days
- The ratio of  standard deviation in the past 1 day over standard deviation in the past 7 days
- The ratio of standard deviation in the past 7 days over standard deviation in the past 364 days
- The ratio of standard deviation in the past 28 days over standard deviation in the past 364 days
- The year 

3. Model Development

The linear regression is used as the initial model for exploration. 

4. Result Presentation and Analysis 

The mean absolute percentage error (MAPE) is used as the performance error. The MAPE is currently around 53%, which is high. The solution can be improved in the following respects.
  • Create more efficient input variables, like electricity load. The electricity load data in New York ISO is provided in 5-minute interval. Those data has been retrieved, but is under manipulation process, like imputing missing value and aggregating to hour-level. 
  • Use other models like neural network. 
The code developed for this project can be found here. The computation was tried on Spark system. And feature engineering was paid especial attention to in this project. There are still a lot to do in this project in order to improve forecasting accuracy. I will try to continue this topic if I get enough time and energy. 

Wednesday, February 3, 2016

Multi-tenant Management for Database Service on the Cloud Computing Platform

In this post, the concept of multi-tenancy of database service on the cloud computing platform is discussed, as well as its benefits, technical challenge, and solution architecture. And the example of car renting is given to help people understand in a more general way. The reason why I wrote this post is that I am currently working on the project of implementing parallel database service on the cloud computing platform at an extremely low cost through careful multi-tenancy management. And I realize that what I have learned from this project is not only knowledge in database and big data system but also the application of management science. 

Multi-tenancy is one of key problems faced by cloud service providers. For database service on the cloud computing platform, multi-tenancy means that more than one tenants can be deployed on a single database server. If it is managed well, both the cost and energy will be saved a lot. But we cannot randomly deploy some tenants together, because we have to meet a more important objective than cost saving, which is to satisfy the service level every tenant requests. A good solution architecture must be able to handle that challenge.

To help people who are unfamiliar with the concept of multi-tenancy or database service on the cloud understand what it is and also its benefits, we can think about the following real-world example. The cloud computing platform can be thought as a car renting company. And the tenants on the cloud computing platform can be thought as customers of the car renting company. The only difference between them is the service. The cloud computing platform provides computing services like database service, while a car renting company provides the car renting service. Let's image that there are two customers for this car renting company. If the renting schedule of these two customers overlap, the company has to assign one car for each customer. In this case, two cars have to be available in order to serve those two customers. But what if the renting schedule of these two customers are different, then the company can assign the same car for them. In this case, only one car is required to be available. As known, the renting company has to buy enough cars to make them available in order to satisfy all customers. If customers who have different renting schedule are assigned to a single car, then the total number of cars the company needs to buy will be reduced. This is a common strategy for a renting company to save money. So it can also be a very good strategy for the cloud computing platform.

But this strategy is challenging on the cloud computing platform, because it is not scheduled in advance when a tenant uses the service on the cloud computing platform. And a tenant's service level agreement may not be satisfied at each time period if it is deployed on a database server with some other tenants. If we don't know when a tenant uses the service, how could we know which tenant should be put together? So in order to realize multi-tenancy, the first step is to know roughly or predict tenants' behavior like when they use the service, based on their records.

Prof. Lo's group came up with a system architecture as a solution for multi-tenancy of parallel database as service on the cloud computing platform in 2013, which can be found in the paper Parallel Analytics as A Service. The solution can be understood in the following simplified car renting scenario.

Let's assume that the car renting company serves customers who frequently rent cars and their schedules are not known in advance. When a customer comes in, there must be a car available for him or her. As discussed earlier, we want to first know roughly when a customer rents a car based on his or her records, in order to find which customers can be assigned to use the same car in different time periods, which in turn reduces the total number of cars the renting car company needs to have. But how? the solution in the paper suggests when a new customer comes in, assign a car to that customer and only that customer can use that car in certain period like one month. So in this one month, whenever the customer wants to use the car, there is always a car available. And during this month, the time periods when the customer uses the car will be recorded. As time goes on, we collect all customers' records. And based on the records, we can predict when one customer rents a car and then decide which customers can use the same car without schedule conflicts through some analysis.

Back to the cloud computing platform case, the general logic of the solution is the same. In the proposed system architecture, there are four components, which are Tenant Activity Monitor, Deployment Advisor, Deployment Master, and Query Router. I am mainly in charge of implementation of the first three components, which have been almost done, and my teammate has finished the implementation of query router. Hopefully this will bring more people benefits as an open source project pretty soon. Compared with car renting service, there are many more technical details involved in the database service on the cloud computing platform. Take this as an example. If a customer of the car renting company is assigned to a different car, he or she just goes to get it. But on the cloud computing platform, if a tenant is assigned to another database server, all its data has to be migrated to another database server. But this is not the main point of this post.

Considering the fact that the active tenant ratio is very low (i.e. 10% in IBM's database-as-a-service), multi-tenancy of parallel database service on the cloud computing is very attractive. For any kind of service providers like cloud service provider and car renting company, resource can be utilized more efficiently and cost can be reduced a lot, through careful management by some analytical methods. In July 2015, I took the job of implementing parallel database service on OpenStack as a research associate from Prof. Lo because of the reputation of his group in the database and big data system, which I wanted to learn more about in order to realize my career goal. Now besides the knowledge in database and big data system, this project also provides a vision about the application of resource management, which help me further understand the spirit of management science.