In the past three weeks, I have worked on the task of developing a RESTful web service and integrating it to OpenStack dashboard. In this post, I will share some knowledge about them and also the tools (i.e. Flask framework, Flask RESTful framework) that help me finish this task.
Nowadays many people benefit from the convenience of cloud computing services on OpenStack, Amazon Web Services, and many other cloud providers. Some may wonder how I get the service like a virtual machine by just clicking several buttons in the browser. This relies on the technique of RESTful web service.
In the concept of RESTful web service, everything can be categorized as either a resource or an operation/method on a resource. A resource is represented by a URI. And there are four kinds of operations on a resource, namely, GET, POST, PUT, and DELETE. The operation is done by HTTP request. In RESTful web service, a response to a HTTP request is serialized data in format of JSON or XML, which makes the information transportation between machines efficient.
To learn more details about RESTful web service, many online tutorials can be referred to. When I started to study this concept, I read lots of posts. And I have found the following mapping way is helpful. For people who have object-oriented (OO) programming experience, the resource-oriented architecture of RESTful web service may be understood in the OO paradigm. In OO, there are only objects and methods on objects, where an method on a object may be done by another object. But note that the work mechanism of RESTful web service and OO are very different. The former is done through HTTP request, while the latter is just a programming paradigm.
Because there are many repetitive works involved in the web application and service development, many frameworks are created to do those repetitive works automatically for developers. So the developers can only focus on their own parts. The frameworks Flask and Flask RESTful have helped me develop RESTful web service. I chose these two frameworks because I found that they were fairly easy to learn and use, and suitable in my case, after exploring other Python web frameworks like Django and Django REST. Another difference between Flask and Django is that the developer is allowed to choose their own object relational mapper (ORM) in Flask, while Django provides the default ORM. At the moment, I feel more comfortable using SQLAlchemy as ORM in the development. Their documentations provide the code of the example project to help get started. So the example code will not be provided here. Note that as mentioned in Flask RESTful documentation, the python package requests can serve as an easy way to test whether RESTful web service or URIs work as expected.
After RESTful web service is developed and ready for use, a user-friendly front-end dashboard will also be desired not only by users but also by administrators. Because the RESTful web service in my case is related to OpenStack, the OpenStack dashboard is customized to host it, considering the fact that system administrators and users always want to do everything in just one dashboard. To figure out how to integrate an external web service to OpenStack dashboard, I have mainly referred to two posts from Keith and OpenStack tutorial respectively, which answer most of my questions.
For people who want to develop their own front-end dashboard, the knowledge of HTML, CSS, and JavaScript will be required. HTML is in charge of the content of a webpage. CSS is in charge of the style of a webpage. And JavaScript makes a webpage more interactive. I don't have many experiences to share on this aspect. But I believe as long as we know what we want to achieve, we will find the bridge to it.
Hi everyone. This blog is used to record my "ahh-ha!" moments, along the way of solving problems in the areas of data science, operations research, programming, database, full stack web development, cloud computing system and services (particularly database service). Hope you will find some insights here.
Sunday, March 27, 2016
Saturday, February 27, 2016
Automating system administrative tasks of Vertica cluster on OpenStack with Python
In this post, I would like to share some experiences on how to automate system administrative tasks with Python. This is the first time for me to do such tasks. One of big lessons I have learned is how to break a complex task down into small sequence tasks.
Based on the project need, recently I have been working on some system administrative tasks. And one task is to automate the following process with a Python program.
- Create a Vertica cluster on OpenStack Trove.
- Migrate some certain users and their tables to this newly created cluster from the cluster they currently sit on. This task can be broken down into following sequence tasks.
- Connect to the current Vertica cluster and export objects of those users into a sql file on the machine where the python program is executed.
- Transfer the sql file to the target Vertica cluster.
- Connect to the target Vertica cluster with the dbadmin credentials to create users and grant database privilege to them.
- Connect to the target Vertica cluster with each user's credentials to create their objects (schemas and tables) from sql file.
- Copy table data to the target Vertica cluster from the current Vertica cluster.
The reason why we are interested in launching a Vertica cluster is that Vertica is a massively parallel processing database (MPPDB). The MPPDB-as-a-service on the cloud would be very attractive for companies who perform analytical tasks on huge amounts of data.
The following Python packages have been used to implement the above tasks.
- troveclient: Call Trove Python API.
- python-vertica: Interact with Vertica server on the cluster.
- subprocess: Execute a command line on the local machine.
- paramiko: Execute a command line on the remote machine (i.e. virtual machines that host Vertica server on OpenStack)
Let's see how things work out with examples.
1. As it mentions in this article, OpenStack is a popular open source cloud operating system for deploying infrastructure as a service and cloud-based tasks can be automated through working with OpenStack Python APIs. The author provides the examples of Keystone API, Nova API, and so on, but doesn't cover Trove API.
An Example of Trove API can be found in the screenshot below.
2. Every database management system adopts the server-client paradigm, which means we can interact with the server with a client. The Python client package for Vertica is python_vertica.
Below screenshot provides an example how to user python_vertica to export objects.
3. As mentioned in the above section, one small task is to transfer sql file from the local machine to a remote machine (the target Vertica cluster). subprocess.check_call() is used.
4. Another small task is to execute a command line to create a user's objects from sql file on the target Vertica cluster. So how can we execute a command line on a remote machine?
Below is the code screenshot to use paramiko to do that. Note in this case, SSH login of the remote machine is set via key pair file. So the parameter "KEY_FILE_PATH_LOCAL" is need. If SSH login is set via password, refer to paramiko document and make corresponding changes.
This task requires the knowledge in the operating system, database system (i.e. Vertica), and Python. I am glad that I make it, since the growth always starts with small steps. There is still some space to optimize the code, which is what I will work on.
Sunday, February 21, 2016
An Initial Exploration of Electricity Price Forecasting
One month ago, I decided to perform some exploration on the problem electricity price forecasting to get some knowledge in electricity market and sharpen skills in data management and modeling. This post will describe what I have done and achieved in this project, from problem definition, literature review, data collection, data exploration, feature creation, model implementation, model evaluation, and result discussion. Feature engineering and Spark are two skills I aim to gain and improve in this project.
The price of a commodity in a market influences the behaviors of all market participants from suppliers to consumers. So knowledge about the future price plays a determinant role on selling and buying decisions for suppliers to make profits and for consumers to save cost.
Electricity is the commodity in the electricity market. In deregulated electricity market, generating companies (GENCOs) submit production bids one day ahead. When GENCOs decide the bids, both electricity load and price for the coming day are not known. So those decisions rely on the forecasting of electricity load and price. Electricity load forecasting has moved to an advanced stage both in the industry and academic with low enough prediction error, while electricity price forecasting is not as mature as electricity load forecasting in the respect of tools and algorithms. That is because the components of electricity price are more complicated than electricity load.
Electricity is the commodity in the electricity market. In deregulated electricity market, generating companies (GENCOs) submit production bids one day ahead. When GENCOs decide the bids, both electricity load and price for the coming day are not known. So those decisions rely on the forecasting of electricity load and price. Electricity load forecasting has moved to an advanced stage both in the industry and academic with low enough prediction error, while electricity price forecasting is not as mature as electricity load forecasting in the respect of tools and algorithms. That is because the components of electricity price are more complicated than electricity load.
1. Literature Review
"If I have been able to see further, it was only because I stood on the shoulder of giants." -- Newton
The review paper (Electricity Price Forecasting in Deregulated Markets: A Review and Evaluation) has been mainly referred to. In this paper, both price-influencing factors and models are summarized.
2. Exploratory Data Analysis
The locational based marginal price (LBMP) in day ahead market provided by New York Independent System Operator (ISO) is used. Because of computational resource limitation, this project is only to forecast the price of "WEST" zone for the coming day.
"Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." -- Pedro Domingos in "A Few Things to Know about Machine Learning"
"Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." -- Pedro Domingos in "A Few Things to Know about Machine Learning"
Based on the scatter plot and correlation coefficient, the following variables are used as model inputs.
- Marginal cost losses
- The square of marginal cost losses
- Marginal cost congestion
- The square of marginal cost congestion
- The average price in the past 1 day
- The average price in the past 7 days
- The average price in the past 28 days
- The average price in the past 364 days
- The standard deviation of price in the past 1 day
- The standard deviation of price in the past 7 days
- The standard deviation of price in the past 28 days
- The standard deviation of price in the past 364 days
- The ratio of average price in the past 1 day over average price in the past 7 days
- The ratio of average price in the past 7 days over average price in the past 364 days
- The ratio of average price in the past 28 days over average price in the past 364 days
- The ratio of standard deviation in the past 1 day over standard deviation in the past 7 days
- The ratio of standard deviation in the past 7 days over standard deviation in the past 364 days
- The ratio of standard deviation in the past 28 days over standard deviation in the past 364 days
- The year
3. Model Development
The linear regression is used as the initial model for exploration.
4. Result Presentation and Analysis
The mean absolute percentage error (MAPE) is used as the performance error. The MAPE is currently around 53%, which is high. The solution can be improved in the following respects.
- Create more efficient input variables, like electricity load. The electricity load data in New York ISO is provided in 5-minute interval. Those data has been retrieved, but is under manipulation process, like imputing missing value and aggregating to hour-level.
- Use other models like neural network.
The code developed for this project can be found here. The computation was tried on Spark system. And feature engineering was paid especial attention to in this project. There are still a lot to do in this project in order to improve forecasting accuracy. I will try to continue this topic if I get enough time and energy.
Wednesday, February 3, 2016
Multi-tenant Management for Database Service on the Cloud Computing Platform
In this post, the concept of multi-tenancy of database service on the cloud computing platform is discussed, as well as its benefits, technical challenge, and solution architecture. And the example of car renting is given to help people understand in a more general way. The reason why I wrote this post is that I am currently working on the project of implementing parallel database service on the cloud computing platform at an extremely low cost through careful multi-tenancy management. And I realize that what I have learned from this project is not only knowledge in database and big data system but also the application of management science.
Multi-tenancy is one of key problems faced by cloud service providers. For database service on the cloud computing platform, multi-tenancy means that more than one tenants can be deployed on a single database server. If it is managed well, both the cost and energy will be saved a lot. But we cannot randomly deploy some tenants together, because we have to meet a more important objective than cost saving, which is to satisfy the service level every tenant requests. A good solution architecture must be able to handle that challenge.
To help people who are unfamiliar with the concept of multi-tenancy or database service on the cloud understand what it is and also its benefits, we can think about the following real-world example. The cloud computing platform can be thought as a car renting company. And the tenants on the cloud computing platform can be thought as customers of the car renting company. The only difference between them is the service. The cloud computing platform provides computing services like database service, while a car renting company provides the car renting service. Let's image that there are two customers for this car renting company. If the renting schedule of these two customers overlap, the company has to assign one car for each customer. In this case, two cars have to be available in order to serve those two customers. But what if the renting schedule of these two customers are different, then the company can assign the same car for them. In this case, only one car is required to be available. As known, the renting company has to buy enough cars to make them available in order to satisfy all customers. If customers who have different renting schedule are assigned to a single car, then the total number of cars the company needs to buy will be reduced. This is a common strategy for a renting company to save money. So it can also be a very good strategy for the cloud computing platform.
But this strategy is challenging on the cloud computing platform, because it is not scheduled in advance when a tenant uses the service on the cloud computing platform. And a tenant's service level agreement may not be satisfied at each time period if it is deployed on a database server with some other tenants. If we don't know when a tenant uses the service, how could we know which tenant should be put together? So in order to realize multi-tenancy, the first step is to know roughly or predict tenants' behavior like when they use the service, based on their records.
Prof. Lo's group came up with a system architecture as a solution for multi-tenancy of parallel database as service on the cloud computing platform in 2013, which can be found in the paper Parallel Analytics as A Service. The solution can be understood in the following simplified car renting scenario.
Let's assume that the car renting company serves customers who frequently rent cars and their schedules are not known in advance. When a customer comes in, there must be a car available for him or her. As discussed earlier, we want to first know roughly when a customer rents a car based on his or her records, in order to find which customers can be assigned to use the same car in different time periods, which in turn reduces the total number of cars the renting car company needs to have. But how? the solution in the paper suggests when a new customer comes in, assign a car to that customer and only that customer can use that car in certain period like one month. So in this one month, whenever the customer wants to use the car, there is always a car available. And during this month, the time periods when the customer uses the car will be recorded. As time goes on, we collect all customers' records. And based on the records, we can predict when one customer rents a car and then decide which customers can use the same car without schedule conflicts through some analysis.
Back to the cloud computing platform case, the general logic of the solution is the same. In the proposed system architecture, there are four components, which are Tenant Activity Monitor, Deployment Advisor, Deployment Master, and Query Router. I am mainly in charge of implementation of the first three components, which have been almost done, and my teammate has finished the implementation of query router. Hopefully this will bring more people benefits as an open source project pretty soon. Compared with car renting service, there are many more technical details involved in the database service on the cloud computing platform. Take this as an example. If a customer of the car renting company is assigned to a different car, he or she just goes to get it. But on the cloud computing platform, if a tenant is assigned to another database server, all its data has to be migrated to another database server. But this is not the main point of this post.
Considering the fact that the active tenant ratio is very low (i.e. 10% in IBM's database-as-a-service), multi-tenancy of parallel database service on the cloud computing is very attractive. For any kind of service providers like cloud service provider and car renting company, resource can be utilized more efficiently and cost can be reduced a lot, through careful management by some analytical methods. In July 2015, I took the job of implementing parallel database service on OpenStack as a research associate from Prof. Lo because of the reputation of his group in the database and big data system, which I wanted to learn more about in order to realize my career goal. Now besides the knowledge in database and big data system, this project also provides a vision about the application of resource management, which help me further understand the spirit of management science.
Multi-tenancy is one of key problems faced by cloud service providers. For database service on the cloud computing platform, multi-tenancy means that more than one tenants can be deployed on a single database server. If it is managed well, both the cost and energy will be saved a lot. But we cannot randomly deploy some tenants together, because we have to meet a more important objective than cost saving, which is to satisfy the service level every tenant requests. A good solution architecture must be able to handle that challenge.
To help people who are unfamiliar with the concept of multi-tenancy or database service on the cloud understand what it is and also its benefits, we can think about the following real-world example. The cloud computing platform can be thought as a car renting company. And the tenants on the cloud computing platform can be thought as customers of the car renting company. The only difference between them is the service. The cloud computing platform provides computing services like database service, while a car renting company provides the car renting service. Let's image that there are two customers for this car renting company. If the renting schedule of these two customers overlap, the company has to assign one car for each customer. In this case, two cars have to be available in order to serve those two customers. But what if the renting schedule of these two customers are different, then the company can assign the same car for them. In this case, only one car is required to be available. As known, the renting company has to buy enough cars to make them available in order to satisfy all customers. If customers who have different renting schedule are assigned to a single car, then the total number of cars the company needs to buy will be reduced. This is a common strategy for a renting company to save money. So it can also be a very good strategy for the cloud computing platform.
But this strategy is challenging on the cloud computing platform, because it is not scheduled in advance when a tenant uses the service on the cloud computing platform. And a tenant's service level agreement may not be satisfied at each time period if it is deployed on a database server with some other tenants. If we don't know when a tenant uses the service, how could we know which tenant should be put together? So in order to realize multi-tenancy, the first step is to know roughly or predict tenants' behavior like when they use the service, based on their records.
Prof. Lo's group came up with a system architecture as a solution for multi-tenancy of parallel database as service on the cloud computing platform in 2013, which can be found in the paper Parallel Analytics as A Service. The solution can be understood in the following simplified car renting scenario.
Let's assume that the car renting company serves customers who frequently rent cars and their schedules are not known in advance. When a customer comes in, there must be a car available for him or her. As discussed earlier, we want to first know roughly when a customer rents a car based on his or her records, in order to find which customers can be assigned to use the same car in different time periods, which in turn reduces the total number of cars the renting car company needs to have. But how? the solution in the paper suggests when a new customer comes in, assign a car to that customer and only that customer can use that car in certain period like one month. So in this one month, whenever the customer wants to use the car, there is always a car available. And during this month, the time periods when the customer uses the car will be recorded. As time goes on, we collect all customers' records. And based on the records, we can predict when one customer rents a car and then decide which customers can use the same car without schedule conflicts through some analysis.
Back to the cloud computing platform case, the general logic of the solution is the same. In the proposed system architecture, there are four components, which are Tenant Activity Monitor, Deployment Advisor, Deployment Master, and Query Router. I am mainly in charge of implementation of the first three components, which have been almost done, and my teammate has finished the implementation of query router. Hopefully this will bring more people benefits as an open source project pretty soon. Compared with car renting service, there are many more technical details involved in the database service on the cloud computing platform. Take this as an example. If a customer of the car renting company is assigned to a different car, he or she just goes to get it. But on the cloud computing platform, if a tenant is assigned to another database server, all its data has to be migrated to another database server. But this is not the main point of this post.
Considering the fact that the active tenant ratio is very low (i.e. 10% in IBM's database-as-a-service), multi-tenancy of parallel database service on the cloud computing is very attractive. For any kind of service providers like cloud service provider and car renting company, resource can be utilized more efficiently and cost can be reduced a lot, through careful management by some analytical methods. In July 2015, I took the job of implementing parallel database service on OpenStack as a research associate from Prof. Lo because of the reputation of his group in the database and big data system, which I wanted to learn more about in order to realize my career goal. Now besides the knowledge in database and big data system, this project also provides a vision about the application of resource management, which help me further understand the spirit of management science.
Sunday, December 27, 2015
The First Python Project in Data Science: Stock Price Prediction
In this post, I will explain what I have done in my first Python project in data science - stock price prediction, combined with the code. I started to learn how to use Python to perform data analytical works during my after-working hours at the beginning of December. So this post will also serve as a summary of what I have learned in the past three weeks. Hope this post can also give readers some insights on the whole process of how to get a data science project done with Python from beginning to end. I want to thank Vik from dataquest for comments he has provided, which I have really benefited from.
My code for this project can be found here on GitHub. It can be customized by changing settings in settings.py and automatically process data acquisition and prediction tasks in this project. It can also be easily modified to adapt more users' needs.
1. Get and store data.
There are basically three data resources.
- Files, such as EXCEL, CSV, and TEXT.
- Database, such as SQLite, MySQL, and MongoDB.
- Web.
The historical data used in this project is web data from Yahoo Finance Historical Prices. There are basically two ways to retrieve web data.
- Web API, such as Twitter API, Facebook API, and OpenNotify API.
- Web Scraping.
I chose the web scraping method to retrieve historical data, since I just learned it and wanted to practice it. In the code, the program data_acquisition.py scrapes historical data and stores them in either csv file or mysql database. The function get_query_input() will get the date range of historical data that the user wants to retrieve (i.e. from 2015-01-01 to 2015-12-25), form the url with parameters extracted from user inputs, and return that url. With that url, we then can get data by sending HTTP request to it and scraping it, which is what the function scrape_page_data(pageUrl) does. aggregate_data_to_mysql(total_table, pageUrl, output_table, database_connection) will store the scraped data in mysql database, while aggregate_data_to_csv(total_table, pageUrl, output_file) will store the scraped data in a csv file. Notice that these two functions are recursive functions. That is because all historical data requested may not be in the same webpage, and if not, we need to go to the next page url and keep scraping data. So the question is how we can scrape data in a webpage and then go to next page and continue this process until all historical data requested is retrieved.
To scrape data in a webpage, we first take a look at its HTML source code. In Safari, right click the webpage and select "show page source". This may require a little bit knowledge about HTML, but it can be quickly caught up. By investigating the HTML source code of stock price page, we can find that the data we want to retrieve is in the table named "yfnc_datamodoutline1". CSS selectors make it easy for us to select the rows in the table. And the next page url can be obtained from the attribute of <a rel="next" href=, where tag <a indicates a link in HTML.
By the time when I write this post, I find Yahoo! Finance APIs. Next time I need it, I will play with that to see how it would make things easier. I guess when working with data, the first thing to check is whether that website provides APIs.
2. Explore data.
The purpose of data exploration may fall in one of the below categories.
- Missing values and outliers
- The pattern of the target variable
- Potential predictors
- The relationship between the target variable and potential predictors
- The distribution of variables
Data visualization and statistics (i.e. correlation) are good ways to explore data. One specific way for checking missing values can be found here. It turns out there is no missing value in the historical data.For time series data, we set the date as the index and sort the data in ascending order, which can be done though the function pandas.DataFrame.set_index() and pandas.DataFrame.sort_index().
The objective of this project is to predict stock price. Take "Close" price as the example. To predict it, we will be interested in its own pattern and the relationship between it and other factors. One question is what factors may influence our target variable ("Close" price). We can get some insights on those driving factors by studying the problem deeper and doing some research on relative literature and available modelings. Frankly speaking, I almost knew nothing about the stock market, before I started to work on this project. So I got indicators of "Close" price from the project information received from dataquest.
- The average price from the past five days.
- The average price for the past month.
- The average price for the past year.
- The ratio between the average price for the past five days, and the average price for the past year.
- The standard deviation of the price over the past five days.
- The standard deviation of the price over the past year.
- The ratio between the standard deviation for the past five days, and the standard deviation for the past year.
- The average volume over the past five days.
- The average volume over the past year.
- The ratio between the average volume for the past five days, and the average volume for the past year.
- The standard deviation of the average volume over the past five days.
- The standard deviation of the average volume over the past year.
- The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
- The year component of the date.
In Pandas, there are functions to compute moving (rolling) statistics, such as rolling_mean and rolling_std. But we need to shift the column of that rolling statistics forward by one day. The reason is as follows. We want the average price from the past five days for 2015-12-12 to be the mean of prices from 2015-12-07 to 2015-12-11, but the returned value of the rolling_mean function for 2015-12-12 is the mean of prices from 2015-12-08 to 2015-12-12. So the average price from the past five days for 2015-12-12 is actually the returned value of the function rolling_mean for 2015-12-11.
3. Clean data.
When computing some indicators, a year of historical data is required. So those indicators for the earliest year will be missing, indicated as NaN in Python. We will just remove rows with NaN values.
This step and following steps are done in the program prediction.py. First, users' prediction requests will be retrieved by get_prediction_req(data_storage_method). Then historical data will be read and loaded to a DataFrame either from mysql database or csv file with the function read_data_from_mysql(database_connection, historical_data_table) or read_data_from_csv(historical_data_file).
This step and following steps are done in the program prediction.py. First, users' prediction requests will be retrieved by get_prediction_req(data_storage_method). Then historical data will be read and loaded to a DataFrame either from mysql database or csv file with the function read_data_from_mysql(database_connection, historical_data_table) or read_data_from_csv(historical_data_file).
4. Build the predictive model.
Some common predictive methods we use are regression, classification, decision tree, neutral network, and supporting vector machine. For this project, multiple linear regression and random forest will be chosen as initial methods, whose interfaces are provided in sklearn package.
5. Validate and select the model.
To validate the model, two things need to be done.
- Split the historical data into train set and test set.
- Choose an error performance measurement and calculate it.
The function is predict(df, prediction_start_date, prediction_end_date, predict_tommorrow), which performs predictive tasks and calculates the error measurement after calculating the indicators and cleaning the data.
The big question here is how to split the historical data into train set and test for prediction. For example, the user wants to predict the price from 2015-01-03 to 2015-12-27. The train and test data set will be split based on the backtesting technique. Let's say the earliest date in historical data is 1950-01-03. For example, to predict the price for 2015-01-03, the train set will be historical data from 1950-01-03 to 2015-01-02, and the test set will be 2015-01-03. And to predict the price for 2015-01-04, the train set will be historical data from 1950-01-03 to 2015-01-03, and the test set will be 2015-01-04. Keep this process until we get all predicted values from 2015-01-03 to 2015-12-27. In the code, this part is done by looping over the index set of the prediction period.
The mean absolute error (MAE) is picked as the error performance measure. The model with a smaller MAE will chosen to predict the price for tomorrow. The MAEs and predicted value for tomorrow will be written to file "predicted_value_for_tommorrow". And the actual and predicted value for test data will be stored in either a mysql database or csv file by function write_prediction_to_mysql(df_prediction, database_connection, predicted_output_table) or write_prediction_to_csv(df_prediction, predicted_output_file) for further analysis, which may give some insights on where the model doesn't perform well and how to improve it.
6. Present the results.
Beside predicting the value for tomorrow, a simple experiment is done by using the whole year data in 2014 and 2015 as test respectively. The MAE result is as follows.
| 2014 | 2015 | |
| Regression | 15.62 | 19.98 |
| Random Forest | 13.99 | 19.08 |
We can have two simple findings.
- The model performs differently in different years. shows that the model performs better for the stock market in the year 2014.
- Random forest seems to perform better than multiple linear regression based on their MAE.
The time series plots for 2014 and 2015 with actual and predicted values can be found below. As shown, the trend is caught well in the whole. One potential improvement point for regression model is that the peak falls behind slightly. And one potential improvement point for random forest model is that its predicted value is smaller than actual value in general.
I will tweak algorithms more later and then present more findings. After all, the initial main purpose of this project is to get myself familiar with the full lifecycle of a data science project and Python packages/functions that are frequently used in a data science project.
- The model performs differently in different years. shows that the model performs better for the stock market in the year 2014.
- Random forest seems to perform better than multiple linear regression based on their MAE.
The time series plots for 2014 and 2015 with actual and predicted values can be found below. As shown, the trend is caught well in the whole. One potential improvement point for regression model is that the peak falls behind slightly. And one potential improvement point for random forest model is that its predicted value is smaller than actual value in general.
I will tweak algorithms more later and then present more findings. After all, the initial main purpose of this project is to get myself familiar with the full lifecycle of a data science project and Python packages/functions that are frequently used in a data science project.
Friday, December 25, 2015
A Simple but Complete Guide for OpenStack Trove
This post illustrates the framework and some details that help learners get started with OpenStack and its database service Trove. I started to study OpenStack and its Trove project five months ago from scratch. If I can do it, you can also do it. I appreciate the guidance I have received in every way, especially my teammates whom I have worked closely with every weekday.
Trove is database-as-a-service project on OpenStack, which can help users save the cost on database infrastructure and eliminate administrative tasks like deployment, configurations, and backups, and so on. To understand its benefit, image that you need a database server for your business. Traditionally, you have to set up an infrastructure first, install database package, configure it, and maintain it, which cost lots of money and time. With OpenStack Trove, what you do is just to open an account, execute some commands or click some tabs to launch a database server, and only pay for what you use.
Before we start this exciting journey, I would like to highlight the following documentation and reference.
Before we start this exciting journey, I would like to highlight the following documentation and reference.
- OpenStack Documentation, where you can find installation guide, admin user guide, end user guide, and command line reference. That will cover almost all tasks performed on OpenStack and its service components. Reading the documentation carefully can always help us avoid some naive mistakes and save us much time.
- OpenStack Trove, where you can find comprehensive details about Trove.
The following steps can help you deploy OpenStack Trove and then operate it from scratch.
1. Set up an OpenStack environment and add identity service (Keystone), image service (Glance), computer service (Nova), dashboard (Horizon), networking service (Neutron or Nova-network) and block storage service (Cinder).
This can be done by following OpenStack Installation Guide, which can be found in OpenStack documentation. Be sure to choose the right version of the installation guide for your operating system (Ubuntu 14.04, Red Hat Enterprise Linux 7, CentOS 7, openSUSE 13.2, and SUSE Linux Enterprise Server 12) and the version of OpenStack (Juno, Kilo, and Liberty) you want to deploy.
2. Add database as service (Trove) on OpenStack.
This part is not provided in the official installation documentation. For Ubuntu users, follow the steps and commands provided in my earlier post. For other Linux operating system users, I guess it will still be a good way to follow steps in that post and change commands correspondingly.
3. Obtain or build a Trove guest image.
This is the first step to use Trove. There are currently 3 ways.
- Download a pre-built guest image from here. This method is DevStack based.
- Build a guest image using OpenStack Trove tools (Disk Image Builder, redstack). I tried these two tools on DevStack and my OpenStack, respectively. Using these two tools, I got a working image on DevStack, but failed to get a working image on my OpenStack. Back to the time I tried, probably there was some bug related to their DevStack dependency. By the working image, I mean that a working image can be used to launch a Trove instance successfully with active status. I am not sure whether they work well now or not. After struggling with the tools for one month, I came up with a customized way.
- Build a guest image using customized way. The way I used can be found in this post. I got a working image by performing those steps. That post also provides some insights on how the image works.
For more details about the first two ways, see this post.
4. Add the Trove guest image to the datastore.
The image we get from the above is just a QCOW2 file. In order to tell Trove where it is and let Trove use it, we must add it to Trove datastore, by performing step 2-6 in the database service chapter of OpenStack Administration Guide.
5. Launch a Trove instance.
This can be done either through dashboard or Trove command line client. For the latter, refer to "trove create" command in OpenStack Command Line Reference.
6. Debug.
There are some common errors that OpenStack Trove users can encounter. Just to name some.
- Status goes to "Error" shortly after the creation.
- Status is stuck at "Build" and goest to "ERROR" after reaching the timeout values.
- No host is assigned.
To figure out the reason, the starting point should always be trove logs, including trove-api.log, trove-taskmanager.log, and trove-conductor.log, which by default are in the directory /var/log/trove on the trove controller node, and also trove-guestagent.log, which by default are in the directory /var/log/trove/ on the guest.
With the insights log files provide, google and ask openstack are good places that can help target the errors.
Good luck and enjoy!
Sunday, December 20, 2015
Web Scraping and Data Analysis with Python
Recently I have been working on a project about predicting stock price. Thanks to the guidance received from dataquest, I was excited to grasp how to use web scraping technique to collect data and how to utilize various Python packages to analyze data and build models. I wish I had knowledge about these techniques back to 2013, when I was involved in a project requiring me to collect a large amount of web data and analyze them. If I had knew the existence of these techniques, I would definitely learn them. And that would make my work more efficient and accurate, since it would be less likely to make a mistake and easy to track if there was something wrong by programming data collection and analysis tasks in Python.
In the past one and a half month, I have read through the following three books. I have found them very helpful by empowering myself with techniques of automating data related tasks in Python.
Just take a look at these books and take away what you need. And you are welcome to check my code on GitHub for the stock price prediction project. The program produces right results. But I am still working on making it better and getting it updated.
Subscribe to:
Comments (Atom)




