Sunday, December 27, 2015

The First Python Project in Data Science: Stock Price Prediction

In this post, I will explain what I have done in my first Python project in data science - stock price prediction, combined with the code. I started to learn how to use Python to perform data analytical works during my after-working hours at the beginning of December. So this post will also serve as a summary of what I have learned in the past three weeks. Hope this post can also give readers some insights on the whole process of how to get a data science project done with Python from beginning to end. I want to thank Vik from dataquest for comments he has provided, which I have really benefited from. 

My code for this project can be found here on GitHub. It can be customized by changing settings in settings.py and automatically process data acquisition and prediction tasks in this project. It can also be easily modified to adapt more users' needs. 

1. Get and store data.

There are basically three data resources. 
  • Files, such as EXCEL, CSV, and TEXT. 
  • Database, such as SQLite, MySQL, and MongoDB. 
  • Web.
The historical data used in this project is web data from Yahoo Finance Historical Prices. There are basically two ways to retrieve web data. 
  • Web API, such as Twitter API, Facebook API, and OpenNotify API. 
  • Web Scraping. 
I chose the web scraping method to retrieve historical data, since I just learned it and wanted to practice it. In the code, the program data_acquisition.py scrapes historical data and stores them in either csv file or mysql database. The function get_query_input() will get the date range of historical data that the user wants to retrieve (i.e. from 2015-01-01 to 2015-12-25), form the url with parameters extracted from user inputs, and return that url. With that url, we then can get data by sending HTTP request to it and scraping it, which is what the function scrape_page_data(pageUrl) does. aggregate_data_to_mysql(total_table, pageUrl, output_table, database_connection) will store the scraped data in mysql database, while aggregate_data_to_csv(total_table, pageUrl, output_file) will store the scraped data in a csv file. Notice that these two functions are recursive functions. That is because all historical data requested may not be in the same webpage, and if not, we need to go to the next page url and keep scraping data. So the question is how we can scrape data in a webpage and then go to next page and continue this process until all historical data requested is retrieved.

To scrape data in a webpage, we first take a look at its HTML source code. In Safari, right click the webpage and select "show page source". This may require a little bit knowledge about HTML, but it can be quickly caught up. By investigating the HTML source code of stock price page, we can find that the data we want to retrieve is in the table named "yfnc_datamodoutline1". CSS selectors make it easy for us to select the rows in the table. And the next page url can be obtained from the attribute of  <a rel="next" href=, where tag <a indicates a link in HTML. 

By the time when I write this post, I find Yahoo! Finance APIs. Next time I need it, I will play with that to see how it would make things easier. I guess when working with data, the first thing to check is whether that website provides APIs. 

2. Explore data.

The purpose of data exploration may fall in one of the below categories. 
  • Missing values and outliers
  • The pattern of the target variable
  • Potential predictors
  • The relationship between the target variable and potential predictors
  • The distribution of variables
Data visualization and statistics (i.e. correlation) are good ways to explore data. One specific way for checking missing values can be found here. It turns out there is no missing value in the historical data.For time series data, we set the date as the index and sort the data in ascending order, which can be done though the function pandas.DataFrame.set_index() and pandas.DataFrame.sort_index()

The objective of this project is to predict stock price. Take "Close" price as the example. To predict it, we will be interested in its own pattern and the relationship between it and other factors. One question is what factors may influence our target variable ("Close" price). We can get some insights on those driving factors by studying the problem deeper and doing some research on relative literature and available modelings. Frankly speaking, I almost knew  nothing about the stock market, before I started to work on this project. So I got indicators of "Close" price from the project information received from dataquest. 

  • The average price from the past five days.
  • The average price for the past month.
  • The average price for the past year.
  • The ratio between the average price for the past five days, and the average price for the past year.
  • The standard deviation of the price over the past five days.
  • The standard deviation of the price over the past year.
  • The ratio between the standard deviation for the past five days, and the standard deviation for the past year.
  • The average volume over the past five days.
  • The average volume over the past year.
  • The ratio between the average volume for the past five days, and the average volume for the past year.
  • The standard deviation of the average volume over the past five days.
  • The standard deviation of the average volume over the past year.
  • The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
  • The year component of the date.

In Pandas, there are functions to compute moving (rolling) statistics, such as rolling_mean and rolling_std. But we need to shift the column of that rolling statistics forward by one day. The reason is as follows. We want the average price from the past five days for 2015-12-12 to be the mean of prices from 2015-12-07 to 2015-12-11, but the returned value of the rolling_mean function for 2015-12-12 is the mean of prices from 2015-12-08 to 2015-12-12. So  the average price from the past five days for 2015-12-12 is actually the returned value of the function rolling_mean for 2015-12-11. 

3. Clean data. 

When computing some indicators, a year of historical data is required. So those indicators for the earliest year will be missing, indicated as NaN in Python. We will just remove rows with NaN values.

This step and following steps are done in the program prediction.py. First, users' prediction requests will be retrieved by get_prediction_req(data_storage_method). Then historical data will be read and loaded to a DataFrame either from mysql database or csv file with the function read_data_from_mysql(database_connection, historical_data_table) or read_data_from_csv(historical_data_file).

4. Build the predictive model. 

Some common predictive methods we use are regression, classification, decision tree, neutral network, and supporting vector machine. For this project, multiple linear regression and random forest will be chosen as initial methods, whose interfaces are provided in sklearn package.

5. Validate and select the model. 

To validate the model, two things need to be done. 
  • Split the historical data into train set and test set. 
  • Choose an error performance measurement and calculate it. 
The function is predict(df, prediction_start_date, prediction_end_date, predict_tommorrow), which performs predictive tasks and calculates the error measurement after calculating the indicators and cleaning the data. 

The big question here is how to split the historical data into train set and test for prediction. For example, the user wants to predict the price from 2015-01-03 to 2015-12-27. The train and test data set will be split based on the backtesting technique. Let's say the earliest date in historical data is 1950-01-03. For example, to predict the price for 2015-01-03, the train set will be historical data from 1950-01-03 to 2015-01-02, and the test set will be 2015-01-03. And to predict the price for 2015-01-04, the train set will be historical data from 1950-01-03 to 2015-01-03, and the test set will be 2015-01-04. Keep this process until we get all predicted values from 2015-01-03 to 2015-12-27. In the code, this part is done by looping over the index set of the prediction period. 

The mean absolute error (MAE) is picked as the error performance measure. The model with a smaller MAE will chosen to predict the price for tomorrow. The MAEs and predicted value for tomorrow will be written to file "predicted_value_for_tommorrow". And the actual and predicted value for test data will be stored in either a mysql database or csv file by function write_prediction_to_mysql(df_prediction, database_connection, predicted_output_table) or write_prediction_to_csv(df_prediction, predicted_output_file) for further analysis, which may give some insights on where the model doesn't perform well and how to improve it.

6. Present the results. 

Beside predicting the value for tomorrow, a simple experiment is done by using the whole year data in 2014 and 2015 as test respectively. The MAE result is as follows. 


2014 2015
Regression 15.62 19.98
Random Forest 13.99 19.08

We can have two simple findings. 

- The model performs differently in different years.  shows that the model performs better for the stock market in the year 2014.
- Random forest seems to perform better than multiple linear regression based on their MAE. 

The time series plots for 2014 and 2015 with actual and predicted values can be found below. As shown, the trend is caught well in the whole. One potential improvement point for regression model is that the peak falls behind slightly. And one potential improvement point for random forest model is that its predicted value is smaller than actual value in general. 

I will tweak algorithms more later and then present more findings. After all, the initial main purpose of this project is to get myself familiar with the full lifecycle of a data science project and Python packages/functions that are frequently used in a data science project. 


Friday, December 25, 2015

A Simple but Complete Guide for OpenStack Trove

This post illustrates the framework and some details that help learners get started with OpenStack and its database service Trove. I started to study OpenStack and its Trove project five months ago from scratch. If I can do it, you can also do it. I appreciate the guidance I have received in every way, especially my teammates whom I have worked closely with every weekday. 

Trove is database-as-a-service project on OpenStack, which can help users save the cost on database infrastructure and eliminate administrative tasks like deployment, configurations, and backups, and so on. To understand its benefit, image that you need a database server for your business. Traditionally, you have to set up an infrastructure first, install database package, configure it, and maintain it, which cost lots of money and time. With OpenStack Trove, what you do is just to open an account, execute some commands or click some tabs to launch a database server, and only pay for what you use.

Before we start this exciting journey, I would like to highlight the following documentation and reference. 
  • OpenStack Documentation, where you can find installation guide, admin user guide, end user guide, and command line reference. That will cover almost all tasks performed on OpenStack and its service components. Reading the documentation carefully can always help us avoid some naive mistakes and save us much time.
  • OpenStack Trove, where you can find comprehensive details about Trove. 
The following steps can help you deploy OpenStack Trove and then operate it from scratch. 

1. Set up an OpenStack environment and add identity service (Keystone), image service (Glance), computer service (Nova), dashboard (Horizon), networking service (Neutron or Nova-network) and block storage service (Cinder). 

This can be done by following OpenStack Installation Guide, which can be found in OpenStack documentation. Be sure to choose the right version of the installation guide for your operating system (Ubuntu 14.04, Red Hat Enterprise Linux 7, CentOS 7, openSUSE 13.2, and SUSE Linux Enterprise Server 12)  and the version of OpenStack (Juno, Kilo, and Liberty) you want to deploy. 

2. Add database as service (Trove) on OpenStack.    

This part is not provided in the official installation documentation. For Ubuntu users, follow the steps and commands provided in my earlier post. For other Linux operating system users, I guess it will still be a good way to follow steps in that post and change commands correspondingly. 

3. Obtain or build a Trove guest image. 

This is the first step to use Trove. There are currently 3 ways. 

  • Download a pre-built guest image from here. This method is DevStack based.
  • Build a guest image using OpenStack Trove tools (Disk Image Builder, redstack). I tried these two tools on DevStack and my OpenStack, respectively. Using these two tools, I got a working image on DevStack, but failed to get a working image on my OpenStack. Back to the time I tried, probably there was some bug related to their DevStack dependency. By the working image, I mean that a working image can be used to launch a Trove instance successfully with active status. I am not sure whether they work well now or not. After struggling with the tools for one month, I came up with a customized way.  
  • Build a guest image using customized way. The way I used can be found in this post. I got a working image by performing those steps. That post also provides some insights on how the image works.
For more details about the first two ways, see this post

4. Add the Trove guest image to the datastore. 

The image we get from the above is just a QCOW2 file. In order to tell Trove where it is and let Trove use it, we must add it to Trove datastore, by performing step 2-6 in the database service chapter of OpenStack Administration Guide.  

5. Launch a Trove instance. 

This can be done either through dashboard or Trove command line client. For the latter, refer to "trove create" command in OpenStack Command Line Reference

6. Debug. 

There are some common errors that OpenStack Trove users can encounter. Just to name some.
  • Status goes to "Error" shortly after the creation. 
  • Status is stuck at "Build" and goest to "ERROR" after reaching the timeout values. 
  • No host is assigned.
To figure out the reason, the starting point should always be trove logs, including trove-api.log, trove-taskmanager.log, and trove-conductor.log, which by default are in the directory /var/log/trove on the trove controller node, and also trove-guestagent.log, which by default are in the directory /var/log/trove/ on the guest. 

With the insights log files provide, google and ask openstack are good places that can help target the errors. 

Good luck and enjoy! 


Sunday, December 20, 2015

Web Scraping and Data Analysis with Python

Recently I have been working on a project about predicting stock price. Thanks to the guidance received from dataquest, I was excited to grasp how to use web scraping technique to collect data and how to utilize various Python packages to analyze data and build models. I wish I had knowledge about these techniques back to 2013, when I was involved in a project requiring me to collect a large amount of web data and analyze them. If I had knew the existence of these techniques, I would definitely learn them. And that would make my work more efficient and accurate, since it would be less likely to make a mistake and easy to track if there was something wrong by programming data collection and analysis tasks in Python.

In the past one and a half month, I have read through the following three books. I have found them very helpful by empowering myself with techniques of automating data related tasks in Python.


Just take a look at these books and take away what you need. And you are welcome to check my code on GitHub for the stock price prediction project. The program produces right results. But I am still working on making it better and getting it updated.