Lili's Sharings : 2015

Sunday, December 27, 2015

The First Python Project in Data Science: Stock Price Prediction

In this post, I will explain what I have done in my first Python project in data science - stock price prediction, combined with the code. I started to learn how to use Python to perform data analytical works during my after-working hours at the beginning of December. So this post will also serve as a summary of what I have learned in the past three weeks. Hope this post can also give readers some insights on the whole process of how to get a data science project done with Python from beginning to end. I want to thank Vik from dataquest for comments he has provided, which I have really benefited from.

My code for this project can be found here on GitHub. It can be customized by changing settings in settings.py and automatically process data acquisition and prediction tasks in this project. It can also be easily modified to adapt more users' needs.

1. Get and store data.

There are basically three data resources.

Files, such as EXCEL, CSV, and TEXT.
Database, such as SQLite, MySQL, and MongoDB.
Web.

The historical data used in this project is web data from Yahoo Finance Historical Prices. There are basically two ways to retrieve web data.

Web API, such as Twitter API, Facebook API, and OpenNotify API.
Web Scraping.

I chose the web scraping method to retrieve historical data, since I just learned it and wanted to practice it. In the code, the program data_acquisition.py scrapes historical data and stores them in either csv file or mysql database. The function get_query_input() will get the date range of historical data that the user wants to retrieve (i.e. from 2015-01-01 to 2015-12-25), form the url with parameters extracted from user inputs, and return that url. With that url, we then can get data by sending HTTP request to it and scraping it, which is what the function scrape_page_data(pageUrl) does. aggregate_data_to_mysql(total_table, pageUrl, output_table, database_connection) will store the scraped data in mysql database, while aggregate_data_to_csv(total_table, pageUrl, output_file) will store the scraped data in a csv file. Notice that these two functions are recursive functions. That is because all historical data requested may not be in the same webpage, and if not, we need to go to the next page url and keep scraping data. So the question is how we can scrape data in a webpage and then go to next page and continue this process until all historical data requested is retrieved.

To scrape data in a webpage, we first take a look at its HTML source code. In Safari, right click the webpage and select "show page source". This may require a little bit knowledge about HTML, but it can be quickly caught up. By investigating the HTML source code of stock price page, we can find that the data we want to retrieve is in the table named "yfnc_datamodoutline1". CSS selectors make it easy for us to select the rows in the table. And the next page url can be obtained from the attribute of <a rel="next" href=, where tag <a indicates a link in HTML.

By the time when I write this post, I find Yahoo! Finance APIs. Next time I need it, I will play with that to see how it would make things easier. I guess when working with data, the first thing to check is whether that website provides APIs.

2. Explore data.

The purpose of data exploration may fall in one of the below categories.

Missing values and outliers
The pattern of the target variable
Potential predictors
The relationship between the target variable and potential predictors
The distribution of variables

Data visualization and statistics (i.e. correlation) are good ways to explore data. One specific way for checking missing values can be found here. It turns out there is no missing value in the historical data.For time series data, we set the date as the index and sort the data in ascending order, which can be done though the function pandas.DataFrame.set_index() and pandas.DataFrame.sort_index().

The objective of this project is to predict stock price. Take "Close" price as the example. To predict it, we will be interested in its own pattern and the relationship between it and other factors. One question is what factors may influence our target variable ("Close" price). We can get some insights on those driving factors by studying the problem deeper and doing some research on relative literature and available modelings. Frankly speaking, I almost knew nothing about the stock market, before I started to work on this project. So I got indicators of "Close" price from the project information received from dataquest.

The average price from the past five days.

The average price for the past month.

The average price for the past year.

The ratio between the average price for the past five days, and the average price for the past year.

The standard deviation of the price over the past five days.

The standard deviation of the price over the past year.

The ratio between the standard deviation for the past five days, and the standard deviation for the past year.

The average volume over the past five days.

The average volume over the past year.

The ratio between the average volume for the past five days, and the average volume for the past year.

The standard deviation of the average volume over the past five days.

The standard deviation of the average volume over the past year.

The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.

The year component of the date.

In Pandas, there are functions to compute moving (rolling) statistics, such as rolling_mean and rolling_std. But we need to shift the column of that rolling statistics forward by one day. The reason is as follows. We want the average price from the past five days for 2015-12-12 to be the mean of prices from 2015-12-07 to 2015-12-11, but the returned value of the rolling_mean function for 2015-12-12 is the mean of prices from 2015-12-08 to 2015-12-12. So the average price from the past five days for 2015-12-12 is actually the returned value of the function rolling_mean for 2015-12-11.

3. Clean data.

When computing some indicators, a year of historical data is required. So those indicators for the earliest year will be missing, indicated as NaN in Python. We will just remove rows with NaN values.

This step and following steps are done in the program prediction.py. First, users' prediction requests will be retrieved by get_prediction_req(data_storage_method). Then historical data will be read and loaded to a DataFrame either from mysql database or csv file with the function read_data_from_mysql(database_connection, historical_data_table) or read_data_from_csv(historical_data_file).

4. Build the predictive model.

Some common predictive methods we use are regression, classification, decision tree, neutral network, and supporting vector machine. For this project, multiple linear regression and random forest will be chosen as initial methods, whose interfaces are provided in sklearn package.

5. Validate and select the model.

To validate the model, two things need to be done.

Split the historical data into train set and test set.
Choose an error performance measurement and calculate it.

The function is predict(df, prediction_start_date, prediction_end_date, predict_tommorrow), which performs predictive tasks and calculates the error measurement after calculating the indicators and cleaning the data.

The big question here is how to split the historical data into train set and test for prediction. For example, the user wants to predict the price from 2015-01-03 to 2015-12-27. The train and test data set will be split based on the backtesting technique. Let's say the earliest date in historical data is 1950-01-03. For example, to predict the price for 2015-01-03, the train set will be historical data from 1950-01-03 to 2015-01-02, and the test set will be 2015-01-03. And to predict the price for 2015-01-04, the train set will be historical data from 1950-01-03 to 2015-01-03, and the test set will be 2015-01-04. Keep this process until we get all predicted values from 2015-01-03 to 2015-12-27. In the code, this part is done by looping over the index set of the prediction period.

The mean absolute error (MAE) is picked as the error performance measure. The model with a smaller MAE will chosen to predict the price for tomorrow. The MAEs and predicted value for tomorrow will be written to file "predicted_value_for_tommorrow". And the actual and predicted value for test data will be stored in either a mysql database or csv file by function write_prediction_to_mysql(df_prediction, database_connection, predicted_output_table) or write_prediction_to_csv(df_prediction, predicted_output_file) for further analysis, which may give some insights on where the model doesn't perform well and how to improve it.

6. Present the results.

Beside predicting the value for tomorrow, a simple experiment is done by using the whole year data in 2014 and 2015 as test respectively. The MAE result is as follows.

	2014	2015
Regression	15.62	19.98
Random Forest	13.99	19.08

We can have two simple findings.

- The model performs differently in different years. shows that the model performs better for the stock market in the year 2014.
- Random forest seems to perform better than multiple linear regression based on their MAE.

The time series plots for 2014 and 2015 with actual and predicted values can be found below. As shown, the trend is caught well in the whole. One potential improvement point for regression model is that the peak falls behind slightly. And one potential improvement point for random forest model is that its predicted value is smaller than actual value in general.

I will tweak algorithms more later and then present more findings. After all, the initial main purpose of this project is to get myself familiar with the full lifecycle of a data science project and Python packages/functions that are frequently used in a data science project.

Friday, December 25, 2015

A Simple but Complete Guide for OpenStack Trove

This post illustrates the framework and some details that help learners get started with OpenStack and its database service Trove. I started to study OpenStack and its Trove project five months ago from scratch. If I can do it, you can also do it. I appreciate the guidance I have received in every way, especially my teammates whom I have worked closely with every weekday.

Trove is database-as-a-service project on OpenStack, which can help users save the cost on database infrastructure and eliminate administrative tasks like deployment, configurations, and backups, and so on. To understand its benefit, image that you need a database server for your business. Traditionally, you have to set up an infrastructure first, install database package, configure it, and maintain it, which cost lots of money and time. With OpenStack Trove, what you do is just to open an account, execute some commands or click some tabs to launch a database server, and only pay for what you use.

Before we start this exciting journey, I would like to highlight the following documentation and reference.

OpenStack Documentation, where you can find installation guide, admin user guide, end user guide, and command line reference. That will cover almost all tasks performed on OpenStack and its service components. Reading the documentation carefully can always help us avoid some naive mistakes and save us much time.
OpenStack Trove, where you can find comprehensive details about Trove.

The following steps can help you deploy OpenStack Trove and then operate it from scratch.

1. Set up an OpenStack environment and add identity service (Keystone), image service (Glance), computer service (Nova), dashboard (Horizon), networking service (Neutron or Nova-network) and block storage service (Cinder).

This can be done by following OpenStack Installation Guide, which can be found in OpenStack documentation. Be sure to choose the right version of the installation guide for your operating system (Ubuntu 14.04, Red Hat Enterprise Linux 7, CentOS 7, openSUSE 13.2, and SUSE Linux Enterprise Server 12) and the version of OpenStack (Juno, Kilo, and Liberty) you want to deploy.

2. Add database as service (Trove) on OpenStack.

This part is not provided in the official installation documentation. For Ubuntu users, follow the steps and commands provided in my earlier post. For other Linux operating system users, I guess it will still be a good way to follow steps in that post and change commands correspondingly.

3. Obtain or build a Trove guest image.

This is the first step to use Trove. There are currently 3 ways.

Download a pre-built guest image from here. This method is DevStack based.
Build a guest image using OpenStack Trove tools (Disk Image Builder, redstack). I tried these two tools on DevStack and my OpenStack, respectively. Using these two tools, I got a working image on DevStack, but failed to get a working image on my OpenStack. Back to the time I tried, probably there was some bug related to their DevStack dependency. By the working image, I mean that a working image can be used to launch a Trove instance successfully with active status. I am not sure whether they work well now or not. After struggling with the tools for one month, I came up with a customized way.
Build a guest image using customized way. The way I used can be found in this post. I got a working image by performing those steps. That post also provides some insights on how the image works.

For more details about the first two ways, see this post.

4. Add the Trove guest image to the datastore.

The image we get from the above is just a QCOW2 file. In order to tell Trove where it is and let Trove use it, we must add it to Trove datastore, by performing step 2-6 in the database service chapter of OpenStack Administration Guide.

5. Launch a Trove instance.

This can be done either through dashboard or Trove command line client. For the latter, refer to "trove create" command in OpenStack Command Line Reference.

6. Debug.

There are some common errors that OpenStack Trove users can encounter. Just to name some.

Status goes to "Error" shortly after the creation.
Status is stuck at "Build" and goest to "ERROR" after reaching the timeout values.
No host is assigned.

To figure out the reason, the starting point should always be trove logs, including trove-api.log, trove-taskmanager.log, and trove-conductor.log, which by default are in the directory /var/log/trove on the trove controller node, and also trove-guestagent.log, which by default are in the directory /var/log/trove/ on the guest.

With the insights log files provide, google and ask openstack are good places that can help target the errors.

Good luck and enjoy!

Sunday, December 20, 2015

Web Scraping and Data Analysis with Python

Recently I have been working on a project about predicting stock price. Thanks to the guidance received from dataquest, I was excited to grasp how to use web scraping technique to collect data and how to utilize various Python packages to analyze data and build models. I wish I had knowledge about these techniques back to 2013, when I was involved in a project requiring me to collect a large amount of web data and analyze them. If I had knew the existence of these techniques, I would definitely learn them. And that would make my work more efficient and accurate, since it would be less likely to make a mistake and easy to track if there was something wrong by programming data collection and analysis tasks in Python.

In the past one and a half month, I have read through the following three books. I have found them very helpful by empowering myself with techniques of automating data related tasks in Python.

1. Python for Informatics

2. Python for Data Analysis

3. Web Scraping with Python

Just take a look at these books and take away what you need. And you are welcome to check my code on GitHub for the stock price prediction project. The program produces right results. But I am still working on making it better and getting it updated.

Thursday, November 19, 2015

Build Trove Guest Image Manually in OpenStack

I came up with this solution to manually building an image for OpenStack Trove based on my knowledge and intuition, after struggling with using Disk Image Builder on my OpenStack environment.

The first step to use Trove is to obtain or build a Trove image. Compared to regular OpenStack virtual machine image used for Nova, there are two extra essential components on a Trove image: Trove guest agent (or the ability of getting Trove guest agent) and a database server package.

If you already have an OpenStack environment set up, you can build a Trove guest image following the below steps.

1. Launch a Nova instance with a flavor that satisfies the system requirements (CPU, RAM, swap disk, etc) of the database server you want to install. Perform step 2-6 on this Nova instance.

2. Install and configure cloud-init on the Nova instance launched in Step 1.

# apt-get install cloud-init cloud-utils cloud-initramfs-growroot cloud-initramfs-rescuevol
# echo 'manage_etc_hosts: True' > /etc/cloud/cloud.cfg.d/10_etc_hosts.cfg

3. Set up basic environment for trove-guestagent on the Nova instance launched in Step 1.

# apt-get update
# apt-get install ubuntu-cloud-keyring
# echo "deb http://ubuntu-cloud.archive.canonical.com/ubuntu" \
"trusty-updates/kilo main" > /etc/apt/sources.list.d/cloudarchive-kilo.list

(Note: change "kilo" to the version of your OpenStack.)

# apt-get update && apt-get dist-upgrade

4. Install Trove and Trove-guestagent on the Nova instance launched in Step 1.

# apt-get install python-trove python-troveclient trove-common trove-guestagent

5. Configure Trove-guestagent on the Nova instance launched in Step 1.

Edit /etc/trove/trove-guestagent.conf.

[DEFAULT]
rabbit_host = controller
rabbit_userid = RABBIT_USER
rabbit_password = RABBIT_PASS
nova_proxy_admin_user = admin
nova_proxy_admin_pass = admin
nova_proxy_admin_tenant_name = admin
trove_auth_url = http://controller:35357/v2.0
log_file = trove-guestagent.log

( Change RABBIT_USER, RABBIT_PASS)
Refer to Step 1 in the database chapter of OpenStack Administration Guide.

6. Upload the database server package to the Nova instance launched in Step 1.

Note: After uploading the database server package, you can choose to install it manually or not. If you don't install it, Trove-guestagent should install it at the boot time of a Trove instance. But different database servers have different scenarios. Refer to the source code of the database you want to use under /trove/guestagent/datastore to find out the default path of the database server package, where the package should be uploaded to. For example, if you use Vertica, the default path that Vertica package should be uploaded to is root directory '/' based on the value of option "INSTALL_VERTICA" in the source file /trove/guestagent/datastore/experimental/vertica/system.py. And the community edition of Vertica is free and can be downloaded here.

7. In the OpenStack dashboard, shut off this Nova instance, take a snapshot of it.

Note: In OpenStack, a snapshot is an image.

8. Through trove client, add the snapshot to datastore.

Refer to Step 2-6 in the database chapter of OpenStack Administration Guide.

9. Restart all trove service on the controller node.

# service trove-api restart
# service trove-taskmanager restart
# service trove-conductor restart

10. Launch a Trove instance.

Refer to OpenStack Command Line Reference.

Tuesday, October 27, 2015

Add Database Service (Trove) for OpenStack - Kilo - Ubuntu

Note: Change the values in CAPITAL and italic, like TROVE_DBPASS, TROVE_PASS, RABBIT_USER, RABBIT_PASS, NETWORK_LABEL. And also change controller to IP address of OpenStack controller node, if necessary.

1. Prepare trove database

$ mysql -u root -p
mysql> CREATE DATABASE trove;
mysql> GRANT ALL PRIVILEGES ON trove.* TO trove@'localhost' IDENTIFIED BY 'TROVE_DBPASS';
mysql> GRANT ALL PRIVILEGES ON trove.* TO trove@'%' IDENTIFIED BY 'TROVE_DBPASS';
mysql> FLUSH PRIVILEGES;

2. Install required Trove components

# apt-get install python-trove python-troveclient trove-common trove-api trove-taskmanager trove-conductor

3. Prepare OpenStack

$ source ~/admin-openrc.sh
$ keystone user-create --name trove --pass TROVE_PASS
$ keystone user-role-add --user trove --tenant service --role admin

4. Add the following configuration options to [filter:authtoken] section in /etc/trove/api-paste.ini

[filter:authtoken]

auth_uri = http://controller:5000
auth_url = http://controller:35357
auth_plugin = password
project_domain_id = default
user_domain_id = default
project_name = service
username = trove
password = TROVE_PASS

5. Edit the configuration following options in [DEFAULT] section in the following files

/etc/trove/trove.conf
/etc/trove/trove-taskmanager.conf
/etc/trove/trove-conductor.conf

[DEFAULT]
log_dir = /var/log/trove
trove_auth_url = http://controller:5000/v2.0
nova_compute_url = http://controller:8774/v2
cinder_url = http://controller:8776/v2
swift_url = http://controller:8080/v1/AUTH_
notifier_queue_hostname = controller
control_exchange = trove
rabbit_host = controller
rabbit_userid = RABBIT_USER
rabbit_password = RABBIT_PASS
rabbit_virtual_host= /
rpc_backend = trove.openstack.common.rpc.impl_kombu
sql_connection = mysql://trove:TROVE_DBPASS@controller/trove

(note: comment old configuration options if any, such as trove_auth_url, connection)

6. Edit the following configuration options in [DEFAULT] section in /etc/trove/trove.conf

[DEFAULT]
add_addresses = True
network_label_regex = ^NETWORK_LABEL$

(note: replace NETWORK_LABLEL with network label you want to connect. Find it out by "nova net-list" or "neutron net-list".)

7. Edit the following configuration options in [DEFAULT] section in /etc/trove/trove-taskmanager.conf

[DEFAULT]
nova_proxy_admin_user = admin
nova_proxy_admin_pass = admin
nova_proxy_admin_tenant_name = admin
taskmanager_manager = trove.taskmanager.manager.Manager
log_file = trove-taskmanager.log

8. Initialize database

# trove-manage db_sync

9. Edit /etc/init/trove-conductor.conf to make the following option mataching

--exec /usr/bin/trove-conductor -- --config-file=/etc/trove/trove-conductor.conf ${DAEMON_ARGS}

10. Edit /etc/init/trove-taskmanager.conf to make the following option matching

--exec /usr/bin/trove-taskmanager -- --config-file=/etc/trove/trove-taskmanager.conf ${DAEMON_ARGS}

11. Configure the Trove Endpoint in Keystone

$ keystone service-create --name trove --type database --description "OpenStack Database Service"
$ keystone endpoint-create \
--service-id $(keystone service-list | awk '/ trove / {print $2}') \
--publicurl http://controller:8779/v1.0/%$tenant_id$s \
--internalurl http://controller:8779/v1.0/%$tenant_id$s \
--adminurl http://controller:8779/v1.0/%$tenant_id$s \
--region regionOne

12. Restart the Trove Services

$ sudo service trove-api restart
$ sudo service trove-taskmanager restart
$ sudo service trove-conductor restart

Thursday, July 9, 2015

Introduction to MySQL and SQL for Absolute Beginners or as Quick Refresher

The database management system (DBMS) becomes extremely more important in the data era. MySQL is an open source software as relational database management system (RDBMS). It can be freely downloaded here. The instructions for installations, testing, and simple operations are provided here. And more SQL commands can be found on the website of SQLCourse, which is also an interactive online SQL training platform.

Monday, July 6, 2015

An Introductory Book to Business Intelligence

Business intelligence (BI) has been a hot concept. I have been wondering what it is exactly. Now I get a satisfactory answer from the book Business Intelligence: Making Decisions through Data Analytics. This book gives a thorough and systematic introduction to BI and its tools. Basically, BI and its tools are derived from the following four areas: statistics and econometrics, operations research, artificial intelligence, and database technologies. This book successfully connects what I have grasped to BI.

I guess it would be a good idea to start to explore BI from this book.

Thursday, June 25, 2015

A Tutorial Website for Simple Learning - Tutorialspoint

The tutorials library provided by Tutorialspoint covers broad topics in the computer related technology and targets the main points. Self-learners may find the topic they want to learn more accessible and grasp the pin points in a short time. And also it provides online terminal and IDEs for practice.

The tutorial of Hadoop quickly led me to understand the function, structure, and operation of Hadoop as a solution to big data.

Sunday, June 21, 2015

An Introductory Book to Data Mining

Discovering Knowledge in Data: An Introduction to Data Mining is a pretty interesting and straightforward book for beginners in data mining. I read 5 chapters of this book just this afternoon. It is so attractive.

The course materials (i.e. data sets, homework, project) of data mining course provided by University of Tennessee can also serve very good practical reference.

Monday, June 15, 2015

SAS Global Certification Program

It is a big data era. People who have data analysis and modeling skills can find plenty of job opportunities. SAS Global Certification Program provides a bridge to enter this profession area by getting them certified.

Based on my own experiences, the procedure of preparing for the certification exam can help study the software and theories specifically and thoroughly in an organized time framework.

Currently I hold the following certificates.

- SAS Certified Base Programmer for SAS 9 (08/2014)
- SAS Certified Advanced Programmer for SAS 9 (12/2014)
- SAS Certified Statistical Business Analyst Using SAS 9: Regression and Modeling (05/2015)
- SAS Certified Predictive Modeler Using Enterprise Miner 7 (06/2015)

I mainly used the SAS online tutor and course notes. If you need any study materials for study purpose only, please leave a comment.

Tuesday, June 9, 2015

An Introductory Example to Lagrangian Relaxation

The example provided here makes the Lagrangian relaxation (LR) not a mystery for me any more. Each step becomes very clear. I have read a lot of materials in LR. This is one of the best for beginners to get to know how LR works.

An implementation of LR with C++ in ILOG CPLEX can be found here.

That makes me think that there is probably no really unsolvable problems. It is just we haven't found a right way to approach it.