In this post, I will explain what I have done in my first Python project in data science - stock price prediction, combined with the code. I started to learn how to use Python to perform data analytical works during my after-working hours at the beginning of December. So this post will also serve as a summary of what I have learned in the past three weeks. Hope this post can also give readers some insights on the whole process of how to get a data science project done with Python from beginning to end. I want to thank Vik from dataquest for comments he has provided, which I have really benefited from.
My code for this project can be found here on GitHub. It can be customized by changing settings in settings.py and automatically process data acquisition and prediction tasks in this project. It can also be easily modified to adapt more users' needs.
1. Get and store data.
There are basically three data resources.
- Files, such as EXCEL, CSV, and TEXT.
- Database, such as SQLite, MySQL, and MongoDB.
- Web.
The historical data used in this project is web data from Yahoo Finance Historical Prices. There are basically two ways to retrieve web data.
- Web API, such as Twitter API, Facebook API, and OpenNotify API.
- Web Scraping.
I chose the web scraping method to retrieve historical data, since I just learned it and wanted to practice it. In the code, the program data_acquisition.py scrapes historical data and stores them in either csv file or mysql database. The function get_query_input() will get the date range of historical data that the user wants to retrieve (i.e. from 2015-01-01 to 2015-12-25), form the url with parameters extracted from user inputs, and return that url. With that url, we then can get data by sending HTTP request to it and scraping it, which is what the function scrape_page_data(pageUrl) does. aggregate_data_to_mysql(total_table, pageUrl, output_table, database_connection) will store the scraped data in mysql database, while aggregate_data_to_csv(total_table, pageUrl, output_file) will store the scraped data in a csv file. Notice that these two functions are recursive functions. That is because all historical data requested may not be in the same webpage, and if not, we need to go to the next page url and keep scraping data. So the question is how we can scrape data in a webpage and then go to next page and continue this process until all historical data requested is retrieved.
To scrape data in a webpage, we first take a look at its HTML source code. In Safari, right click the webpage and select "show page source". This may require a little bit knowledge about HTML, but it can be quickly caught up. By investigating the HTML source code of stock price page, we can find that the data we want to retrieve is in the table named "yfnc_datamodoutline1". CSS selectors make it easy for us to select the rows in the table. And the next page url can be obtained from the attribute of <a rel="next" href=, where tag <a indicates a link in HTML.
By the time when I write this post, I find Yahoo! Finance APIs. Next time I need it, I will play with that to see how it would make things easier. I guess when working with data, the first thing to check is whether that website provides APIs.
2. Explore data.
The purpose of data exploration may fall in one of the below categories.
- Missing values and outliers
- The pattern of the target variable
- Potential predictors
- The relationship between the target variable and potential predictors
- The distribution of variables
Data visualization and statistics (i.e. correlation) are good ways to explore data. One specific way for checking missing values can be found here. It turns out there is no missing value in the historical data.For time series data, we set the date as the index and sort the data in ascending order, which can be done though the function pandas.DataFrame.set_index() and pandas.DataFrame.sort_index().
The objective of this project is to predict stock price. Take "Close" price as the example. To predict it, we will be interested in its own pattern and the relationship between it and other factors. One question is what factors may influence our target variable ("Close" price). We can get some insights on those driving factors by studying the problem deeper and doing some research on relative literature and available modelings. Frankly speaking, I almost knew nothing about the stock market, before I started to work on this project. So I got indicators of "Close" price from the project information received from dataquest.
- The average price from the past five days.
- The average price for the past month.
- The average price for the past year.
- The ratio between the average price for the past five days, and the average price for the past year.
- The standard deviation of the price over the past five days.
- The standard deviation of the price over the past year.
- The ratio between the standard deviation for the past five days, and the standard deviation for the past year.
- The average volume over the past five days.
- The average volume over the past year.
- The ratio between the average volume for the past five days, and the average volume for the past year.
- The standard deviation of the average volume over the past five days.
- The standard deviation of the average volume over the past year.
- The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
- The year component of the date.
In Pandas, there are functions to compute moving (rolling) statistics, such as rolling_mean and rolling_std. But we need to shift the column of that rolling statistics forward by one day. The reason is as follows. We want the average price from the past five days for 2015-12-12 to be the mean of prices from 2015-12-07 to 2015-12-11, but the returned value of the rolling_mean function for 2015-12-12 is the mean of prices from 2015-12-08 to 2015-12-12. So the average price from the past five days for 2015-12-12 is actually the returned value of the function rolling_mean for 2015-12-11.
3. Clean data.
When computing some indicators, a year of historical data is required. So those indicators for the earliest year will be missing, indicated as NaN in Python. We will just remove rows with NaN values.
This step and following steps are done in the program prediction.py. First, users' prediction requests will be retrieved by get_prediction_req(data_storage_method). Then historical data will be read and loaded to a DataFrame either from mysql database or csv file with the function read_data_from_mysql(database_connection, historical_data_table) or read_data_from_csv(historical_data_file).
This step and following steps are done in the program prediction.py. First, users' prediction requests will be retrieved by get_prediction_req(data_storage_method). Then historical data will be read and loaded to a DataFrame either from mysql database or csv file with the function read_data_from_mysql(database_connection, historical_data_table) or read_data_from_csv(historical_data_file).
4. Build the predictive model.
Some common predictive methods we use are regression, classification, decision tree, neutral network, and supporting vector machine. For this project, multiple linear regression and random forest will be chosen as initial methods, whose interfaces are provided in sklearn package.
5. Validate and select the model.
To validate the model, two things need to be done.
- Split the historical data into train set and test set.
- Choose an error performance measurement and calculate it.
The function is predict(df, prediction_start_date, prediction_end_date, predict_tommorrow), which performs predictive tasks and calculates the error measurement after calculating the indicators and cleaning the data.
The big question here is how to split the historical data into train set and test for prediction. For example, the user wants to predict the price from 2015-01-03 to 2015-12-27. The train and test data set will be split based on the backtesting technique. Let's say the earliest date in historical data is 1950-01-03. For example, to predict the price for 2015-01-03, the train set will be historical data from 1950-01-03 to 2015-01-02, and the test set will be 2015-01-03. And to predict the price for 2015-01-04, the train set will be historical data from 1950-01-03 to 2015-01-03, and the test set will be 2015-01-04. Keep this process until we get all predicted values from 2015-01-03 to 2015-12-27. In the code, this part is done by looping over the index set of the prediction period.
The mean absolute error (MAE) is picked as the error performance measure. The model with a smaller MAE will chosen to predict the price for tomorrow. The MAEs and predicted value for tomorrow will be written to file "predicted_value_for_tommorrow". And the actual and predicted value for test data will be stored in either a mysql database or csv file by function write_prediction_to_mysql(df_prediction, database_connection, predicted_output_table) or write_prediction_to_csv(df_prediction, predicted_output_file) for further analysis, which may give some insights on where the model doesn't perform well and how to improve it.
6. Present the results.
Beside predicting the value for tomorrow, a simple experiment is done by using the whole year data in 2014 and 2015 as test respectively. The MAE result is as follows.
2014 | 2015 | |
Regression | 15.62 | 19.98 |
Random Forest | 13.99 | 19.08 |
We can have two simple findings.
- The model performs differently in different years. shows that the model performs better for the stock market in the year 2014.
- Random forest seems to perform better than multiple linear regression based on their MAE.
The time series plots for 2014 and 2015 with actual and predicted values can be found below. As shown, the trend is caught well in the whole. One potential improvement point for regression model is that the peak falls behind slightly. And one potential improvement point for random forest model is that its predicted value is smaller than actual value in general.
I will tweak algorithms more later and then present more findings. After all, the initial main purpose of this project is to get myself familiar with the full lifecycle of a data science project and Python packages/functions that are frequently used in a data science project.
- The model performs differently in different years. shows that the model performs better for the stock market in the year 2014.
- Random forest seems to perform better than multiple linear regression based on their MAE.
The time series plots for 2014 and 2015 with actual and predicted values can be found below. As shown, the trend is caught well in the whole. One potential improvement point for regression model is that the peak falls behind slightly. And one potential improvement point for random forest model is that its predicted value is smaller than actual value in general.
I will tweak algorithms more later and then present more findings. After all, the initial main purpose of this project is to get myself familiar with the full lifecycle of a data science project and Python packages/functions that are frequently used in a data science project.
Hello Lili,
ReplyDeleteThe Article on The First Python Project in Data Science is amazing.It give detail information about it. Thanks for Sharing the information about it. data science consulting