- QQ：99515681
- 郵箱：[email protected]
- 工作時間：8:00-23:00
- 微信：codinghelp

Project ECON 427,

1. Predicting Stock Price Movements

The goal of this project is to predict stock pricesby applying machine learning techniques

to data from StockTwits, a social media platform for investors. We extract

features from textual data, and formulate price prediction as both a regression and a

classification problem. We demonstrate the results and analyze them.

(a) Make yourself familiar with the StockTwits platform: https://stocktwits.

com/.

(b) The goal is to perform analysis on the component stocks of the Dow Jones Industrial

Average. Data were collected for the period December 2013 to December

2016, totaling 756 trading days. Two main datasets are used:

i. StockTwits Data: The data were collected and downloaded in raw JSON

format, totaling over 540,000 messages. Sentiment polarity was also extracted

from user-generated “bullish”/ “bearish” tags.

A. Calculate the difference of the number of bullis and bearish tags and

divide it by the total number of messages tagged for each stock in each

day, to find a polarity for each stock in each day, and calculate a moving

average for this ratio, and call it st

. Use a 3 point moving average initially,

but you can try to change the window size of the moving average to see

if you can get better results when you are training models.

B. Calculate the number of messages for each stock in each day, which we

call message volume.

C. Calculate the percentage 1-day message volume change, which is the difference

between today’s message volume and yesterday’s message volume

divided by yesterday’s message volume and call it mv1,t.

D. Calculate today’s message volume divided by the average message volume

in the previous 10 days and call it mv10,t.

ii. Price Data: Daily split-adjusted stock price data was collected via the Yahoo

Finance API. You can only focus only on the closing price data for the

purposes of this project, but you are welcome to test your algorithms for

other prices in the data set as well.

iii. Prediction Target: We focus on the forward T-day return, calculated as a

percentage change for the future price movement three days ahead of today’s

trading price, i.e.:

rt(T) = pt+T pt

where pt+T is the price at time period t + T, i.e. T days ahead. Calculate

rt(3) and rt(5) from the data for each company. Later, we will try to predict

them using various techniques.

(c) Pre-Processing and Exploratory data analysis:

1

Project ECON 427, Instructor: Mohammad Reza Rajati

i. There are exceedingly large number of posts about AAPL. You can remove

AAPL from your analysis if the computational burden is too much for your

computer.

ii. Search what stop words mean and remove them from the data.

iii. Remove company names from the data.

iv. Remove posts mentioning/tagging multiple stocks (e.g. “$AAPL $FB $GOOG”).

v. Aggregate posts by date. For each date in the the period December 2013 to

December 2016, you should have a set of tweets for each company in that

date.

vi. Use 70% of the data for training and 30% for testing. Remember not to select

training and test data randomly. Use the first 70% of the days for training

and the last 30% for testing (January 2016 to December 2016). Explain whay

this is a correct way of splitting the data.

(d) Bag of Words Features

i. Calculate the frequencies of the words in the data.

ii. Only keep words that occured at least 25 times in the dataset. This should

give you more than 6800 words.

iii. For each of the words in 1(d)ii, calculate the TF-IDF metric with Laplace

smoothing. Those metrics are used as features in your classification models.

(e) Chi-Squared Statistics

i. Since the number of features is very large, we use a preliminary feature selection

method that detects correlation between features. Use the chi-squared

test to select the first 1000 important features with highest chi-squared scores.

(f) Classification

i. Explain how prediction of rt(T) can be converted into a binary classification

problem and convert the responses to binary labels.

ii. Na¨ve Bayes Binary Classifier

A. Train a Na¨?ve Bayes classifier using bag of words features.

B. Report train and test accuracy for this model.

C. Build a confusion matrix for both training and test data.

D. Report AUC, precision, recall, and F1-scores for both training and testing

data.

iii. Logistic Regression

A. Apply Recursive Feature elimination on the chi-squared features to train

a Logistic Regression model for binary classification.

B. Train an L 1-penalized Logistic Regression using the chi-squared features

as well as st

, mv1,t, and mv10,t. Use 5-fold cross validation to find the best

hyper-parameter.

C. Report train and test accuracy for both models.

D. Build a confusion matrix for both training and test data for both models.

2

Project ECON 427, Instructor: Mohammad Reza Rajati

E. Report AUC, precision, recall, and F1-scores for both training and testing

data.

iv. Random Forests and Extra Trees

A. Use as many of the 1000 chi-squared features as you can (at least the top

20) along with st

, mv1,t, and mv10,t to train a random forest model for

binary classification.

B. Repeat 1(f)ivA using Extra Trees.

C. Report train and test accuracy for both models.

D. Build a confusion matrix for both training and test data for both models.

E. Report AUC, precision, recall, and F1-scores for both training and testing

data.

v. Support Vector Machines

A. Train an L 1-penalized SVM using the chi-squared features as well as

st

, mv1,t, and mv10,t. Use 5-fold cross validation to find the best hyperparameter

B. Report train and test accuracy for both models.

C. Build a confusion matrix for both training and test data for both models.

D. Report AUC, precision, recall, and F1-scores for both training and testing

data.

(g) Regression

i. KNN Regression

A. Use the chi-squared features along with st

, mv1,t, and mv10,t to perform

KNN regression on the data. Use 5-fold cross validation to determine the

value of k ∈ {5, 6, . . . , 30}. You are welcome to test the effect of larger

k’s.

B. Map any predicted ?r(T) whose absolute value is bigger than a reasonable

threshold (the suggested value is 0.5%, but you are welcome to try other

thresholds as well. Obviously, if the threshold is elected to be 0%, there

is not any no action signal) into a positive or negative signal.

C. Report train and test accuracy for both models.

D. Build a confusion matrix for both training and test data for both models.

E. Report AUC, precision, recall, and F1-scores for both training and testing

data.

F. Note: If you have a no action signal, the cases that are detected as no

action should not be considered in evaluationg classification metrics.

ii. Support Vector Regression1

A. Use the chi-squared features along with st

, mv1,t, and mv10,t to train a

Support Vector regression model on the data. Use L2 regularization. Use

5-fold cross validation to determine the hyperparameters of the algorithm.

1https://medium.com/coinmonks/support-vector-regression-or-svr-8eb3acf6d0ff

3

Project ECON 427, Instructor: Mohammad Reza Rajati

B. Map any predicted ?r(T) whose absolute value is bigger than a reasonable

threshold (the suggested value is 0.5%, but you are welcome to try other

thresholds as well. Obviously, if the threshold is elected to be 0%, there

is not any no action signal) into a positive or negative signal.

C. Report train and test accuracy for both models.

D. Build a confusion matrix for both training and test data for both models.

E. Report AUC, precision, recall, and F1-scores for both training and testing

data.

F. Note: If you have a no action signal, the cases that are detected as no

action should not be considered in evaluationg classification metrics.

iii. Random Forest and Extra Tree Regression

A. Use the chi-squared features along with st

, mv1,t, and mv10,t to train a

Random Forest regression model and and an Extra Tree regression model

on the data.

B. Map any predicted ?r(T) whose absolute value is bigger than a reasonable

threshold (the suggested value is 0.5%, but you are welcome to try other

thresholds as well. Obviously, if the threshold is elected to be 0%, there

is not any no action signal) into a positive or negative signal.

C. Report train and test accuracy for both models.

D. Build a confusion matrix for both training and test data for both models.

E. Report AUC, precision, recall, and F1-scores for both training and testing

data.

F. Note: If you have a no action signal, the cases that are detected as no

action should not be considered in evaluationg classification metrics.

(h) Improving The Models: Use any method you know, including ensemble methods,

to yield the best classifier and the best regression model you can. You may

want to reduce the number of features you use using recursive feature elimination.

In that case, use recursive feature elimination inside your cross validation loops.

You are free to use any technique, for example a Recurrent Neural Network or

XGBoost.

(i) Explain why even a test accuracy slightly above 50% is not bad for this problem,

although it is a binary classification problem. Make every effort to have a test

accuracy of at least 60%.

(j) Make a table of the test accuracies for each stock, and identify the best three and

the worst three accuracies. Comment on your results.

2. Trading Scenario

(a) Your capital at the beginning of each day is Ct and is Et at the end of each day.

Assume that you are considering days {1, 2, . . . , τ} in your test set where τ is the

number of your test days. You make long/short decisions in days {1, 2, . . . , τ T}.

Because you have to wait T days to see the effect of your decisions on your capital,

4

Project ECON 427, Instructor: Mohammad Reza Rajati

you calculate your capital at the end of days {1 + T, 2 + T, τ}. Repeat all of the

following steps for both T = 3 and T = 5.

(b) Start with an initial capital of C1 = C2 = · · · = C1+T = E1 = E2 = . . . = ET =

$90, 000. Only 1/3 of your total money at the end of the previous day should

be invested at the beginning of each day. Thus, if C

is the amount you invest

on stocks on day t, you would initially invest C

1+T = E1/3 =E2/3 = . . . = ET /3 = $30, 000, and C

changes from

$30, 000 at day t = 2 + T, and because the effect of your decisions in day 1 will

change your capital at the end of day 1 + T (which is E1+T ), and 1/3 of E1+T will

be available capital for investment at day t = 2 + T, i.e. C0

2+T = E1+T /3.

(c) Invest equal amounts of money in each company. Therefore, if you are considering,

say, M = 25 companies, invest I

/M in day t in each company. This means

you initially invest I

/M = $30, 000/25 = $1200 in

company m (if it makes your calculations simpler, you can consider fractional

shares, but small remainders do not seem to significantly affect the results). If

the price of each share of company m in day t is pt

, this means you invest in

/pt shares of company m in day t.

(d) Start making decisions in the first day in your training set. Trading is done using

long/short signals. If your predicted trade signal for company m is positive in

day t (i.e. if you predict that its price will go up in day t + T), long its shares,

i.e. calculate your return for the share of company m in day t using the following

formula:

On the other hand, if your predicted trade signal for a company is negative in

day t (i.e. if you predict that its price will go down in day t+T), short its shares,

i.e. calculate your return for the share of company m in day t using the following

formula:

If you predicted no action for a stock using the regression methods, obviously

(e) The effect of decisions in day t ? T on your capital are revealed when you realize

the prices in day t. The total gains and losses on day t resulting from long/short

decisions on day t T is calculated as:

is the number of shares of company m that was traded in day t T,

and the comission for each trade is considered to be $0.0075, unless r

(T) was

5

Project ECON 427, Instructor: Mohammad Reza Rajati

predicted to be 0 (no action, by a regression model), where qt?T = 0. Thus, your

capital at the end of day t is:

Obviously, Ct+1 = Et

, t ∈ {T, · · · , τ ?1}, but we introduced Ct and Et

for clarity

of the above descriptions.

(f) Plot Ct over the test period, for each of your prediction algorithms on the same

graph and compare them. Which method makes you richer at the end of the

test period? You can include any custom-made algorithm you created to improve

the results in this comparison and argue that it works better than the standard

algorithms offered in the description of the project.

(g) Comparison with Oracle trading Dow Jones Industrial Average (DIA):

repeat the above scenario for the Dow Jones industrial average (DIA) and an

omniscient trader (Oracle), i.e. instead of predicting the movements using any

of your algorithms, use the true movements. In other words, if the actual T-day

ahead return is positive in a day, long the stock, and if it is negative, short the

stock. Compare all of your algoritms with the performance of Oracle on DIA on

the same plot and draw conclusions.

6

版權所有：編程輔導網 2018 All Rights Reserved 聯系方式：QQ:99515681 電子信箱：[email protected]

免責聲明：本站部分內容從網絡整理而來，只供參考！如有版權問題可聯系本站刪除。