聯系方式

您當前位置:首頁 >> Python編程Python編程

日期:2019-06-07 11:31

You use a subset (see below) of the dataset in the file “HousePrices.txt” which consist of 11

columns, with measurements for each of 585 Belgian municipalities. The response variable is the

median price of a regular house in the municipality (in thousands of euros).

Region x1 The administrative region: Flanders, Walloon, Brussels-capital.

Province x2 The name of the province (there are officially 10 provinces in Belgium), plus the

Brussels-capital region, which is here treated as a separate province. Hence this variable has 11

categories.

Municipality The name of the municipality (this identifies the different observations and is

provided just for the curious ones).

PriceHouse y Median price of a regular house in the municipality (in thousands of euros).

Shops x3 The number of officially registered shops in the municipality exceeding a certain

number of square meters.

Bankruptcies x4 Number of bankruptcies in the municipality in one year, this includes all type of

enterprises (from one-person companies to big firms).

MeanIncome x5 The average of the taxable incomes of all tax forms of the municipality (in

thousands of euros).

TaxForms x6 The number of tax declarations for the municipality that were submitted to the tax

office.

HotelRestaurant x7 The number of hotels and restaurants (added together) in the municipality.

Industries x8 Number of industrial firms in the municipality.

HealthSocial x9 The number of health care and social service facilities in the municipality.

Each of you will study a subset of these data, and use the following code to get your sub-dataset.

Note that the provided code serves as a hint, you will need to make changes to it.

Constructing your own dataset:

code = 753031

fulldata = read.csv("HousePrices.txt", sep = " ", header = TRUE)

digitsum = function(x) sum(floor(x/10^(0:(nchar(x)-1)))%%10)

set.seed(code)

mysum = digitsum(code)

if((mysum %% 2) == 0) { # number is even

rownumbers = sample(1:327,150,replace=F)

} else { # number is odd

rownumbers = sample(309:585,150,replace=F)

}

mydata = fulldata[rownumbers,]

This way you have taken a sample of 150 municipalities, either from the Flanders region +

Brussels captial area, or from the Walloon region + Brussels captial area. Now, based on your own

sub-dataset, answer following questions one by one.

Questions to be answered:

1) Q1: Use semiparametric flexible modelling to construct a model for the median house price.

Use AIC as a method to select a final model and report on which (type of) models were

included in the search. Only for the components of the selected model that are modeled in a

nonlinear way, provide graphs. The models in this question should not treat covariates as

random effects. Give the model that you have selected in correct notation. It is alright to use

a general notation (e.g. f(x2)) for a smooth function, but you have to state which (spline)

functions you have used, and how the smoothing parameter was selected. If you want to

use the function gam from library(mgcv), the provided AIC value is compatible with

parametric AIC values when using the default option for setting the smoothing parameter.

Notes for Q1:

a) explore all variables of “mydata”, state the information of “distribution” and “link function”

clearly in the models.

b) Clearly state how many (and why) knots you choose, and clarify how you choose

smoothing parameter in details.

c) Treat all variables as fixed.

2) For this question you use the response and only the covariates x6 (number of tax forms) and

x9 (number of health care and social service facilities). State the null hypothesis of a

parametric additive model for the median house price with quadratic effects for both

covariates. Test this hypothesis using an order selection test against a nonparametric

alternative hypothesis, report the hypotheses, the construction of the test statistic, its value,

as well as the corresponding p-value and draw the correct conclusion.

Notes for Q2:

a) Test whether you can fit an additive model in those two covariates (x6 and x9) in quadratic

effects.

b) Clearly state how to do a proper test, including all steps of hypothesis testing and how

they lead to the conclusion?

3) In this question a parametric (generalized) linear mixed effect model should be constructed.

(i) Make a graphical presentation that supports why you suggest a certain mixed effect structure

using x2 Province as the grouping variable. Construct the plot illustrating whether there is an

effect of Province when regressing y on x6 the number of tax forms. For the plot you may

ignore all other covariates.

(ii) Construct a parametric (generalized) linear mixed effect model using your suggestion from (i).

You leave out variable x1 for this part, other covariates may be included in the model in a

parametric way. Your model should include x2 and x6, the inclusion of other covariates in

your model may be based on your answer of question 1, no fixed effect model selection

should be done for this question. Provide the model using correct notation, and give a

summary of the output. Briefly discuss whether the output supports your suggestion from (i).

Note: library(hglm) contains both hglm and hglm2 wich may be used for fitting, also

glmm-PQL is a possibility. If one of these functions gives problems for your dataset, try one of

the other ones.

Notes for Q3:

a) Among Q1-5, only Q3 takes the random effect into consideration.

4) In this question you start from a large parametric model (no random effects, no interactions)

and you will perform a focused search over all sub-models of the large model and this for

two focuses:

(i) the median price of a regular house for one municipality of your choice from your

dataset where there is a low (though not the lowest) number of industrial firms,

(ii) the median price of a regular house for one municipality of your choice from your

dataset where there is a large (though not the largest) number of hotels and restaurants.

Write the selected model for each focus using correct notation and provide the

estimated values of the focuses for both cases. Briefly discuss.

Notes for Q4:

a) Look your dataset in 150 lines, pick one village for the low industry, and another one for

the high number of hotels. And, search for the best models to match the house price for

those two villages.

b) Use correct notations and clearly state the “distribution”, “link function”, “coefficient”.

5) In this question you use the same large parametric model (no random effects, no interactions)

as you started with in question 4.

(i) Construct a table containing the vector of estimated coefficients of the regression model

using four methods:

(a) maximum likelihood estimation in the large model

(b) Ridge regression

(c) Lasso estimation

(d) An elastic net estimator, different from the ridge and lasso one.

For (b), (c) and (d) you use the software’s default value for the penalty parameter λ.

(ii) Using the four estimation methods from (i), give in a table the predictions for the median

price of a regular house for the same two municipalities as in question 4. Briefly discuss.

Note: If you would like to use a function other than glmnet for penalized estimation, here is an

alternative with a few more options. Since the syntax is quite a bit different, you might want to

adjust the lines below to your setting, if you want to use this.

library(h2o)

h2o.init()

mydat2=as.h2o(mydata)

mydat2$Region <- as.factor(mydat2$Region)

mydat2$Province <- as.factor(mydat2$Province)

y="PriceHouse"

X = c("Province", "Shops") # add here the variables that you wish to put in X.

alpha0 <- h2o.glm(family= "something", link="something", x= X, y=y, alpha=0,

lambda_search=TRUE, training_frame=mydat2, nfolds=0)

# indicate the same rows as in question 4:

Xeval = as.h2o(as.data.frame(mydat2[c(1,2),]))

h2o.predict(alpha0, newdata=Xeval)


版權所有:編程輔導網 2018 All Rights Reserved 聯系方式:QQ:99515681 電子信箱:[email protected]
免責聲明:本站部分內容從網絡整理而來,只供參考!如有版權問題可聯系本站刪除。

25选5一等奖多少钱