PREDICTING HOUSE PRICES USING DATA SCIENCE.

PREDICTING HOUSE PRICES USING REGRESSION

6 min readJul 8, 2020

One of the playground datasets for Data Scientists is the House Prices dataset. I will use this to demonstrate how to go about completing a data science project involving Regression.

HOW TO TELL WHAT ML TECHNIQUE TO USE

Before taking up any DS project, one should always try to get an understanding of what exactly the output of the project should be. You can do that in a couple of ways:

1. Check the sample submission files

2. Determine the target data

Doing this helps you determine whether you will employ the services of a Supervised or Unsupervised ML technique. If there is a target data (like in this case) then you will probably use Supervised ML techniques like Regression or Classification. Our target data is the SalePrice of the house.

The SalePrice is a continuous variable thus we need to use Regression. Classification would be appropriate if the data were ordinal, nominal, or dichotomus.

So it’s settled, we will employ supervised regression ML techniques.

OUR DEPENDENCIES

I have a habit of wanting all my import statements at the top. It doesn’t necessarily mean that you need to know everything you need before you start (I sure as hell don’t).

For the handling of DataFrames and Series:

import pandas as pd

For the arrays:

import numpy as np

To enable us to replace Categorical variables with numerical variables for our ML model:

from sklearn.preprocessing import LabelEncoder

For feature scaling:

from sklearn import preprocessing

To create our ML Regression model:

from sklearn.ensemble import RandomForestRegressor

To determine the accountability of our model:

from sklearn.metrics import accuracy_score

LOADING OUR DATA

The following three lines of code will load the data into variables as pandas DataFrames:

train = pd.read(“train.csv”)
test = pd.read(“test.csv”)
sample_submission = pd.read(“sample_submission.csv”)

To note is that any transformation done to the training data must be reiterated on the test data. This is because once the model has been trained it needs to acquire the same type of data.

Let’s have a look-see at our data then:

train.head(20)
test.head(20)
sample_submission.head(20)

This will display the first twenty rows in the data under their respective column heads. We see here that train.csv contains 81 columns whereas test.csv contains 80 columns. The SalePrice column has been excluded since it is what we need to predict. The sample_submission.csv contains two columns that are the Id and SalePrice.

Data Exploration: Dealing with null values

Let us check for null values

train.isnull().any()

Id False

MSSubClass False

MSZoning False

LotFrontage True

LotArea False

…

MoSold False

YrSold False

SaleType False

SaleCondition False

SalePrice False

Length: 81, dtype: bool

This will return a list of the columns with either True or False depending on whether there are null values in that column.

To get the total number of empty columns per column:

train.isnull().sum().head(50)

Id 0

MSSubClass 0

MSZoning 0

LotFrontage 259

LotArea 0

Street 0

Alley 1369

LotShape 0

LandContour 0

Utilities 0

LotConfig 0

LandSlope 0

Neighborhood 0

Condition1 0

Condition2 0

BldgType 0

HouseStyle 0

OverallQual 0

OverallCond 0

YearBuilt 0

YearRemodAdd 0

RoofStyle 0

RoofMatl 0

Exterior1st 0

Exterior2nd 0

MasVnrType 8

MasVnrArea 8

ExterQual 0

ExterCond 0

Foundation 0

BsmtQual 37

BsmtCond 37

BsmtExposure 38

BsmtFinType1 37

BsmtFinSF1 0

BsmtFinType2 38

BsmtFinSF2 0

BsmtUnfSF 0

TotalBsmtSF 0

Heating 0

HeatingQC 0

CentralAir 0

Electrical 1

1stFlrSF 0

2ndFlrSF 0

LowQualFinSF 0

GrLivArea 0

BsmtFullBath 0

BsmtHalfBath 0

FullBath 0

train.isnull().sum().tail(30)

HalfBath 0

BedroomAbvGr 0

KitchenAbvGr 0

KitchenQual 0

TotRmsAbvGrd 0

Functional 0

Fireplaces 0

FireplaceQu 690

GarageType 81

GarageYrBlt 81

GarageFinish 81

GarageCars 0

GarageArea 0

GarageQual 81

GarageCond 81

PavedDrive 0

WoodDeckSF 0

OpenPorchSF 0

EnclosedPorch 0

3SsnPorch 0

ScreenPorch 0

PoolArea 0

PoolQC 1453

Fence 1179

MiscFeature 1406

MiscVal 0

MoSold 0

YrSold 0

SaleType 0

SaleCondition 0

SalePrice 0

We can see there are a number of columns with an alarmingly large number of null values since the total number of rows are 1459. Let’s go to data_description.txt to figure out what these columns represent and how to deal with the null values.

Alley: Type of alley access to the property. Gravel, Paved or No Alley access.
1369/1459 null values.
PoolQC: Pool quality which could be
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
1453/1459 null values
Fence: Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
1179/1459 null values
MiscFeatures: Miscellanous features not covered in other columns
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
1406/1459 null values.

Due to the overwhelmingly large number of null values in these columns we cannot autofill these columns since doing so will skew our results way too much for our liking. It would make more sense to do away with these columns.

train.drop(“Alley”, axis = 1,inplace = True)
train.drop(“PoolQC”, axis = 1,inplace = True)
train.drop(“Fence”, axis = 1,inplace = True)
train.drop(“MiscFeatures”,axis = 1,inplace = True)

Now let’s continue

LotFrontage: Linear feet of street-connected to the property

259 null values.

This column contains continuous values hence we should fill it with the median or mean value.

median = train[‘LotFrontage’].median()
train[‘LotFrontage’].fillna(median, inplace = True)

BsmtQual: Evaluates the height of the basement

This column contains discrete variable hence we can autofill using the modal value

mode = train[‘BsmtQual’].mode()
train[‘BsmtQual’].fillna(mode, inplace = True)

BsmtCond: Evaluates the general condition of the basement

Requires the same operations as the BsmtQual

BsmtExposure: Refers to walkout or garden level walls

Requires the same operations as BsmtQual

GarageYrBlt: Year garage was built

It would make sense to set this to the same date the house was built.

FEATURE TRANSFORMATION

We can replace the YrBuilt and YrRemodAdd columns with values that represent the actual time since.

trainYrsSinceBuild = []
trainYrsSinceRenov = []
testYrsSinceBuild = []
testYrsSinceRenov = []
for i in train[‘YearBuilt’]:
ageSinceBuild = 2020 — i
trainYrsSinceBuild.append(ageSinceBuild)
for i in train[‘YearRemodAdd’]:
ageSinceRenov = 2020 — i
trainYrsSinceRenov.append(ageSinceRenov)
train[‘YrsSinceBuild’] = trainYrsSinceBuild
train[‘YrsSinceRenov’] = trainYrsSinceRenov
train.drop([‘YearBuilt’,’YearRemodAdd’],axis = 1, inplace = True)

Our ML model can only accept numerical values. Hence we need to change the Non-numerical values. This can be accomplished by label encoding.

First, we need to obtain a subset of the training dataset containing only object datatype values.

train_categorical_cols = train.select_dtypes(‘object’)

Then we use LabelEncoder class to transform the values.

lblEncoder = LabelEncoder()
train_categorical_cols = train_categorical_cols.apply(lblEncoder.fit_transform)

We can then return the values back to the training dataset by getting the columns that aren’t strings and joining them with the ‘train_categorical_cols’ that solely contained object type columns.

trainMinusStrings = train.select_dtypes(exclude = ‘object’)
train = pd.concat([train_categorical_cols,trainMinusStrings], axis = 1)

Correlation Coefficient.

We can now look at how each column correlates with the target dataset. Let us try and create a new dataset containing the columns with a high positive or high negative correlation coefficient and simultaneously train a separate model using only these columns to see how it would affect performance.

To extract these columns heads:

count = 0
cols_with_high_corr = []
for i in columns:

corr = new_train[i].corr(prices)
if(corr >= 0.50):
print(i)
print(corr)
cols_with_high_corr.append(i)
count+=1

elif(corr <= -0.50):
print(i)
print(corr)
cols_with_high_corr.append(i)
count+=1
else:
count = count
print(cols_with_high_corr)
high_corr_train = pd.DataFrame()
for i in cols_with_high_corr:
high_corr_train[i] = new_train[i]

Feature Scaling

We need to avoid letting data with a higher variance obtain more control in determining the SalePrice. So we need to scale the data such that the variance in each column is the same. In our situation we will reduce our data to a -1,+1 range.

First let us drop the ID column because it should have no bearing on what the SalePrice should be.

train.drop(‘Id’ ,axis = 1, inplace =True)

Then let us drop the SalePrice column:

saleprice = train[‘SalePrice’]
train.drop(‘Id’, axis = 1, inplace = True)

Now let us scale the data:

DF_train = preprocessing.minmax_scale(train)

Training our model

model = RandomForestRegressor()
model.fit(DF_train,prices)
predictions = model.predict(test)
predictions_df = pd.DataFrame()
predictions_df[‘ID’] = test[‘Id’]
predictions_df[‘SalePrice’] = predictions