PREDICTING HOUSE PRICES USING DATA SCIENCE.
PREDICTING HOUSE PRICES USING REGRESSION
One of the playground datasets for Data Scientists is the House Prices dataset. I will use this to demonstrate how to go about completing a data science project involving Regression.
HOW TO TELL WHAT ML TECHNIQUE TO USE
Before taking up any DS project, one should always try to get an understanding of what exactly the output of the project should be. You can do that in a couple of ways:
1. Check the sample submission files
2. Determine the target data
Doing this helps you determine whether you will employ the services of a Supervised or Unsupervised ML technique. If there is a target data (like in this case) then you will probably use Supervised ML techniques like Regression or Classification. Our target data is the SalePrice of the house.
The SalePrice is a continuous variable thus we need to use Regression. Classification would be appropriate if the data were ordinal, nominal, or dichotomus.
So it’s settled, we will employ supervised regression ML techniques.
OUR DEPENDENCIES
I have a habit of wanting all my import statements at the top. It doesn’t necessarily mean that you need to know everything you need before you start (I sure as hell don’t).
For the handling of DataFrames and Series:
import pandas as pd
For the arrays:
import numpy as np
To enable us to replace Categorical variables with numerical variables for our ML model:
from sklearn.preprocessing import LabelEncoder
For feature scaling:
from sklearn import preprocessing
To create our ML Regression model:
from sklearn.ensemble import RandomForestRegressor
To determine the accountability of our model:
from sklearn.metrics import accuracy_score
LOADING OUR DATA
The following three lines of code will load the data into variables as pandas DataFrames:
train = pd.read(“train.csv”)
test = pd.read(“test.csv”)
sample_submission = pd.read(“sample_submission.csv”)
To note is that any transformation done to the training data must be reiterated on the test data. This is because once the model has been trained it needs to acquire the same type of data.
Let’s have a look-see at our data then:
train.head(20)
test.head(20)
sample_submission.head(20)
This will display the first twenty rows in the data under their respective column heads. We see here that train.csv contains 81 columns whereas test.csv contains 80 columns. The SalePrice column has been excluded since it is what we need to predict. The sample_submission.csv contains two columns that are the Id and SalePrice.
Data Exploration: Dealing with null values
Let us check for null values
train.isnull().any()
Id False
MSSubClass False
MSZoning False
LotFrontage True
LotArea False
…
MoSold False
YrSold False
SaleType False
SaleCondition False
SalePrice False
Length: 81, dtype: bool
This will return a list of the columns with either True or False depending on whether there are null values in that column.
To get the total number of empty columns per column:
train.isnull().sum().head(50)
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
Street 0
Alley 1369
LotShape 0
LandContour 0
Utilities 0
LotConfig 0
LandSlope 0
Neighborhood 0
Condition1 0
Condition2 0
BldgType 0
HouseStyle 0
OverallQual 0
OverallCond 0
YearBuilt 0
YearRemodAdd 0
RoofStyle 0
RoofMatl 0
Exterior1st 0
Exterior2nd 0
MasVnrType 8
MasVnrArea 8
ExterQual 0
ExterCond 0
Foundation 0
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinSF1 0
BsmtFinType2 38
BsmtFinSF2 0
BsmtUnfSF 0
TotalBsmtSF 0
Heating 0
HeatingQC 0
CentralAir 0
Electrical 1
1stFlrSF 0
2ndFlrSF 0
LowQualFinSF 0
GrLivArea 0
BsmtFullBath 0
BsmtHalfBath 0
FullBath 0
train.isnull().sum().tail(30)
HalfBath 0
BedroomAbvGr 0
KitchenAbvGr 0
KitchenQual 0
TotRmsAbvGrd 0
Functional 0
Fireplaces 0
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageCars 0
GarageArea 0
GarageQual 81
GarageCond 81
PavedDrive 0
WoodDeckSF 0
OpenPorchSF 0
EnclosedPorch 0
3SsnPorch 0
ScreenPorch 0
PoolArea 0
PoolQC 1453
Fence 1179
MiscFeature 1406
MiscVal 0
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
We can see there are a number of columns with an alarmingly large number of null values since the total number of rows are 1459. Let’s go to data_description.txt to figure out what these columns represent and how to deal with the null values.
Alley: Type of alley access to the property. Gravel, Paved or No Alley access.
1369/1459 null values.
PoolQC: Pool quality which could be
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
1453/1459 null values
Fence: Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
1179/1459 null values
MiscFeatures: Miscellanous features not covered in other columns
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
1406/1459 null values.
Due to the overwhelmingly large number of null values in these columns we cannot autofill these columns since doing so will skew our results way too much for our liking. It would make more sense to do away with these columns.
train.drop(“Alley”, axis = 1,inplace = True)
train.drop(“PoolQC”, axis = 1,inplace = True)
train.drop(“Fence”, axis = 1,inplace = True)
train.drop(“MiscFeatures”,axis = 1,inplace = True)
Now let’s continue
LotFrontage: Linear feet of street-connected to the property
259 null values.
This column contains continuous values hence we should fill it with the median or mean value.
median = train[‘LotFrontage’].median()
train[‘LotFrontage’].fillna(median, inplace = True)
BsmtQual: Evaluates the height of the basement
This column contains discrete variable hence we can autofill using the modal value
mode = train[‘BsmtQual’].mode()
train[‘BsmtQual’].fillna(mode, inplace = True)
BsmtCond: Evaluates the general condition of the basement
Requires the same operations as the BsmtQual
BsmtExposure: Refers to walkout or garden level walls
Requires the same operations as BsmtQual
GarageYrBlt: Year garage was built
It would make sense to set this to the same date the house was built.
FEATURE TRANSFORMATION
We can replace the YrBuilt and YrRemodAdd columns with values that represent the actual time since.
trainYrsSinceBuild = []
trainYrsSinceRenov = []
testYrsSinceBuild = []
testYrsSinceRenov = []
for i in train[‘YearBuilt’]:
ageSinceBuild = 2020 — i
trainYrsSinceBuild.append(ageSinceBuild)
for i in train[‘YearRemodAdd’]:
ageSinceRenov = 2020 — i
trainYrsSinceRenov.append(ageSinceRenov)
train[‘YrsSinceBuild’] = trainYrsSinceBuild
train[‘YrsSinceRenov’] = trainYrsSinceRenov
train.drop([‘YearBuilt’,’YearRemodAdd’],axis = 1, inplace = True)
Our ML model can only accept numerical values. Hence we need to change the Non-numerical values. This can be accomplished by label encoding.
First, we need to obtain a subset of the training dataset containing only object datatype values.
train_categorical_cols = train.select_dtypes(‘object’)
Then we use LabelEncoder class to transform the values.
lblEncoder = LabelEncoder()
train_categorical_cols = train_categorical_cols.apply(lblEncoder.fit_transform)
We can then return the values back to the training dataset by getting the columns that aren’t strings and joining them with the ‘train_categorical_cols’ that solely contained object type columns.
trainMinusStrings = train.select_dtypes(exclude = ‘object’)
train = pd.concat([train_categorical_cols,trainMinusStrings], axis = 1)
Correlation Coefficient.
We can now look at how each column correlates with the target dataset. Let us try and create a new dataset containing the columns with a high positive or high negative correlation coefficient and simultaneously train a separate model using only these columns to see how it would affect performance.
To extract these columns heads:
count = 0
cols_with_high_corr = []
for i in columns:
corr = new_train[i].corr(prices)
if(corr >= 0.50):
print(i)
print(corr)
cols_with_high_corr.append(i)
count+=1
elif(corr <= -0.50):
print(i)
print(corr)
cols_with_high_corr.append(i)
count+=1
else:
count = count
print(cols_with_high_corr)high_corr_train = pd.DataFrame()
for i in cols_with_high_corr:
high_corr_train[i] = new_train[i]
Feature Scaling
We need to avoid letting data with a higher variance obtain more control in determining the SalePrice. So we need to scale the data such that the variance in each column is the same. In our situation we will reduce our data to a -1,+1 range.
First let us drop the ID column because it should have no bearing on what the SalePrice should be.
train.drop(‘Id’ ,axis = 1, inplace =True)
Then let us drop the SalePrice column:
saleprice = train[‘SalePrice’]
train.drop(‘Id’, axis = 1, inplace = True)
Now let us scale the data:
DF_train = preprocessing.minmax_scale(train)
Training our model
model = RandomForestRegressor()
model.fit(DF_train,prices)
predictions = model.predict(test)
predictions_df = pd.DataFrame()
predictions_df[‘ID’] = test[‘Id’]
predictions_df[‘SalePrice’] = predictions