CLASSIFYING TUMORS OF THE BREAST
For this we will use the Breast Cancer Wisconsin Dataset.
The aim here is to classify tumors of the breast as either ‘Malignant’ or ‘Benign’.
Firstly, I feel it is important to decide whether we need a Supervised or Unsupervised Machine Learning technique. Supervised ML techniques are used when we need to feed the algorithm with the target dataset (Usually labelled y_train) whereas in Unsupervised ML one does not assign the algorithm the target dataset but instead allows it to form associations of its own and classifies the datasets using the aforementioned associations.
The aim is clearly to classify the Cancer as Malignant or Benign so we need to use Supervised Learning techniques.
So now let’s get to it.
Let us import our modules:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
The role of numpy, pandas and matplotlib will be apparent within the next sections of the project.
Let us look at the Sci-kit Learn modules used within our project.
accuracy_score:
Used to determine the accuracy of the model based on the predictions the models made measured versus the actual values of the target data in the dataset we have.
StandardScaler:
An important aspect of Data Science is preparing your dataset. One of the activities involved in this is Feature Scaling. Feature Scaling is about normalization of data which essentially means that we will be bringing the data closer to zero.
RandomForestClassifier
Here is the main part. Our classification algorithm that is the basis of our model.
train_test_split
We have all our data in one csv file. This needs to be split into train and test data. For those who may not know, we do not use our training data as our testing data for the obvious reason that the computer will have already seen the data during training hence while testing it may record a 100 percent accuracy.
Let us now load our dataset shall we?
df = pd.read_csv(“data.csv”)
Next we should display the dataframe to get a look-see at what we are working with.
df.head()
We need to determing what columns we are dealing with.
columns = df.columns
print(columns)
count = 0
for column in columns:
count+=1
print(count)
Index([‘id’, ‘diagnosis’, ‘radius_mean’, ‘texture_mean’, ‘perimeter_mean’,
. ‘area_mean’, ‘smoothness_mean’, ‘compactness_mean’, ‘concavity_mean’,
. ‘concave points_mean’, ‘symmetry_mean’, ‘fractal_dimension_mean’,
. ‘radius_se’, ‘texture_se’, ‘perimeter_se’, ‘area_se’, ‘smoothness_se’,
. ‘compactness_se’, ‘concavity_se’, ‘concave points_se’, ‘symmetry_se’,
. ‘fractal_dimension_se’, ‘radius_worst’, ‘texture_worst’,
. ‘perimeter_worst’, ‘area_worst’, ‘smoothness_worst’,
. ‘compactness_worst’, ‘concavity_worst’, ‘concave points_worst’,
. ‘symmetry_worst’, ‘fractal_dimension_worst’, ‘Unnamed: 32'],
. dtype=’object’)
33
Forgive my use of this noobish counting method. We can see here that our dataset consists of 33 columns. The columns themselves are displayed above.
How about the data types contained in the columns…
df.dtypes
id. int64
diagnosis. object
radius_mean. float64
texture_mean. float64
perimeter_mean. float64
area_mean. float64
smoothness_mean. float64
compactness_mean. float64
concavity_mean. float64
concave points_mean. float64
symmetry_mean. float64
fractal_dimension_mean. float64
radius_se. float64
texture_se. float64
perimeter_se. float64
area_se. float64
smoothness_se. float64
compactness_se. float64
concavity_se. float64
concave points_se. float64
symmetry_se. float64
fractal_dimension_se. float64
radius_worst. float64
texture_worst. float64
perimeter_worst. float64
area_worst. float64
smoothness_worst. float64
compactness_worst. float64
concavity_worst. float64
concave points_worst. float64
symmetry_worst. float64
fractal_dimension_worst. float64
Unnamed: 32. float64
All the columns are in floating point decimals except id which is in integer values and diagnosis which contains String characters.
The id columns has no bearing on our classification so we need to drop it.
df = df.drop(‘id’ , axis = 1)
Let us check how many unique values are in the diagnosis column so that we can decide how to code them
in: _ = df[‘diagnosis’]
in: print(_.unique())
out: [‘M’,’B’]
We will code M for Malignant as 1 and B for Benign as 2.
Since it is a dichotomus variable column, we could code them manually using the following code:
df.replace(‘M’, 1, inplace. = True)
df.replace(‘B’, 2, inplace. = True)
MISSING VALUES
Missing values can come about due to a couple of reasons like users forgetting to fill in. a field, data loss during transfer or programming errors. There are different types of missing data but we shall not delve into all that.
We deal with missing values by either entering. median or mean value into that feature or by simply ignoring the whole instance if the number of missing values is few.
Let us check the number of missing values for each feature.
in: print df.isnull().sum()
out: diagnosis. 0
radius_mean. 0
texture_mean. 0
perimeter_mean. 0
area_mean. 0
smoothness_mean. 0
compactness_mean. 0
concavity_mean. 0
concave points_mean. 0
symmetry_mean. 0
fractal_dimension_mean. 0
radius_se. 0
texture_se. 0
perimeter_se. 0
area_se. 0
smoothness_se. 0
compactness_se. 0
concavity_se. 0
concave points_se. 0
symmetry_se. 0
fractal_dimension_se. 0
radius_worst. 0
texture_worst. 0
perimeter_worst. 0
area_worst. 0
smoothness_worst. 0
compactness_worst. 0
concavity_worst. 0
concave points_worst. 0
symmetry_worst. 0
fractal_dimension_worst. 0
Unnamed: 32. 569
From this we can ascertain that there are no missing. values in any of the columns except in the last column which exclusively has null values. Let us drop this column.
df.rename(columns = {‘Unnamed: 32': ‘Last Column’}, inplace = True)
df.drop(‘Last Column’, axis = 1)
Our features dataset will be called X whereas our target data is y
in: X = df.drop([‘diagnosis’], axis = 1)
in: y = df[‘diagnosis’]
FEATURE SCALING
We previously discussed what feature scaling is all about. The point of bringing all the variables closer to zero is to ensure that the variance of all of the features is brought closer together. If one feature was to have a greater variance then the ML model would be led to believe that this feature has a greater effect on determining the target dataset.
For our example we use StandardScaler.
in: X = StandardScaler().fit_transform(X)
SPLITTING THE DATA
At this stage we will split the data into training and test data for the reasons stated previously.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.4, random_state = 0)
ADDING OUR CLASSIFICATION ALGORITHM TO THE MIX.
Declare a variable called model and assign it the Classification algorithm
model = RandomForestClassifier()
Fit the data into out model using the .fit() function
model.fit(X_train,y_train)
Make some predictions using the .predict() method
pred = model.predict(X_test)
Now we can check the accuracy of our model
in:accuracy = accuracy_score(pred, y_test)
in: print(accuracy)
out : 0.9429824561403509
In my first attempt I made a couple of mistakes lie failing to Standardize the data and leaving the id column in. This affected the accuracy of my model. I think it helps to receive insights from a Data Scientist who is not very experienced because I explain things in a way that is easily understandable.