CLASSIFYING TUMORS OF THE BREAST

Thomas Tsuma
5 min readJan 29, 2020

For this we will use the Breast Cancer Wisconsin Dataset.

The aim here is to classify tumors of the breast as either ‘Malignant’ or ‘Benign’.

Firstly, I feel it is important to decide whether we need a Supervised or Unsupervised Machine Learning technique. Supervised ML techniques are used when we need to feed the algorithm with the target dataset (Usually labelled y_train) whereas in Unsupervised ML one does not assign the algorithm the target dataset but instead allows it to form associations of its own and classifies the datasets using the aforementioned associations.

The aim is clearly to classify the Cancer as Malignant or Benign so we need to use Supervised Learning techniques.

So now let’s get to it.

Let us import our modules:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

The role of numpy, pandas and matplotlib will be apparent within the next sections of the project.

Let us look at the Sci-kit Learn modules used within our project.

accuracy_score:

Used to determine the accuracy of the model based on the predictions the models made measured versus the actual values of the target data in the dataset we have.

StandardScaler:

An important aspect of Data Science is preparing your dataset. One of the activities involved in this is Feature Scaling. Feature Scaling is about normalization of data which essentially means that we will be bringing the data closer to zero.

RandomForestClassifier

Here is the main part. Our classification algorithm that is the basis of our model.

train_test_split

We have all our data in one csv file. This needs to be split into train and test data. For those who may not know, we do not use our training data as our testing data for the obvious reason that the computer will have already seen the data during training hence while testing it may record a 100 percent accuracy.

Let us now load our dataset shall we?

df = pd.read_csv(“data.csv”)

Next we should display the dataframe to get a look-see at what we are working with.

df.head()

We need to determing what columns we are dealing with.

columns = df.columns

print(columns)

count = 0

for column in columns:

count+=1

print(count)

Index([‘id’, ‘diagnosis’, ‘radius_mean’, ‘texture_mean’, ‘perimeter_mean’,

. ‘area_mean’, ‘smoothness_mean’, ‘compactness_mean’, ‘concavity_mean’,

. ‘concave points_mean’, ‘symmetry_mean’, ‘fractal_dimension_mean’,

. ‘radius_se’, ‘texture_se’, ‘perimeter_se’, ‘area_se’, ‘smoothness_se’,

. ‘compactness_se’, ‘concavity_se’, ‘concave points_se’, ‘symmetry_se’,

. ‘fractal_dimension_se’, ‘radius_worst’, ‘texture_worst’,

. ‘perimeter_worst’, ‘area_worst’, ‘smoothness_worst’,

. ‘compactness_worst’, ‘concavity_worst’, ‘concave points_worst’,

. ‘symmetry_worst’, ‘fractal_dimension_worst’, ‘Unnamed: 32'],

. dtype=’object’)

33

Forgive my use of this noobish counting method. We can see here that our dataset consists of 33 columns. The columns themselves are displayed above.

How about the data types contained in the columns…

df.dtypes

id. int64

diagnosis. object

radius_mean. float64

texture_mean. float64

perimeter_mean. float64

area_mean. float64

smoothness_mean. float64

compactness_mean. float64

concavity_mean. float64

concave points_mean. float64

symmetry_mean. float64

fractal_dimension_mean. float64

radius_se. float64

texture_se. float64

perimeter_se. float64

area_se. float64

smoothness_se. float64

compactness_se. float64

concavity_se. float64

concave points_se. float64

symmetry_se. float64

fractal_dimension_se. float64

radius_worst. float64

texture_worst. float64

perimeter_worst. float64

area_worst. float64

smoothness_worst. float64

compactness_worst. float64

concavity_worst. float64

concave points_worst. float64

symmetry_worst. float64

fractal_dimension_worst. float64

Unnamed: 32. float64

All the columns are in floating point decimals except id which is in integer values and diagnosis which contains String characters.

The id columns has no bearing on our classification so we need to drop it.

df = df.drop(‘id’ , axis = 1)

Let us check how many unique values are in the diagnosis column so that we can decide how to code them

in: _ = df[‘diagnosis’]

in: print(_.unique())

out: [‘M’,’B’]

We will code M for Malignant as 1 and B for Benign as 2.

Since it is a dichotomus variable column, we could code them manually using the following code:

df.replace(‘M’, 1, inplace. = True)

df.replace(‘B’, 2, inplace. = True)

MISSING VALUES

Missing values can come about due to a couple of reasons like users forgetting to fill in. a field, data loss during transfer or programming errors. There are different types of missing data but we shall not delve into all that.

We deal with missing values by either entering. median or mean value into that feature or by simply ignoring the whole instance if the number of missing values is few.

Let us check the number of missing values for each feature.

in: print df.isnull().sum()

out: diagnosis. 0

radius_mean. 0

texture_mean. 0

perimeter_mean. 0

area_mean. 0

smoothness_mean. 0

compactness_mean. 0

concavity_mean. 0

concave points_mean. 0

symmetry_mean. 0

fractal_dimension_mean. 0

radius_se. 0

texture_se. 0

perimeter_se. 0

area_se. 0

smoothness_se. 0

compactness_se. 0

concavity_se. 0

concave points_se. 0

symmetry_se. 0

fractal_dimension_se. 0

radius_worst. 0

texture_worst. 0

perimeter_worst. 0

area_worst. 0

smoothness_worst. 0

compactness_worst. 0

concavity_worst. 0

concave points_worst. 0

symmetry_worst. 0

fractal_dimension_worst. 0

Unnamed: 32. 569

From this we can ascertain that there are no missing. values in any of the columns except in the last column which exclusively has null values. Let us drop this column.

df.rename(columns = {‘Unnamed: 32': ‘Last Column’}, inplace = True)

df.drop(‘Last Column’, axis = 1)

Our features dataset will be called X whereas our target data is y

in: X = df.drop([‘diagnosis’], axis = 1)

in: y = df[‘diagnosis’]

FEATURE SCALING

We previously discussed what feature scaling is all about. The point of bringing all the variables closer to zero is to ensure that the variance of all of the features is brought closer together. If one feature was to have a greater variance then the ML model would be led to believe that this feature has a greater effect on determining the target dataset.

For our example we use StandardScaler.

in: X = StandardScaler().fit_transform(X)

SPLITTING THE DATA

At this stage we will split the data into training and test data for the reasons stated previously.

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.4, random_state = 0)

ADDING OUR CLASSIFICATION ALGORITHM TO THE MIX.

Declare a variable called model and assign it the Classification algorithm

model = RandomForestClassifier()

Fit the data into out model using the .fit() function

model.fit(X_train,y_train)

Make some predictions using the .predict() method

pred = model.predict(X_test)

Now we can check the accuracy of our model

in:accuracy = accuracy_score(pred, y_test)

in: print(accuracy)

out : 0.9429824561403509

In my first attempt I made a couple of mistakes lie failing to Standardize the data and leaving the id column in. This affected the accuracy of my model. I think it helps to receive insights from a Data Scientist who is not very experienced because I explain things in a way that is easily understandable.

--

--

Thomas Tsuma

I am a Machine Learning and AI Engineer who enjoys writing about topics within the space of AI