An Outline of the Modules in Data Science

Thomas Tsuma
5 min readJan 27, 2020

Anyone who has ever had a conversation on programming with me knows how I love to spam the phrase, “You have to be ready not to know anything”. Looking back I can still remember being a total greenhorn (Forgive my use of this cliché) on matters concerning DS. I am by no measure an expert on the matter but I wouldn’t count myself as a slouch either.

A lot of experts recommend a top-down approach when trying to learn Data Science and Machine Learning and I feel the same. Let me paint you a brief picture of what you are going to do in any DS or ML classes.

Import your modules.

This probably seems really obvious to anyone who uses Python. After doing multiple projects and binge-ing countless YouTube tutorials these almost become as song to you.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.(insert sub-module) import …

These are absolutely essential before starting your project. However, one may need a lot more than these if you are a Pythonista such as myself.

So what are these modules and what are they for? I will take a unique (and frowned upon) method to try and describe what they are and what they do.

  1. Pandas

I can feel the experts sucking their teeth in pre-discontent already. Pandas is the fourth most used library/framework in the world and although this has no bearing on why I feel like this should be the first module to be taught in any Data Science class or publication it still should, in my opinion. What comes after import statements is usually to add the dataset that is to be worked on to your current project. For the newbies, the datasets used in DS are usually downloaded externally as .csv or .xlsx files. One needs to use the pandas functions: pd.read_csv(“File location of .csv file”) or pd.read_xlsx(“File location of .xlsx file”) so as to read the dataset. The dataset is entered as a dataframe hence it is common practice to save it in a variable called df. It’ll look a little like this:

import statements…

df = pd.read_csv(“C:/Users/UserName/Dataset.csv”)

or

df = pd.read_xlsx(“C:/Users/UserName/Dataset.xlsx”)

Next is to see what your dataset contains. For this we use the .head() or .tail() functions of pandas to display the first five and last five elements, respectively, of a dataset

df.head()

Data exploration functions in pandas such as .corr(), .scatter_matrix(), .hist() and .bar() must always be at the tip of your tongue also.

Numpy

This is usually taught first and don’t get me wrong, there are solid reasons why it should be taught first. A majority of DS and ML involves working with arrays and it is important to learn how to work with numpy arrays.We create a numpy array using by using the numpy.array() function.

Take the following example of the code below:

stations = np.array(df[‘Stations’])

unique(stations)

Now the first line is done so as to add the column with the header, ‘Stations’ to an numpy array called ‘stations’.

The juicy part is the second line. The ‘unique()’ function is used in scenarios where a column contains nominal, dichotomus or ordinal values. The unique() function is pretty useless when the columns contain contain continuous variables.

Don’t fret. I’m going to tell you exactly what the previously mentioned value types are.

1. Nominal Variable-Columns

These are columns that contain unique values that can be represented by numbers. To note however is that the numbers representing them do not necessarily represent a higher value or lower value. For example, take a column contained any single permutation of the values: “up”, ”down”, ”left”, ”right” . The unique() function would display a list of only a single one of each of these values.

2. Dichotomus Variable-Columns

These are columns that contain only two types of values that can be coded as any two numbers. For example, when diagnosing Breast Cancer patients, the target data(usually called y_train) contains the type of diagnosis (Benign or Malignant). One could chose to code ‘Benign’ as ‘1’ and ‘Malignant’ as ‘2’.

3. Ordinal Variable-Columns

While these may be very similar to Nominal, they bear the difference that a change in the code also represents a higher or lower value.

Numpy arrays provide an ease of manipulation and swiftness of computation that normal lists just cannot.

A useful numpy function is the np.arange() function. Whereby we enter the start value, end value and and step value as parameters.

The numpy shape() function is used to show the number of columns and rows in the numpy array and the reshape() function is used to change the shape of the array by changing the rows to columns and vice versa. For this we use a permutation of. 1 and -1 as parameters.

Matplotlib

An intricate part of DS or ML is being able to visualize your dataset and solutions. As a matter of fact, data visualization is regarded as the most important skill by some. This is because one needs to be able to explain what they are doing and graphs, charts and scatterplots do exactly that for you.

The simplest function is the .plot() functions that takes two arrays of continuous values as its parameters. It draws a line-graph with the second array as the y-axis and the first variable as the x-axis.

One can change the x-axis label, y-axis label and title using the .xlabel(), .ylabel(), .title() functions respectively.

A scatterplot is drawn using the .scatter() function.

Bar graphs are drawn using the .bar() function.

Matplotlib is even used to display images in case you are a Tensorflow image classification man like myself.

Sci-kit-learn

Finally, sklearn, the ‘beef’ of ML. Where do I even start? I suppose I should just talk about Supervised versus Unsupervised Learning. In Supervised learning, the Machine Learning model you choose is handed the features of the dataset along with the target data to be used for training. In Unsupervised learning however, the ML model. is fed with the features of the dataset but the target data is excluded. This forces the model to form associations of its own.

There are quite a number of models contained in sklearn but since this is the heart of ML I won’t get deep into things but will instead classify them because I think it’s a better way to introduce them.

In my personal opinion, it’s absolutely paramount to understand clustering,classification and regression models and their points of application.

Clustering algorithms are applied in Unsupervised learning to group the instances into a finite number of classes based on their features or patterns. In Python, one would employ the services of KMeans clustering or MeanShift. The difference between the two being that in KMeans you specify the number of groups of classification.

Classification algorithms are used in Supervised learning where, as previously mentioned, the labels have been predefined (Usually as a variable y). RandomForestClassifier is probably your easiest bet for a good classification algorithm in Python. The anatomy of the aforementioned is quite complex but boils down to having an aggregate of a series of Decision Trees.

Regression algorithms are used in situations whereby the target dataset contains a set of continuous variables. An example is with a dataset that requires us to determine the house prices in different neighborhoods. Knowledge of Linear Regression is a clear prerequisite to this obviously. Of the top of my head, I would recommend the RandomForestRegressor as your go-to for a regression problems.

Feel free to criticize this work. It is my first and I am sure it has as many errors as some of the code on your programs.

--

--

Thomas Tsuma

I am a Machine Learning and AI Engineer who enjoys writing about topics within the space of AI