Machine Learning

Sam Kamperis

Jupyterlab

Relevant code and data will be provided in a Jupyterlab instance available at:

Jupyter Lab

https://tinyurl.com/asdanpython

Agenda

  1. Introduction to Machine Learning
  2. Python & DataFrames
  3. Supervised Machine Learning practice
  4. Visualisation

Part 1: Introduction to Machine Learning

Why Machine Learning?

  • Machine learning allows us to make data-driven decisions and predictions.
  • It can handle large and complex datasets that traditional methods struggle with.
  • Machine learning models can improve over time with more data.
  • It is widely used in various industries such as healthcare, finance, and technology.

Real-World Applications

  • Healthcare: Predicting disease outbreaks, personalized treatment plans.
  • Finance: Fraud detection, stock market prediction.
  • Technology: Image and speech recognition, recommendation systems.
  • Retail: Customer segmentation, inventory management.

Benefits of Machine Learning

  • Efficiency: Automates repetitive tasks and processes large amounts of data quickly.
  • Accuracy: Provides more accurate predictions and insights compared to traditional methods.
  • Scalability: Can be applied to various domains and scaled to handle increasing data volumes.
  • Adaptability: Models can adapt to new data and improve over time.

Types of Machine Learning

Supervised vs Unsupervised

  • Supervised: data comes with labels; tasks include classification and regression.
    • Example: predicting survival on Titanic, using k-NN or logistic regression.
  • Unsupervised: no labels are provided; we try to find structure such as clusters or reduce dimensions.
    • Example: k-means clustering groups similar points, PCA reduces dimensionality (we’ll see more of both later).

scikit-learn

We will be using models from the scikit-learn package in this lecture.

There is extensive documentation with examples available at:

https://scikit-learn.org/stable/supervised_learning.html

Difference Between Classification and Regression

Classification

  • Predicts categorical labels.
  • Example: Predicting if an email is spam or not spam.

Regression

  • Predicts continuous values.
  • Example: Predicting the price of a house.

Naming Conventions

  • Feature = predictor variable = independent variable
  • Target variable = dependent variable = response variable

Introduction to k-NN

  • k-Nearest Neighbors (k-NN) is a classification method.
  • Suitable for binary or multi-class classification.
  • Classifies a new data point based on the majority class of its k nearest neighbors.

source: https://mlarchive.com/

Part 2: Python & DataFrames

Python recap

# simple Python examples
x = 10
y = [1, 2, 3]
print("Hello from Python!", x, y)
Hello from Python! 10 [1, 2, 3]

Terminology - Dataframe

This is a commonly used data structure in data science. It is tabular data structure with some constraints that make it simpler to work with.

Every row is an observation in the data.

Every column is a variable in the data.

A variable will have a specific data type e.g. number, text, date.

We will make frequent use of pandas dataframes.

Creating Data Points in Python

import pandas as pd
# Create a DataFrame with six varied data points
data = {
    'X1': [1, 3, 5, 7, 9, 11],
    'X2': [2, 4, 6, 8, 10, 12],
    'y': ['red', 'red', 'red', 'blue', 'blue', 'blue']
}
df = pd.DataFrame(data)

# Map 'red' to 0 and 'blue' to 1
color_map = {'red': 0, 'blue': 1}
df['y'] = df['y'].map(color_map)
X1 X2 y
1 2 0
3 4 0
5 6 0
7 8 1
9 10 1
11 12 1

Exercise: DataFrame practice

Create a DataFrame with 10 random integers between 1 and 100, compute the mean of the column, and add a new Boolean column indicating whether each value is above the mean.

Solution

   val  above_mean
0    3       False
1   58        True
2   34       False
3   75        True
4   39       False
5   34       False
6   30       False
7   17       False
8   74        True
9   58        True

Binary Classification Example: Titanic Dataset

In this example, we will use the Titanic dataset to predict whether a passenger survived or not.

The target variable is survived it is a binary value, so the model will predict 0 or 1.

import pandas as pd
titanic = pd.read_csv("data/titanic.csv")
titanic.head()
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

head() and tail() function

We can view the first few rows of a pandas dataframe with the head() function

Similarly we can view end of a dataframe with tail()

titanic.tail()
     PassengerId  Survived  Pclass  ...   Fare Cabin  Embarked
886          887         0       2  ...  13.00   NaN         S
887          888         1       1  ...  30.00   B42         S
888          889         0       3  ...  23.45   NaN         S
889          890         1       1  ...  30.00  C148         C
890          891         0       3  ...   7.75   NaN         Q

[5 rows x 12 columns]

describe() function

We can get summary statistics of each column with the describe() function.

titanic.describe()
       PassengerId    Survived      Pclass  ...       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  ...  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642  ...    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071  ...    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000  ...    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000  ...    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000  ...    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000  ...    8.000000    6.000000  512.329200

[8 rows x 7 columns]

Exercise: explore the Titanic data

Load the Titanic dataset (as above) and use head(), tail() and describe() to inspect it. Try selecting only the Age and Fare columns.

Solution

   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]
     PassengerId  Survived  Pclass  ...   Fare Cabin  Embarked
886          887         0       2  ...  13.00   NaN         S
887          888         1       1  ...  30.00   B42         S
888          889         0       3  ...  23.45   NaN         S
889          890         1       1  ...  30.00  C148         C
890          891         0       3  ...   7.75   NaN         Q

[5 rows x 12 columns]
       PassengerId    Survived      Pclass  ...       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  ...  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642  ...    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071  ...    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000  ...    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000  ...    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000  ...    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000  ...    8.000000    6.000000  512.329200

[8 rows x 7 columns]
    Age     Fare
0  22.0   7.2500
1  38.0  71.2833
2  26.0   7.9250
3  35.0  53.1000
4  35.0   8.0500

pandas cheat sheets

You can download a pdf of the pandas cheat sheet from their docs website at:

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

Mathematics of \(R^n\) Vectors

  • Introduce the concept of vectors in \(R^n\).
  • Explain vector operations and their significance in machine learning.

Euclidean Distance

To calculate the Euclidean distance between two points \((x_1, y_1)\) and \((x_2, y_2)\), we use the formula:

\[ d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} \]

For the points \((2, 3)\) and \((8, 5)\), the Euclidean distance is calculated as follows:

\[ d = \sqrt{(8 - 2)^2 + (5 - 3)^2} = \sqrt{6^2 + 2^2} = \sqrt{36 + 4} = \sqrt{40} = 2\sqrt{10} \]

Example in \(\mathbb{R}^3\)

To calculate the Euclidean distance between two points in \(\mathbb{R}^3\), we use the formula:

\[ d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2} \]

For the points \((1, 2, 3)\) and \((4, 5, 6)\), the Euclidean distance is calculated as follows:

\[ d = \sqrt{(4 - 1)^2 + (5 - 2)^2 + (6 - 3)^2} = \sqrt{3^2 + 3^2 + 3^2} = \sqrt{9 + 9 + 9} = \sqrt{27} = 3\sqrt{3} \]

General Formula in \(\mathbb{R}^n\)

The general formula for calculating the Euclidean distance between two points in \(\mathbf{X}_1\) and \(\mathbf{X}_2\)\(\mathbb{R}^n\)

where

\[ \mathbf{X}_1 = \{x_{1_1}, x_{2_1}, x_{3_1}, \ldots, x_{n_1}\} \]

and

\[ \mathbf{X}_2 = \{x_{1_2}, x_{2_2}, x_{3_2}, \ldots, x_{n_2}\} \]

\[ d = \sqrt{\sum_{i=1}^{n} (x_{i_2} - x_{i_1})^2} \]

where \(x_{i1}\) and \(x_{i2}\) are the coordinates of the two points in the \(i\)-th dimension.

Part 3: Supervised Machine Learning

In this section we look at practical examples of supervised learning using scikit-learn. We’ll revisit the Titanic dataset and build models, then explore k-NN, logistic regression and the Iris dataset.

Titanic dataset – preparing the data

# Import the module
from sklearn.model_selection import train_test_split

# Reload and prepare the Titanic dataset
import pandas as pd
titanic = pd.read_csv("data/titanic.csv")
X = titanic.drop("Survived", axis=1).values
y = titanic["Survived"].values

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"train target proportion: {y_train.sum()/len(y_train):.3f}")
train target proportion: 0.383
print(f"test  target proportion: {y_test.sum()/len(y_test):.3f}")
test  target proportion: 0.385

Exercise: change split ratio

Re-split the data using a 30% test set and print the proportions again.

Solution

0.38362760834670945
0.3843283582089552

Explanation of Variables

  • X1 and X2 are independent variables.
  • y is the target variable, which is binary (0 for red, 1 for blue).

Plotting Data Points using Matplotlib

import matplotlib.pyplot as plt

# Plot the data points
colors = {0: 'r', 1: 'b'}
plt.scatter(df['X1'], df['X2'], c=df['y'].map(colors))
plt.xlabel('X1')
plt.ylabel('X2')
plt.title('Data Points')
plt.show()

scikit-learn Syntax

from scikit-learn.module import Model

# Create an instance of the model
model = Model()

# Fit the model to the data
model.fit(X, y)

# Make predictions on new data
predictions = model.predict(X_new)

# Print the predictions
print(predictions)

scikit-learn Syntax with Train Test Split

from scikit-learn.module import Model
from sklearn.model_selection import train_test_split

# Create an instance of the model
model = Model()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=prop)

# Fit the model to the data
model.fit(X_train, y_train)

# Make predictions on new data
predictions = model.predict(X_test)

# Print the accuracy
print(model.score(X_test, y_test))

k-NN Example

from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Prepare the data
X = df[['X1', 'X2']].values
y = df['y'].values

# Create and train the k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict the class of a new point
new_point = np.array([[8, 8]])
predicted_class = knn.predict(new_point)
predicted_color = 'red' if predicted_class == 0 else 'blue'

print(f'The new point {new_point} is classified as {predicted_color}.')

k-NN Example Output

KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
The new point [[8 8]] is classified as blue.

Exercise: tweak k-NN

Change the number of neighbours to 5 and try classifying point [2,2]. What colour do you get?

Solution

KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[0]

k-NN Example with Iris Data (Two Variables)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
import matplotlib.pyplot as plt

# Load the iris dataset
iris = load_iris()
X = iris.data[:, :2]  # Use only the first two features
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Plot all points
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.title('All Points')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')

# Plot train and test data separately
plt.subplot(1, 2, 2)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis', edgecolor='k', s=50, label='Train')
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='viridis', edgecolor='k', s=50, label='Test', marker='x')
plt.title('Train and Test Data')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend()
plt.show()
# Create and train the k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict the class of the test set
y_pred = knn.predict(X_test)

# Count the number of correct predictions
correct_predictions = np.sum(y_pred == y_test)
total_predictions = len(y_test)

print(f'Correct predictions: {correct_predictions} out of {total_predictions}')

k-NN Example with Iris Data (Two Variables) Output

KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Correct predictions: 34 out of 45

k-NN Example with Iris Data (All Variables)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Load the iris dataset
iris = load_iris()
X = iris.data  # Use all four features
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict the class of the test set
y_pred = knn.predict(X_test)

# Count the number of correct predictions
correct_predictions = np.sum(y_pred == y_test)
total_predictions = len(y_test)

print(f'Correct predictions: {correct_predictions} out of {total_predictions}')

Exercise: iris all-variables

Change n_neighbors to 5 and re-run the classifier. Did accuracy improve?

Solution

KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Correct predictions: 36 out of 45

Exercise: iris two-variable k-NN

Try using n_neighbors=5 and report how many correct predictions you get.

Solution

KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Correct predictions: 36 out of 45

k-NN Example with Iris Data (All Variables) Output

KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Correct predictions: 45 out of 45

Logistic Regression

  • Logistic regression is a classification algorithm.
  • It is used for binary classification.
  • The logistic function (S-curve) maps any real-valued number into the range [0, 1].

\[\frac{1}{1 + e^{-x}}\]

Logistic Regression S-Curve

Code
import numpy as np
import matplotlib.pyplot as plt

# Logistic function
def logistic(x):
    return 1 / (1 + np.exp(-x))

x = np.linspace(-10, 10, 100)
y = logistic(x)

plt.plot(x, y)
plt.title('Logistic Regression S-Curve')
plt.xlabel('x')
plt.ylabel('logistic(x)')
plt.show()

Logistic Regression Example

from sklearn.linear_model import LogisticRegression
import numpy as np
# Prepare the data
X = df[['X1', 'X2']].values
y = df['y'].values

# Create and train the logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X, y)

# Predict the class of a new point
new_point = np.array([[6, 7]])
predicted_class = log_reg.predict(new_point)
predicted_prob = log_reg.predict_proba(new_point)

predicted_color = 'red' if predicted_class == 0 else 'blue'
print(f'The new point {new_point} is classified as {predicted_color} with probability {predicted_prob}.')

Logistic Regression Example Output

LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
The new point [[6 7]] is classified as red with probability [[0.50000894 0.49999106]].

Exercise: logistic prediction

Try predicting a different point such as [2, 9] and print the class and probability.

Solution

[0] [[0.77441402 0.22558598]]

Dimensionality Reduction

Introduction to Dimensionality Reduction

  • Dimensionality reduction is a technique used to reduce the number of features in a dataset.
  • It helps in visualizing high-dimensional data and reducing computational complexity.
  • Common dimensionality reduction techniques include PCA, t-SNE, and LDA.

Principal Component Analysis (PCA)

  • PCA is a linear dimensionality reduction technique that projects the data onto a lower-dimensional space.
  • It identifies the directions (principal components) that maximize the variance in the data.
  • PCA is widely used for data visualization and noise reduction.

PCA Example

Code
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

# Generate some sample data
X = np.random.rand(100, 5)

# Create and train the PCA model
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot the PCA-transformed data
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title('PCA Example')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

Scikit - Learn Iris Example

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Scikit - Learn Iris Example cont

Part 4: Visualisation

Plotting from NumPy arrays

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
plt.plot(x, y, label='sine')
plt.scatter(x[::10], y[::10], c='red')
plt.title('Line and scatter from NumPy arrays')
plt.legend()
plt.show()

Exercise

Using the x and y arrays above, plot a cosine curve on the same axes in green.

Solution

Plotting from pandas DataFrame

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df3 = pd.DataFrame({
    'a': np.random.randn(50).cumsum(),
    'b': np.random.randn(50).cumsum()
})
df3.plot(title='Line chart from DataFrame', figsize=(6,4))
plt.show()

Exercise

Create a bar chart showing the counts of a categorical column in a DataFrame.

Solution