Hello from Python! 10 [1, 2, 3]
Relevant code and data will be provided in a Jupyterlab instance available at:
We will be using models from the scikit-learn package in this lecture.
There is extensive documentation with examples available at:
https://scikit-learn.org/stable/supervised_learning.html
source: https://mlarchive.com/
This is a commonly used data structure in data science. It is tabular data structure with some constraints that make it simpler to work with.
Every row is an observation in the data.
Every column is a variable in the data.
A variable will have a specific data type e.g. number, text, date.
We will make frequent use of pandas dataframes.
import pandas as pd
# Create a DataFrame with six varied data points
data = {
'X1': [1, 3, 5, 7, 9, 11],
'X2': [2, 4, 6, 8, 10, 12],
'y': ['red', 'red', 'red', 'blue', 'blue', 'blue']
}
df = pd.DataFrame(data)
# Map 'red' to 0 and 'blue' to 1
color_map = {'red': 0, 'blue': 1}
df['y'] = df['y'].map(color_map)| X1 | X2 | y |
|---|---|---|
| 1 | 2 | 0 |
| 3 | 4 | 0 |
| 5 | 6 | 0 |
| 7 | 8 | 1 |
| 9 | 10 | 1 |
| 11 | 12 | 1 |
Create a DataFrame with 10 random integers between 1 and 100, compute the mean of the column, and add a new Boolean column indicating whether each value is above the mean.
val above_mean
0 3 False
1 58 True
2 34 False
3 75 True
4 39 False
5 34 False
6 30 False
7 17 False
8 74 True
9 58 True
In this example, we will use the Titanic dataset to predict whether a passenger survived or not.
The target variable is survived it is a binary value, so the model will predict 0 or 1.
head() and tail() functionWe can view the first few rows of a pandas dataframe with the head() function
Similarly we can view end of a dataframe with tail()
describe() functionWe can get summary statistics of each column with the describe() function.
PassengerId Survived Pclass ... SibSp Parch Fare
count 891.000000 891.000000 891.000000 ... 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 ... 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 ... 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 ... 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 ... 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 ... 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 ... 8.000000 6.000000 512.329200
[8 rows x 7 columns]
Load the Titanic dataset (as above) and use head(), tail() and describe() to inspect it. Try selecting only the Age and Fare columns.
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
[5 rows x 12 columns]
PassengerId Survived Pclass ... Fare Cabin Embarked
886 887 0 2 ... 13.00 NaN S
887 888 1 1 ... 30.00 B42 S
888 889 0 3 ... 23.45 NaN S
889 890 1 1 ... 30.00 C148 C
890 891 0 3 ... 7.75 NaN Q
[5 rows x 12 columns]
PassengerId Survived Pclass ... SibSp Parch Fare
count 891.000000 891.000000 891.000000 ... 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 ... 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 ... 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 ... 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 ... 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 ... 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 ... 8.000000 6.000000 512.329200
[8 rows x 7 columns]
Age Fare
0 22.0 7.2500
1 38.0 71.2833
2 26.0 7.9250
3 35.0 53.1000
4 35.0 8.0500
You can download a pdf of the pandas cheat sheet from their docs website at:
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
To calculate the Euclidean distance between two points \((x_1, y_1)\) and \((x_2, y_2)\), we use the formula:
\[ d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} \]
For the points \((2, 3)\) and \((8, 5)\), the Euclidean distance is calculated as follows:
\[ d = \sqrt{(8 - 2)^2 + (5 - 3)^2} = \sqrt{6^2 + 2^2} = \sqrt{36 + 4} = \sqrt{40} = 2\sqrt{10} \]
To calculate the Euclidean distance between two points in \(\mathbb{R}^3\), we use the formula:
\[ d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2} \]
For the points \((1, 2, 3)\) and \((4, 5, 6)\), the Euclidean distance is calculated as follows:
\[ d = \sqrt{(4 - 1)^2 + (5 - 2)^2 + (6 - 3)^2} = \sqrt{3^2 + 3^2 + 3^2} = \sqrt{9 + 9 + 9} = \sqrt{27} = 3\sqrt{3} \]
The general formula for calculating the Euclidean distance between two points in \(\mathbf{X}_1\) and \(\mathbf{X}_2\)\(\mathbb{R}^n\)
where
\[ \mathbf{X}_1 = \{x_{1_1}, x_{2_1}, x_{3_1}, \ldots, x_{n_1}\} \]
and
\[ \mathbf{X}_2 = \{x_{1_2}, x_{2_2}, x_{3_2}, \ldots, x_{n_2}\} \]
\[ d = \sqrt{\sum_{i=1}^{n} (x_{i_2} - x_{i_1})^2} \]
where \(x_{i1}\) and \(x_{i2}\) are the coordinates of the two points in the \(i\)-th dimension.
In this section we look at practical examples of supervised learning using scikit-learn. We’ll revisit the Titanic dataset and build models, then explore k-NN, logistic regression and the Iris dataset.
# Import the module
from sklearn.model_selection import train_test_split
# Reload and prepare the Titanic dataset
import pandas as pd
titanic = pd.read_csv("data/titanic.csv")
X = titanic.drop("Survived", axis=1).values
y = titanic["Survived"].values
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"train target proportion: {y_train.sum()/len(y_train):.3f}")train target proportion: 0.383
test target proportion: 0.385
Re-split the data using a 30% test set and print the proportions again.
0.38362760834670945
0.3843283582089552
from scikit-learn.module import Model
from sklearn.model_selection import train_test_split
# Create an instance of the model
model = Model()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=prop)
# Fit the model to the data
model.fit(X_train, y_train)
# Make predictions on new data
predictions = model.predict(X_test)
# Print the accuracy
print(model.score(X_test, y_test))KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3)
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
The new point [[8 8]] is classified as blue.
Change the number of neighbours to 5 and try classifying point [2,2]. What colour do you get?
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
[0]
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
import matplotlib.pyplot as plt
# Load the iris dataset
iris = load_iris()
X = iris.data[:, :2] # Use only the first two features
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Plot all points
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.title('All Points')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
# Plot train and test data separately
plt.subplot(1, 2, 2)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis', edgecolor='k', s=50, label='Train')
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='viridis', edgecolor='k', s=50, label='Test', marker='x')
plt.title('Train and Test Data')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend()
plt.show()KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3)
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Correct predictions: 34 out of 45
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
# Load the iris dataset
iris = load_iris()
X = iris.data # Use all four features
y = iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train the k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3)
Change n_neighbors to 5 and re-run the classifier. Did accuracy improve?
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Correct predictions: 36 out of 45
Try using n_neighbors=5 and report how many correct predictions you get.
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Correct predictions: 36 out of 45
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Correct predictions: 45 out of 45
\[\frac{1}{1 + e^{-x}}\]
from sklearn.linear_model import LogisticRegression
import numpy as np
# Prepare the data
X = df[['X1', 'X2']].values
y = df['y'].values
# Create and train the logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X, y)
# Predict the class of a new point
new_point = np.array([[6, 7]])
predicted_class = log_reg.predict(new_point)
predicted_prob = log_reg.predict_proba(new_point)
predicted_color = 'red' if predicted_class == 0 else 'blue'
print(f'The new point {new_point} is classified as {predicted_color} with probability {predicted_prob}.')LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
The new point [[6 7]] is classified as red with probability [[0.50000894 0.49999106]].
Try predicting a different point such as [2, 9] and print the class and probability.
[0] [[0.77441402 0.22558598]]
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
# Generate some sample data
X = np.random.rand(100, 5)
# Create and train the PCA model
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Plot the PCA-transformed data
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title('PCA Example')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Using the x and y arrays above, plot a cosine curve on the same axes in green.


Create a bar chart showing the counts of a categorical column in a DataFrame.


ASDAN Maths