Forelesningsnotat – Obligatorisk oppgave

Titanic

I filmen om titanic-forliset følger overlevelsen det mønsteret man godt kunne tenke seg. Kvinner på første klasse overlever, og menn på tredje klasse dør. Hvem hvor godt klarer vi å bruke det vi vet om en passaser til å predikere om vedkommende kommer til å overleve titanic-forliset? og kan vi bruke en slik modell til å forstå noe om hvordan de forskjellige egenskapene til en person, slik som kjønn, alder og pris på billetten, påvirker overlevelsen ved et forlis?

Men før det

  • Høyde og lengde på forskjellige dyr
Animal Length Height
33 Tiger 2.780537 1.272670
73 Giraffe 3.041766 5.536510
93 Zebra 2.211075 1.434069
3 Lion 2.516610 1.180352
61 Elephant 3.680609 3.180050
184 Ostrich 1.919341 2.362251
146 Koala 0.974074 0.791800
92 Zebra 2.132858 1.329540

Vi har laget oss et datasett med lengde- og høydedata om forskjellige dyr. Det kan lastes ned her.

Plot

import matplotlib.pyplot as plt
# Load the dataset from the CSV file
animal_data = pd.read_csv('data/animal_data.csv')
plt.figure(figsize=(8, 5))
plt.plot(animal_data["Length"], animal_data["Height"], "o")

Med navn på dyrene

# Lage scatter-plot fordelt på dyr 
plt.figure(figsize=(8, 5))
for animal in animal_data['Animal'].unique():
    subset = animal_data[animal_data['Animal'] == animal]
    plt.scatter(subset['Length'], subset['Height'], label=animal)

# Pynt
plt.title('Animal Length vs Height')
plt.xlabel('Length (m)')
plt.ylabel('Height (m)')
plt.legend(title='Animal')
plt.show()

Envariabel lineær regresjon

from sklearn.linear_model import LinearRegression

# Prepare the data for linear regression
X = animal_data[['Length']]
y = animal_data['Height']

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Predict the heights using the model
predicted_heights = model.predict(X)

# Plot the original data and the linear regression model
plt.figure(figsize=(8, 5))
plt.scatter(animal_data['Length'], animal_data['Height'], label='Actual Data')
plt.plot(animal_data['Length'], predicted_heights, color='red', label='Linear Regression Model')

# Add labels and title
plt.title('Linear Regression: Animal Length vs Height')
plt.xlabel('Length (m)')
plt.ylabel('Height (m)')
plt.legend()
plt.show()

\[H(L, N) = \beta_0 + \beta_1 L\]

H Høyde
L Lengde
\(\beta_i\) Regresjonskoeffisienter

Hvordan gjøre det bedre?

  • forslag?

Hva med å numerisk kode dyrene?

from sklearn.linear_model import LinearRegression

# Encode the animal names as numbers
animal_data['Animal_Code'] = animal_data['Animal'].astype('category').cat.codes
# Display a random selection of 5 rows from the dataset
display(animal_data.sample(10))
Animal Length Height Animal_Code
53 Elephant 3.348638 3.010803 0
131 Panda 1.441929 0.863909 6
155 Penguin 0.850698 0.606088 7
13 Lion 2.526350 1.286599 4
47 Tiger 2.658747 0.933085 8
45 Tiger 2.640487 1.112690 8
60 Elephant 3.551559 3.113933 0
150 Penguin 0.518143 0.754484 7
129 Panda 1.366025 0.921771 6
128 Panda 1.356603 0.839018 6

Illustrasjon av datasettet med tallkoder

# Plot the animal data with animal codes
plt.figure(figsize=(10, 6))
colors = plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))

for i, animal in enumerate(animal_data['Animal'].unique()):
    subset = animal_data[animal_data['Animal'] == animal]
    plt.scatter(subset['Length'], subset['Height'], label=animal, color=colors(i))
    
    # Print the animal code above the average length and height
    avg_length = subset['Length'].mean()
    avg_height = subset['Height'].mean()
    plt.text(avg_length, avg_height + 0.3, f'{i}', fontsize=20, color=colors(i))

# Add labels and title
plt.title('Animal Length vs Height with Animal Codes')
plt.xlabel('Length (m)')
plt.ylabel('Height (m)')
plt.legend(title='Animal')
plt.show()
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/1848883834.py:3: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

Lineær regresjon med tallkoder

# Prepare the data for linear regression
X = animal_data[['Length', 'Animal_Code']]
y = animal_data['Height']

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Predict the heights using the model
predicted_heights = model.predict(X)

# Plot the original data and the linear regression model
plt.figure(figsize=(10, 6))
colors = plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))

for i, animal in enumerate(animal_data['Animal'].unique()):
    subset = animal_data[animal_data['Animal'] == animal]
    plt.scatter(subset['Length'], subset['Height'], label=animal, color=colors(i))
    
    # Predict heights for the subset
    subset_X = subset[['Length', 'Animal_Code']]
    subset_predicted_heights = model.predict(subset_X)
    
    # Plot the regression line for the subset
    plt.plot(subset['Length'], subset_predicted_heights, color=colors(i))

    # Print the animal code above the average length and height
    avg_length = subset['Length'].mean()
    avg_height = subset['Height'].mean()
    plt.text(avg_length, avg_height + 0.3, f'{i}', fontsize=20, color=colors(i))

# Add labels and title
plt.title('Two Variable Linear Regression: Animal Length vs Height')
plt.xlabel('Length (m)')
plt.ylabel('Height (m)')
plt.legend(title='Animal')
plt.show()
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/389375858.py:14: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

  • Hvordan gikk egentlig dette? Se nøye etter.

Om vi ser på tallene, så ser vi at alle regresjonslinjene er sortert ettet tall. Det har altså noe å si hvordan dyrene er sortert. Bet blir litt rart.

Ligning

\[H(L, N) = \beta_0 + \beta_1 L + \beta_2 N\]

Symbol Beskrivelse
H Høyde
L Lengde
N Numerisk kode for dyret
\(\beta_i\) Regresjonskoeffisienter

Hvordan gjøre det bedre?

  • One-hot-encoding

Med one-hot encoding

Vi kan lage one-hot-kodet data med pandas.get_dummies(...)

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

transformed_data = pd.get_dummies(animal_data, columns=['Animal'], drop_first=False)
display(transformed_data.sample(10))
Length Height Animal_Code Animal_Elephant Animal_Giraffe Animal_Kangaroo Animal_Koala Animal_Lion Animal_Ostrich Animal_Panda Animal_Penguin Animal_Tiger Animal_Zebra
117 1.759347 1.959681 2 0 0 1 0 0 0 0 0 0 0
49 2.842813 1.258549 8 0 0 0 0 0 0 0 0 1 0
78 3.089452 5.566428 1 0 1 0 0 0 0 0 0 0 0
156 0.786210 0.576376 7 0 0 0 0 0 0 0 1 0 0
12 2.641885 1.271773 4 0 0 0 0 1 0 0 0 0 0
17 2.536413 0.944489 4 0 0 0 0 1 0 0 0 0 0
170 2.066710 2.473048 5 0 0 0 0 0 1 0 0 0 0
46 2.670109 1.070402 8 0 0 0 0 0 0 0 0 1 0
45 2.640487 1.112690 8 0 0 0 0 0 0 0 0 1 0
106 1.669596 1.872797 2 0 0 1 0 0 0 0 0 0 0

Regresjonsmodell med one-hot-coding

X = transformed_data.drop(columns=['Height', 'Animal_Code'])
y = transformed_data['Height']

model = LinearRegression()
model.fit(X, y)

# Bruke modellen
predicted_heights = model.predict(X)

# Plot de originale dataene og den lineære regresjonsmodellen
plt.figure(figsize=(8, 5))
colors = plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))

for i, animal in enumerate(animal_data['Animal'].unique()):
    subset = animal_data[animal_data['Animal'] == animal]
    plt.scatter(subset['Length'], subset['Height'], label=animal, color=colors(i), edgecolor=colors(i), facecolors='none')
    
    # Prediker høyder for hvert dyr 
    subset_X = transformed_data[transformed_data[f'Animal_{animal}'] == 1].drop(columns=['Height', 'Animal_Code'])
    subset_predicted_heights = model.predict(subset_X)
    
    # Plot regresjonslinjen for hvert enkelt dyr
    plt.plot(subset['Length'], subset_predicted_heights, color=np.array(colors(i))*0.9, linewidth=5)

# Pynt 
plt.title('Lineær regresjon med One-Hot Encoding: Dyrelengde vs Høyde')
plt.xlabel('Lengde (m)')
plt.ylabel('Høyde (m)')
plt.legend(title='Dyr')
plt.show()
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/1362502205.py:12: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

Likning

\[H(L, N) = \beta_0 + \beta_1 L + \sum_{\mathrm{i = \{Lion, Tiger, ...\}}}^{k} \beta_i [\text{er dette en }i\mathrm{?}]\]

La oss se på dette i et litt mindre datasett

import pandas as pd
import seaborn as sns
health = sns.load_dataset('healthexp')
display(health.sample(10))
Year Country Spending_USD Life_Expectancy
131 1997 Germany 2496.201 77.3
223 2012 Great Britain 3614.131 81.0
25 1976 Japan 303.725 74.8
156 2001 France 2875.294 79.3
55 1983 Great Britain 501.924 74.3
234 2014 France 4626.679 82.8
79 1988 Canada 1461.300 76.8
49 1982 Germany 1044.528 73.5
24 1976 Germany 591.098 71.8
216 2011 France 4161.698 82.3

Her bruker vi seaborn kun for å laste inn et datasett. Seaborn gir oss også noen muligheter til pen visualisering i statistikk, for dem som måtte være interessert i det.

Underveisoppgave

Note
  1. Gjør one-hot encoding av healthexp-datasettet
  2. Gjør trenings-validerings-splitt av datasettet
  3. Tren en lineær regresjonsmodell for å predikere life expectancy, med spending som forklaringsvariabel
  4. Ta med land som forklaringsvariabel i modellen
  5. Sammenligne nøyaktigehten til modellene
import pandas as pd
import seaborn as sns
health = sns.load_dataset('healthexp')
health_onehot = pd.get_dummies(health, columns=['Country'])
display(health_onehot.sample(10))
Year Spending_USD Life_Expectancy Country_Canada Country_France Country_Germany Country_Great Britain Country_Japan Country_USA
113 1994 2188.676 76.5 0 0 1 0 0 0
138 1998 2321.931 78.8 0 1 0 0 0 0
218 2011 3740.756 82.7 0 0 0 0 1 0
245 2016 5669.064 81.0 0 0 1 0 0 0
201 2008 7385.026 78.1 0 0 0 0 0 1
97 1991 842.797 75.9 0 0 0 1 0 0
250 2017 5150.470 81.9 1 0 0 0 0 0
1 1970 192.143 72.2 0 1 0 0 0 0
151 2000 1897.202 77.9 0 0 0 1 0 0
256 2018 5308.356 82.0 1 0 0 0 0 0
for i, frame in health.groupby("Country"):
    plt.scatter(frame["Spending_USD"], frame["Life_Expectancy"], marker="o", label=i)
plt.xlabel("Expenditure (USD)")
plt.ylabel("Life expectancy")
plt.legend()

Start på løsning

health_onehot = pd.get_dummies(health, columns=['Country'], drop_first=False)
display(health.sample(10))
Year Country Spending_USD Life_Expectancy
106 1993 Canada 1930.889 77.8
65 1985 France 1001.145 75.4
28 1977 Japan 340.628 75.3
186 2006 France 3444.855 81.0
193 2007 Great Britain 3021.671 79.7
231 2013 USA 8519.620 78.8
228 2013 France 4544.964 82.3
1 1970 France 192.143 72.2
24 1976 Germany 591.098 71.8
214 2011 Canada 4228.962 81.4

Enkel regresjonsmodell

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = np.array(health_onehot["Spending_USD"]).reshape(-1,1)
y = health_onehot['Life_Expectancy']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the model on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the life expectancy using the model on the test data
predicted_life_expectancy = model.predict(X_test)

# Evaluate the model
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, predicted_life_expectancy)
r2 = r2_score(y_test, predicted_life_expectancy)

# Plot the original data and the linear regression model
plt.figure(figsize=(8, 5))
plt.scatter(X_test, y_test, color='blue', label='Actual Data')
plt.plot(X_test, predicted_life_expectancy, color='red', label='Linear Regression Model')

# Add labels and title
plt.title('Linear Regression: Spending vs Life Expectancy')
plt.xlabel('Spending (USD)')
plt.ylabel('Life Expectancy (years)')
plt.legend()
plt.show()
print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

Mean Squared Error: 7.846016617615249
R^2 Score: 0.3573359515082699

En påfallende “god” modell, hva har skjedd her?

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = health_onehot.drop(columns=['Life_Expectancy'])
y = health_onehot['Life_Expectancy']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the model on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the life expectancy using the model on the test data
predicted_life_expectancy = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predicted_life_expectancy)
r2 = r2_score(y_test, predicted_life_expectancy)


# Plot the original data and the linear regression model predictions per country
plt.figure(figsize=(8, 5))
colors = plt.cm.get_cmap('tab20', len(health['Country'].unique()))

for i, country in enumerate(health['Country'].unique()):
    subset = health[health['Country'] == country]
    plt.scatter(subset['Spending_USD'], subset['Life_Expectancy'], label=country, color=colors(i), edgecolor=colors(i), facecolors='none')
    
    # Predict life expectancy for the subset
    subset_X = health_onehot[health_onehot[f'Country_{country}'] == 1].drop(columns=['Life_Expectancy'])
    subset_predicted_life_expectancy = model.predict(subset_X)
    
    # Plot the regression line for the subset
    plt.plot(subset['Spending_USD'], subset_predicted_life_expectancy, color=np.array(colors(i))*0.9, linewidth=2)

# Add labels and title
plt.title('Linear Regression with One-Hot Encoding: Spending vs Life Expectancy by Country')
plt.xlabel('Spending (USD)')
plt.ylabel('Life Expectancy (years)')
plt.legend(title='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/1590429931.py:24: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

Mean Squared Error: 0.13772868450150377
R^2 Score: 0.9887186991451874

Uten år som forklaringsvariabel

# Drop the 'Year' column from the dataset
health_onehot_no_year = health_onehot.drop(columns=['Year'])

# Prepare the data for linear regression
X = health_onehot_no_year.drop(columns=['Life_Expectancy'])
y = health_onehot_no_year['Life_Expectancy']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the model on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the life expectancy using the model on the test data
predicted_life_expectancy = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predicted_life_expectancy)
r2 = r2_score(y_test, predicted_life_expectancy)

# Plot the original data and the linear regression model predictions per country
plt.figure(figsize=(8, 5))
colors = plt.cm.get_cmap('tab20', len(health['Country'].unique()))

for i, country in enumerate(health['Country'].unique()):
    subset = health[health['Country'] == country]
    plt.scatter(subset['Spending_USD'], subset['Life_Expectancy'], label=country, color=colors(i), edgecolor=colors(i), facecolors='none')
    
    # Predict life expectancy for the subset
    subset_X = health_onehot_no_year[health_onehot_no_year[f'Country_{country}'] == 1].drop(columns=['Life_Expectancy'])
    subset_predicted_life_expectancy = model.predict(subset_X)
    
    # Plot the regression line for the subset
    plt.plot(subset['Spending_USD'], subset_predicted_life_expectancy, color=np.array(colors(i))*0.9, linewidth=2)

# Add labels and title
plt.title('Linear Regression with One-Hot Encoding (No Year): Spending vs Life Expectancy by Country')
plt.xlabel('Spending (USD)')
plt.ylabel('Life Expectancy (years)')
plt.legend(title='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/3728857717.py:24: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

Mean Squared Error: 2.3013732097838697
R^2 Score: 0.8114954509821513