Forelesningsnotat – Obligatorisk oppgave

Titanic

I filmen om titanic-forliset følger overlevelsen det mønsteret man godt kunne tenke seg. Kvinner på første klasse overlever, og menn på tredje klasse dør. Hvem hvor godt klarer vi å bruke det vi vet om en passaser til å predikere om vedkommende kommer til å overleve titanic-forliset? og kan vi bruke en slik modell til å forstå noe om hvordan de forskjellige egenskapene til en person, slik som kjønn, alder og pris på billetten, påvirker overlevelsen ved et forlis?

Men før det

Høyde og lengde på forskjellige dyr

	Animal	Length	Height
33	Tiger	2.780537	1.272670
73	Giraffe	3.041766	5.536510
93	Zebra	2.211075	1.434069
3	Lion	2.516610	1.180352
61	Elephant	3.680609	3.180050
184	Ostrich	1.919341	2.362251
146	Koala	0.974074	0.791800
92	Zebra	2.132858	1.329540

Vi har laget oss et datasett med lengde- og høydedata om forskjellige dyr. Det kan lastes ned her.

Plot

import matplotlib.pyplot as plt
# Load the dataset from the CSV file
animal_data = pd.read_csv('data/animal_data.csv')
plt.figure(figsize=(8, 5))
plt.plot(animal_data["Length"], animal_data["Height"], "o")

Med navn på dyrene

# Lage scatter-plot fordelt på dyr 
plt.figure(figsize=(8, 5))
for animal in animal_data['Animal'].unique():
    subset = animal_data[animal_data['Animal'] == animal]
    plt.scatter(subset['Length'], subset['Height'], label=animal)

# Pynt
plt.title('Animal Length vs Height')
plt.xlabel('Length (m)')
plt.ylabel('Height (m)')
plt.legend(title='Animal')
plt.show()

Envariabel lineær regresjon

from sklearn.linear_model import LinearRegression

# Prepare the data for linear regression
X = animal_data[['Length']]
y = animal_data['Height']

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Predict the heights using the model
predicted_heights = model.predict(X)

# Plot the original data and the linear regression model
plt.figure(figsize=(8, 5))
plt.scatter(animal_data['Length'], animal_data['Height'], label='Actual Data')
plt.plot(animal_data['Length'], predicted_heights, color='red', label='Linear Regression Model')

# Add labels and title
plt.title('Linear Regression: Animal Length vs Height')
plt.xlabel('Length (m)')
plt.ylabel('Height (m)')
plt.legend()
plt.show()

\[H(L, N) = \beta_0 + \beta_1 L\]

H	Høyde
L	Lengde
\(\beta_i\)	Regresjonskoeffisienter

Hvordan gjøre det bedre?

forslag?

Hva med å numerisk kode dyrene?

from sklearn.linear_model import LinearRegression

# Encode the animal names as numbers
animal_data['Animal_Code'] = animal_data['Animal'].astype('category').cat.codes
# Display a random selection of 5 rows from the dataset
display(animal_data.sample(10))

	Animal	Length	Height	Animal_Code
53	Elephant	3.348638	3.010803	0
131	Panda	1.441929	0.863909	6
155	Penguin	0.850698	0.606088	7
13	Lion	2.526350	1.286599	4
47	Tiger	2.658747	0.933085	8
45	Tiger	2.640487	1.112690	8
60	Elephant	3.551559	3.113933	0
150	Penguin	0.518143	0.754484	7
129	Panda	1.366025	0.921771	6
128	Panda	1.356603	0.839018	6

Illustrasjon av datasettet med tallkoder

# Plot the animal data with animal codes
plt.figure(figsize=(10, 6))
colors = plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))

for i, animal in enumerate(animal_data['Animal'].unique()):
    subset = animal_data[animal_data['Animal'] == animal]
    plt.scatter(subset['Length'], subset['Height'], label=animal, color=colors(i))
    
    # Print the animal code above the average length and height
    avg_length = subset['Length'].mean()
    avg_height = subset['Height'].mean()
    plt.text(avg_length, avg_height + 0.3, f'{i}', fontsize=20, color=colors(i))

# Add labels and title
plt.title('Animal Length vs Height with Animal Codes')
plt.xlabel('Length (m)')
plt.ylabel('Height (m)')
plt.legend(title='Animal')
plt.show()

/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/1848883834.py:3: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

Lineær regresjon med tallkoder

# Prepare the data for linear regression
X = animal_data[['Length', 'Animal_Code']]
y = animal_data['Height']

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Predict the heights using the model
predicted_heights = model.predict(X)

# Plot the original data and the linear regression model
plt.figure(figsize=(10, 6))
colors = plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))

for i, animal in enumerate(animal_data['Animal'].unique()):
    subset = animal_data[animal_data['Animal'] == animal]
    plt.scatter(subset['Length'], subset['Height'], label=animal, color=colors(i))
    
    # Predict heights for the subset
    subset_X = subset[['Length', 'Animal_Code']]
    subset_predicted_heights = model.predict(subset_X)
    
    # Plot the regression line for the subset
    plt.plot(subset['Length'], subset_predicted_heights, color=colors(i))

    # Print the animal code above the average length and height
    avg_length = subset['Length'].mean()
    avg_height = subset['Height'].mean()
    plt.text(avg_length, avg_height + 0.3, f'{i}', fontsize=20, color=colors(i))

# Add labels and title
plt.title('Two Variable Linear Regression: Animal Length vs Height')
plt.xlabel('Length (m)')
plt.ylabel('Height (m)')
plt.legend(title='Animal')
plt.show()

/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/389375858.py:14: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

Hvordan gikk egentlig dette? Se nøye etter.

Om vi ser på tallene, så ser vi at alle regresjonslinjene er sortert ettet tall. Det har altså noe å si hvordan dyrene er sortert. Bet blir litt rart.

Ligning

\[H(L, N) = \beta_0 + \beta_1 L + \beta_2 N\]

Symbol	Beskrivelse
H	Høyde
L	Lengde
N	Numerisk kode for dyret
\(\beta_i\)	Regresjonskoeffisienter

Hvordan gjøre det bedre?

One-hot-encoding

Med one-hot encoding

Vi kan lage one-hot-kodet data med pandas.get_dummies(...)

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

transformed_data = pd.get_dummies(animal_data, columns=['Animal'], drop_first=False)
display(transformed_data.sample(10))

	Length	Height	Animal_Code	Animal_Giraffe	Animal_Kangaroo	Animal_Lion	Animal_Ostrich	Animal_Penguin	Animal_Tiger
117	1.759347	1.959681	2	0	1	0	0	0	0
49	2.842813	1.258549	8	0	0	0	0	0	1
78	3.089452	5.566428	1	1	0	0	0	0	0
156	0.786210	0.576376	7	0	0	0	0	1	0
12	2.641885	1.271773	4	0	0	1	0	0	0
17	2.536413	0.944489	4	0	0	1	0	0	0
170	2.066710	2.473048	5	0	0	0	1	0	0
46	2.670109	1.070402	8	0	0	0	0	0	1
45	2.640487	1.112690	8	0	0	0	0	0	1
106	1.669596	1.872797	2	0	1	0	0	0	0

Regresjonsmodell med one-hot-coding

X = transformed_data.drop(columns=['Height', 'Animal_Code'])
y = transformed_data['Height']

model = LinearRegression()
model.fit(X, y)

# Bruke modellen
predicted_heights = model.predict(X)

# Plot de originale dataene og den lineære regresjonsmodellen
plt.figure(figsize=(8, 5))
colors = plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))

for i, animal in enumerate(animal_data['Animal'].unique()):
    subset = animal_data[animal_data['Animal'] == animal]
    plt.scatter(subset['Length'], subset['Height'], label=animal, color=colors(i), edgecolor=colors(i), facecolors='none')
    
    # Prediker høyder for hvert dyr 
    subset_X = transformed_data[transformed_data[f'Animal_{animal}'] == 1].drop(columns=['Height', 'Animal_Code'])
    subset_predicted_heights = model.predict(subset_X)
    
    # Plot regresjonslinjen for hvert enkelt dyr
    plt.plot(subset['Length'], subset_predicted_heights, color=np.array(colors(i))*0.9, linewidth=5)

# Pynt 
plt.title('Lineær regresjon med One-Hot Encoding: Dyrelengde vs Høyde')
plt.xlabel('Lengde (m)')
plt.ylabel('Høyde (m)')
plt.legend(title='Dyr')
plt.show()

/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/1362502205.py:12: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

Likning

\[H(L, N) = \beta_0 + \beta_1 L + \sum_{\mathrm{i = \{Lion, Tiger, ...\}}}^{k} \beta_i [\text{er dette en }i\mathrm{?}]\]

La oss se på dette i et litt mindre datasett

import pandas as pd
import seaborn as sns
health = sns.load_dataset('healthexp')
display(health.sample(10))

	Year	Country	Spending_USD	Life_Expectancy
131	1997	Germany	2496.201	77.3
223	2012	Great Britain	3614.131	81.0
25	1976	Japan	303.725	74.8
156	2001	France	2875.294	79.3
55	1983	Great Britain	501.924	74.3
234	2014	France	4626.679	82.8
79	1988	Canada	1461.300	76.8
49	1982	Germany	1044.528	73.5
24	1976	Germany	591.098	71.8
216	2011	France	4161.698	82.3

Her bruker vi seaborn kun for å laste inn et datasett. Seaborn gir oss også noen muligheter til pen visualisering i statistikk, for dem som måtte være interessert i det.

Underveisoppgave

Note

Gjør one-hot encoding av healthexp-datasettet
Gjør trenings-validerings-splitt av datasettet
Tren en lineær regresjonsmodell for å predikere life expectancy, med spending som forklaringsvariabel
Ta med land som forklaringsvariabel i modellen
Sammenligne nøyaktigehten til modellene

import pandas as pd
import seaborn as sns
health = sns.load_dataset('healthexp')
health_onehot = pd.get_dummies(health, columns=['Country'])
display(health_onehot.sample(10))

	Year	Spending_USD	Life_Expectancy	Country_Canada	Country_France	Country_Germany	Country_Great Britain	Country_Japan	Country_USA
113	1994	2188.676	76.5	0	0	1	0	0	0
138	1998	2321.931	78.8	0	1	0	0	0	0
218	2011	3740.756	82.7	0	0	0	0	1	0
245	2016	5669.064	81.0	0	0	1	0	0	0
201	2008	7385.026	78.1	0	0	0	0	0	1
97	1991	842.797	75.9	0	0	0	1	0	0
250	2017	5150.470	81.9	1	0	0	0	0	0
1	1970	192.143	72.2	0	1	0	0	0	0
151	2000	1897.202	77.9	0	0	0	1	0	0
256	2018	5308.356	82.0	1	0	0	0	0	0

for i, frame in health.groupby("Country"):
    plt.scatter(frame["Spending_USD"], frame["Life_Expectancy"], marker="o", label=i)
plt.xlabel("Expenditure (USD)")
plt.ylabel("Life expectancy")
plt.legend()

Start på løsning

health_onehot = pd.get_dummies(health, columns=['Country'], drop_first=False)
display(health.sample(10))

	Year	Country	Spending_USD	Life_Expectancy
106	1993	Canada	1930.889	77.8
65	1985	France	1001.145	75.4
28	1977	Japan	340.628	75.3
186	2006	France	3444.855	81.0
193	2007	Great Britain	3021.671	79.7
231	2013	USA	8519.620	78.8
228	2013	France	4544.964	82.3
1	1970	France	192.143	72.2
24	1976	Germany	591.098	71.8
214	2011	Canada	4228.962	81.4

Enkel regresjonsmodell

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = np.array(health_onehot["Spending_USD"]).reshape(-1,1)
y = health_onehot['Life_Expectancy']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the model on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the life expectancy using the model on the test data
predicted_life_expectancy = model.predict(X_test)

# Evaluate the model
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, predicted_life_expectancy)
r2 = r2_score(y_test, predicted_life_expectancy)

# Plot the original data and the linear regression model
plt.figure(figsize=(8, 5))
plt.scatter(X_test, y_test, color='blue', label='Actual Data')
plt.plot(X_test, predicted_life_expectancy, color='red', label='Linear Regression Model')

# Add labels and title
plt.title('Linear Regression: Spending vs Life Expectancy')
plt.xlabel('Spending (USD)')
plt.ylabel('Life Expectancy (years)')
plt.legend()
plt.show()
print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

Mean Squared Error: 7.846016617615249
R^2 Score: 0.3573359515082699

En påfallende “god” modell, hva har skjedd her?

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = health_onehot.drop(columns=['Life_Expectancy'])
y = health_onehot['Life_Expectancy']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the model on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the life expectancy using the model on the test data
predicted_life_expectancy = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predicted_life_expectancy)
r2 = r2_score(y_test, predicted_life_expectancy)


# Plot the original data and the linear regression model predictions per country
plt.figure(figsize=(8, 5))
colors = plt.cm.get_cmap('tab20', len(health['Country'].unique()))

for i, country in enumerate(health['Country'].unique()):
    subset = health[health['Country'] == country]
    plt.scatter(subset['Spending_USD'], subset['Life_Expectancy'], label=country, color=colors(i), edgecolor=colors(i), facecolors='none')
    
    # Predict life expectancy for the subset
    subset_X = health_onehot[health_onehot[f'Country_{country}'] == 1].drop(columns=['Life_Expectancy'])
    subset_predicted_life_expectancy = model.predict(subset_X)
    
    # Plot the regression line for the subset
    plt.plot(subset['Spending_USD'], subset_predicted_life_expectancy, color=np.array(colors(i))*0.9, linewidth=2)

# Add labels and title
plt.title('Linear Regression with One-Hot Encoding: Spending vs Life Expectancy by Country')
plt.xlabel('Spending (USD)')
plt.ylabel('Life Expectancy (years)')
plt.legend(title='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/1590429931.py:24: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

Mean Squared Error: 0.13772868450150377
R^2 Score: 0.9887186991451874

Uten år som forklaringsvariabel

# Drop the 'Year' column from the dataset
health_onehot_no_year = health_onehot.drop(columns=['Year'])

# Prepare the data for linear regression
X = health_onehot_no_year.drop(columns=['Life_Expectancy'])
y = health_onehot_no_year['Life_Expectancy']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the model on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the life expectancy using the model on the test data
predicted_life_expectancy = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predicted_life_expectancy)
r2 = r2_score(y_test, predicted_life_expectancy)

# Plot the original data and the linear regression model predictions per country
plt.figure(figsize=(8, 5))
colors = plt.cm.get_cmap('tab20', len(health['Country'].unique()))

for i, country in enumerate(health['Country'].unique()):
    subset = health[health['Country'] == country]
    plt.scatter(subset['Spending_USD'], subset['Life_Expectancy'], label=country, color=colors(i), edgecolor=colors(i), facecolors='none')
    
    # Predict life expectancy for the subset
    subset_X = health_onehot_no_year[health_onehot_no_year[f'Country_{country}'] == 1].drop(columns=['Life_Expectancy'])
    subset_predicted_life_expectancy = model.predict(subset_X)
    
    # Plot the regression line for the subset
    plt.plot(subset['Spending_USD'], subset_predicted_life_expectancy, color=np.array(colors(i))*0.9, linewidth=2)

# Add labels and title
plt.title('Linear Regression with One-Hot Encoding (No Year): Spending vs Life Expectancy by Country')
plt.xlabel('Spending (USD)')
plt.ylabel('Life Expectancy (years)')
plt.legend(title='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')

/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/3728857717.py:24: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

Mean Squared Error: 2.3013732097838697
R^2 Score: 0.8114954509821513