Animal | Length | Height | |
---|---|---|---|
33 | Tiger | 2.780537 | 1.272670 |
73 | Giraffe | 3.041766 | 5.536510 |
93 | Zebra | 2.211075 | 1.434069 |
3 | Lion | 2.516610 | 1.180352 |
61 | Elephant | 3.680609 | 3.180050 |
184 | Ostrich | 1.919341 | 2.362251 |
146 | Koala | 0.974074 | 0.791800 |
92 | Zebra | 2.132858 | 1.329540 |
Forelesningsnotat – Obligatorisk oppgave
Titanic
I filmen om titanic-forliset følger overlevelsen det mønsteret man godt kunne tenke seg. Kvinner på første klasse overlever, og menn på tredje klasse dør. Hvem hvor godt klarer vi å bruke det vi vet om en passaser til å predikere om vedkommende kommer til å overleve titanic-forliset? og kan vi bruke en slik modell til å forstå noe om hvordan de forskjellige egenskapene til en person, slik som kjønn, alder og pris på billetten, påvirker overlevelsen ved et forlis?
Men før det
- Høyde og lengde på forskjellige dyr
Vi har laget oss et datasett med lengde- og høydedata om forskjellige dyr. Det kan lastes ned her.
Plot
import matplotlib.pyplot as plt
# Load the dataset from the CSV file
= pd.read_csv('data/animal_data.csv')
animal_data =(8, 5))
plt.figure(figsize"Length"], animal_data["Height"], "o") plt.plot(animal_data[
Envariabel lineær regresjon
from sklearn.linear_model import LinearRegression
# Prepare the data for linear regression
= animal_data[['Length']]
X = animal_data['Height']
y
# Create and fit the model
= LinearRegression()
model
model.fit(X, y)
# Predict the heights using the model
= model.predict(X)
predicted_heights
# Plot the original data and the linear regression model
=(8, 5))
plt.figure(figsize'Length'], animal_data['Height'], label='Actual Data')
plt.scatter(animal_data['Length'], predicted_heights, color='red', label='Linear Regression Model')
plt.plot(animal_data[
# Add labels and title
'Linear Regression: Animal Length vs Height')
plt.title('Length (m)')
plt.xlabel('Height (m)')
plt.ylabel(
plt.legend() plt.show()
\[H(L, N) = \beta_0 + \beta_1 L\]
H | Høyde |
L | Lengde |
\(\beta_i\) | Regresjonskoeffisienter |
Hvordan gjøre det bedre?
- forslag?
Hva med å numerisk kode dyrene?
from sklearn.linear_model import LinearRegression
# Encode the animal names as numbers
'Animal_Code'] = animal_data['Animal'].astype('category').cat.codes
animal_data[# Display a random selection of 5 rows from the dataset
10)) display(animal_data.sample(
Animal | Length | Height | Animal_Code | |
---|---|---|---|---|
53 | Elephant | 3.348638 | 3.010803 | 0 |
131 | Panda | 1.441929 | 0.863909 | 6 |
155 | Penguin | 0.850698 | 0.606088 | 7 |
13 | Lion | 2.526350 | 1.286599 | 4 |
47 | Tiger | 2.658747 | 0.933085 | 8 |
45 | Tiger | 2.640487 | 1.112690 | 8 |
60 | Elephant | 3.551559 | 3.113933 | 0 |
150 | Penguin | 0.518143 | 0.754484 | 7 |
129 | Panda | 1.366025 | 0.921771 | 6 |
128 | Panda | 1.356603 | 0.839018 | 6 |
Illustrasjon av datasettet med tallkoder
# Plot the animal data with animal codes
=(10, 6))
plt.figure(figsize= plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))
colors
for i, animal in enumerate(animal_data['Animal'].unique()):
= animal_data[animal_data['Animal'] == animal]
subset 'Length'], subset['Height'], label=animal, color=colors(i))
plt.scatter(subset[
# Print the animal code above the average length and height
= subset['Length'].mean()
avg_length = subset['Height'].mean()
avg_height + 0.3, f'{i}', fontsize=20, color=colors(i))
plt.text(avg_length, avg_height
# Add labels and title
'Animal Length vs Height with Animal Codes')
plt.title('Length (m)')
plt.xlabel('Height (m)')
plt.ylabel(='Animal')
plt.legend(title plt.show()
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/1848883834.py:3: MatplotlibDeprecationWarning:
The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
Lineær regresjon med tallkoder
# Prepare the data for linear regression
= animal_data[['Length', 'Animal_Code']]
X = animal_data['Height']
y
# Create and fit the model
= LinearRegression()
model
model.fit(X, y)
# Predict the heights using the model
= model.predict(X)
predicted_heights
# Plot the original data and the linear regression model
=(10, 6))
plt.figure(figsize= plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))
colors
for i, animal in enumerate(animal_data['Animal'].unique()):
= animal_data[animal_data['Animal'] == animal]
subset 'Length'], subset['Height'], label=animal, color=colors(i))
plt.scatter(subset[
# Predict heights for the subset
= subset[['Length', 'Animal_Code']]
subset_X = model.predict(subset_X)
subset_predicted_heights
# Plot the regression line for the subset
'Length'], subset_predicted_heights, color=colors(i))
plt.plot(subset[
# Print the animal code above the average length and height
= subset['Length'].mean()
avg_length = subset['Height'].mean()
avg_height + 0.3, f'{i}', fontsize=20, color=colors(i))
plt.text(avg_length, avg_height
# Add labels and title
'Two Variable Linear Regression: Animal Length vs Height')
plt.title('Length (m)')
plt.xlabel('Height (m)')
plt.ylabel(='Animal')
plt.legend(title plt.show()
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/389375858.py:14: MatplotlibDeprecationWarning:
The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
- Hvordan gikk egentlig dette? Se nøye etter.
Om vi ser på tallene, så ser vi at alle regresjonslinjene er sortert ettet tall. Det har altså noe å si hvordan dyrene er sortert. Bet blir litt rart.
Ligning
\[H(L, N) = \beta_0 + \beta_1 L + \beta_2 N\]
Symbol | Beskrivelse |
---|---|
H | Høyde |
L | Lengde |
N | Numerisk kode for dyret |
\(\beta_i\) | Regresjonskoeffisienter |
Hvordan gjøre det bedre?
- One-hot-encoding
Med one-hot encoding
Vi kan lage one-hot-kodet data med pandas.get_dummies(...)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
= pd.get_dummies(animal_data, columns=['Animal'], drop_first=False)
transformed_data 10)) display(transformed_data.sample(
Length | Height | Animal_Code | Animal_Elephant | Animal_Giraffe | Animal_Kangaroo | Animal_Koala | Animal_Lion | Animal_Ostrich | Animal_Panda | Animal_Penguin | Animal_Tiger | Animal_Zebra | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
117 | 1.759347 | 1.959681 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
49 | 2.842813 | 1.258549 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
78 | 3.089452 | 5.566428 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
156 | 0.786210 | 0.576376 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
12 | 2.641885 | 1.271773 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
17 | 2.536413 | 0.944489 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
170 | 2.066710 | 2.473048 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
46 | 2.670109 | 1.070402 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
45 | 2.640487 | 1.112690 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
106 | 1.669596 | 1.872797 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Regresjonsmodell med one-hot-coding
= transformed_data.drop(columns=['Height', 'Animal_Code'])
X = transformed_data['Height']
y
= LinearRegression()
model
model.fit(X, y)
# Bruke modellen
= model.predict(X)
predicted_heights
# Plot de originale dataene og den lineære regresjonsmodellen
=(8, 5))
plt.figure(figsize= plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))
colors
for i, animal in enumerate(animal_data['Animal'].unique()):
= animal_data[animal_data['Animal'] == animal]
subset 'Length'], subset['Height'], label=animal, color=colors(i), edgecolor=colors(i), facecolors='none')
plt.scatter(subset[
# Prediker høyder for hvert dyr
= transformed_data[transformed_data[f'Animal_{animal}'] == 1].drop(columns=['Height', 'Animal_Code'])
subset_X = model.predict(subset_X)
subset_predicted_heights
# Plot regresjonslinjen for hvert enkelt dyr
'Length'], subset_predicted_heights, color=np.array(colors(i))*0.9, linewidth=5)
plt.plot(subset[
# Pynt
'Lineær regresjon med One-Hot Encoding: Dyrelengde vs Høyde')
plt.title('Lengde (m)')
plt.xlabel('Høyde (m)')
plt.ylabel(='Dyr')
plt.legend(title plt.show()
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/1362502205.py:12: MatplotlibDeprecationWarning:
The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
Likning
\[H(L, N) = \beta_0 + \beta_1 L + \sum_{\mathrm{i = \{Lion, Tiger, ...\}}}^{k} \beta_i [\text{er dette en }i\mathrm{?}]\]
La oss se på dette i et litt mindre datasett
import pandas as pd
import seaborn as sns
= sns.load_dataset('healthexp')
health 10)) display(health.sample(
Year | Country | Spending_USD | Life_Expectancy | |
---|---|---|---|---|
131 | 1997 | Germany | 2496.201 | 77.3 |
223 | 2012 | Great Britain | 3614.131 | 81.0 |
25 | 1976 | Japan | 303.725 | 74.8 |
156 | 2001 | France | 2875.294 | 79.3 |
55 | 1983 | Great Britain | 501.924 | 74.3 |
234 | 2014 | France | 4626.679 | 82.8 |
79 | 1988 | Canada | 1461.300 | 76.8 |
49 | 1982 | Germany | 1044.528 | 73.5 |
24 | 1976 | Germany | 591.098 | 71.8 |
216 | 2011 | France | 4161.698 | 82.3 |
Her bruker vi seaborn kun for å laste inn et datasett. Seaborn gir oss også noen muligheter til pen visualisering i statistikk, for dem som måtte være interessert i det.
Underveisoppgave
- Gjør one-hot encoding av healthexp-datasettet
- Gjør trenings-validerings-splitt av datasettet
- Tren en lineær regresjonsmodell for å predikere life expectancy, med spending som forklaringsvariabel
- Ta med land som forklaringsvariabel i modellen
- Sammenligne nøyaktigehten til modellene
import pandas as pd
import seaborn as sns
= sns.load_dataset('healthexp')
health = pd.get_dummies(health, columns=['Country'])
health_onehot 10)) display(health_onehot.sample(
Year | Spending_USD | Life_Expectancy | Country_Canada | Country_France | Country_Germany | Country_Great Britain | Country_Japan | Country_USA | |
---|---|---|---|---|---|---|---|---|---|
113 | 1994 | 2188.676 | 76.5 | 0 | 0 | 1 | 0 | 0 | 0 |
138 | 1998 | 2321.931 | 78.8 | 0 | 1 | 0 | 0 | 0 | 0 |
218 | 2011 | 3740.756 | 82.7 | 0 | 0 | 0 | 0 | 1 | 0 |
245 | 2016 | 5669.064 | 81.0 | 0 | 0 | 1 | 0 | 0 | 0 |
201 | 2008 | 7385.026 | 78.1 | 0 | 0 | 0 | 0 | 0 | 1 |
97 | 1991 | 842.797 | 75.9 | 0 | 0 | 0 | 1 | 0 | 0 |
250 | 2017 | 5150.470 | 81.9 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 1970 | 192.143 | 72.2 | 0 | 1 | 0 | 0 | 0 | 0 |
151 | 2000 | 1897.202 | 77.9 | 0 | 0 | 0 | 1 | 0 | 0 |
256 | 2018 | 5308.356 | 82.0 | 1 | 0 | 0 | 0 | 0 | 0 |
for i, frame in health.groupby("Country"):
"Spending_USD"], frame["Life_Expectancy"], marker="o", label=i)
plt.scatter(frame["Expenditure (USD)")
plt.xlabel("Life expectancy")
plt.ylabel( plt.legend()
Start på løsning
= pd.get_dummies(health, columns=['Country'], drop_first=False)
health_onehot 10)) display(health.sample(
Year | Country | Spending_USD | Life_Expectancy | |
---|---|---|---|---|
106 | 1993 | Canada | 1930.889 | 77.8 |
65 | 1985 | France | 1001.145 | 75.4 |
28 | 1977 | Japan | 340.628 | 75.3 |
186 | 2006 | France | 3444.855 | 81.0 |
193 | 2007 | Great Britain | 3021.671 | 79.7 |
231 | 2013 | USA | 8519.620 | 78.8 |
228 | 2013 | France | 4544.964 | 82.3 |
1 | 1970 | France | 192.143 | 72.2 |
24 | 1976 | Germany | 591.098 | 71.8 |
214 | 2011 | Canada | 4228.962 | 81.4 |
Enkel regresjonsmodell
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
= np.array(health_onehot["Spending_USD"]).reshape(-1,1)
X = health_onehot['Life_Expectancy']
y
# Split the data into training and testing sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Create and fit the model on the training data
= LinearRegression()
model
model.fit(X_train, y_train)
# Predict the life expectancy using the model on the test data
= model.predict(X_test)
predicted_life_expectancy
# Evaluate the model
from sklearn.metrics import mean_squared_error, r2_score
= mean_squared_error(y_test, predicted_life_expectancy)
mse = r2_score(y_test, predicted_life_expectancy)
r2
# Plot the original data and the linear regression model
=(8, 5))
plt.figure(figsize='blue', label='Actual Data')
plt.scatter(X_test, y_test, color='red', label='Linear Regression Model')
plt.plot(X_test, predicted_life_expectancy, color
# Add labels and title
'Linear Regression: Spending vs Life Expectancy')
plt.title('Spending (USD)')
plt.xlabel('Life Expectancy (years)')
plt.ylabel(
plt.legend()
plt.show()print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
Mean Squared Error: 7.846016617615249
R^2 Score: 0.3573359515082699
En påfallende “god” modell, hva har skjedd her?
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
= health_onehot.drop(columns=['Life_Expectancy'])
X = health_onehot['Life_Expectancy']
y
# Split the data into training and testing sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Create and fit the model on the training data
= LinearRegression()
model
model.fit(X_train, y_train)
# Predict the life expectancy using the model on the test data
= model.predict(X_test)
predicted_life_expectancy
# Evaluate the model
= mean_squared_error(y_test, predicted_life_expectancy)
mse = r2_score(y_test, predicted_life_expectancy)
r2
# Plot the original data and the linear regression model predictions per country
=(8, 5))
plt.figure(figsize= plt.cm.get_cmap('tab20', len(health['Country'].unique()))
colors
for i, country in enumerate(health['Country'].unique()):
= health[health['Country'] == country]
subset 'Spending_USD'], subset['Life_Expectancy'], label=country, color=colors(i), edgecolor=colors(i), facecolors='none')
plt.scatter(subset[
# Predict life expectancy for the subset
= health_onehot[health_onehot[f'Country_{country}'] == 1].drop(columns=['Life_Expectancy'])
subset_X = model.predict(subset_X)
subset_predicted_life_expectancy
# Plot the regression line for the subset
'Spending_USD'], subset_predicted_life_expectancy, color=np.array(colors(i))*0.9, linewidth=2)
plt.plot(subset[
# Add labels and title
'Linear Regression with One-Hot Encoding: Spending vs Life Expectancy by Country')
plt.title('Spending (USD)')
plt.xlabel('Life Expectancy (years)')
plt.ylabel(='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.legend(title
plt.show()print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/1590429931.py:24: MatplotlibDeprecationWarning:
The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
Mean Squared Error: 0.13772868450150377
R^2 Score: 0.9887186991451874
Uten år som forklaringsvariabel
# Drop the 'Year' column from the dataset
= health_onehot.drop(columns=['Year'])
health_onehot_no_year
# Prepare the data for linear regression
= health_onehot_no_year.drop(columns=['Life_Expectancy'])
X = health_onehot_no_year['Life_Expectancy']
y
# Split the data into training and testing sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Create and fit the model on the training data
= LinearRegression()
model
model.fit(X_train, y_train)
# Predict the life expectancy using the model on the test data
= model.predict(X_test)
predicted_life_expectancy
# Evaluate the model
= mean_squared_error(y_test, predicted_life_expectancy)
mse = r2_score(y_test, predicted_life_expectancy)
r2
# Plot the original data and the linear regression model predictions per country
=(8, 5))
plt.figure(figsize= plt.cm.get_cmap('tab20', len(health['Country'].unique()))
colors
for i, country in enumerate(health['Country'].unique()):
= health[health['Country'] == country]
subset 'Spending_USD'], subset['Life_Expectancy'], label=country, color=colors(i), edgecolor=colors(i), facecolors='none')
plt.scatter(subset[
# Predict life expectancy for the subset
= health_onehot_no_year[health_onehot_no_year[f'Country_{country}'] == 1].drop(columns=['Life_Expectancy'])
subset_X = model.predict(subset_X)
subset_predicted_life_expectancy
# Plot the regression line for the subset
'Spending_USD'], subset_predicted_life_expectancy, color=np.array(colors(i))*0.9, linewidth=2)
plt.plot(subset[
# Add labels and title
'Linear Regression with One-Hot Encoding (No Year): Spending vs Life Expectancy by Country')
plt.title('Spending (USD)')
plt.xlabel('Life Expectancy (years)')
plt.ylabel(='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.legend(title
plt.show()print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_62064/3728857717.py:24: MatplotlibDeprecationWarning:
The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
Mean Squared Error: 2.3013732097838697
R^2 Score: 0.8114954509821513