Animal | Length | Height | |
---|---|---|---|
93 | Zebra | 2.499837 | 1.621510 |
17 | Lion | 2.632459 | 1.147693 |
167 | Penguin | 0.798167 | 0.585082 |
60 | Elephant | 3.388946 | 2.996978 |
148 | Koala | 0.769779 | 0.730835 |
127 | Panda | 1.438418 | 0.790432 |
45 | Tiger | 2.919400 | 1.186447 |
16 | Lion | 2.502121 | 1.233868 |
Forelesningsnotat – Obligatorisk oppgave
Titanic
I filmen om titanic-forliset følger overlevelsen det mønsteret man godt kunne tenke seg. Kvinner på første klasse overlever, og menn på tredje klasse dør. Hvem hvor godt klarer vi å bruke det vi vet om en passaser til å predikere om vedkommende kommer til å overleve titanic-forliset? og kan vi bruke en slik modell til å forstå noe om hvordan de forskjellige egenskapene til en person, slik som kjønn, alder og pris på billetten, påvirker overlevelsen ved et forlis?
Men før det
- Høyde og lengde på forskjellige dyr
Vi har laget oss et datasett med lengde- og høydedata om forskjellige dyr. Det kan lastes ned her.
Plot
import matplotlib.pyplot as plt
# Load the dataset from the CSV file
= pd.read_csv('data/animal_data.csv')
animal_data =(8, 5))
plt.figure(figsize"Length"], animal_data["Height"], "o") plt.plot(animal_data[
Envariabel lineær regresjon
from sklearn.linear_model import LinearRegression
# Prepare the data for linear regression
= animal_data[['Length']]
X = animal_data['Height']
y
# Create and fit the model
= LinearRegression()
model
model.fit(X, y)
# Predict the heights using the model
= model.predict(X)
predicted_heights
# Plot the original data and the linear regression model
=(8, 5))
plt.figure(figsize'Length'], animal_data['Height'], label='Actual Data')
plt.scatter(animal_data['Length'], predicted_heights, color='red', label='Linear Regression Model')
plt.plot(animal_data[
# Add labels and title
'Linear Regression: Animal Length vs Height')
plt.title('Length (m)')
plt.xlabel('Height (m)')
plt.ylabel(
plt.legend() plt.show()
\[H(L, N) = \beta_0 + \beta_1 L\]
H | Høyde |
L | Lengde |
\(\beta_i\) | Regresjonskoeffisienter |
Hvordan gjøre det bedre?
- forslag?
Hva med å numerisk kode dyrene?
from sklearn.linear_model import LinearRegression
# Encode the animal names as numbers
'Animal_Code'] = animal_data['Animal'].astype('category').cat.codes
animal_data[# Display a random selection of 5 rows from the dataset
10)) display(animal_data.sample(
Animal | Length | Height | Animal_Code | |
---|---|---|---|---|
40 | Tiger | 2.949765 | 1.097031 | 8 |
147 | Koala | 0.659256 | 0.801045 | 3 |
127 | Panda | 1.438418 | 0.790432 | 6 |
108 | Kangaroo | 1.500805 | 1.836083 | 2 |
152 | Penguin | 0.507209 | 0.551672 | 7 |
157 | Penguin | 0.778098 | 0.645235 | 7 |
171 | Ostrich | 2.051805 | 2.451676 | 5 |
174 | Ostrich | 1.964196 | 2.526428 | 5 |
43 | Tiger | 2.739377 | 1.144657 | 8 |
122 | Panda | 1.321700 | 0.626105 | 6 |
Illustrasjon av datasettet med tallkoder
# Plot the animal data with animal codes
=(10, 6))
plt.figure(figsize= plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))
colors
for i, animal in enumerate(animal_data['Animal'].unique()):
= animal_data[animal_data['Animal'] == animal]
subset 'Length'], subset['Height'], label=animal, color=colors(i))
plt.scatter(subset[
# Print the animal code above the average length and height
= subset['Length'].mean()
avg_length = subset['Height'].mean()
avg_height + 0.3, f'{i}', fontsize=20, color=colors(i))
plt.text(avg_length, avg_height
# Add labels and title
'Animal Length vs Height with Animal Codes')
plt.title('Length (m)')
plt.xlabel('Height (m)')
plt.ylabel(='Animal')
plt.legend(title plt.show()
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_10268/1848883834.py:3: MatplotlibDeprecationWarning:
The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
Lineær regresjon med tallkoder
# Prepare the data for linear regression
= animal_data[['Length', 'Animal_Code']]
X = animal_data['Height']
y
# Create and fit the model
= LinearRegression()
model
model.fit(X, y)
# Predict the heights using the model
= model.predict(X)
predicted_heights
# Plot the original data and the linear regression model
=(10, 6))
plt.figure(figsize= plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))
colors
for i, animal in enumerate(animal_data['Animal'].unique()):
= animal_data[animal_data['Animal'] == animal]
subset 'Length'], subset['Height'], label=animal, color=colors(i))
plt.scatter(subset[
# Predict heights for the subset
= subset[['Length', 'Animal_Code']]
subset_X = model.predict(subset_X)
subset_predicted_heights
# Plot the regression line for the subset
'Length'], subset_predicted_heights, color=colors(i))
plt.plot(subset[
# Print the animal code above the average length and height
= subset['Length'].mean()
avg_length = subset['Height'].mean()
avg_height + 0.3, f'{i}', fontsize=20, color=colors(i))
plt.text(avg_length, avg_height
# Add labels and title
'Two Variable Linear Regression: Animal Length vs Height')
plt.title('Length (m)')
plt.xlabel('Height (m)')
plt.ylabel(='Animal')
plt.legend(title plt.show()
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_10268/389375858.py:14: MatplotlibDeprecationWarning:
The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
- Hvordan gikk egentlig dette? Se nøye etter.
Om vi ser på tallene, så ser vi at alle regresjonslinjene er sortert ettet tall. Det har altså noe å si hvordan dyrene er sortert. Bet blir litt rart.
Ligning
\[H(L, N) = \beta_0 + \beta_1 L + \beta_2 N\]
Symbol | Beskrivelse |
---|---|
H | Høyde |
L | Lengde |
N | Numerisk kode for dyret |
\(\beta_i\) | Regresjonskoeffisienter |
Hvordan gjøre det bedre?
- One-hot-encoding
Med one-hot encoding
Vi kan lage one-hot-kodet data med pandas.get_dummies(...)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
= pd.get_dummies(animal_data, columns=['Animal'], drop_first=False)
transformed_data 10)) display(transformed_data.sample(
Length | Height | Animal_Code | Animal_Elephant | Animal_Giraffe | Animal_Kangaroo | Animal_Koala | Animal_Lion | Animal_Ostrich | Animal_Panda | Animal_Penguin | Animal_Tiger | Animal_Zebra | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19 | 2.666154 | 1.282923 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
169 | 0.787750 | 0.620942 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
39 | 2.623555 | 1.121049 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
133 | 1.436339 | 0.714177 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
78 | 3.074540 | 5.400244 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
175 | 2.289047 | 2.511948 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
29 | 2.901872 | 1.066009 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
106 | 1.724647 | 1.953400 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
117 | 1.616279 | 1.795778 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
176 | 1.932531 | 2.407404 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Regresjonsmodell med one-hot-coding
= transformed_data.drop(columns=['Height', 'Animal_Code'])
X = transformed_data['Height']
y
= LinearRegression()
model
model.fit(X, y)
# Bruke modellen
= model.predict(X)
predicted_heights
# Plot de originale dataene og den lineære regresjonsmodellen
=(8, 5))
plt.figure(figsize= plt.cm.get_cmap('tab10', len(animal_data['Animal'].unique()))
colors
for i, animal in enumerate(animal_data['Animal'].unique()):
= animal_data[animal_data['Animal'] == animal]
subset 'Length'], subset['Height'], label=animal, color=colors(i), edgecolor=colors(i), facecolors='none')
plt.scatter(subset[
# Prediker høyder for hvert dyr
= transformed_data[transformed_data[f'Animal_{animal}'] == 1].drop(columns=['Height', 'Animal_Code'])
subset_X = model.predict(subset_X)
subset_predicted_heights
# Plot regresjonslinjen for hvert enkelt dyr
'Length'], subset_predicted_heights, color=np.array(colors(i))*0.9, linewidth=5)
plt.plot(subset[
# Pynt
'Lineær regresjon med One-Hot Encoding: Dyrelengde vs Høyde')
plt.title('Lengde (m)')
plt.xlabel('Høyde (m)')
plt.ylabel(='Dyr')
plt.legend(title plt.show()
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_10268/1362502205.py:12: MatplotlibDeprecationWarning:
The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
Likning
\[H(L, N) = \beta_0 + \beta_1 L + \sum_{\mathrm{i = \{Lion, Tiger, ...\}}}^{k} \beta_i [\text{er dette en }i\mathrm{?}]\]
La oss se på dette i et litt mindre datasett
import pandas as pd
import seaborn as sns
= sns.load_dataset('healthexp')
health 10)) display(health.sample(
Year | Country | Spending_USD | Life_Expectancy | |
---|---|---|---|---|
126 | 1996 | France | 2169.451 | 78.2 |
222 | 2012 | France | 4299.434 | 82.1 |
40 | 1980 | Great Britain | 385.099 | 73.2 |
154 | 2001 | Canada | 2624.293 | 79.3 |
272 | 2020 | Japan | 4665.641 | 84.7 |
84 | 1989 | Canada | 1579.543 | 77.1 |
250 | 2017 | Canada | 5150.470 | 81.9 |
201 | 2008 | USA | 7385.026 | 78.1 |
48 | 1982 | Canada | 996.086 | 75.6 |
110 | 1993 | Japan | 1332.213 | 79.4 |
Her bruker vi seaborn kun for å laste inn et datasett. Seaborn gir oss også noen muligheter til pen visualisering i statistikk, for dem som måtte være interessert i det.
Underveisoppgave
- Gjør one-hot encoding av healthexp-datasettet
- Gjør trenings-validerings-splitt av datasettet
- Tren en lineær regresjonsmodell for å predikere life expectancy, med spending som forklaringsvariabel
- Ta med land som forklaringsvariabel i modellen
- Sammenligne nøyaktigehten til modellene
import pandas as pd
import seaborn as sns
= sns.load_dataset('healthexp')
health = pd.get_dummies(health, columns=['Country'])
health_onehot 10)) display(health_onehot.sample(
Year | Spending_USD | Life_Expectancy | Country_Canada | Country_France | Country_Germany | Country_Great Britain | Country_Japan | Country_USA | |
---|---|---|---|---|---|---|---|---|---|
223 | 2012 | 3614.131 | 81.0 | 0 | 0 | 0 | 1 | 0 | 0 |
171 | 2003 | 5726.538 | 77.1 | 0 | 0 | 0 | 0 | 0 | 1 |
211 | 2010 | 3441.710 | 80.6 | 0 | 0 | 0 | 1 | 0 | 0 |
93 | 1990 | 1088.959 | 78.9 | 0 | 0 | 0 | 0 | 1 | 0 |
131 | 1997 | 2496.201 | 77.3 | 0 | 0 | 1 | 0 | 0 | 0 |
249 | 2016 | 9717.649 | 78.7 | 0 | 0 | 0 | 0 | 0 | 1 |
193 | 2007 | 3021.671 | 79.7 | 0 | 0 | 0 | 1 | 0 | 0 |
209 | 2010 | 4423.070 | 80.5 | 0 | 0 | 1 | 0 | 0 | 0 |
149 | 2000 | 2895.533 | 78.2 | 0 | 0 | 1 | 0 | 0 | 0 |
163 | 2002 | 2287.476 | 78.3 | 0 | 0 | 0 | 1 | 0 | 0 |
for i, frame in health.groupby("Country"):
"Spending_USD"], frame["Life_Expectancy"], marker="o", label=i)
plt.scatter(frame["Expenditure (USD)")
plt.xlabel("Life expectancy")
plt.ylabel( plt.legend()
Start på løsning
= pd.get_dummies(health, columns=['Country'], drop_first=False)
health_onehot 10)) display(health.sample(
Year | Country | Spending_USD | Life_Expectancy | |
---|---|---|---|---|
74 | 1987 | Canada | 1357.453 | 76.7 |
57 | 1983 | USA | 1451.945 | 74.6 |
159 | 2001 | USA | 4888.518 | 76.9 |
153 | 2000 | USA | 4536.561 | 76.7 |
152 | 2000 | Japan | 1847.786 | 81.2 |
52 | 1982 | USA | 1329.669 | 74.5 |
137 | 1998 | Germany | 2566.003 | 77.7 |
22 | 1975 | USA | 560.750 | 72.7 |
113 | 1994 | Germany | 2188.676 | 76.5 |
104 | 1992 | Japan | 1253.415 | 79.2 |
Enkel regresjonsmodell
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
= np.array(health_onehot["Spending_USD"]).reshape(-1,1)
X = health_onehot['Life_Expectancy']
y
# Split the data into training and testing sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Create and fit the model on the training data
= LinearRegression()
model
model.fit(X_train, y_train)
# Predict the life expectancy using the model on the test data
= model.predict(X_test)
predicted_life_expectancy
# Evaluate the model
from sklearn.metrics import mean_squared_error, r2_score
= mean_squared_error(y_test, predicted_life_expectancy)
mse = r2_score(y_test, predicted_life_expectancy)
r2
# Plot the original data and the linear regression model
=(8, 5))
plt.figure(figsize='blue', label='Actual Data')
plt.scatter(X_test, y_test, color='red', label='Linear Regression Model')
plt.plot(X_test, predicted_life_expectancy, color
# Add labels and title
'Linear Regression: Spending vs Life Expectancy')
plt.title('Spending (USD)')
plt.xlabel('Life Expectancy (years)')
plt.ylabel(
plt.legend()
plt.show()print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
Mean Squared Error: 7.846016617615249
R^2 Score: 0.3573359515082699
En påfallende “god” modell, hva har skjedd her?
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
= health_onehot.drop(columns=['Life_Expectancy'])
X = health_onehot['Life_Expectancy']
y
# Split the data into training and testing sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Create and fit the model on the training data
= LinearRegression()
model
model.fit(X_train, y_train)
# Predict the life expectancy using the model on the test data
= model.predict(X_test)
predicted_life_expectancy
# Evaluate the model
= mean_squared_error(y_test, predicted_life_expectancy)
mse = r2_score(y_test, predicted_life_expectancy)
r2
# Plot the original data and the linear regression model predictions per country
=(8, 5))
plt.figure(figsize= plt.cm.get_cmap('tab20', len(health['Country'].unique()))
colors
for i, country in enumerate(health['Country'].unique()):
= health[health['Country'] == country]
subset 'Spending_USD'], subset['Life_Expectancy'], label=country, color=colors(i), edgecolor=colors(i), facecolors='none')
plt.scatter(subset[
# Predict life expectancy for the subset
= health_onehot[health_onehot[f'Country_{country}'] == 1].drop(columns=['Life_Expectancy'])
subset_X = model.predict(subset_X)
subset_predicted_life_expectancy
# Plot the regression line for the subset
'Spending_USD'], subset_predicted_life_expectancy, color=np.array(colors(i))*0.9, linewidth=2)
plt.plot(subset[
# Add labels and title
'Linear Regression with One-Hot Encoding: Spending vs Life Expectancy by Country')
plt.title('Spending (USD)')
plt.xlabel('Life Expectancy (years)')
plt.ylabel(='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.legend(title
plt.show()print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_10268/1590429931.py:24: MatplotlibDeprecationWarning:
The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
Mean Squared Error: 0.13772868450150377
R^2 Score: 0.9887186991451874
Uten år som forklaringsvariabel
# Drop the 'Year' column from the dataset
= health_onehot.drop(columns=['Year'])
health_onehot_no_year
# Prepare the data for linear regression
= health_onehot_no_year.drop(columns=['Life_Expectancy'])
X = health_onehot_no_year['Life_Expectancy']
y
# Split the data into training and testing sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Create and fit the model on the training data
= LinearRegression()
model
model.fit(X_train, y_train)
# Predict the life expectancy using the model on the test data
= model.predict(X_test)
predicted_life_expectancy
# Evaluate the model
= mean_squared_error(y_test, predicted_life_expectancy)
mse = r2_score(y_test, predicted_life_expectancy)
r2
# Plot the original data and the linear regression model predictions per country
=(8, 5))
plt.figure(figsize= plt.cm.get_cmap('tab20', len(health['Country'].unique()))
colors
for i, country in enumerate(health['Country'].unique()):
= health[health['Country'] == country]
subset 'Spending_USD'], subset['Life_Expectancy'], label=country, color=colors(i), edgecolor=colors(i), facecolors='none')
plt.scatter(subset[
# Predict life expectancy for the subset
= health_onehot_no_year[health_onehot_no_year[f'Country_{country}'] == 1].drop(columns=['Life_Expectancy'])
subset_X = model.predict(subset_X)
subset_predicted_life_expectancy
# Plot the regression line for the subset
'Spending_USD'], subset_predicted_life_expectancy, color=np.array(colors(i))*0.9, linewidth=2)
plt.plot(subset[
# Add labels and title
'Linear Regression with One-Hot Encoding (No Year): Spending vs Life Expectancy by Country')
plt.title('Spending (USD)')
plt.xlabel('Life Expectancy (years)')
plt.ylabel(='Country', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.legend(title
plt.show()print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
/var/folders/qn/3_cqp_vx25v4w6yrx68654q80000gp/T/ipykernel_10268/3728857717.py:24: MatplotlibDeprecationWarning:
The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
Mean Squared Error: 2.3013732097838697
R^2 Score: 0.8114954509821513