TP Arbres de décision, forêts aléatoires et boosting¶

Consignes¶

Ce notebook constitue la base des programmes et du compte-rendu. Y rajouter directement:

blocs de programme
affichage d'images résultats
réponses aux questions et éléments d'analyse

Envoyer le fichier ipynb avant le 20 février à : bertrand.le_saux@onera.fr

Mentionner dans le sujet du mail et le nom de fichier les noms des étudiants.

Contenu¶

Ce TP porte sur l'analyse de statistiques extraites de la série de livres Game of Thrones. Plus précisément, l'objectif est de prédire la survie ou la disparition des personnages en se basant sur les données mises à disposition.

Les données proviennent du fan site: https://got.show/machine-learning-algorithm-predicts-death-game-of-thrones

Ici vous pouvez accéder au fichier de données GoT légèrement modifié pour ce TD : character-predictions-new.csv

Les différentes étapes consistent à:

Analyser les données;
Prédire la survie ou disparition des personnages par:
- Arbres de décision;
- Forêt aléatoire;
- Boosting.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

character_predictions = pd.read_csv('character-predictions-new.csv')
df = character_predictions

Analyse des données¶

Que contient le fichier fourni ?
Quel type de données est disponible ?
Analyser statistiquement ces données.

# Exemple de fonction affichant un histogramme de survie
def plot(cat):
    df.groupby(cat).isAlive.mean().plot(kind='bar')
    plt.ylabel('Percent Alive')
    plt.ylim([0.0, 1.0])
    plt.show()

df.keys()

Index(['name', 'title', 'male', 'culture', 'dateOfBirth', 'DateoFdeath',
       'mother', 'father', 'heir', 'house', 'spouse', 'book1', 'book2',
       'book3', 'book4', 'book5', 'isAliveMother', 'isAliveFather',
       'isAliveHeir', 'isAliveSpouse', 'isMarried', 'isNoble', 'age',
       'numDeadRelations', 'boolDeadRelations', 'isPopular', 'popularity',
       'isAlive', 'num_books', 'name_in_house'],
      dtype='object')

df

plot('isPopular')

df['is_over'] = df['age'].apply(lambda x: 1 if x>30 else 0)
plot('is_over')
df.drop('is_over',axis=1,inplace=True)

plot("culture")

Prédiction par arbre de décision¶

Entraîner un arbre de décision sur les données
Prédire la survie des personnages sur un ensemble de test (aléatoire ou 'book5' par exemple)
Mesurer l'influence des paramètres de l'algorithme
Évaluer la qualité de la prédiction
Expliquer votre démarche

from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

## 'age' column has NaN values which are not ok for sklearn
from sklearn.preprocessing import Imputer

## replace NaN by average age
imp=Imputer(missing_values="NaN", strategy="mean" )
df["age"]=imp.fit_transform(df[["age"]]).ravel()

print(df['age'].mean() )

#df

36.646651270208054

# prepare training set and corresponding labels
feature_cols = ['male','book1','book2','book3','book4','book5','isMarried','isNoble','popularity','name_in_house',\
                'boolDeadRelations','age','numDeadRelations']
X = df[feature_cols]
y = df.isAlive

indices = np.arange( len(y) )

X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(X, y, indices, random_state=0)

from sklearn.tree import DecisionTreeClassifier

# train a decision tree
# default parameters (Gini coefficient)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train,y_train)

# test the classifier
y_pred=clf.predict(X_test)

# compute classification accuracy
print (clf.score(X_test, y_test) )

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print(clf.classes_)
print(cnf_matrix)
print(classification_report(y_test, y_pred))

0.724845995893
[0 1]
[[ 68  59]
 [ 75 285]]
             precision    recall  f1-score   support

          0       0.48      0.54      0.50       127
          1       0.83      0.79      0.81       360

avg / total       0.74      0.72      0.73       487

def next_popular_dead(df, y_pred, y_test):
    # print names of the predicted dead
    # find how to get indices and then the names
    badguys=df['name'].values
    pop = df['isPopular'].values

    true_dead = np.where( ((y_pred==0) & (y_pred==y_test)) )[0]
    false_dead = np.where( ((y_pred==0) & (y_pred!=y_test)) )[0]

    #print(badguys[idx_test[true_dead]])
    #print(badguys[idx_test[false_dead]])

    # potential popular dead in season 6 and 7... SPOILER !
    false_dead_badguys = badguys[idx_test[false_dead]]
    return(false_dead_badguys[pop[idx_test[false_dead]]==1])

print('next popular dead: ')
print(next_popular_dead(df, y_pred, y_test))

next popular dead: 
['Cersei Lannister' 'Loras Tyrell' 'Orton Merryweather' 'Illyrio Mopatis'
 'Barristan Selmy' 'Tommen Baratheon' 'Walder Frey' 'Jon Connington'
 'Jaime Lannister' 'Gendry' 'Jeyne Poole' 'Euron Greyjoy']

# train a decision tree
# fine-tuned coefficients: entropy as loss, deep tree, well-balanced split
clf = DecisionTreeClassifier(random_state=0, criterion='entropy', max_depth=12, min_samples_split=0.35)
clf.fit(X_train,y_train)

# test the classifier
y_pred=clf.predict(X_test)

# compute classification accuracy
print (clf.score(X_test, y_test) )

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print(clf.classes_)
print(cnf_matrix)
print(classification_report(y_test, y_pred))

print('next popular dead: ')
print(next_popular_dead(df, y_pred, y_test))

0.796714579055
[0 1]
[[ 48  79]
 [ 20 340]]
             precision    recall  f1-score   support

          0       0.71      0.38      0.49       127
          1       0.81      0.94      0.87       360

avg / total       0.78      0.80      0.77       487

next popular dead: 
['Tommen Baratheon' 'Euron Greyjoy']

Prédiction par forêt aléatoire¶

Entraîner une random forest sur les données
Prédire la survie des personnages sur un ensemble de test (aléatoire ou 'book5' par exemple)
Expliquer et mesurer l'influence des paramètres de l'algorithme
Évaluer la qualité de la prédiction
Expliquer votre démarche

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestClassifier(max_depth=3, random_state=0)
rf.fit(X_train, y_train)

# All possible parameters:
#RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
#            max_depth=2, max_features='auto', max_leaf_nodes=None,
#            min_impurity_decrease=0.0, min_impurity_split=None,
#            min_samples_leaf=1, min_samples_split=2,
#            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
#            oob_score=False, random_state=0, verbose=0, warm_start=False)

# sort features by importance and print them
idx_imp = np.argsort(rf.feature_importances_, axis=None) 
#print( rf.feature_importances_[idx_imp[::-1]] )
#print(idx_imp[::-1])

for i in idx_imp[::-1]: print( feature_cols[ i ] )

book4
popularity
boolDeadRelations
numDeadRelations
age
book1
male
book3
book2
isMarried
book5
isNoble
name_in_house

# test the classifier
y_pred=rf.predict(X_test)

# compute classification accuracy
print (rf.score(X_test, y_test) )

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print(rf.classes_)
print(cnf_matrix)
print(classification_report(y_test, y_pred))

print('next popular dead: ')
print(next_popular_dead(df, y_pred, y_test))

0.76180698152
[0 1]
[[ 13 114]
 [  2 358]]
             precision    recall  f1-score   support

          0       0.87      0.10      0.18       127
          1       0.76      0.99      0.86       360

avg / total       0.79      0.76      0.68       487

next popular dead: 
['Tommen Baratheon' 'Euron Greyjoy']

rf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       n_estimators=130, oob_score=True, max_depth=12, random_state=0)
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=12, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=130, n_jobs=1,
            oob_score=True, random_state=0, verbose=0, warm_start=False)

# test the classifier
y_pred=rf.predict(X_test)

# compute classification accuracy
print (rf.score(X_test, y_test) )

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print(rf.classes_)
print(cnf_matrix)
print(classification_report(y_test, y_pred))

print('next popular dead: ')
print(next_popular_dead(df, y_pred, y_test))

0.82135523614
[0 1]
[[ 57  70]
 [ 17 343]]
             precision    recall  f1-score   support

          0       0.77      0.45      0.57       127
          1       0.83      0.95      0.89       360

avg / total       0.81      0.82      0.80       487

next popular dead: 
['Barristan Selmy' 'Tommen Baratheon' 'Gendry' 'Jeyne Poole'
 'Euron Greyjoy' 'Harys Swyft']

Prédiction par Boosting¶

Entraîner un algorithme de boosting sur les données
Prédire la survie des personnages sur un ensemble de test (aléatoire ou 'book5' par exemple)
Expliquer et mesurer l'influence des paramètres de l'algorithme
Évaluer la qualité de la prédiction
Expliquer votre démarche

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor

adaboo = AdaBoostClassifier(random_state=0, algorithm='SAMME')
# all parameters (not so much!)
#AdaBoostClassifier(base_estimator=None, n_estimators=50, 
#learning_rate=1.0, algorithm=’SAMME.R’,random_state=None)

adaboo.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME', base_estimator=None, learning_rate=1.0,
          n_estimators=50, random_state=0)

# test the classifier
y_pred=adaboo.predict(X_test)

# compute classification accuracy
print (adaboo.score(X_test, y_test) )

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print(adaboo.classes_)
print(cnf_matrix)
print(classification_report(y_test, y_pred))

print('next popular dead: ')
print(next_popular_dead(df, y_pred, y_test))

0.802874743326
[0 1]
[[ 51  76]
 [ 20 340]]
             precision    recall  f1-score   support

          0       0.72      0.40      0.52       127
          1       0.82      0.94      0.88       360

avg / total       0.79      0.80      0.78       487

next popular dead: 
['Tommen Baratheon' 'Euron Greyjoy']

graboo = GradientBoostingClassifier(random_state=0, max_depth=5, subsample=0.5, )
# all parameters
#GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, 
#subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, 
#min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, 
#min_impurity_split=None, init=None, random_state=None, max_features=None, 
#verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’)

graboo.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=5,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=0, subsample=0.5, verbose=0,
              warm_start=False)

# test the classifier
y_pred=graboo.predict(X_test)

# compute classification accuracy
print (graboo.score(X_test, y_test) )

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print(graboo.classes_)
print(cnf_matrix)
print(classification_report(y_test, y_pred))

print('next popular dead: ')
print(next_popular_dead(df, y_pred, y_test))

0.813141683778
[0 1]
[[ 58  69]
 [ 22 338]]
             precision    recall  f1-score   support

          0       0.72      0.46      0.56       127
          1       0.83      0.94      0.88       360

avg / total       0.80      0.81      0.80       487

next popular dead: 
['Loras Tyrell' 'Jon Snow' 'Barristan Selmy' 'Tommen Baratheon'
 'Walder Frey' 'Jaime Lannister' 'Gendry' 'Jeyne Poole' 'Euron Greyjoy']

	name	title	male	culture	dateOfBirth	DateoFdeath	mother	father	heir	house	...	isMarried	isNoble	age	numDeadRelations	boolDeadRelations	isPopular	popularity	isAlive	num_books	name_in_house
0	Viserys II Targaryen	NaN	1	NaN	NaN	NaN	Rhaenyra Targaryen	Daemon Targaryen	Aegon IV Targaryen	NaN	...	0	0	NaN	11	1	1	0.605351	0	0	0
1	Walder Frey	Lord of the Crossing	1	Rivermen	208.0	NaN	NaN	NaN	NaN	House Frey	...	1	1	97.0	1	1	1	0.896321	1	5	1
2	Addison Hill	Ser	1	NaN	NaN	NaN	NaN	NaN	NaN	House Swyft	...	0	1	NaN	0	0	0	0.267559	1	1	0
3	Aemma Arryn	Queen	0	NaN	82.0	105.0	NaN	NaN	NaN	House Arryn	...	1	1	23.0	0	0	0	0.183946	0	0	1
4	Sylva Santagar	Greenstone	0	Dornish	276.0	NaN	NaN	NaN	NaN	House Santagar	...	1	1	29.0	0	0	0	0.043478	1	1	1
5	Tommen Baratheon	NaN	1	NaN	NaN	NaN	Cersei Lannister	Robert Baratheon	Myrcella Baratheon	NaN	...	0	0	NaN	5	1	1	1.000000	1	0	0
6	Valarr Targaryen	Hand of the King	1	Valyrian	183.0	209.0	NaN	NaN	NaN	House Targaryen	...	1	1	26.0	0	0	1	0.431438	0	0	1
7	Viserys I Targaryen	NaN	1	NaN	NaN	NaN	Alyssa Targaryen	Baelon Targaryen	Rhaenyra Targaryen	NaN	...	0	0	NaN	5	1	1	0.678930	0	0	0
8	Wilbert	Ser	1	NaN	NaN	298.0	NaN	NaN	NaN	NaN	...	0	1	NaN	0	0	0	0.006689	0	1	0
9	Wilbert Osgrey	Ser	1	NaN	NaN	NaN	NaN	NaN	NaN	House Osgrey	...	0	1	NaN	0	0	0	0.020067	1	0	1
10	Will	NaN	1	NaN	NaN	297.0	NaN	NaN	NaN	Night's Watch	...	0	0	NaN	0	0	0	0.163880	0	2	0
11	Will (orphan)	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0	0	NaN	0	0	0	0.003344	1	1	0
12	Will (squire)	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0	0	NaN	0	0	0	0.003344	1	0	0
13	Will (Standfast)	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	House Osgrey	...	0	0	NaN	0	0	0	0.003344	1	0	0
14	Will (Treb)	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	House Osgrey	...	0	0	NaN	0	0	0	0.003344	1	0	0
15	Will Humble	NaN	1	Ironborn	NaN	NaN	NaN	NaN	NaN	House Humble	...	0	0	NaN	0	0	0	0.013378	1	1	1
16	Willam	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0	0	NaN	0	0	0	0.013378	1	1	0
17	Willem Wylde	Ser	1	NaN	NaN	NaN	NaN	NaN	NaN	House Wylde	...	0	1	NaN	0	0	0	0.020067	1	0	1
18	Willifer	Archmaester	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0	1	NaN	0	0	0	0.006689	1	1	0
19	Willit	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	House Swyft	...	0	0	NaN	0	0	0	0.010033	1	1	0
20	Willis Wode	Ser	1	NaN	NaN	NaN	NaN	NaN	NaN	House Wode	...	0	1	NaN	0	0	0	0.020067	1	2	1
21	Willis Fell	Ser	0	NaN	NaN	NaN	NaN	NaN	NaN	House Fell	...	0	1	NaN	0	0	0	0.023411	1	0	1
22	Willow Heddle	NaN	0	NaN	289.0	NaN	NaN	NaN	NaN	Brotherhood Without Banners	...	0	0	16.0	0	0	0	0.033445	1	1	0
23	Willum	Ser	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0	1	NaN	0	0	0	0.000000	1	1	0
24	Wolmer	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	House Webber	...	0	0	NaN	0	0	0	0.003344	1	0	0
25	Willow Witch-eye	NaN	0	Free Folk	NaN	300.0	NaN	NaN	NaN	NaN	...	0	0	NaN	0	0	0	0.050167	0	1	0
26	Woth	NaN	1	NaN	NaN	299.0	NaN	NaN	NaN	NaN	...	0	0	NaN	0	0	0	0.030100	0	1	0
27	Wulfe	NaN	1	Ironborn	NaN	NaN	NaN	NaN	NaN	House Greyjoy	...	0	0	NaN	0	0	0	0.023411	1	2	0
28	Wyl the Whittler	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0	0	NaN	0	0	0	0.006689	1	1	0
29	Wyl (guard)	NaN	0	Northmen	NaN	298.0	NaN	NaN	NaN	House Stark	...	0	0	NaN	0	0	0	0.026756	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1916	Tygett Lannister	Ser	1	NaN	250.0	NaN	NaN	NaN	NaN	House Lannister	...	1	1	55.0	0	0	0	0.110368	1	5	1
1917	Sarella Sand	NaN	0	Dornish	280.0	NaN	NaN	NaN	NaN	House Martell	...	0	0	25.0	1	1	0	0.103679	1	3	0
1918	Rhaegar Targaryen	Prince of Dragonstone	1	Valyrian	259.0	283.0	NaN	NaN	NaN	House Targaryen	...	1	1	24.0	11	1	1	0.799331	0	5	1
1919	Drogo	Khal	0	Dothraki	267.0	298299.0	NaN	NaN	NaN	NaN	...	1	0	32.0	5	1	1	0.558528	0	4	0
1920	Dunsen	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	House Clegane	...	0	0	NaN	0	0	0	0.040134	1	4	0
1921	Barra	NaN	1	NaN	298.0	299.0	NaN	NaN	NaN	House Baratheon	...	0	0	1.0	1	1	0	0.107023	0	3	0
1922	Bethany Bolton	the Dreadfort	1	Northmen	NaN	NaN	NaN	NaN	NaN	House Bolton	...	1	1	NaN	0	0	0	0.043478	1	2	1
1923	Desmond Grell	Ser	0	NaN	237.0	NaN	NaN	NaN	NaN	House Grell	...	0	1	68.0	0	0	0	0.076923	1	4	1
1924	Devan Seaworth	NaN	1	NaN	287.0	NaN	NaN	NaN	NaN	House Seaworth	...	0	0	18.0	0	0	0	0.093645	1	4	1
1925	Gendry	Ser	1	NaN	284.0	NaN	NaN	NaN	NaN	brotherhood without banners	...	0	1	21.0	2	1	1	0.438127	1	4	0
1926	Gran Goodbrother	NaN	1	Ironborn	NaN	NaN	NaN	NaN	NaN	House Goodbrother	...	0	0	NaN	0	0	0	0.026756	1	1	1
1927	Loras Tyrell	Ser	1	Reach	282.0	NaN	NaN	NaN	NaN	House Tyrell	...	0	1	23.0	2	1	1	0.665552	1	5	1
1928	Ricasso	Sunspear	0	NaN	NaN	NaN	NaN	NaN	NaN	House Martell	...	0	1	NaN	0	0	0	0.070234	1	2	0
1929	Stalwart Shield	NaN	0	NaN	NaN	300.0	NaN	NaN	NaN	Unsullied	...	0	0	NaN	0	0	0	0.043478	0	1	0
1930	Yohn Royce	Runestone	1	Valemen	NaN	NaN	NaN	NaN	NaN	House Royce	...	0	1	NaN	2	1	0	0.170569	1	5	1
1931	Yandry	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	House Targaryen	...	1	0	NaN	0	0	0	0.050167	1	1	0
1932	Tarle	NaN	1	Ironborn	NaN	NaN	NaN	NaN	NaN	Drowned men	...	0	0	NaN	0	0	0	0.026756	1	1	0
1933	Temmo	Khal	1	Dothraki	NaN	NaN	NaN	NaN	NaN	NaN	...	0	0	NaN	0	0	0	0.020067	1	1	0
1934	Rohanne Webber	Coldmoat	0	NaN	185.0	NaN	NaN	NaN	NaN	House Webber	...	1	1	100.0	0	0	0	0.170569	0	0	1
1935	Gormond Goodbrother	NaN	1	Ironborn	282.0	NaN	NaN	NaN	NaN	House Goodbrother	...	0	0	23.0	0	0	0	0.040134	1	1	1
1936	Walder Rivers	Ser	1	Rivermen	NaN	NaN	NaN	NaN	NaN	House Frey	...	1	1	NaN	0	0	0	0.080268	1	5	0
1937	Symon Silver Tongue	NaN	1	NaN	NaN	299.0	NaN	NaN	NaN	NaN	...	0	0	NaN	0	0	0	0.046823	0	3	0
1938	Moqorro	Slave of R'hllor	1	NaN	NaN	NaN	NaN	NaN	NaN	R'hllor	...	0	1	NaN	0	0	0	0.123746	1	1	0
1939	Meg	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	Brotherhood Without Banners	...	0	0	NaN	0	0	0	0.010033	1	1	0
1940	Laena Velaryon	NaN	0	Valyrian	93.0	120.0	NaN	NaN	NaN	House Velaryon	...	1	0	27.0	0	0	0	0.140468	0	0	1
1941	Luwin	NaN	1	Westerosi	NaN	299.0	NaN	NaN	NaN	House Stark	...	0	0	NaN	0	0	1	0.351171	0	5	0
1942	Reek	NaN	1	NaN	NaN	299.0	NaN	NaN	NaN	House Bolton	...	0	0	NaN	0	0	0	0.096990	0	3	0
1943	Symeon Star-Eyes	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0	0	NaN	0	0	0	0.030100	1	5	0
1944	Coldhands	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	Three-eyed crow	...	0	0	NaN	0	0	0	0.130435	1	3	0
1945	Tytos Lannister	Casterly Rock	1	NaN	220.0	267.0	NaN	NaN	NaN	House Lannister	...	1	1	47.0	4	1	0	0.210702	0	3	1