背景

来源: XGBoost in Python from Start to Finish

Introduction

importing data from a file
missing data
xgBoost data format
building a model
optimizing a model

混淆矩阵(Confusion Matrix)

rows: Predict result
columns: Actual result
outputs: TP/TN/FP/FN

# install needed modules
!pip install pandas numpy sklearn xgboost fsspec matplotlib ipympl

import pandas as pd
import numpy as np
import xgboost as xgboost

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import balanced_accuracy_score, roc_auc_score, make_scorer, confusion_matrix, plot_confusion_matrix

Import the data

# download the data
!wget 'https://raw.githubusercontent.com/bhargitay/telco_customer_churn/master/data/WA_Fn-UseC_-Telco-Customer-Churn.csv'

# read by pandas
df = pd.read_csv('G://temp//WA_Fn-UseC_-Telco-Customer-Churn.csv')

# show datafram
df.head()

# drop unwanted column
# axis=1 -> column inplace -> update the dataframe
df.drop(['customerID'], axis=1, inplace=True)

# find those only contains unique value
df['gender'].unique()

array(['Female', 'Male'], dtype=object)

# remove space from column names and values
df['MultipleLines'].replace(' ', '_', regex=True, inplace=True)
df.columns = df.columns.str.replace(' ', '_')
df.head()

# remove all space from data
df.replace(' ', '_', regex=True, inplace=True)

Missing data

Missing data就是类似NA, 或者空白字符的字段. 在xgBoost中, Missing data应该置为0.

为查找Missing data, 使用dtypes方法:

df.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

对所有类型为object的列, 逐项检查是否合理.

找到不合理的列及可能有问题的值后, 找到这些值的位置:

# find location
df.loc[(df['TotalCharges'] == ' ')]

# set to 0
df.loc[(df['TotalCharges'] == ' '), 'TotalCharges'] = 0
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])
df.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

xgBoost data format

首先将特征和估计值分开:

X = df.drop('Churn', axis=1).copy()
X.head()

y = df['Churn'].copy()
y.head()

0     No
1     No
2    Yes
3     No
4    Yes
Name: Churn, dtype: object

其次, 将Category特征转化为One-hot编码. One-hot编码的原因是它能保证所有目标值互相之间的距离相同, 从而避免模型不必要地拉近不同预测值的距离. 用get_dummies转化one-hot编码.

y_encoded=pd.get_dummies(y)['Yes']

X_encoded=pd.get_dummies(X, columns=['gender','Partner','Dependents','PhoneService','MultipleLines','InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','PaperlessBilling','PaymentMethod'])

X_encoded.head()

y_encoded.unique()

array([0, 1], dtype=uint8)

tips:默认值设置为0在one-hot编码中的优势 - 默认值为0会将所有位置的编码都设置为0, 从而使得0不会干扰到其它有实际值的项目.

build xgBoost model

首先, 将数据划分为训练集和测试集. 划分集合时, 需要注意按照目标预测值的比例来进行划分, 保证划分后的比例与划分前尽量一致:

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, random_state=42, stratify=y_encoded)

print(len(X_train), len(X_test))

5282 1761

接下来开始构建xgBoost模型.

early stop: 如果10棵树都没有产生优化效果, 则退出

clf_xgb = xgboost.XGBClassifier(objective='binary:logistic', missing=None, seed=42)
clf_xgb.fit(X_train, y_train, verbose=True, early_stopping_rounds=10, 
            eval_metric='aucpr', eval_set=[(X_test, y_test)])

C:\Users\henryalps\Anaconda3\envs\xeus-python\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)

[0]	validation_0-aucpr:0.63160
[1]	validation_0-aucpr:0.64538
[2]	validation_0-aucpr:0.64402
[3]	validation_0-aucpr:0.64429
[4]	validation_0-aucpr:0.63866
[5]	validation_0-aucpr:0.63419
[6]	validation_0-aucpr:0.63912
[7]	validation_0-aucpr:0.63787
[8]	validation_0-aucpr:0.63779
[9]	validation_0-aucpr:0.63894
[10]	validation_0-aucpr:0.63680

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=None, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)

最后绘制confusion matrix:

%matplotlib inline
plot_confusion_matrix(clf_xgb, X_test, y_test, values_format='d', display_labels=['Did not leave','left'])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x27d32087f70>

optimize xgBoost model

模型对left的分类并不好, 但是left的影响更大, 所以需要对模型进行优化. 优化可以通过结合Cross validation和GridSearch两种方法来达成.

# GridSearch
param_grid = {
    'max_depth': [3,4,5],
    'learning_rate': [0.1,0.01,0.05],
    'gamma': [0, 0.25, 1.0],
    'reg_lambda': [0, 1.0, 10.0],
    'scale_pos_weight': [1,3,5]
}

# 使用subsample参数, 可以指定每棵树对应的样本数量占整体的比重
# 使用colsample_bytree, 可以指定每棵树选取多少特征
# 使用cv, 可指定cross validation是多少折
optimal_params = GridSearchCV(estimator=xgboost.XGBClassifier(objective='binary:logistic',
                                                             seed=42,
                                                             subsample=0.9,
                                                             colsample_bytree=0.5),
                             param_grid=param_grid,
                             scoring='roc_auc',
                             cv=3)

	tenure	MonthlyCharges	TotalCharges	gender_Female	gender_Male	Partner_No	Partner_Yes	Dependents_No	...	Contract_Month-to-month	Contract_One_year	PaperlessBilling_No	PaperlessBilling_Yes	PaymentMethod_Bank_transfer_(automatic)	PaymentMethod_Electronic_check	PaymentMethod_Mailed_check
0	1	29.85	29.85	1	0	0	1	1	...	1	0	0	1	0	1	0
1	34	56.95	1889.50	0	1	1	0	1	...	0	1	1	0	0	0	1
2	2	53.85	108.15	0	1	1	0	1	...	1	0	0	1	0	0	1
3	45	42.30	1840.75	0	1	1	0	1	...	0	1	1	0	1	0	0
4	2	70.70	151.65	1	0	1	0	1	...	1	0	0	1	0	1	0

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	...	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	...	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	...	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No_phone_service	DSL	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	...	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	...	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	No	No	45	No	No_phone_service	DSL	Yes	...	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

	gender	Partner	Dependents	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	Churn
488	Female	Yes	Yes	No	No_phone_service	DSL	Yes	No	Yes	Yes	Yes	No	Two year	Yes	Bank transfer (automatic)	52.55	No
753	Male	No	Yes	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	20.25	No
936	Female	Yes	Yes	Yes	No	DSL	Yes	Yes	Yes	No	Yes	Yes	Two year	No	Mailed check	80.85	No
1082	Male	Yes	Yes	Yes	Yes	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	25.75	No
1340	Female	Yes	Yes	No	No_phone_service	DSL	Yes	Yes	Yes	Yes	Yes	No	Two year	No	Credit card (automatic)	56.05	No
3331	Male	Yes	Yes	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	19.85	No
3826	Male	Yes	Yes	Yes	Yes	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	25.35	No
4380	Female	Yes	Yes	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	20.00	No
5218	Male	Yes	Yes	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	One year	Yes	Mailed check	19.70	No
6670	Female	Yes	Yes	Yes	Yes	DSL	No	Yes	Yes	Yes	Yes	No	Two year	No	Mailed check	73.35	No
6754	Male	No	Yes	Yes	Yes	DSL	Yes	Yes	No	Yes	No	No	Two year	Yes	Bank transfer (automatic)	61.90	No