xgBoost学习笔记
学习StatQuest的https://www.youtube.com/watch?v=GrJP9FLV3FE 教程记录
- 背景
- Introduction
- 混淆矩阵(Confusion Matrix)
- Import the data
- Missing data
- xgBoost data format
- build xgBoost model
- optimize xgBoost model
# install needed modules
!pip install pandas numpy sklearn xgboost fsspec matplotlib ipympl
import pandas as pd
import numpy as np
import xgboost as xgboost
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import balanced_accuracy_score, roc_auc_score, make_scorer, confusion_matrix, plot_confusion_matrix
# download the data
!wget 'https://raw.githubusercontent.com/bhargitay/telco_customer_churn/master/data/WA_Fn-UseC_-Telco-Customer-Churn.csv'
# read by pandas
df = pd.read_csv('G://temp//WA_Fn-UseC_-Telco-Customer-Churn.csv')
# show datafram
df.head()
# drop unwanted column
# axis=1 -> column inplace -> update the dataframe
df.drop(['customerID'], axis=1, inplace=True)
# find those only contains unique value
df['gender'].unique()
# remove space from column names and values
df['MultipleLines'].replace(' ', '_', regex=True, inplace=True)
df.columns = df.columns.str.replace(' ', '_')
df.head()
# remove all space from data
df.replace(' ', '_', regex=True, inplace=True)
df.dtypes
对所有类型为object的列, 逐项检查是否合理.
找到不合理的列及可能有问题的值后, 找到这些值的位置:
# find location
df.loc[(df['TotalCharges'] == ' ')]
# set to 0
df.loc[(df['TotalCharges'] == ' '), 'TotalCharges'] = 0
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])
df.dtypes
X = df.drop('Churn', axis=1).copy()
X.head()
y = df['Churn'].copy()
y.head()
其次, 将Category特征转化为One-hot编码. One-hot编码的原因是它能保证所有目标值互相之间的距离相同, 从而避免模型不必要地拉近不同预测值的距离. 用get_dummies
转化one-hot编码.
y_encoded=pd.get_dummies(y)['Yes']
X_encoded=pd.get_dummies(X, columns=['gender','Partner','Dependents','PhoneService','MultipleLines','InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','PaperlessBilling','PaymentMethod'])
X_encoded.head()
y_encoded.unique()
tips:默认值设置为0在one-hot编码中的优势 - 默认值为0会将所有位置的编码都设置为0, 从而使得0不会干扰到其它有实际值的项目.
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, random_state=42, stratify=y_encoded)
print(len(X_train), len(X_test))
接下来开始构建xgBoost模型.
- early stop: 如果10棵树都没有产生优化效果, 则退出
clf_xgb = xgboost.XGBClassifier(objective='binary:logistic', missing=None, seed=42)
clf_xgb.fit(X_train, y_train, verbose=True, early_stopping_rounds=10,
eval_metric='aucpr', eval_set=[(X_test, y_test)])
最后绘制confusion matrix:
%matplotlib inline
plot_confusion_matrix(clf_xgb, X_test, y_test, values_format='d', display_labels=['Did not leave','left'])
模型对left的分类并不好, 但是left的影响更大, 所以需要对模型进行优化. 优化可以通过结合Cross validation
和GridSearch
两种方法来达成.
# GridSearch
param_grid = {
'max_depth': [3,4,5],
'learning_rate': [0.1,0.01,0.05],
'gamma': [0, 0.25, 1.0],
'reg_lambda': [0, 1.0, 10.0],
'scale_pos_weight': [1,3,5]
}
# 使用subsample参数, 可以指定每棵树对应的样本数量占整体的比重
# 使用colsample_bytree, 可以指定每棵树选取多少特征
# 使用cv, 可指定cross validation是多少折
optimal_params = GridSearchCV(estimator=xgboost.XGBClassifier(objective='binary:logistic',
seed=42,
subsample=0.9,
colsample_bytree=0.5),
param_grid=param_grid,
scoring='roc_auc',
cv=3)