admin管理员组文章数量:1530012
How to win a data science competition
课程简介
课程收获
- how to preprocess the data
- extract features
- how to set up the validation correctly
- optimize the given metric
- A truly unique opportunity to see the detailed explanations of the winning solutions.
课程日程安排
week 2 basic pipeline
- EDA
- Validation
- Data leaks
week 3 improve model
-
Metrics (评估标准)
-
mean encoding 平均数编码
如果某一个特征是定性的(categorical),而这个特征的可能值非常多(高基数),那么平均数编码(mean encoding)是一种高效的编码方式。在实际应用中,这类特征工程能极大提升模型的性能。
week4 improve model
- Advanced features
- Hyperparameter optimization
- Ensembles
Competition Mechnics
Competition Mechnics
-
Data
-
Model
- produce best predication
- reproducible
-
Submission
-
Evaluation
-
value 公式
-
测试集合
[外链图片转存失败(img-jXbx8bHS-1566274040508)(How to win a data science competition.assets/1564806789399.png)]
-
-
Leaderboard
正常流程:
1, analyze data
2, fit model
3, submit
4, see public score
5, repeat to 1
为什么要参加竞赛
- great opportunity for learning
- 全局意识
- 加入进社群
- 赚钱
- 不能是首要目标
竞赛与真实问题的区别
真实问题
- 理解业务
- 抽象问题
- 收集数据
- 数据清洗
- 建模
- 评价模型
- 部署模型
竞赛
- 数据清洗
- 建模
[外链图片转存失败(img-ShORRpbY-1566274040511)(How to win a data science competition.assets/1564807498515.png)]
Recap of main ML algorithms
Linear model
[外链图片转存失败(img-WbhS98i6-1566274040513)(How to win a data science competition.assets/1564967587509.png)]
- 缺点:
- 很多case并不能用一条直线分开
Tree-based
基本原理就是分治策略
[外链图片转存失败(img-Uc0PeCjB-1566274040516)(How to win a data science competition.assets/1564967871312.png)]
先做一个策略进行区分,再用另一个策略
- 针对于表格性的数据是非常有用的
缺点:
- 很难获得linear dependencies,因为需要太多的分割
K-NN
K- nearest Neighbors
相邻的点总是有相近的label
Neural Networks
黑盒
注意
-
没有一个算法会比其他算法更好
-
我们不能使用一种简单的算法就赢得竞赛
结论:
[外链图片转存失败(img-LQzCfpS2-1566274040517)(How to win a data science competition.assets/1564968727633.png)]
Exploratory data analysis
Exploratory Data Analysis: what and why?
EDA可以带来
- 更好的理解数据
- 对数据更有直觉
- 生成假设
- 找到内在规律
Understand the data
- columns 代表是什么
- 数据是否有意义
- 检查数据异常
- 如果数据异常,也不用删除,加一列进行标记,让机器自动学习是不错的
探索 无个性特质的数据
- 特质数据是加密数据,但是都保持原有数据的特性,比如线性关系就是线性关系
- 可以通过一些技巧来解密线性关系
Visualization data
-
直方图
-
注意点
- 需要注意数据的分割,接近于0值不是真的0值
- 永远不要根据一个图就做出一个结论
-
找出问题
- [外链图片转存失败(img-hDpZl19F-1566274040519)(How to win a data science competition.assets/1564992627145.png)]
可能是把空值填成了平均值
-
-
创意的分析方法 plot
[外链图片转存失败(img-muJFw5J2-1566274040521)(How to win a data science competition.assets/1564992821392.png)]
- 横向直线代表很多完全相同的数据
[外链图片转存失败(img-YckwJWYX-1566274040522)(How to win a data science competition.assets/1564993092569.png)]
- 颜色加上分类
[外链图片转存失败(img-myNC0US4-1566274040523)(How to win a data science competition.assets/1564993152640.png)]
- 画出异常值数据
-
统计数据
- describe
-
散点图
- 如何使用
- 画出一个feature与另一个feature的关系
[外链图片转存失败(img-epa7W9c6-1566274040524)(How to win a data science competition.assets/1564993487835.png)]
- 如果是回归问题,那么可以用点的大小来表达数据
- 可以用来验证测试数据和训练数据是否是同样的分布
[外链图片转存失败(img-WAVmwx0g-1566274040525)(How to win a data science competition.assets/1564993574260.png)]
- 另一个应用办法
[外链图片转存失败(img-Exb4iqck-1566274040526)(How to win a data science competition.assets/1564993791759.png)]
- 如何使用
- tree-based model,可以创建一个新的feature: difference or ratio between X1 and X2
[外链图片转存失败(img-VrvxOLLG-1566274040527)(How to win a data science competition.assets/1564993918148.png)]
- 创造新的feature: 判断新的数据属于哪一个三角形
- 如何使用
-
correlation metric
-
计算有多少有意义的feature combination 特质有的
[外链图片转存失败(img-H3DPuEHt-1566274040528)(How to win a data science competition.assets/1564994328361.png)]
- matshow function画出这个图
- 然后是用kmeans on 这个图,然后再reoder一下这些特征
- 结果
- [外链图片转存失败(img-MeNGs8ym-1566274040530)(How to win a data science competition.assets/1564994441608.png)]
[外链图片转存失败(img-14pXWO6k-1566274040532)(How to win a data science competition.assets/1564994476802.png)]
平均后再sort,可以构造初新的feature
调用以上方法
- 直方图
plt.hist(x)
- plot
plt.plot(x, ‘.’)
- statics
df.describe()
- feature之间的关系
plt.scatter(x1, x2)
pd.scatter_matrix(df)
df.corr()
plt.matshow()
df.mean().sort_values().plot(style=’,’)
[外链图片转存失败(img-exPxLu39-1566274040536)(How to win a data science competition.assets/1564994663218.png)]
axis = 1, axis = 0理解一下怎么回事?
删除只有一个值的column
nunique多少个重复值
feats_counts = train.nunique(dropna = False)
feats_counts.sort_values()[:10]
constant_features = feats_counts.loc[feats_counts==1].index.tolist()
print (constant_features)
traintest.drop(constant_features,axis = 1,inplace=True)
去掉重复的column
值是完全重复的
dup_cols = {}
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
for c2 in train_enc.columns[i + 1:]:
if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
dup_cols[c2] = c1
traintest.drop(dup_cols.keys(), axis = 1,inplace=True)
查看一列有多少种不同的值
nunique = train.nunique(dropna=False)
plt.figure(figsize=(14,6))
_ = plt.hist(nunique.astype(float)/train.shape[0], bins=100)
模型的影响
如果两个feature之间的关系是线性的,
- 那么nn 和线性回归会把这个关系找出来,
- 但是树结构的不行
Validation and overfitting
[外链图片转存失败(img-G1cgcNbe-1566274040537)(How to win a data science competition.assets/1565084497615.png)]
预测不可知的数据
[外链图片转存失败(img-efBWpipy-1566274040539)(How to win a data science competition.assets/1565084624222.png)]
比赛如何设置
[外链图片转存失败(img-vCjuWgKL-1566274040541)(How to win a data science competition.assets/1565084719053.png)]
在比赛中,overfit的不同定义
比赛中的test data的质量可能更差,所以
比赛中的overfit 指的是 测试数据比validation数据的表现更差
Validation strategies
- holdout
[外链图片转存失败(img-qjxhImVn-1566274040542)(How to win a data science competition.assets/1565085483344.png)]
- K-fold
[外链图片转存失败(img-a6l3JI8k-1566274040544)(How to win a data science competition.assets/1565085600069.png)]
- Leave-one-out
[外链图片转存失败(img-AUB2CR0U-1566274040546)(How to win a data science competition.assets/1565085748175.png)]
测试的时候注意分层
[外链图片转存失败(img-XQrDTRZM-1566274040548)(How to win a data science competition.assets/1565085937159.png)]
time-based validation
[外链图片转存失败(img-q0E5Z7Cl-1566274040549)(How to win a data science competition.assets/1565089889403.png)]
id baseed validation
[外链图片转存失败(img-tvGuwK3N-1566274040551)(How to win a data science competition.assets/1565090081665.png)]
- 做VALIDATION的适合,要注意把Train user也这样区分
Validation problems
- Validation Stage
- Submission stage
[外链图片转存失败(img-0D2GgwjX-1566274040552)(How to win a data science competition.assets/1565092894714.png)]
[外链图片转存失败(img-8GvuNZSY-1566274040553)(How to win a data science competition.assets/1565094108447.png)]
Data leakage
排名探针
[外链图片转存失败(img-1mxu0ynU-1566274040554)(How to win a data science competition.assets/1565095199845.png)]
使用分类值来预测
某一类的值标志Y的值是某一特定值
使用某一种值来猜测另个一值
[外链图片转存失败(img-04QiJR7I-1566274040555)(How to win a data science competition.assets/1565095746726.png)]
好TMD高级的公式
Metric
绝对权值
- MSE
- RMSE
[外链图片转存失败(img-lN4on4Os-1566274040556)(How to win a data science competition.assets/1565161280586.png)]
- R-squared
[外链图片转存失败(img-c8L9qlNR-1566274040558)(How to win a data science competition.assets/1565161370077.png)]
- MAE
[外链图片转存失败(img-OQALTazB-1566274040559)(How to win a data science competition.assets/1565161394168.png)]
- MAE的迭代方向
[外链图片转存失败(img-Jbs5W8tU-1566274040560)(How to win a data science competition.assets/1565161549810.png)]
- MAE vs MSE
[外链图片转存失败(img-RmrTfXWD-1566274040561)(How to win a data science competition.assets/1565161731245.png)]
加上权值的metic
- MSPE
- MAPE
[外链图片转存失败(img-q71fT4Z1-1566274040562)(How to win a data science competition.assets/1565162228855.png)]
- RMSLE
[外链图片转存失败(img-nBcCsMms-1566274040563)(How to win a data science competition.assets/1565162517464.png)]
[外链图片转存失败(img-C4RVTDkp-1566274040564)(How to win a data science competition.assets/1565162577824.png)]
常用的优化办法
loss 和 metric的区别
- Target metric 是我们想要优化的目标
- Optimization loss是模型优化的方法
Metric优化的方法
[外链图片转存失败(img-AawhLJW0-1566274040565)(How to win a data science competition.assets/1565163512904.png)]
-
MSE, logloss基本上都能直接作为模型的损失函数
-
但是MSPE,MAPE,RMSLE不行
- 比如MSPE就不能直接用在XGBoost上面
-
手写的XGBOOST损失函数
[外链图片转存失败(img-PzeCOopD-1566274040566)(How to win a data science competition.assets/1565163573696.png)]
-
提前结束训练
[外链图片转存失败(img-R6jHFe9O-1566274040570)(How to win a data science competition.assets/1565163638245.png)]
Reggresion metrics 优化
-
支持MSE作为loss的库
[外链图片转存失败(img-ZCTdkK4u-1566274040572)(How to win a data science competition.assets/1565163820161.png)]
-
MAE作为LOSS的库
- MAE也被称为L1
[外链图片转存失败(img-9mGowgsV-1566274040573)(How to win a data science competition.assets/1565163961836.png)]
- MSPE / MAPE
给sample加上权重
[外链图片转存失败(img-L21ygC7r-1566274040574)(How to win a data science competition.assets/1565164580677.png)]
- RMSLE
要改变数据集合的值
提前改好
[外链图片转存失败(img-g046I4Zm-1566274040575)(How to win a data science competition.assets/1565164657444.png)]
Mean encodings
- 为集合加上有意义的参数
[外链图片转存失败(img-bslsoMGe-1566274040576)(How to win a data science competition.assets/1565166225520.png)]
- light GBM非常的有用
加入类似的参数
[外链图片转存失败(img-hXzpP2dS-1566274040577)(How to win a data science competition.assets/1565166637267.png)]
meanning coding example
[外链图片转存失败(img-NwFEeuwl-1566274040578)(How to win a data science competition.assets/1565166711095.png)]
[外链图片转存失败(img-rY2SrR5d-1566274040579)(How to win a data science competition.assets/1565166768802.png)]
正则化 避免overfit的办法
KFOLD用法
KFOLD添加参数,来验证参数是否只是在局部有效
[外链图片转存失败(img-zIkdejQ7-1566274040580)(How to win a data science competition.assets/1565167728141.png)]
Smoothing
[外链图片转存失败(img-PUndk2de-1566274040581)(How to win a data science competition.assets/1565168768544.png)]
Noise
[外链图片转存失败(img-skPMqtRR-1566274040583)(How to win a data science competition.assets/1565168809896.png)]
降低 train data的质量,这个可以用在我的项目里面
Expanding mean
[外链图片转存失败(img-jxWEpBT5-1566274040585)(How to win a data science competition.assets/1565168900602.png)]
Extensions and generalizations
regression可以提取
- medium,
- percentile,
- std,
- 正太分布feature
都需要正则化数据
把数据分类进行提取
- 比如时间来聚合,
- 或者前几天来聚合
[外链图片转存失败(img-byJLlWrJ-1566274040586)(How to win a data science competition.assets/1565170642872.png)]
数据与数据之间的关系来聚合
- 如何分类数据
- 如果一个数据的自述树很多,那么这个点就值得被分类
- 比如feature 1 和 feature 2的子树很多
[外链图片转存失败(img-rmSRiupW-1566274040587)(How to win a data science competition.assets/1565170832402.png)]
features = df.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-9:] # top 10 features
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
- 如何有效的合并数据
调模型顺序
[外链图片转存失败(img-VfBauxum-1566274040588)(How to win a data science competition.assets/1565171469979.png)]
Meaning coding
[外链图片转存失败(img-LUbzmndV-1566274040590)(How to win a data science competition.assets/1565171823713.png)]
Hyperparameter optimization
找到最有影响的参数
去网页上找文档
- 文档里会写哪个优先被调
- 每个参数的含义
可以自动找到最佳参数的库
[外链图片转存失败(img-hoULbvk8-1566274040592)(How to win a data science competition.assets/1565177062062.png)]
一个自动化调整参数的例子
[外链图片转存失败(img-7RbtfhLq-1566274040593)(How to win a data science competition.assets/1565177126951.png)]
参数如何影响模型
- ubderfitting
can not learn the train set
- good fit
- overfitting
[外链图片转存失败(img-RbBk6yIB-1566274040594)(How to win a data science competition.assets/1565177434556.png)]
调整树的 参数!
Model | Where |
---|---|
GBTD | XGBoost, lightGBM |
RandomForest, ExtraTrees | scikit-learn |
others | RGF |
XGBoost / lightGBM
- max_depth
- 层数多的话,帮助构造合适的feature联合体
- 一般7层就可以了
- sub sample / bagging_fraction
- 分类的下一层比率,用的越少越不会over fit
- colsample_bytree / colsample_bylevel / feature_fraction
- min_child_weight / min_data_in_leaf
- 最重要
- eta / learning_rate, num_round / num_iterations 进步率和迭代数
- 可以使用early stop,如果loss开始上升,则停止训练
[外链图片转存失败(img-TCEaplLz-1566274040596)(How to win a data science competition.assets/1565178589233.png)]
RandomForest / ExtraTrees
[外链图片转存失败(img-oY1t92fC-1566274040597)(How to win a data science competition.assets/1565224495798.png)]
-
N_estimators (越多越好)
- 决策树的数量
- 设置个10,逐渐增多然后观察对metric的值影响
-
Max_depth
- 最深的层次
-
max_features
- 用多少feature用于训练
-
min_samples_leaf
-
n_jobs
- 多少个进程跑
Neural networks
- pytorch 和 keras
- number of nrurons per layer
- 每一层的神经元
- number of layers
- 层数
- Batch size
- 每一次训练的个数?
- learning rage
- 要合适
[外链图片转存失败(img-KW2MmrMc-1566274040598)(How to win a data science competition.assets/1565179981939.png)]
Linear models
SVM几乎不需要任何调整参数
- Regularization
- L1
- L2
- L1 可以用于特征选择
[外链图片转存失败(img-8LtE9ii7-1566274040599)(How to win a data science competition.assets/1565180466195.png)]
- GBDT 和 nn训练时间很长的话会很有用
[外链图片转存失败(img-KavKJZ6r-1566274040604)(How to win a data science competition.assets/1565180514867.png)]
在提交的时候,使用相同模型的不同参数会非常的有效果
Statistics and distance based features
Groupby features
[外链图片转存失败(img-jquR0Zyg-1566274040605)(How to win a data science competition.assets/image-20190810151745037.png)]
-
根据用户,给出最低price,和最高price
-
根据page,给出最低价格的position
[外链图片转存失败(img-LF3enBIi-1566274040606)(How to win a data science competition.assets/image-20190810151956560.png)]
[外链图片转存失败(img-82d0RjO3-1566274040607)(How to win a data science competition.assets/image-20190810152130330.png)]
[外链图片转存失败(img-P3lhCIH7-1566274040608)(How to win a data science competition.assets/image-20190810152241637.png)]
- 尽可能的多想出feature
Neighbors
[外链图片转存失败(img-CiB3037n-1566274040609)(How to win a data science competition.assets/image-20190810152504018.png)]
用在我这里就是
- it energy 在什么一定范围内的max,min
- humid 在一定范围内的max,min
Matrix Factorization / 降纬度
[外链图片转存失败(img-1ou1lQ5g-1566274040611)(How to win a data science competition.assets/image-20190810153730536.png)]
- 降纬的具体方法
[外链图片转存失败(img-0bXa6azp-1566274040612)(How to win a data science competition.assets/image-20190810154357323.png)]
[外链图片转存失败(img-ztlFDKnx-1566274040613)(How to win a data science competition.assets/image-20190810154608919.png)]
- PCA可以帮助把种类的feature变为真实值
构造feature 组合
- 第一种,先concat 再onehot
[外链图片转存失败(img-tOhit2im-1566274040614)(How to win a data science competition.assets/image-20190810155229162.png)]
- 先onehot 再组合
[外链图片转存失败(img-NXJajRNY-1566274040615)(How to win a data science competition.assets/image-20190810155315767.png)]
一个数值的例子
[外链图片转存失败(img-oWqb8KXD-1566274040617)(How to win a data science competition.assets/image-20190810155402260.png)]
常用的数值组合办法
- 乘法
- 加
- 减
- 除
这种方法非常适合树结构的算法
执行步骤
[外链图片转存失败(img-27ZjDiKo-1566274040618)(How to win a data science competition.assets/image-20190810160204445.png)]
搞不懂,可以尝试一下
[外链图片转存失败(img-1Kq8O85H-1566274040622)(How to win a data science competition.assets/image-20190810160139421.png)]
t-SNE
- 可用于可视化
- 结果可以用来作为一个feature(类似于分类器)
- perplexity 参数很重要
- 注意解读结果
Ensembling
什么是Ensembling
- combinning different machine learning model to get a better prediction
Average
[外链图片转存失败(img-rGpOJJKt-1566274040623)(How to win a data science competition.assets/image-20190810162705183.png)]
- 简单的组合两个表现不同的组合
Weighted Average
- 给不同的模型加上不同的权重
Conditional averaging
- 在某种条件下用模型1,某种条件下用模型2
Bagging
平均很多个版本稍微不同的模型来预测结果
example: random forest
为啥用Bagging
- Errors due to Bias (underfitting)
- Errors due Variance (Overfit)
bagging 重要的参数
[外链图片转存失败(img-a1JRgvzW-1566274040625)(How to win a data science competition.assets/image-20190810164404876.png)]
- seed
- 模型之间有多么不同
- 行sampling
- 随机
- 列sampling
- 模型特别的参数
- 多少个模型
- 同时跑
手写的一个bagging
[外链图片转存失败(img-LMzwX8Fe-1566274040626)(How to win a data science competition.assets/image-20190810164822859.png)]
Boosting
什么是boosting
- 一种权值model的方式,前面的model做的怎么样,后面的model再跟上预测
主要的bagging方式
Weight based
[外链图片转存失败(img-DbaGtL8H-1566274040627)(How to win a data science competition.assets/image-20190810170057889.png)]
原理:
根据预测结果的偏差,留给下一个模型学习
重要的参数
[外链图片转存失败(img-YWAK3KMB-1566274040628)(How to win a data science competition.assets/image-20190810170152469.png)]
Residual based
最重要的模型,基本上所有的竞赛都用这个
[外链图片转存失败(img-47SBLTfa-1566274040629)(How to win a data science competition.assets/image-20190810170605380.png)]
步骤:
- 先预测一个模型
- 留下一个error
- 后面的模型预测这个error
- 最后的结果所为所有模型的和
[外链图片转存失败(img-lHuQIvO0-1566274040630)(How to win a data science competition.assets/image-20190810170725459.png)]
有名的residual based boosting
- Xgboost
- lightGBM
- H2O
- CATBOOSTING
- 优势是不会花太多时间来调整模型
- SKLEARN
- 可以使用所有的sklearn 模型来作为模型
Stacking
原理
不同的模型在不同的领域表现不一样,使用一个模型来预测那个模型更好,然后用权值来分配给不同的模型。
[外链图片转存失败(img-5h7TwO6a-1566274040632)(How to win a data science competition.assets/image-20190810173933730.png)]
An example
[外链图片转存失败(img-xO8vY259-1566274040634)(How to win a data science competition.assets/image-20190810174009305.png)]
注意点
[外链图片转存失败(img-7tPDz3QU-1566274040637)(How to win a data science competition.assets/image-20190810174058196.png)]
- 如果是Time series的问题,则不能随机
- 模型要尽可能的不同
- 模型的不同来自于
- 不同的算法
- 不同的feature
- 模型的模型可以尽可能地简单
STACKNET
与STACK不同的 是,meta model是神经网络
[外链图片转存失败(img-PmbFKzN7-1566274040640)(How to win a data science competition.assets/image-20190810175954819.png)]
Real Example
Stacking Example
[外链图片转存失败(img-tAsK276a-1566274040641)(How to win a data science competition.assets/image-20190815110101051.png)]
本文标签: 课程笔记Courserawincompetition
版权声明:本文标题:Coursera, How to win a competition 课程笔记 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://m.elefans.com/xitong/1726693022a1080909.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论