Coursera, How to win a competition 课程笔记|电子爱好者

admin管理员组
文章数量:1530012

How to win a data science competition

课程简介

课程收获

how to preprocess the data
extract features
how to set up the validation correctly
optimize the given metric
A truly unique opportunity to see the detailed explanations of the winning solutions.

课程日程安排

week 2 basic pipeline

EDA
Validation
Data leaks

week 3 improve model

Metrics (评估标准)
mean encoding 平均数编码

如果某一个特征是定性的（categorical），而这个特征的可能值非常多（高基数），那么平均数编码（mean encoding）是一种高效的编码方式。在实际应用中，这类特征工程能极大提升模型的性能。

week4 improve model

Advanced features
Hyperparameter optimization
Ensembles

Competition Mechnics

Data
Model
- produce best predication
- reproducible
Submission
Evaluation
- value 公式
- 测试集合
  
  [外链图片转存失败(img-jXbx8bHS-1566274040508)(How to win a data science competition.assets/1564806789399.png)]
Leaderboard

正常流程：

1, analyze data

2, fit model

3, submit

4, see public score

5, repeat to 1

为什么要参加竞赛

great opportunity for learning
全局意识
加入进社群
赚钱
- 不能是首要目标

竞赛与真实问题的区别

真实问题

理解业务
抽象问题
收集数据
数据清洗
建模
评价模型
部署模型

竞赛

数据清洗
建模

[外链图片转存失败(img-ShORRpbY-1566274040511)(How to win a data science competition.assets/1564807498515.png)]

Recap of main ML algorithms

Linear model

[外链图片转存失败(img-WbhS98i6-1566274040513)(How to win a data science competition.assets/1564967587509.png)]

缺点：
- 很多case并不能用一条直线分开

Tree-based

基本原理就是分治策略

[外链图片转存失败(img-Uc0PeCjB-1566274040516)(How to win a data science competition.assets/1564967871312.png)]

先做一个策略进行区分，再用另一个策略

针对于表格性的数据是非常有用的

缺点：

很难获得linear dependencies，因为需要太多的分割

K-NN

K- nearest Neighbors

相邻的点总是有相近的label

Neural Networks

黑盒

注意

没有一个算法会比其他算法更好
我们不能使用一种简单的算法就赢得竞赛

结论：

[外链图片转存失败(img-LQzCfpS2-1566274040517)(How to win a data science competition.assets/1564968727633.png)]

Exploratory data analysis

Exploratory Data Analysis: what and why?

EDA可以带来

更好的理解数据
对数据更有直觉
生成假设
找到内在规律

Understand the data

columns 代表是什么
数据是否有意义
检查数据异常
- 如果数据异常，也不用删除，加一列进行标记，让机器自动学习是不错的

探索无个性特质的数据

特质数据是加密数据，但是都保持原有数据的特性，比如线性关系就是线性关系
可以通过一些技巧来解密线性关系

Visualization data

直方图
- 注意点
  - 需要注意数据的分割，接近于0值不是真的0值
  - 永远不要根据一个图就做出一个结论
- 找出问题
  - [外链图片转存失败(img-hDpZl19F-1566274040519)(How to win a data science competition.assets/1564992627145.png)]
  可能是把空值填成了平均值
创意的分析方法 plot

[外链图片转存失败(img-muJFw5J2-1566274040521)(How to win a data science competition.assets/1564992821392.png)]
- 横向直线代表很多完全相同的数据
[外链图片转存失败(img-YckwJWYX-1566274040522)(How to win a data science competition.assets/1564993092569.png)]
- 颜色加上分类
[外链图片转存失败(img-myNC0US4-1566274040523)(How to win a data science competition.assets/1564993152640.png)]
- 画出异常值数据
统计数据
- describe
散点图
- 如何使用
  - 画出一个feature与另一个feature的关系
[外链图片转存失败(img-epa7W9c6-1566274040524)(How to win a data science competition.assets/1564993487835.png)]
- 如果是回归问题，那么可以用点的大小来表达数据
- 可以用来验证测试数据和训练数据是否是同样的分布
[外链图片转存失败(img-WAVmwx0g-1566274040525)(How to win a data science competition.assets/1564993574260.png)]
- 另一个应用办法
[外链图片转存失败(img-Exb4iqck-1566274040526)(How to win a data science competition.assets/1564993791759.png)]
- 如何使用
  - tree-based model，可以创建一个新的feature: difference or ratio between X1 and X2
[外链图片转存失败(img-VrvxOLLG-1566274040527)(How to win a data science competition.assets/1564993918148.png)]
- 创造新的feature: 判断新的数据属于哪一个三角形
correlation metric
计算有多少有意义的feature combination 特质有的

[外链图片转存失败(img-H3DPuEHt-1566274040528)(How to win a data science competition.assets/1564994328361.png)]

matshow function画出这个图
- 然后是用kmeans on 这个图，然后再reoder一下这些特征
结果
- [外链图片转存失败(img-MeNGs8ym-1566274040530)(How to win a data science competition.assets/1564994441608.png)]

[外链图片转存失败(img-14pXWO6k-1566274040532)(How to win a data science competition.assets/1564994476802.png)]

平均后再sort，可以构造初新的feature

调用以上方法

直方图

plt.hist(x)

plot

plt.plot(x, ‘.’)

statics

df.describe()

feature之间的关系

plt.scatter(x1, x2)

pd.scatter_matrix(df)

df.corr()

plt.matshow()

df.mean().sort_values().plot(style=’,’)

[外链图片转存失败(img-exPxLu39-1566274040536)(How to win a data science competition.assets/1564994663218.png)]

axis = 1, axis = 0理解一下怎么回事?

删除只有一个值的column

nunique多少个重复值

feats_counts = train.nunique(dropna = False)
feats_counts.sort_values()[:10]
constant_features = feats_counts.loc[feats_counts==1].index.tolist()
print (constant_features)


traintest.drop(constant_features,axis = 1,inplace=True)

去掉重复的column

值是完全重复的

dup_cols = {}

for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
    for c2 in train_enc.columns[i + 1:]:
        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
            dup_cols[c2] = c1
            
traintest.drop(dup_cols.keys(), axis = 1,inplace=True)

查看一列有多少种不同的值

nunique = train.nunique(dropna=False)
plt.figure(figsize=(14,6))
_ = plt.hist(nunique.astype(float)/train.shape[0], bins=100)

模型的影响

如果两个feature之间的关系是线性的，

那么nn 和线性回归会把这个关系找出来，
但是树结构的不行

Validation and overfitting

[外链图片转存失败(img-G1cgcNbe-1566274040537)(How to win a data science competition.assets/1565084497615.png)]

预测不可知的数据

[外链图片转存失败(img-efBWpipy-1566274040539)(How to win a data science competition.assets/1565084624222.png)]

比赛如何设置

[外链图片转存失败(img-vCjuWgKL-1566274040541)(How to win a data science competition.assets/1565084719053.png)]

在比赛中，overfit的不同定义

比赛中的test data的质量可能更差，所以

比赛中的overfit 指的是测试数据比validation数据的表现更差

Validation strategies

holdout

[外链图片转存失败(img-qjxhImVn-1566274040542)(How to win a data science competition.assets/1565085483344.png)]

K-fold

[外链图片转存失败(img-a6l3JI8k-1566274040544)(How to win a data science competition.assets/1565085600069.png)]

Leave-one-out

[外链图片转存失败(img-AUB2CR0U-1566274040546)(How to win a data science competition.assets/1565085748175.png)]

测试的时候注意分层

[外链图片转存失败(img-XQrDTRZM-1566274040548)(How to win a data science competition.assets/1565085937159.png)]

time-based validation

[外链图片转存失败(img-q0E5Z7Cl-1566274040549)(How to win a data science competition.assets/1565089889403.png)]

id baseed validation

[外链图片转存失败(img-tvGuwK3N-1566274040551)(How to win a data science competition.assets/1565090081665.png)]

做VALIDATION的适合，要注意把Train user也这样区分

Validation problems

Validation Stage
Submission stage

[外链图片转存失败(img-0D2GgwjX-1566274040552)(How to win a data science competition.assets/1565092894714.png)]

[外链图片转存失败(img-8GvuNZSY-1566274040553)(How to win a data science competition.assets/1565094108447.png)]

Data leakage

排名探针

[外链图片转存失败(img-1mxu0ynU-1566274040554)(How to win a data science competition.assets/1565095199845.png)]

使用分类值来预测

某一类的值标志Y的值是某一特定值

使用某一种值来猜测另个一值

[外链图片转存失败(img-04QiJR7I-1566274040555)(How to win a data science competition.assets/1565095746726.png)]

好TMD高级的公式

Metric

绝对权值

MSE
RMSE

[外链图片转存失败(img-lN4on4Os-1566274040556)(How to win a data science competition.assets/1565161280586.png)]

R-squared

[外链图片转存失败(img-c8L9qlNR-1566274040558)(How to win a data science competition.assets/1565161370077.png)]

[外链图片转存失败(img-OQALTazB-1566274040559)(How to win a data science competition.assets/1565161394168.png)]

MAE的迭代方向

[外链图片转存失败(img-Jbs5W8tU-1566274040560)(How to win a data science competition.assets/1565161549810.png)]

MAE vs MSE

[外链图片转存失败(img-RmrTfXWD-1566274040561)(How to win a data science competition.assets/1565161731245.png)]

加上权值的metic

MSPE
MAPE

[外链图片转存失败(img-q71fT4Z1-1566274040562)(How to win a data science competition.assets/1565162228855.png)]

RMSLE

[外链图片转存失败(img-nBcCsMms-1566274040563)(How to win a data science competition.assets/1565162517464.png)]

[外链图片转存失败(img-C4RVTDkp-1566274040564)(How to win a data science competition.assets/1565162577824.png)]

常用的优化办法

loss 和 metric的区别

Target metric 是我们想要优化的目标
Optimization loss是模型优化的方法

Metric优化的方法

[外链图片转存失败(img-AawhLJW0-1566274040565)(How to win a data science competition.assets/1565163512904.png)]

MSE， logloss基本上都能直接作为模型的损失函数
但是MSPE，MAPE，RMSLE不行
- 比如MSPE就不能直接用在XGBoost上面
手写的XGBOOST损失函数

[外链图片转存失败(img-PzeCOopD-1566274040566)(How to win a data science competition.assets/1565163573696.png)]

提前结束训练

[外链图片转存失败(img-R6jHFe9O-1566274040570)(How to win a data science competition.assets/1565163638245.png)]

Reggresion metrics 优化

支持MSE作为loss的库

[外链图片转存失败(img-ZCTdkK4u-1566274040572)(How to win a data science competition.assets/1565163820161.png)]
MAE作为LOSS的库
- MAE也被称为L1

[外链图片转存失败(img-9mGowgsV-1566274040573)(How to win a data science competition.assets/1565163961836.png)]

MSPE / MAPE

给sample加上权重

[外链图片转存失败(img-L21ygC7r-1566274040574)(How to win a data science competition.assets/1565164580677.png)]

RMSLE

要改变数据集合的值

提前改好

[外链图片转存失败(img-g046I4Zm-1566274040575)(How to win a data science competition.assets/1565164657444.png)]

Mean encodings

为集合加上有意义的参数

[外链图片转存失败(img-bslsoMGe-1566274040576)(How to win a data science competition.assets/1565166225520.png)]

light GBM非常的有用

加入类似的参数

[外链图片转存失败(img-hXzpP2dS-1566274040577)(How to win a data science competition.assets/1565166637267.png)]

meanning coding example

[外链图片转存失败(img-NwFEeuwl-1566274040578)(How to win a data science competition.assets/1565166711095.png)]

[外链图片转存失败(img-rY2SrR5d-1566274040579)(How to win a data science competition.assets/1565166768802.png)]

正则化避免overfit的办法

KFOLD用法

KFOLD添加参数，来验证参数是否只是在局部有效

[外链图片转存失败(img-zIkdejQ7-1566274040580)(How to win a data science competition.assets/1565167728141.png)]

Smoothing

[外链图片转存失败(img-PUndk2de-1566274040581)(How to win a data science competition.assets/1565168768544.png)]

Noise

[外链图片转存失败(img-skPMqtRR-1566274040583)(How to win a data science competition.assets/1565168809896.png)]

降低 train data的质量，这个可以用在我的项目里面

Expanding mean

[外链图片转存失败(img-jxWEpBT5-1566274040585)(How to win a data science competition.assets/1565168900602.png)]

Extensions and generalizations

regression可以提取

medium,
percentile,
std,
正太分布feature

都需要正则化数据

把数据分类进行提取

比如时间来聚合，
或者前几天来聚合

[外链图片转存失败(img-byJLlWrJ-1566274040586)(How to win a data science competition.assets/1565170642872.png)]

数据与数据之间的关系来聚合

如何分类数据
- 如果一个数据的自述树很多，那么这个点就值得被分类
- 比如feature 1 和 feature 2的子树很多

[外链图片转存失败(img-rmSRiupW-1566274040587)(How to win a data science competition.assets/1565170832402.png)]

features = df.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-9:]  # top 10 features
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

如何有效的合并数据

调模型顺序

[外链图片转存失败(img-VfBauxum-1566274040588)(How to win a data science competition.assets/1565171469979.png)]

Meaning coding

[外链图片转存失败(img-LUbzmndV-1566274040590)(How to win a data science competition.assets/1565171823713.png)]

Hyperparameter optimization

找到最有影响的参数

去网页上找文档

文档里会写哪个优先被调
每个参数的含义

可以自动找到最佳参数的库

[外链图片转存失败(img-hoULbvk8-1566274040592)(How to win a data science competition.assets/1565177062062.png)]

一个自动化调整参数的例子

[外链图片转存失败(img-7RbtfhLq-1566274040593)(How to win a data science competition.assets/1565177126951.png)]

参数如何影响模型

ubderfitting

can not learn the train set

good fit
overfitting

[外链图片转存失败(img-RbBk6yIB-1566274040594)(How to win a data science competition.assets/1565177434556.png)]

调整树的参数！

Model	Where
GBTD	XGBoost, lightGBM
RandomForest, ExtraTrees	scikit-learn
others	RGF

XGBoost / lightGBM

max_depth
- 层数多的话，帮助构造合适的feature联合体
- 一般7层就可以了
sub sample / bagging_fraction
- 分类的下一层比率，用的越少越不会over fit
colsample_bytree / colsample_bylevel / feature_fraction
min_child_weight / min_data_in_leaf
- 最重要
eta / learning_rate, num_round / num_iterations 进步率和迭代数
- 可以使用early stop，如果loss开始上升，则停止训练

[外链图片转存失败(img-TCEaplLz-1566274040596)(How to win a data science competition.assets/1565178589233.png)]

RandomForest / ExtraTrees

[外链图片转存失败(img-oY1t92fC-1566274040597)(How to win a data science competition.assets/1565224495798.png)]

N_estimators (越多越好)
- 决策树的数量
- 设置个10，逐渐增多然后观察对metric的值影响
Max_depth
- 最深的层次
max_features
- 用多少feature用于训练
min_samples_leaf
n_jobs
- 多少个进程跑

Neural networks

pytorch 和 keras
number of nrurons per layer
- 每一层的神经元
number of layers
- 层数
Batch size
- 每一次训练的个数？
learning rage
- 要合适

[外链图片转存失败(img-KW2MmrMc-1566274040598)(How to win a data science competition.assets/1565179981939.png)]

Linear models

SVM几乎不需要任何调整参数

Regularization
- L1
- L2
L1 可以用于特征选择

[外链图片转存失败(img-8LtE9ii7-1566274040599)(How to win a data science competition.assets/1565180466195.png)]

GBDT 和 nn训练时间很长的话会很有用

[外链图片转存失败(img-KavKJZ6r-1566274040604)(How to win a data science competition.assets/1565180514867.png)]

在提交的时候，使用相同模型的不同参数会非常的有效果

Statistics and distance based features

Groupby features

[外链图片转存失败(img-jquR0Zyg-1566274040605)(How to win a data science competition.assets/image-20190810151745037.png)]

根据用户，给出最低price，和最高price
根据page，给出最低价格的position

[外链图片转存失败(img-LF3enBIi-1566274040606)(How to win a data science competition.assets/image-20190810151956560.png)]

[外链图片转存失败(img-82d0RjO3-1566274040607)(How to win a data science competition.assets/image-20190810152130330.png)]

[外链图片转存失败(img-P3lhCIH7-1566274040608)(How to win a data science competition.assets/image-20190810152241637.png)]

尽可能的多想出feature

Neighbors

[外链图片转存失败(img-CiB3037n-1566274040609)(How to win a data science competition.assets/image-20190810152504018.png)]

用在我这里就是

it energy 在什么一定范围内的max,min
humid 在一定范围内的max，min

Matrix Factorization / 降纬度

[外链图片转存失败(img-1ou1lQ5g-1566274040611)(How to win a data science competition.assets/image-20190810153730536.png)]

降纬的具体方法

[外链图片转存失败(img-0bXa6azp-1566274040612)(How to win a data science competition.assets/image-20190810154357323.png)]

[外链图片转存失败(img-ztlFDKnx-1566274040613)(How to win a data science competition.assets/image-20190810154608919.png)]

PCA可以帮助把种类的feature变为真实值

构造feature 组合

第一种，先concat 再onehot

[外链图片转存失败(img-tOhit2im-1566274040614)(How to win a data science competition.assets/image-20190810155229162.png)]

先onehot 再组合

[外链图片转存失败(img-NXJajRNY-1566274040615)(How to win a data science competition.assets/image-20190810155315767.png)]

一个数值的例子

[外链图片转存失败(img-oWqb8KXD-1566274040617)(How to win a data science competition.assets/image-20190810155402260.png)]

常用的数值组合办法

乘法
加
减
除

这种方法非常适合树结构的算法

执行步骤

[外链图片转存失败(img-27ZjDiKo-1566274040618)(How to win a data science competition.assets/image-20190810160204445.png)]

搞不懂，可以尝试一下

[外链图片转存失败(img-1Kq8O85H-1566274040622)(How to win a data science competition.assets/image-20190810160139421.png)]

t-SNE

可用于可视化
结果可以用来作为一个feature(类似于分类器)
perplexity 参数很重要
注意解读结果

Ensembling

什么是Ensembling

combinning different machine learning model to get a better prediction

Average

[外链图片转存失败(img-rGpOJJKt-1566274040623)(How to win a data science competition.assets/image-20190810162705183.png)]

简单的组合两个表现不同的组合

Weighted Average

给不同的模型加上不同的权重

Conditional averaging

在某种条件下用模型1，某种条件下用模型2

Bagging

平均很多个版本稍微不同的模型来预测结果

example: random forest

为啥用Bagging

Errors due to Bias (underfitting)
Errors due Variance (Overfit)

bagging 重要的参数

[外链图片转存失败(img-a1JRgvzW-1566274040625)(How to win a data science competition.assets/image-20190810164404876.png)]

seed
- 模型之间有多么不同
- 行sampling
- 随机
- 列sampling
- 模型特别的参数
- 多少个模型
- 同时跑

手写的一个bagging

[外链图片转存失败(img-LMzwX8Fe-1566274040626)(How to win a data science competition.assets/image-20190810164822859.png)]

Boosting

什么是boosting

一种权值model的方式，前面的model做的怎么样，后面的model再跟上预测

主要的bagging方式

Weight based

[外链图片转存失败(img-DbaGtL8H-1566274040627)(How to win a data science competition.assets/image-20190810170057889.png)]

原理：

根据预测结果的偏差，留给下一个模型学习

重要的参数

[外链图片转存失败(img-YWAK3KMB-1566274040628)(How to win a data science competition.assets/image-20190810170152469.png)]

Residual based

最重要的模型，基本上所有的竞赛都用这个

[外链图片转存失败(img-47SBLTfa-1566274040629)(How to win a data science competition.assets/image-20190810170605380.png)]

步骤：

先预测一个模型
留下一个error
后面的模型预测这个error
最后的结果所为所有模型的和

[外链图片转存失败(img-lHuQIvO0-1566274040630)(How to win a data science competition.assets/image-20190810170725459.png)]

有名的residual based boosting

Xgboost
lightGBM
H2O
CATBOOSTING
- 优势是不会花太多时间来调整模型
SKLEARN
- 可以使用所有的sklearn 模型来作为模型

Stacking

原理

不同的模型在不同的领域表现不一样，使用一个模型来预测那个模型更好，然后用权值来分配给不同的模型。

[外链图片转存失败(img-5h7TwO6a-1566274040632)(How to win a data science competition.assets/image-20190810173933730.png)]

An example

[外链图片转存失败(img-xO8vY259-1566274040634)(How to win a data science competition.assets/image-20190810174009305.png)]

注意点

[外链图片转存失败(img-7tPDz3QU-1566274040637)(How to win a data science competition.assets/image-20190810174058196.png)]

如果是Time series的问题，则不能随机
模型要尽可能的不同
模型的不同来自于
- 不同的算法
- 不同的feature
模型的模型可以尽可能地简单

STACKNET

与STACK不同的是，meta model是神经网络

[外链图片转存失败(img-PmbFKzN7-1566274040640)(How to win a data science competition.assets/image-20190810175954819.png)]

Real Example

Stacking Example

[外链图片转存失败(img-tAsK276a-1566274040641)(How to win a data science competition.assets/image-20190815110101051.png)]

本文标签：课程笔记 Coursera win competition

版权声明：本文标题：Coursera, How to win a competition 课程笔记内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/xitong/1726693022a1080909.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

Coursera, How to win a competition 课程笔记

How to win a data science competition

课程简介

课程收获

课程日程安排

Competition Mechnics

Competition Mechnics

为什么要参加竞赛

竞赛与真实问题的区别

Recap of main ML algorithms

Linear model

Tree-based

K-NN

Neural Networks

注意

结论：

Exploratory data analysis

Exploratory Data Analysis: what and why?

Understand the data

探索 无个性特质的数据

Visualization data

调用以上方法

删除只有一个值的column

去掉重复的column

查看一列有多少种不同的值

模型的影响

Validation and overfitting

预测不可知的数据

比赛如何设置

在比赛中，overfit的不同定义

Validation strategies

Validation problems

Data leakage

使用分类值来预测

Metric

绝对权值

加上权值的metic

常用的优化办法

loss 和 metric的区别

Metric优化的方法

Reggresion metrics 优化

Mean encodings

正则化 避免overfit的办法

KFOLD用法

Smoothing

Noise

Expanding mean

Extensions and generalizations

regression可以提取

把数据分类进行提取

数据与数据之间的关系来聚合

调模型顺序

Meaning coding

Hyperparameter optimization

找到最有影响的参数

去网页上找文档

参数如何影响模型

调整树的 参数！

XGBoost / lightGBM

RandomForest / ExtraTrees

Neural networks

Linear models

在提交的时候，使用相同模型的不同参数会非常的有效果

Statistics and distance based features

Groupby features

Neighbors

用在我这里就是

Matrix Factorization / 降纬度

构造feature 组合

常用的数值组合办法

执行步骤

搞不懂，可以尝试一下

t-SNE

Ensembling

Average

Weighted Average

Conditional averaging

Bagging

为啥用Bagging

探索无个性特质的数据

正则化避免overfit的办法

调整树的参数！

与STACK不同的是，meta model是神经网络

PotPlayer v1.7 纯净绿色版最好用的本地视频播放器下载