泊松分布和泊松回归_在eBay上通过泊松回归预测键盘销售率|电子爱好者

admin管理员组
文章数量:1579405

泊松分布和泊松回归

With the trend of online shopping malls replacing traditional malls, more and more people are getting interested in becoming an online seller.

随着在线购物中心取代传统购物中心的趋势，越来越多的人对成为在线卖家感兴趣。

The purpose of this article is to give some insights to online sellers who may be interested in finding the characteristics of product postings that might increase the sale of their products. The data used for this project is the query results of typing ‘keyboard’ in ebay and it was scraped using ‘BeautifulSoup’.

本文的目的是为在线卖家提供一些见解，他们可能对发现可能增加其产品销售量的产品过帐特征感兴趣。该项目使用的数据是在ebay中键入“ keyboard”的查询结果，并使用“ BeautifulSoup”进行了抓取。

The raw data is messy and there are lots of duplicate product postings as eBay has an option for users to opt for automatically re-listing the item if it doesn’t sell. Also there’s a lot of cleaning to do such as stripping out less meaningful strings, converting data types, removing sparse columns, etc.

原始数据很乱，并且有很多重复的产品过帐，因为eBay可以让用户选择自动重新列出未售出的商品。另外，还有很多清理工作要做，例如删除意义不大的字符串，转换数据类型，删除稀疏列等。

With initial datasets cleaned, there were 5,211 observations left which are then split again with 7:3 ratio. There’s still more engineering to do such as imputating missing values, checking multicollinearity, feature-engineering, etc.

清理初始数据集后，剩下5,211个观测值，然后以7：3的比例再次拆分。还有更多工程要做，例如估算缺失值，检查多重共线性，特征工程等。

Let’s check which variables have missing values.

让我们检查哪些变量缺少值。

np.sum(pd.isna(x_train), axis =0)price                  7
rating              3418
num_ratings            0
watcher                0
shipping               2
free_return            0
open_box               2
pre_owned              2
refurbished            2
benefits_charity       0
price_present          0
rating_present         0
shipping_present       0
status_present         0
dtype: int64

There are 3,418 rating (92%) that are missing. Imputating with mean or median would underestimate the variance of ratings, which may not be an ideal solution. Here, we use MICE(Imputation by Multiple imputation by chained equations) which uses regression to predict the missing value with the other features. You can check out “MICE steps” from this link if you want more details: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

缺少3,418评分(92％)。用均值或中位数估算会低估评级的方差，这可能不是理想的解决方案。在这里，我们使用MICE( 通过链式方程进行多次插补 )，该MICE 通过回归来预测具有其他特征的缺失值。如果您需要更多详细信息，可以从此链接签出“ MICE步骤”： https : //www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

Now let’s check the target’s distribution in a training set.

现在，让我们在训练集中检查目标的分布。

Distribution of target

目标分配

It is extremely skewed to the right. This may not satisfy the Normal assumption about target in linear regression models. We may need to consider transformation of target or even Poisson regression since Poisson is skewed to the right when the mean is close to zero. Log-transformation can reduce the skewness.

它向右偏斜。这可能不满足线性回归模型中有关目标的法线假设。我们可能需要考虑目标的转换，甚至是泊松回归，因为当平均值接近零时，泊松向右倾斜。对数转换可以减少偏度。

In regression, there are more assumptions to check: linearity between each feature and the (transformed) target, interaction effects, and constant variance of residuals.

在回归中，还有更多的假设要检查：每个特征与(已转换的)目标之间的线性，交互作用以及残差的恒定方差。

Of course, the assumptions are not going to be met perfectly, but they should at least be checked if we want to reduce bias of the estimated coefficients in the model.

当然，这些假设并不能完美地满足，但是如果我们想减少模型中估计系数的偏差，则至少应该对其进行检查。

Linearizing the relationship between Sale Volume(target) and Price(feature)

线性化销售量(目标)和价格(特征)之间的关系

The above plot shows that after transforming price variable to 1/sqrt, the relationship with the target is more linearized.

上图显示，将价格变量转换为1 / sqrt之后，与目标的关系更加线性化。

Interaction plot between ‘watcher’(number of views) and ‘free return’(binary variable — whether returning the product for free or not)

“观察者”(观看次数)和“免费退货”(二进制变量-是否免费退货)之间的交互作用图

Including the interaction plots also relaxes the strict assumption that each feature affects the target in the same way per unit increase. The above plot is one example that interaction term ‘watcher*free_return’ should be included in the model as the number of views(‘watcher’) has less impact on the sale volume(‘sold’) when there is free-return policy.

包括交互图也放松了严格的假设，即每个特征以每单位增加的相同方式影响目标。上图是一个示例，其中应在模型中包括交互项“ watcher * free_return”，因为在制定免费退货政策时，视图数(“ watcher”)对销量(“已售”)的影响较小。

Was feature-engineering overall helpful? Yes! As one metric, Rsquared increased by 1.5 times from 0.262 to 0.399. Feature-engineering is helpful in fitting the data better especially when you don’t have enough features to fit the model. In this dataset, “reviews” of the buyers are missing which may be one of the most important feature in predicting sale rate.

功能设计整体上是否有帮助？是! 作为一项指标，Rsquared从0.262增加到0.399，增长了1.5倍。特征工程有助于更好地拟合数据，尤其是当您没有足够的特征来拟合模型时。在此数据集中，缺少买家的“评论”，这可能是预测销售率的最重要特征之一。

Before diving into modeling, there’s one more important step: outliers.

在深入建模之前，还有一个重要的步骤：离群值。

Cook’s distance

库克的距离

An observation’s Cook’s distance is a product between its residual and its distance from centroid of the feature space. In a nutshell, it measures how unusual the observation is in terms of X(features) and y(target). Assuming that Cook’s distance has a proxy F-distribution, Cook’s distance of about 0.8(40th percentile of F) means removing this observation pushes the estimated coefficient to 40% confidence region which may seem dramatic change after omitting just one observation. It turns out, this outlier was just a keyboard cover, not an actual keyboard which seems to be legitimate reason to remove from the data. Removing this observation also helps in constant variance assumption.

观测值的库克距离是其残差与距要素空间质心的距离之间的乘积。简而言之，它根据X(特征)和y(目标)来衡量观察的异常程度。假设Cook的距离具有代理F分布，则Cook的距离约为0.8(F的40％)意味着删除此观测值会将估算的系数推到40％的置信区域，在仅删除一个观测值之后，这似乎是巨大的变化。事实证明，该异常值只是键盘盖，而不是实际的键盘，这似乎是从数据中删除的合理原因。删除此观察值还有助于进行恒定方差假设。

Distribution of fitted values of linear regression vs. Poisson regression

线性回归与泊松回归的拟合值分布

Linear Regression and Poisson Regression were fit to the data. Linear regression seems to estimate the target distribution better.

线性回归和泊松回归拟合数据。线性回归似乎可以更好地估计目标分布。

MAE(Mean Absolute Error) is 43.8 for Linear Regression and 60.5 for Poisson Regression.

线性回归的MAE(平均绝对误差)为43.8，泊松回归的MAE(平均绝对误差)为60.5。

With the hold-out validation set, both models are overfit, but Linear Regression did much better than Poisson Regression in terms of MAE. In fact, Poisson Regression did worse than the sample mean.

使用保留验证集，这两个模型都是过拟合的，但是就MAE而言，线性回归的表现要比泊松回归好得多。实际上，泊松回归确实比样本平均值差。

Linear regression MAE: 62.9, R2 score: 0.03
Poisson regression MAE: 119.3
MAE with sample mean(training) is 95.4

Regularization seems to be necessary for these models. ‘statsmodels’ package was used for Poisson regression and it has Elastic-Net regularization only. ‘sklearn’ has Lasso and Ridge. Trying these regularization gives different results.

对于这些模型，正则化似乎是必需的。 'statsmodels'软件包用于Poisson回归，并且仅具有Elastic-Net正则化。 “ sklearn”有套索和里奇。尝试这些正则化将得出不同的结果。

MAE against regularization weight of Linear Regression(left) and Poisson Regression(right)

针对线性回归(左)和泊松回归(右)的正则化权重的MAE

With linear regression, there’s no change in MAE after regularization. With Poisson regression, there is clear dip in MAE. MAE decreased in half from 119.3 to 62.1. With impressive improvement after regularization, Poisson regression is chosen as the final model.

使用线性回归，正则化后MAE不变。通过泊松回归，MAE明显下降。 MAE从119.3降至62.1，下降了一半。经过正则化后的惊人改进，选择了Poisson回归作为最终模型。

[('price', 0.7030020418009802),
 ('rating', 0.0),
 ('num_ratings', 0.0),
 ('watcher', 0.0),
 ('shipping', 0.47952629436260674),
 ('free_return', 0.5762104970838818),
 ('open_box', 0.0),
 ('pre_owned', 0.0),
 ('refurbished', 0.0),
 ('benefits_charity', 0.0),
 ('price_present', 0.0),
 ('rating_present', 0.30835693807441716),
 ('shipping_present', 0.0),
 ('status_present', 0.0),
 ('watcher * free_return', 0.0),
 ('watcher * refurbished', 0.0),
 ('shipping * benefits_charity', 0.0),
 ('price * shipping', 0.0),
 ('num_ratings * shipping', 0.0)]

With Elastic-Net regularization, 15 out of 19 features are zeroed out(which is personally a little depressing after all feature engineering and interaction analysis).

借助Elastic-Net正则化，可以将19个特征中的15个归零(对所有特征工程和交互分析而言，这个人都会感到沮丧)。

Now, with test dataset, the final results are as follows.

现在，对于测试数据集，最终结果如下。

NMAE: 0.579
MAE for the model: 42.67
MAE with the sample mean of train+val: 73.7

Distribution of Poisson-model-fitted values

泊松模型拟合值的分布

Normalized Mean Absolute Error is 58% which means 42% less error than the sample mean. The fitted values look much more like target distribution after regularization.

归一化平均绝对误差为58％，这意味着误差比样本平均值小42％。拟合值看起来更像是经过正则化后的目标分布。

Interaction between price and shipping cost

价格和运输成本之间的相互作用

There’s weak but interesting interaction between price and shipping cost. Overall, price has negative linear relationship with sale volume as expected, but when the shipping costs more than 10 dollars, people are more sensitive to the price. This kind of makes sense since the item is a keyboard, generally a cheap product, so people would be reluctant to buy a keyboard when the delivery is too expensive.

价格和运输成本之间的互动微弱但有趣。总体而言，价格与预期的销量成负线性关系，但是当运输成本超过10美元时，人们对价格更加敏感。由于这种物品是键盘，通常是一种便宜的产品，因此这种说法很有意义，因此，当交货太贵时，人们会不愿意购买键盘。

The model may be better after all the efforts taken, still the actual sale volume is much more skewed with 70% of the items that are not sold at all. And this extreme skewness doesn’t cope with the usual assumption about Poisson regression where the mean and the variance are the same.

经过所有的努力，该模型可能会更好，但实际的销售量仍然有很大的偏差，其中有70％的产品根本没有售出。而且这种极度的偏斜不能满足均值和方差相同的关于泊松回归的通常假设。

Perhaps, in the future, negative binomial distribution may be something to consider since it has a dispersion parameter k in var(Y)=μ+μ^2/k where one can adjust this parameter k to control the variance for different feature values.

也许在将来，负二项式分布可能需要考虑，因为它的色散参数k为var(Y)=μ+μ^ 2 / k，其中人们可以调整该参数k来控制不同特征值的方差。

Lastly, let’s interpret this model in a meaningful way.

最后，让我们以有意义的方式解释该模型。

Sale Volume = C * exp{𝛽1/√(Price+1) + 𝛽2/√(Shipping Cost + 1) + 𝛽3*Free return + 𝛽4*Rating present} + ε where Sale Volume ~ Poisson(λ)

销售量= C * exp {𝛽1 /√(价格+1)+ 𝛽2 /√(运输成本+1)+ 𝛽3 *免费退货+ 𝛽4 *额定价格} +ε其中，销售量〜Poisson(λ)

𝐶 ≈ 0.239, 𝛽1 ≈ 9.731, 𝛽2≈ 1.413, 𝛽3 ≈ 1.307, 𝛽4 ≈ 1.167

𝐶≈0.239，𝛽1≈9.731，𝛽2≈ 1.413，𝛽3≈1.307，𝛽4≈1.167

𝛽1 ≈ 9.731 means, increasing price of your keyboard from $29.00 to $39.00 can reduce sale volume by 21.2% on average.

𝛽1≈9.731意味着，将键盘价格从$ 29.00增加到$ 39.00可以平均减少21.2％的销量。

1-exp{𝛽1(1/√(39+1)-1/√(29+1))}≈0.212.

1-exp {𝛽1(1 /√(39 + 1)-1 /√(29 + 1))}≈0.212。

𝛽2 ≈ 1.413 means, increasing shipping cost from free-of-charge to just by $1.00 can reduce sale volume by 33.9% on average.

𝛽2≈1.413意味着，将运输成本从免费增加到1.00美元，平均可以减少33.9％的销量。

𝛽3 ≈ 1.307 means, if you remove the cost of returning, it can increase sale volume by 3.7 times on average. 𝑒^𝛽3 ≈ 3.70.

𝛽3≈1.307意味着，如果除去退货成本，平均销售量可以增加3.7倍。 𝑒^ 𝛽3≈3.70。

𝛽4 ≈ 1.167 means, if there is at least one review with the rating, it can increase sale volume by 3.2 times on average. 𝑒^𝛽4 ≈ 3.21.

𝛽4≈1.167意味着，如果至少有一个具有该评论的评论，它可以使销售量平均增加3.2倍。 𝑒^ 𝛽4≈3.21。

With these inferences made, I could make a few suggestions to keyboard sellers:

通过这些推断，我可以向键盘销售商提出一些建议：

Be cautious about increasing price under $50 since it can reduce sale rate by at least 10% on average.
对于将价格降低到50美元以下要谨慎，因为它可以平均将销售率降低至少10％。
People are sensitive about shipping cost more than $10. Increasing price will suffer large decrease in sale rate.
人们对超过10美元的运费比较敏感。价格上涨将使销售率大幅下降。
Free return is highly encouraged!
强烈鼓励免费退货！
Reward writing product reviews!
奖励撰写产品评论！

And also, there are some caveats in this model that should not be ignored.

而且，此模型中有一些警告不容忽视。

There is lack of confidence interval of the weights in these suggestions, which means the aforementioned impacts of price, shipping cost, free return, and rating are just an estimated average value and the actual impact may vary a lot.
这些建议中没有权重的置信区间，这意味着价格，运输成本，免费退货和定额的上述影响仅仅是估计平均值，实际影响可能会有很大不同。
It is likely that the weights are biased since strong predictors such as content of buyer’s reviews, rating of the sellers are missing.
由于缺少强大的预测因素，例如买方评论的内容，卖方的评级，因此权重可能存在偏差。

In the future, one could reduce the variance of weights by dimensionality reduction such as PCA to treat multicollinearity that wasn’t treated in this project.

将来，可以通过降维(例如PCA)来减少权重差异，以处理本项目中未处理的多重共线性。

p.s: If you’d like to see the python code, it’s available at https://github/jung-akim/keyboard

ps：如果您想查看python代码，请访问https://github/jung-akim/keyboard

LinkedIn of the author

作者的LinkedIn

翻译自: https://medium/analytics-vidhya/keyboard-sale-rate-prediction-by-poisson-regression-on-ebay-64247d29215d

泊松分布和泊松回归

本文标签：键盘 eBay

版权声明：本文标题：泊松分布和泊松回归_在eBay上通过泊松回归预测键盘销售率内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/xitong/1727848402a1133369.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

泊松分布和泊松回归_在eBay上通过泊松回归预测键盘销售率

更多相关文章

On-Screen Keyboard(屏幕键盘) v7.0.2pro注册版

键盘强制重启电脑

没键盘计算机能启动吗,实现PC电脑无键盘鼠标开机

dell电脑蜜汁更新后键盘严重延迟解决方案（长期更新）

盲打打字php,讯飞输入法盲打键盘闪亮登场 是时候展示你真正的技术了

【Qt】linux下qt程序打包无法识别键盘、无法输入中文问题

android输入法夹字母,Android InputMethodService|KeyboardView 自定义输入法和键盘 01

百度输入法键盘android,百度输入法Android 1.2.0正式版 支持智能手写

android 输入法判断,Android如何检测输入法键盘是否显示

安卓虚拟键盘_逍遥安卓模拟器工作室版下载-逍遥安卓模拟器工作室版PC版下载v7.2.8...

笔记本电脑键盘完全失灵，但是进入BIOS又能用，是什么原因？

键盘不能使用的情况下，如何顺利进入BIOS模式？

ubuntu 显示键盘按键

计算机键盘能直接接手机吗,手机变成电脑！将键盘和鼠标连接到智能手机的3种方式...

计算机键盘能直接接手机吗,手机变电脑！智能手机外接键盘和鼠标的3种方法...

火影 超神V5笔记本键盘维修

01.04_计算机基础知识(键盘功能键和快捷键)

STM32F103C8T6制作USB键盘

三、键盘检测原理及应用实现

计算机中￥符号按哪个键,电脑键盘符号快捷键大全 电脑键盘上每个键的作用？...

发表评论

推荐文章

WIFI技术及产品在智能家居市场的应用现状

A simple model for describing basic sources of possible performance problems

计算机管理服务重置网络,重置网络命令需要管理员身份

谷歌浏览器模拟微信内置浏览器环境

中电金信：大湾区AI创新发展 “源启未来智研荟”在横琴召开

热门文章

简单的文件加密解密并拷贝到同步盘

python控制电脑定时开机关机_如何实现电脑在指定的时间自动开机？

win10电脑wifi服务器未响应,win10系统点电脑无线图标没反应的解决方法

win10只有飞行模式

关于win10安装VM没有虚拟网卡，连接不上Xshell

服务器版系统和家庭版有什么区别,系统(2000.,xp),服务器版和专业版,家庭版有什么不一样,多了什么功能。...

几款的网盘搜索引擎

Rethinking the Route Towards Weakly Supervised Object Localization论文阅读

谷歌浏览器移动端部分字体大小与设定大小不同

又一个程序猿的奋斗史——第三章入职

最新文章

谷歌浏览器被2345主页强制绑定

解决Edge及Chrome等浏览器主页被篡改2345导航页

关于Google浏览器添加QQ电脑管家广告过滤插件出现2345主页拦截问题

判断浏览器中是否安装了某插件

chrome浏览器被2345网页劫持，杀毒、删注册表等各种方法都解决不了

浏览器被恶意设置主页http:www.2345.com?kunown的解决方法

2345 网址导航劫持 解决办法

谷歌浏览器打开后同时弹出百度搜索和2345问题解决

PC端浏览器自动填充账号密码输入框问题该如何解决？

google浏览器被2345强制绑定

浏览器无法找到css或者js文件

WebSocket

浏览器提示：正在下载代理脚本

vue开发之不同浏览器的类型判断

tinymce.init()浏览器兼容问题

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

如何实现高效的treenode搜索算法

treenode与链表有何本质区别

在哪些场景下应优先考虑使用treenode

treenode在树形结构中的角色是什么

如何通过treenode实现二叉树

盲打打字php,讯飞输入法盲打键盘闪亮登场是时候展示你真正的技术了

百度输入法键盘android,百度输入法Android 1.2.0正式版支持智能手写

火影超神V5笔记本键盘维修

计算机中￥符号按哪个键,电脑键盘符号快捷键大全电脑键盘上每个键的作用？...

2345 网址导航劫持解决办法

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载