admin管理员组

文章数量:1579405

泊松分布和泊松回归

With the trend of online shopping malls replacing traditional malls, more and more people are getting interested in becoming an online seller.

随着在线购物中心取代传统购物中心的趋势,越来越多的人对成为在线卖家感兴趣。

The purpose of this article is to give some insights to online sellers who may be interested in finding the characteristics of product postings that might increase the sale of their products. The data used for this project is the query results of typing ‘keyboard’ in ebay and it was scraped using ‘BeautifulSoup’.

本文的目的是为在线卖家提供一些见解,他们可能对发现可能增加其产品销售量的产品过帐特征感兴趣。 该项目使用的数据是在ebay中键入“ keyboard”的查询结果,并使用“ BeautifulSoup”进行了抓取。

The raw data is messy and there are lots of duplicate product postings as eBay has an option for users to opt for automatically re-listing the item if it doesn’t sell. Also there’s a lot of cleaning to do such as stripping out less meaningful strings, converting data types, removing sparse columns, etc.

原始数据很乱,并且有很多重复的产品过帐,因为eBay可以让用户选择自动重新列出未售出的商品。 另外,还有很多清理工作要做,例如删除意义不大的字符串,转换数据类型,删除稀疏列等。

With initial datasets cleaned, there were 5,211 observations left which are then split again with 7:3 ratio. There’s still more engineering to do such as imputating missing values, checking multicollinearity, feature-engineering, etc.

清理初始数据集后,剩下5,211个观测值,然后以7:3的比例再次拆分。 还有更多工程要做,例如估算缺失值,检查多重共线性,特征工程等。

Let’s check which variables have missing values.

让我们检查哪些变量缺少值。

np.sum(pd.isna(x_train), axis =0)price                  7
rating 3418
num_ratings 0
watcher 0
shipping 2
free_return 0
open_box 2
pre_owned 2
refurbished 2
benefits_charity 0
price_present 0
rating_present 0
shipping_present 0
status_present 0
dtype: int64

There are 3,418 rating (92%) that are missing. Imputating with mean or median would underestimate the variance of ratings, which may not be an ideal solution. Here, we use MICE(Imputation by Multiple imputation by chained equations) which uses regression to predict the missing value with the other features. You can check out “MICE steps” from this link if you want more details: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

缺少3,418评分(92%)。 用均值或中位数估算会低估评级的方差,这可能不是理想的解决方案。 在这里,我们使用MICE( 通过链式方程进行多次插补 ),该MICE 通过回归来预测具有其他特征的缺失值。 如果您需要更多详细信息,可以从此链接签出“ MICE步骤”: https : //www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

Now let’s check the target’s distribution in a training set.

现在,让我们在训练集中检查目标的分布。

Distribution of target
目标分配

It is extremely skewed to the right. This may not satisfy the Normal assumption about target in linear regression models. We may need to consider transformation of target or even Poisson regression since Poisson is skewed to the right when the mean is close to zero. Log-transformation can reduce the skewness.

它向右偏斜。 这可能不满足线性回归模型中有关目标的法线假设。 我们可能需要考虑目标的转换,甚至是泊松回归,因为当平均值接近零时,泊松向右倾斜。 对数转换可以减少偏度。

In regression, there are more assumptions to check: linearity between each feature and the (transformed) target, interaction effects, and constant variance of residuals.

在回归中,还有更多的假设要检查:每个特征与(已转换的)目标之间的线性,交互作用以及残差的恒定方差。

Of course, the assumptions are not going to be met perfectly, but they should at least be checked if we want to reduce bias of the estimated coefficients in the model.

当然,这些假设并不能完美地满足,但是如果我们想减少模型中估计系数的偏差,则至少应该对其进行检查。

Linearizing the relationship between Sale Volume(target) and Price(feature)
线性化销售量(目标)和价格(特征)之间的关系

The above plot shows that after transforming price variable to 1/sqrt, the relationship with the target is more linearized.

上图显示,将价格变量转换为1 / sqrt之后,与目标的关系更加线性化。

Interaction plot between ‘watcher’(number of views) and ‘free return’(binary variable — whether returning the product for free or not)
“观察者”(观看次数)和“免费退货”(二进制变量-是否免费退货)之间的交互作用图

Including the interaction plots also relaxes the strict assumption that each feature affects the target in the same way per unit increase. The above plot is one example that interaction term ‘watcher*free_return’ should be included in the model as the number of views(‘watcher’) has less impact on the sale volume(‘sold’) when there is free-return policy.

包括交互图也放松了严格的假设,即每个特征以每单位增加的相同方式影响目标。 上图是一个示例,其中应在模型中包括交互项“ watcher * free_return”,因为在制定免费退货政策时,视图数(“ watcher”)对销量(“已售”)的影响较小。

Was feature-engineering overall helpful? Yes! As one metric, Rsquared increased by 1.5 times from 0.262 to 0.399. Feature-engineering is helpful in fitting the data better especially when you don’t have enough features to fit the model. In this dataset, “reviews” of the buyers are missing which may be one of the most important feature in predicting sale rate.

功能设计整体上是否有帮助? 是! 作为一项指标,Rsquared从0.262增加到0.399,增长了1.5倍。 特征工程有助于更好地拟合数据,尤其是当您没有足够的特征来拟合模型时。 在此数据集中,缺少买家的“评论”,这可能是预测销售率的最重要特征之一。

Before diving into modeling, there’s one more important step: outliers.

在深入建模之前,还有一个重要的步骤:离群值。

Cook’s distance
库克的距离

An observation’s Cook’s distance is a product between its residual and its distance from centroid of the feature space. In a nutshell, it measures how unusual the observation is in terms of X(features) and y(target). Assuming that Cook’s distance has a proxy F-distribution, Cook’s distance of about 0.8(40th percentile of F) means removing this observation pushes the estimated coefficient to 40% confidence region which may seem dramatic change after omitting just one observation. It turns out, this outlier was just a keyboard cover, not an actual keyboard which seems to be legitimate reason to remove from the data. Removing this observation also helps in constant variance assumption.

观测值的库克距离是其残差与距要素空间质心的距离之间的乘积。 简而言之,它根据X(特征)和y(目标)来衡量观察的异常程度。 假设Cook的距离具有代理F分布,则Cook的距离约为0.8(F的40%)意味着删除此观测值会将估算的系数推到40%的置信区域,在仅删除一个观测值之后,这似乎是巨大的变化。 事实证明,该异常值只是键盘盖,而不是实际的键盘,这似乎是从数据中删除的合理原因。 删除此观察值还有助于进行恒定方差假设。

Distribution of fitted values of linear regression vs. Poisson regression
线性回归与泊松回归的拟合值分布

Linear Regression and Poisson Regression were fit to the data. Linear regression seems to estimate the target distribution better.

线性回归和泊松回归拟合数据。 线性回归似乎可以更好地估计目标分布。

MAE(Mean Absolute Error) is 43.8 for Linear Regression and 60.5 for Poisson Regression.

线性回归的MAE(平均绝对误差)为43.8,泊松回归的MAE(平均绝对误差)为60.5。

With the hold-out validation set, both models are overfit, but Linear Regression did much better than Poisson Regression in terms of MAE. In fact, Poisson Regression did worse than the sample mean.

使用保留验证集,这两个模型都是过拟合的,但是就MAE而言,线性回归的表现要比泊松回归好得多。 实际上,泊松回归确实比样本平均值差。

Linear regression MAE: 62.9, R2 score: 0.03
Poisson regression MAE: 119.3
MAE with sample mean(training) is 95.4

Regularization seems to be necessary for these models. ‘statsmodels’ package was used for Poisson regression and it has Elastic-Net regularization only. ‘sklearn’ has Lasso and Ridge. Trying these regularization gives different results.

对于这些模型,正则化似乎是必需的。 'statsmodels'软件包用于Poisson回归,并且仅具有Elastic-Net正则化。 “ sklearn”有套索和里奇。 尝试这些正则化将得出不同的结果。

MAE against regularization weight of Linear Regression(left) and Poisson Regression(right)
针对线性回归(左)和泊松回归(右)的正则化权重的MAE

With linear regression, there’s no change in MAE after regularization. With Poisson regression, there is clear dip in MAE. MAE decreased in half from 119.3 to 62.1. With impressive improvement after regularization, Poisson regression is chosen as the final model.

使用线性回归,正则化后MAE不变。 通过泊松回归,MAE明显下降。 MAE从119.3降至62.1,下降了一半。 经过正则化后的惊人改进,选择了Poisson回归作为最终模型。

[('price', 0.7030020418009802),
('rating', 0.0),
('num_ratings', 0.0),
('watcher', 0.0),
('shipping', 0.47952629436260674),
('free_return', 0.5762104970838818),
('open_box', 0.0),
('pre_owned', 0.0),
('refurbished', 0.0),
('benefits_charity', 0.0),
('price_present', 0.0),
('rating_present', 0.30835693807441716),
('shipping_present', 0.0),
('status_present', 0.0),
('watcher * free_return', 0.0),
('watcher * refurbished', 0.0),
('shipping * benefits_charity', 0.0),
('price * shipping', 0.0),
('num_ratings * shipping', 0.0)]

With Elastic-Net regularization, 15 out of 19 features are zeroed out(which is personally a little depressing after all feature engineering and interaction analysis).

借助Elastic-Net正则化,可以将19个特征中的15个归零(对所有特征工程和交互分析而言,这个人都会感到沮丧)。

Now, with test dataset, the final results are as follows.

现在,对于测试数据集,最终结果如下。

NMAE: 0.579
MAE for the model: 42.67
MAE with the sample mean of train+val: 73.7
Distribution of Poisson-model-fitted values
泊松模型拟合值的分布

Normalized Mean Absolute Error is 58% which means 42% less error than the sample mean. The fitted values look much more like target distribution after regularization.

归一化平均绝对误差为58%,这意味着误差比样本平均值小42%。 拟合值看起来更像是经过正则化后的目标分布。

Interaction between price and shipping cost
价格和运输成本之间的相互作用

There’s weak but interesting interaction between price and shipping cost. Overall, price has negative linear relationship with sale volume as expected, but when the shipping costs more than 10 dollars, people are more sensitive to the price. This kind of makes sense since the item is a keyboard, generally a cheap product, so people would be reluctant to buy a keyboard when the delivery is too expensive.

价格和运输成本之间的互动微弱但有趣。 总体而言,价格与预期的销量成负线性关系,但是当运输成本超过10美元时,人们对价格更加敏感。 由于这种物品是键盘,通常是一种便宜的产品,因此这种说法很有意义,因此,当交货太贵时,人们会不愿意购买键盘。

The model may be better after all the efforts taken, still the actual sale volume is much more skewed with 70% of the items that are not sold at all. And this extreme skewness doesn’t cope with the usual assumption about Poisson regression where the mean and the variance are the same.

经过所有的努力,该模型可能会更好,但实际的销售量仍然有很大的偏差,其中有70%的产品根本没有售出。 而且这种极度的偏斜不能满足均值和方差相同的关于泊松回归的通常假设。

Perhaps, in the future, negative binomial distribution may be something to consider since it has a dispersion parameter k in var(Y)=μ+μ^2/k where one can adjust this parameter k to control the variance for different feature values.

也许在将来,负二项式分布可能需要考虑,因为它的色散参数k为var(Y)=μ+μ^ 2 / k,其中人们可以调整该参数k来控制不同特征值的方差。

Lastly, let’s interpret this model in a meaningful way.

最后,让我们以有意义的方式解释该模型。

Sale Volume = C * exp{𝛽1/√(Price+1) + 𝛽2/√(Shipping Cost + 1) + 𝛽3*Free return + 𝛽4*Rating present} + ε where Sale Volume ~ Poisson(λ)

销售量= C * exp {𝛽1 /√(价格+1)+ 𝛽2 /√(运输成本+1)+ 𝛽3 *免费退货+ 𝛽4 *额定价格} +ε其中,销售量〜Poisson(λ)

𝐶 ≈ 0.239, 𝛽1 ≈ 9.731, 𝛽2≈ 1.413, 𝛽3 ≈ 1.307, 𝛽4 ≈ 1.167

𝐶≈0.239,𝛽1≈9.731,𝛽2≈ 1.413,𝛽3≈1.307,𝛽4≈1.167

𝛽1 ≈ 9.731 means, increasing price of your keyboard from $29.00 to $39.00 can reduce sale volume by 21.2% on average.

𝛽1≈9.731意味着,将键盘价格从$ 29.00增加到$ 39.00可以平均减少21.2%的销量。

1-exp{𝛽1(1/√(39+1)-1/√(29+1))}≈0.212.

1-exp {𝛽1(1 /√(39 + 1)-1 /√(29 + 1))}≈0.212。

𝛽2 ≈ 1.413 means, increasing shipping cost from free-of-charge to just by $1.00 can reduce sale volume by 33.9% on average.

𝛽2≈1.413意味着,将运输成本从免费增加到1.00美元,平均可以减少33.9%的销量。

𝛽3 ≈ 1.307 means, if you remove the cost of returning, it can increase sale volume by 3.7 times on average. 𝑒^𝛽3 ≈ 3.70.

𝛽3≈1.307意味着,如果除去退货成本,平均销售量可以增加3.7倍。 𝑒^ 𝛽3≈3.70。

𝛽4 ≈ 1.167 means, if there is at least one review with the rating, it can increase sale volume by 3.2 times on average. 𝑒^𝛽4 ≈ 3.21.

𝛽4≈1.167意味​​着,如果至少有一个具有该评论的评论,它可以使销售量平均增加3.2倍。 𝑒^ 𝛽4≈3.21。

With these inferences made, I could make a few suggestions to keyboard sellers:

通过这些推断,我可以向键盘销售商提出一些建议:

  1. Be cautious about increasing price under $50 since it can reduce sale rate by at least 10% on average.

    对于将价格降低到50美元以下要谨慎,因为它可以平均将销售率降低至少10%。
  2. People are sensitive about shipping cost more than $10. Increasing price will suffer large decrease in sale rate.

    人们对超过10美元的运费比较敏感。 价格上涨将使销售率大幅下降。
  3. Free return is highly encouraged!

    强烈鼓励免费退货!
  4. Reward writing product reviews!

    奖励撰写产品评论!

And also, there are some caveats in this model that should not be ignored.

而且,此模型中有一些警告不容忽视。

  1. There is lack of confidence interval of the weights in these suggestions, which means the aforementioned impacts of price, shipping cost, free return, and rating are just an estimated average value and the actual impact may vary a lot.

    这些建议中没有权重的置信区间,这意味着价格,运输成本,免费退货和定额的上述影响仅仅是估计平均值,实际影响可能会有很大不同。
  2. It is likely that the weights are biased since strong predictors such as content of buyer’s reviews, rating of the sellers are missing.

    由于缺少强大的预测因素,例如买方评论的内容,卖方的评级,因此权重可能存在偏差。

In the future, one could reduce the variance of weights by dimensionality reduction such as PCA to treat multicollinearity that wasn’t treated in this project.

将来,可以通过降维(​​例如PCA)来减少权重差异,以处理本项目中未处理的多重共线性。

p.s: If you’d like to see the python code, it’s available at https://github/jung-akim/keyboard

ps:如果您想查看python代码,请访问https://github/jung-akim/keyboard

LinkedIn of the author

作者的LinkedIn

翻译自: https://medium/analytics-vidhya/keyboard-sale-rate-prediction-by-poisson-regression-on-ebay-64247d29215d

泊松分布和泊松回归

本文标签: 键盘eBay