admin管理员组

文章数量:1530051

逻辑回归 自由度

Back in middle and high school you likely learned to calculate the mean and standard deviation of a dataset. And your teacher probably told you that there are two kinds of standard deviation: population and sample. The formulas for the two are just small variations on one another:

回到中学和高中时,您可能已经学会了计算数据集的平均值和标准偏差。 您的老师可能告诉过您,标准差有两种:总体和样本。 两者的公式彼此之间只是很小的变化:

Different Formulas for the Standard Deviation
标准偏差的不同公式

where μ is the population mean and x-bar is the sample mean. Typically, one just learns the formulas and is told when to use them. If you ask why, the answer is something vague like “there was one degree of freedom used up when estimating the sample mean.” without a true definition of a “degree of freedom.”

其中,μ是总体平均值,x-bar是样本平均值。 通常,人们只是学习公式并被告知何时使用它们。 如果您问为什么,答案是模糊的,例如“估计样本均值时使用了一个自由度。” 没有“自由度”的真实定义。

Degrees of freedom also show up in several other places in statistics, for example: when doing t-tests, F-tests, χ² tests, and generally studying regression problems. Depending on the circumstance, degrees of freedom can mean subtly different things (the wikipedia article lists at least 9 closely-related definitions by my count¹).

自由度还在统计中的其他几个地方出现,例如:进行t检验,F检验,χ²检验以及一般研究回归问题时。 根据情况的不同,自由度可能意味着微妙的不同(根据我的观点, 维基百科文章列出了至少9个紧密相关的定义¹)。

In this article, we’ll focus on the meaning of degrees of freedom in a regression context. Specifically we’ll use the sense in which “degrees of freedom” is the “effective number of parameters” for a model. We’ll see how to compute the number of degrees of freedom of the standard deviation problem above alongside linear regression, ridge regression, and k-nearest neighbors regression. As we go we’ll also briefly discuss the relation to statistical inference (like a t-test) and model selection (how to compare two different models using their effective degrees of freedom).

在本文中,我们将重点介绍回归上下文中自由度的含义。 具体来说,我们将使用“自由度”是模型的“有效参数数量”的含义。 我们将看到如何在上面与线性回归,岭回归和k最近邻回归一起计算标准差问题的自由度数。 在进行过程中,我们还将简要讨论与统计推断(如t检验)和模型选择(如何使用其有效自由度比较两个不同模型)的关系。

自由程度 (Degrees of Freedom)

In the regression context we have N samples each with a real-valued outcome value y. For each sample, we have a vector of covariates x, usually taken to include a constant. In other words, the first entry of the x-vector is 1 for each sample. We have some sort of model or procedure (which could be parametric or non-parametric) that is fit to the data (or otherwise uses the data) to produce predictions about what we think the value of y should be given an x-vector (which could be out-of-sample or not).

在回归上下文中,我们有N个样本,每个样本的实值结果值为y 。 对于每个样本,我们都有一个协变量向量x ,通常将其包括一个常数。 换句话说,对于每个样本, x向量的第一项均为1。 我们有某种适合数据(或以其他方式使用数据)的模型或过程(可以是参数化的也可以是非参数化的)来产生关于我们认为y值应赋予x向量的预测( (可能超出样本)。

The result is the predicted value, y-hat, for each of the N samples. We’ll define the degrees of freedom, which we denote as ν (nu):

结果是N个样本中每个样本的预测值y-hat。 我们将定义自由度,我们将其表示为ν(nu):

Definition of the Degrees of Freedom
自由度的定义

And we’ll interpret the degrees of freedom as the “effective number of parameters” of the model. Now let’s see some examples.

我们将把自由度解释为模型的“有效参数数量”。 现在让我们看一些例子。

均值和标准差 (The Mean and Standard Deviation)

Let’s return to the school-age problem we started with. Computing the mean of a sample is just making the prediction that every data point has value equal to the mean (after all, that’s the best guess you can make under the circumstances). In other words:

让我们回到开始时的学龄问题。 计算样本均值只是在预测每个数据点的值等于均值(毕竟,这是在这种情况下可以做出的最佳猜测)。 换一种说法:

Estimating the Mean as a Prediction Problem
估计均值作为预测问题

Note that estimating the mean is equivalent to running a linear regression with only one covariate, a constant: x = [1]. Hopefully this makes it clear why we can re-cast the problem as a

本文标签: 自由度逻辑定义官方