admin管理员组

文章数量:1582368

掌握大数据数据分析师吗?

Either you are a data scientist, a data engineer, or someone enthusiastic about data, understanding your data is one thing you don’t want to overlook. We usually regard data as numbers, texts, or images, but data is more than that.

Ëither你是一个数据科学家,数据工程师,还是有人热衷于数据, 了解你的数据是你不想忽视的一件事。 我们通常将数据视为数字,文本或图像,但数据不仅限于此。

We should consider data as an independent entity. Data can make self-introduction, tell stories, and visualize trends. To reach those outcomes, you must understand your data first. Not only about how it was formed or its origin, but how it’ll change over time and its usability. Some of this information is what we call metadata.

我们应该将数据视为一个独立的实体。 数据可以自我介绍, 讲故事和可视化趋势。 为了获得这些结果,您必须首先了解您的数据。 不仅是关于它的形成方式或起源,还包括它随着时间的变化及其可用性的变化。 其中一些信息就是我们所说的元数据。

Why is metadata so important? And why must we master metadata before we master data? Today I’ll show you how we can leverage metadata in our data business.

为什么元数据如此重要? 为何我们在掌握数据之前必须掌握元数据? 今天,我将向您展示如何在数据业务中利用元数据。

到底什么是元数据? (What is metadata, exactly?)

According to Wikipedia, metadata is “data that provides information about other data”. It’s “data about data”. That sounds straightforward, doesn’t it? All data contains information about a specific thing. For metadata, that specific thing is another data.

根据维基百科 ,元数据是“ 提供有关其他数据的信息的数据 ”。 这是“关于数据的数据” 。 这听起来很简单,不是吗? 所有数据都包含有关特定事物的信息。 对于元数据,那个特定的东西是另一种数据。

However, metadata also varies in the definition per se. It can be the name of the dataset, creation information, or statistical distribution of data points. It can be anything related to the data properties. With that said, all data must possess for it the metadata. But that’s not always the exhaustive case.

但是,元数据本身的定义也有所不同。 它可以是数据集的名称,创建信息或数据点的统计分布 。 它可以是与数据属性有关的任何内容。 话虽如此,所有数据都必须拥有元数据。 但这并不总是穷举。

Data without metadata is always incomplete.

没有元数据的数据总是不完整的。

Types of metadata. Credit to the author.
元数据的类型。 感谢作者。

We use data with the hope of extracting useful insights, and the purpose of data comprehension. Metadata helps us to assert the data integrity, to verify the source of truth, or to maintain stable data quality.

我们使用数据的目的是希望提取有用的见解以及数据理解的目的。 元数据可帮助我们维护数据完整性,验证真相来源或保持稳定的数据质量。

An example of an email’s metadata. Credit to the author.
电子邮件元数据的示例。 感谢作者。

However, in some cases, data users ignore the effect of metadata. They view it as just labels and the value it brings to the table is limited. We’ll see next how metadata is related to another critical aspect of data: Data quality.

但是,在某些情况下,数据用户会忽略元数据的影响。 他们将其视为标签,并且它带给表的价值是有限的。 接下来,我们将看到元数据与数据的另一个关键方面如何相关: 数据质量 。

资料品质 (Data quality)

Again, Wikipedia says: “Data quality refers to the state of qualitative or quantitative pieces of information.” In general, data is said to have high quality when “it fits the intended use case regardless of data users”.

维基百科再次说:“ 数据质量是指定性或定量信息的状态 。” 通常,当数据“适合预期的使用情况而与数据用户无关”时,数据被认为具有高质量。

Data is a valuable source of information, but nobody wants to use a piece of crap. The more you desire to extract from data, the more significant is data quality. In the world of Big Data, this also becomes a bottleneck.

数据是有价值的信息来源,但是没有人愿意使用这些废话。 您希望从数据中提取的内容越多,数据质量就越重要。 在大数据世界中,这也成为瓶颈。

Photo by Markus Winkler on Unsplash
Markus Winkler在 Unsplash上 拍摄的照片

As data grows bigger, so does metadata. We are not used to handling a great amount of metadata. Since it needs a special kind of treatment, we must consider it is at the same time data and not data. Metadata is not an independent piece of information but rather an attachment to our data. We have the possibility to extend that to become an assessment of the data quality.

随着数据的增长,元数据也随之增长。 我们不习惯处理大量的元数据。 由于它需要一种特殊的处理方式,因此必须同时考虑它是数据而不是数据。 元数据不是独立的信息,而是数据的附件。 我们有可能将其扩展为对数据质量的评估。

Data is a valuable source of information, but nobody wants to use a piece of crap

数据是有价值的信息来源,但是没有人愿意使用废话

In a common effort of cultivating a high data quality in Big data pipelines, tech companies are paying lots of attention to this newish subject. From detecting anomalies to automatic alerting systems, we wish to limit the impact of erroneous data as little as possible. We can’t do this without data comprehension, or precisely without metadata.

为了在大数据管道中培养高质量的数据,技术公司一直在关注这一新话题。 从检测异常到自动警报系统,我们希望尽可能减少错误数据的影响。 没有数据理解,或者没有元数据,我们就无法做到这一点。

Data quality reflects via many aspects, but most often is the correctness of values. Imagine you plot a histogram of university students’ grades within a semester. The histogram is a statistical representation of those values, and it describes your data. It becomes metadata. What you might interpret is the distribution of the grades, then you can conclude whether it will fit your use case.

数据质量可以通过许多方面反映出来,但最常见的是值的正确性。 想象一下,您绘制了一个学期内大学生成绩的直方图 。 直方图是这些值的统计表示形式,它描述了您的数据。 它成为元数据。 您可能会解释的是成绩的分布,然后可以得出结论是否适合您的用例。

Using Histograms to Understand Your Data 使用直方图了解您的数据

There are many questions to be asked about data values beforehand. Are those values stable overtime? Are there any outliers? If yes, what should we do with those outliers? By answering these questions, we extract some insights, not information-wise but data-wise. We can create metadata, useful metadata. That’s just a primitive step in asserting data quality via metadata. We’ll have a good look at the next section on how we can leverage metadata that we could generate.

事先有很多关于数据值的问题。 这些值在超时后是否稳定? 有离群值吗? 如果是,我们应该如何处理这些异常值? 通过回答这些问题,我们可以得出一些见解,而不是信息方面的见解,而是数据方面的见解。 我们可以创建元数据,有用的元数据。 这只是通过元数据声明数据质量的原始步骤。 我们将在下一节中很好地介绍如何利用我们可以生成的元数据。

如何利用元数据 (How to leverage metadata)

Some people might be overwhelmed by the various statistical representations we can extract from a dataset. Others might as well ignore that additional information thinking it is useless. It’s true that we don’t need to draw a histogram every time working with data, but it helps. To leverage the insightful metadata, data users must first answer three important questions:

我们可能从数据集中提取的各种统计表示可能会让某些人不知所措。 其他人可能会以为多余的信息无用,而忽略了这些信息。 的确,我们不需要每次处理数据时都绘制直方图,但这很有用。 要利用有见地的元数据,数据用户必须首先回答三个重要问题:

  • What: What do you want to verify the quality of your data? Some data requires strict stability while some need attention whether it’s righteous. For each kind of data, we adapt the information extracted as metadata. Statistical distribution, trends over time, discrepancies, etc. This is what we call the metadata strategy. We are limited in storage and human resources while working with both data and metadata. Therefore, we must think cautiously about where to focus.

    什么: 您想验证什么数据质量? 有些数据需要严格的稳定性,而有些则需要注意其是否合理。 对于每种数据,我们将提取的信息调整为元数据。 统计分布,随时间的趋势,差异等。这就是我们所说的元数据策略 。 在处理数据和元数据时,我们在存储和人力资源上受到限制。 因此,我们必须谨慎考虑应将重点放在哪里。

  • How: How do we measure data quality? These actions follow the metadata strategy. We could choose to measure the whole database, or some tables, or a specific set of columns. The total number of values, the maximum/minimum length of a string, the proportion of missing data. What we decide to measure depends on how we use those data to produce outcomes.

    如何: 我们如何衡量数据质量? 这些操作遵循元数据策略。 我们可以选择测量整个数据库,某些表或一组特定的列。 值的总数,字符串的最大/最小长度,丢失数据的比例。 我们决定衡量的内容取决于我们如何使用这些数据来产生结果。

  • When: Data changes over time. When we extract insights via metadata, we are tracking those transitions. When do we track the metadata? Every day? Every hour? Every quarter? It depends on how much granularity is sufficient to address data quality. We adapt our measure to how quickly the data can change. For example, stock market data needs to be tracked every single minute or second. Weather data changes every hour while aerospatial data can take months or years to shift.

    时间:数据随时间变化。 当我们通过元数据提取见解时,我们正在跟踪这些过渡。 我们何时跟踪元数据? 每天? 每隔一小时? 每个季度? 这取决于多少粒度足以解决数据质量。 我们会根据数据变化的速度调整指标。 例如,需要每隔一分钟或一秒钟跟踪一次股市数据。 天气数据每小时都会变化,而航空数据可能要花费数月或数年才能变化。

Stock market data needs to be tracked every single minute. Photo by Markus Spiske on Unsplash
需要每分钟跟踪一次股市数据。 Markus Spiske在 Unsplash上 拍摄的照片

Metadata has its long history, but we have just recently discovered its contribution to data management, or especially data quality. Metadata itself can’t change the outcomes of data, but it adds a security and management layer between our raw data and its usage. You might even use metadata to discover your data without realizing it.

元数据具有悠久的历史,但我们最近才发现它对数据管理 (特别是数据质量)的贡献。 元数据本身无法更改数据的结果,但会在原始数据及其使用之间增加安全性和管理层。 您甚至可能使用元数据来发现数据而没有意识到。

Data quality might be insignificant when your data is small, but it becomes critical when working with a bigger amount. Metadata helps us keep track of that growth, and make sure the data evolves as it should be. By failing to leverage metadata, we fail to understand your data.

当您的数据较小时,数据质量可能微不足道,但在处理大量数据时就变得至关重要。 元数据可帮助我们跟踪增长情况,并确保数据按预期发展。 由于未能利用元数据,我们无法理解您的数据。

我该如何处理元数据? (What should I do with metadata?)

If you wish to master your data, you should start to treat metadata systematically. Base on the framework we have seen above, you choose for yourself a suitable data strategy. There’s nothing fancy about it yet. It starts with how you wish to use your data and how you control the quality of its usage. Everything starts with a goal.

如果您希望掌握数据,则应该开始系统地处理元数据。 在上面我们看到的框架的基础上,您可以自己选择合适的数据策略。 对此还没有幻想。 它从您希望如何使用数据以及如何控制其使用质量开始。 一切始于目标。

There’s one phase in the ETL process called Exploratory Data Analysis. I find it quite interesting to know more about the statistical aspect of your data. It seems to be close to what we would like to know via metadata.

ETL过程中有一个阶段称为“ 探索性数据分析” 。 我发现对您的数据的统计方面的更多了解非常有趣。 它似乎与我们希望通过元数据知道的内容接近。

I always see my data scientists and/or data analysts friends start with EDA before doing anything with their raw data. So I’ve figured out it must be an important step and I wondered how it’s linked to my metadata framework. They turn out to share quite a lot of things in common.

我总是看到我的数据科学家和/或数据分析师朋友从EDA开始,然后再处理原始数据。 因此,我认为这必须是重要的一步,我想知道它如何与我的元数据框架链接。 他们竟然分享了很多共同点。

First comes the purpose. The “exploratory” part in EDA somehow coincides with the discovery objective of metadata. Second is how they both look at the statistical side of data to evaluate its future usage. With all that said, EDA is actually a must-to-have step due to its similarity to metadata-based assessment on data quality.

首先是目的。 EDA中的“探索性”部分在某种程度上与元数据的发现目标相吻合。 其次是他们俩都如何看待数据的统计方面来评估其未来使用情况。 综上所述,EDA实际上是必不可少的步骤,因为它与基于元数据的数据质量评估相似。

You have the data strategy, the data evaluation, now it’s the time for you to decide what to proceed with all that information. How the data will be used decides whether it’s righteous and trustworthy under the eyes of a data quality control.

您有了数据策略,数据评估,现在是时候决定如何处理所有信息。 在数据质量控制的眼中,如何使用数据将决定其是否合理和可信赖。

Key takeaways:- Build your data strategy based on data usability
- Apply an EDA - Exploratory Data Analysis to evaluate the data
- Decide on whether you have a solid confidence on your data

结论 (Conclusion)

I’ve shared some of my points of view on metadata. For me, it has as much value as the data itself. Those who take advantage of these values are the ones who understand their data. It’s easier to misuse something we don’t comprehend. Metadata gives us a clearer view of the data, and furthermore data quality, integrity, and usability.

我已经分享了一些有关元数据的观点。 对我来说,它与数据本身一样有价值。 那些利用这些价值的人就是了解他们的数据的人。 滥用我们不理解的东西会更容易。 元数据为我们提供了更清晰的数据视图,以及数据质量,完整性和可用性。

My name’s Nam Nguyen, and I write (mostly) about Big Data. Enjoy your reading? Follow me on Medium and Twitter for more updates.

我叫Nam Nguyen,(主要)写有关大数据的文章。 喜欢阅读吗? 在Medium和Twitter上关注我以获取更多更新。

翻译自: https://towardsdatascience/want-to-master-your-data-heres-why-you-should-care-about-metadata-8fcd7754c3b8

掌握大数据数据分析师吗?

本文标签: 数据您的这就是分析师原因