admin管理员组

文章数量:1582333

presto集群

介绍 (Introduction)

Data is the lifeblood of Grab and the insights we gain from it drive all the most critical business decisions made by Grabbers and our leaders every day.

数据是Grab的命脉,我们从中获得的见解每天都会驱动Grabbers和我们的领导者做出的所有最关键的业务决策。

Grab’s Data Engineering (DE) team is responsible for maintaining the data platform, which consists of data pipelines, job schedulers, and the query/computation engines that are the key components for generating insights from data. SQL is the core language for analytics at Grab and as of early 2020, our Presto platform serves about 200 user groups that add up to 500 users who run 350,000 queries every day. These queries span across 10,000 tables that process up to 1PB of data daily.

Grab的数据工程(DE)团队负责维护数据平台,该平台由数据管道,作业计划程序和查询/计算引擎组成,它们是从数据生成洞察力的关键组件。 SQL是Grab的分析核心语言,截至2020年初,我们的Presto平台可为大约200个用户组提供服务,该组用户总数达500个,每天运行350,000个查询。 这些查询跨越10,000个表,每天处理多达1PB的数据。

In 2016, we started the DataGateway project to enable us to manage data access for the hundreds of Grabbers who needed access to Presto for their work. Since then, DataGateway has grown to become much more than just an access control mechanism for Presto. In this blog, we want to share what we’ve achieved since the initial launch of the project.

2016年,我们启动了DataGateway项目,以使我们能够管理数百名需要访问Presto进行工作的Grabber的数据访问。 从那时起,DataGateway已经发展成为不仅仅是Presto的访问控制机制了。 在此博客中,我们希望分享自项目首次启动以来所取得的成就。

我们要解决的问题 (The problems we wanted to solve)

As we were reviewing the key challenges around data access in Grab and assessing possible solutions, we came up with this prioritized list of user requirements we wanted to work on:

在审查Grab中有关数据访问的主要挑战并评估可能的解决方案时,我们提出了我们要致力于的用户需求的优先列表:

  • Use a single endpoint to serve everyone.

    使用单个端点为所有人服务。
  • Manage user access to clusters, schemas, tables, and fields.

    管理用户对集群,架构,表和字段的访问。
  • Provide seamless user experience when presto clusters are scaled up/down, in/out, or provisioned/decommissioned.

    当presto集群按比例放大/缩小,放大/缩小或置备/停用时,可提供无缝的用户体验。
  • Capture audit trail of user activities.

    捕获用户活动的审核跟踪。

To provide Grabbers with the critical need of interactive querying, as well as performing extract, transform, load (ETL) jobs, we evaluated several technologies. Presto was among the ones we evaluated, and was what we eventually chose although it didn’t meet all of our requirements out of the box. In order to address these gaps, we came up with the idea of a security gateway for the Presto compute engine that could also act as a load balancer/proxy, this is how we ended up creating the DataGateway.

为了满足Grabbers交互式查询的关键需求,并执行提取,转换,加载(ETL)作业,我们评估了几种技术。 Presto是我们评估过的产品之一,尽管我们无法立即满足我们的所有要求,但我们最终还是选择了Presto。 为了解决这些差距,我们提出了Presto计算引擎安全网关的想法,该网关还可以充当负载平衡器/代理,这就是我们最终创建DataGateway的方式。

DataGateway is a service that sits between clients and Presto clusters. It is essentially a smart HTTP proxy server that is an abstraction layer on top of the Presto clusters that handles the following actions:

DataGateway是位于客户端和Presto群集之间的服务。 本质上,它是一个智能HTTP代理服务器,是Presto群集之上的抽象层,可处理以下操作:

  1. Parse incoming SQL statements to get requested schemas, tables, and fields.

    解析传入SQL语句以获取请求的架构,表和字段。
  2. Manage user Access Control List (ACL) to limit users’ data access by checking against the SQL parsing results.

    通过检查SQL解析结果,管理用户访问控制列表(ACL)以限制用户的数据访问。
  3. Manage users’ cluster access.

    管理用户的群集访问。
  4. Redirect users’ traffic to the authorized clusters.

    将用户流量重定向到授权集群。
  5. Show meaningful error messages to users whenever the query is rejected or exceptions from clusters are encountered.

    每当查询被拒绝或遇到集群异常时,向用户显示有意义的错误消息。

DataGateway的剖析 (Anatomy of DataGateway)

The DataGateway’s key components are as follows:

DataGateway的关键组件如下:

  • API Service

    API服务
  • SQL Parser

    SQL解析器
  • Auth framework

    Auth框架
  • Administration UI

    管理界面

We leveraged Kubernetes to run all these components as microservices.

我们利用Kubernetes将所有这些组件作为微服务运行。

API服务 (API Service)

This is the component that manages all users and cluster-facing processes. We integrated this service with the Presto API, which means it appears to be the same as a Presto cluster to a client. It accepts query requests from clients, gets the parsing result and runs authorization from the SQL Parser and the Auth Framework.

这是管理所有用户和面向群集的过程的组件。 我们将此服务与Presto API集成在一起,这意味着它对于客户端似乎与Presto群集相同。 它接受来自客户端的查询请求,获取解析结果并运行来自SQL Parser和Auth Framework的授权。

If everything is good to go, the API Service forwards queries to the assigned clusters and continues the entire query process.

如果一切顺利,API服务会将查询转发到分配的集群,并继续整个查询过程。

验证框架 (Auth Framework)

This handles both authentication and authorization requests. It stores the ACL of users and communicates with the API Service and the SQL Parser to run the entire authentication process. But why is it a microservice instead of a module in API Service, you ask? It’s because we keep evolving the security checks at Grab to ensure that everything is compliant with our security requirements, especially when dealing with data.

这处理身份验证和授权请求。 它存储用户的ACL,并与API服务和SQL Parser通信以运行整个身份验证过程。 但是,您问为什么它是微服务而不是API Service中的模块? 这是因为我们在Grab不断进行安全检查,以确保所有内容都符合我们的安全要求,尤其是在处理数据时。

We wanted to make it flexible to fulfill ad-hoc requests from the security team without affecting the API Service. Furthermore, there are different authentication methods out there that we might need to deal with (OAuth2, SSO, you name it). The API Service supports multiple authentication frameworks that enable different authentication methods for different users.

我们希望灵活地满足安全团队的临时请求,而又不影响API服务。 此外,我们可能还需要处理不同的身份验证方法(OAuth2,SSO,请命名)。 API服务支持多种身份验证框架,这些框架为不同的用户启用了不同的身份验证方法。

SQL解析器 (SQL Parser)

This is a SQL parsing engine to get schema, tables, and fields by reading SQL statements. Since Presto SQL parsing works differently in each version, we would compile multiple SQL Parsers that are identical to the Presto clusters we run. The SQL Parser becomes the single source of truth.

这是一个SQL解析引擎,用于通过读取SQL语句来获取架构,表和字段。 由于Presto SQL解析在每个版本中的工作方式不同,因此我们将编译与运行的Presto群集相同的多个SQL解析器。 SQL解析器成为事实的唯一来源。

管理员界面 (Admin UI)

This is a UI for Presto administrators to manage clusters and user access, as well as to select an authentication framework, making it easier for the administrators to deal with the entire ecosystem.

这是Presto管理员用来管理集群和用户访问以及选择身份验证框架的UI,使管理员可以更轻松地处理整个生态系统。

我们如何使用Kubernetes部署DataGateway (How we deployed DataGateway using Kubernetes)

In the past couple of years, we’ve had significant growth in workloads from analysts and data scientists. As we were very enthusiastic about Kubernetes, DataGateway was chosen as one of the earliest services for deployment in Kubernetes. DataGateway in Kubernetes is known to be highly available and fully scalable to handle traffic from users and systems.

在过去的几年中,我们的分析师和数据科学家的工作量有了显着增长。 由于我们对Kubernetes充满热情,因此DataGateway被选为Kubernetes中最早的部署服务之一。 众所周知,Kubernetes中的DataGateway具有高可用性,并且可以完全扩展以处理来自用户和系统的流量。

We also tested the HPA feature of Kubernetes, which is a dynamic scaling feature to scale in or out the number of pods based on actual traffic and resource consumption.

我们还测试了Kubernetes的HPA功能 ,该功能是一种动态扩展功能,可根据实际流量和资源消耗来扩展或扩展Pod的数量。

Figure 2. DataGateway deployment using Kubernetes 图2.使用Kubernetes的DataGateway部署

DataGateway的功能 (Functionality of DataGateway)

This section highlights some of the ways we use DataGateway to manage our Presto ecosystem efficiently.

本节重点介绍了我们使用DataGateway高效管理Presto生态系统的一些方式。

基于架构/表级别的访问限制用户 (Restrict users based on Schema/Table level access)

In a setup where a Presto cluster is deployed on AWS Amazon Elastic MapReduce (EMR) or Elastic Kubernetes Service (EKS), we configure an IAM role and attach it to the EMR or EKS nodes. The IAM role is set to limit the access to S3 storage. However, the IAM only provides bucket-level and file-level control; it doesn’t meet our requirements to have schema, table, and column-level ACLs. That’s how DataGateway is found useful in such scenarios.

在将Presto集群部署在AWS Amazon Elastic MapReduce(EMR)或Elastic Kubernetes Service(EKS)的设置中 ,我们配置IAM角色并将其附加到EMR或EKS节点。 IAM角色设置为限制对S3存储的访问。 但是,IAM仅提供存储桶级别和文件级别的控制。 拥有架构,表和列级ACL不符合我们的要求。 这就是发现DataGateway在这种情况下有用的方式。

One of the DataGateway services is an SQL Parser. As previously covered, this is a service that parses and digs out schemas and tables involved in a query. The API service receives the parsing result and checks against the ACL of users, and decides whether to allow or reject the query. This is a remarkable improvement in our security control since we now have another layer to restrict access, on top of the S3 storage. We’ve implemented an SQL-based access control down to table level.

DataGateway服务之一是SQL解析器。 如前所述,这是一项服务,用于解析和挖掘查询中涉及的模式和表。 API服务接收解析结果并根据用户的ACL进行检查,然后决定是允许还是拒绝查询。 这是我们安全控制方面的显着改进,因为我们现在在S3存储设备的上方还有另一层限制访问。 我们已经实现了一个基于SQL的访问控制,直到表级别。

As shown in the Figure 3, user A is trying run a SQL statement select * from locations.cities. The SQL Parser reads the statement and tells the API service that user A is trying to read data from the table cities in the schema locations. Then, the API service checks against the ACL of user A. The service finds that user A has only read access to table countries in schema locations. Eventually, the API service denies this attempt because user A doesn’t have read access to table cities in the schema locations.

如图3所示,用户A尝试运行一条SQL语句select * from locations.cities 。 SQL Parser读取该语句,并告诉API服务用户A试图从架构locations的表cities读取数据。 然后,API服务将根据用户A的ACL进行检查。该服务发现用户A仅对架构locationscountries / locations具有读取权限。 最终,API服务拒绝了此尝试,因为用户A对模式locations中的表cities没有读取权限。

Figure 3. An example of how to check user access to run SQL statements 图3.如何检查用户访问以运行SQL语句的示例

The above flow shows an access denied result because the user doesn’t have the appropriate permissions.

上面的流程显示了访问被拒绝的结果,因为用户没有适当的权限。

EMR迁移期间的无缝用户体验 (Seamless User Experience during the EMR migration)

We use AWS EMR to deploy Presto as an SQL query engine since deployment is really easy. However, without DataGateway, any EMR operations such as terminations, new cluster deployment, config changes, and version upgrades, would require quite a bit of user involvement. We would sometimes need users to make changes on their side. For example, request users to change the endpoints to connect to suitable clusters.

由于部署非常容易,因此我们使用AWS EMR将Presto部署为SQL查询引擎。 但是,如果没有DataGateway,则任何EMR操作(如终止,新集群部署,配置更改和版本升级)都将需要大量用户参与。 有时我们需要用户做出自己的更改。 例如,请求用户更改端点以连接到合适的群集。

With DataGateway, ACLs exist for each of the user accounts. The ACL includes the list of EMR clusters that users are allowed to access. As a Presto access management platform, here the DataGateway redirects user traffics to an appropriate cluster based on the ACL, like a proxy. Users always connect to the same endpoint we offer, which is the DataGateway. To switch over from one cluster to another, we just need to edit the cluster ACL and everything is handled seamlessly.

使用DataGateway,每个用户帐户都存在ACL。 ACL包括允许用户访问的EMR群集列表。 作为Presto访问管理平台,此处的DataGateway将用户流量重定向到基于ACL的适当群集,例如代理。 用户始终连接到我们提供的同一终结点,即DataGateway。 要从一个群集切换到另一个群集,我们只需要编辑群集ACL,即可无缝处理所有内容。

Figure 4. Cluster switching using DataGateway 图4.使用DataGateway进行集群切换

Figure 4 highlights the case when we’re switching EMR from one cluster to another. No changes are required from users.

图4突出显示了将EMR从一个群集切换到另一个群集时的情况。 用户不需要任何更改。

We executed the migration of our entire Presto platform from an AWS EMR instance to another AWS EMR instance using the same methodology. The migrations were executed with little to no disruption for our users. We were able to move 40 clusters with hundreds of users. They were able to issue millions of queries daily in a few phases over a couple of months.

我们使用相同的方法将整个Presto平台从一个AWS EMR实例迁移到另一个AWS EMR实例。 迁移的执行对我们的用户几乎没有中断。 我们能够与数百个用户一起移动40个集群。 在几个月内,他们可以在几个阶段中每天发出数百万个查询。

In most cases, users didn’t have to make any changes on their end, they just continued using Presto as usual while we made the changes in the background.

在大多数情况下,用户无需在终端上进行任何更改,而是在我们在后台进行更改的同时照常使用Presto。

多云Data Lake / Presto集群维护 (Multi-Cloud Data Lake/Presto Cluster maintenance)

Recently, we started to build and maintain data lakes not just in one cloud, but two — in AWS and Azure. Since most end-users are AWS-based, and each team has their own AWS sub-account to run their services and workloads, it would be a nightmare to bridge all the connections and access routes between these two clouds from end-to-end, sub-account by sub-account.

最近,我们开始在AWS和Azure中不仅在一个云中而且在两个云中构建和维护数据湖。 由于大多数最终用户都是基于AWS的,并且每个团队都有自己的AWS子帐户来运行其服务和工作负载,因此将这两个云之间的所有连接和访问路由从端到端架桥起来将是一场噩梦,子帐户逐个子帐户。

Here, the DataGateway plays the role of the multi-cloud gateway. Since all end-users’ AWS sub-accounts have peered to DataGateway’s network, everything becomes much easier to handle.

在此,DataGateway扮演了多云网关的角色。 由于所有最终用户的AWS子帐户都已与DataGateway的网络建立了对等关系,因此一切变得更加容易处理。

For end-users, they retain the same Presto connection profile. The DE team then handles the connection setup from DataGateway to Azure, and also the deployment of Presto clusters in Azure.

对于最终用户,他们保留相同的Presto连接配置文件。 然后,DE团队处理从DataGateway到Azure的连接设置,以及Presto群集在Azure中的部署。

When all is set, end-users use the same endpoint to DataGateway. We offer a feature called Cluster Switch that allows users to switch between AWS Presto cluster and Azure Presto Cluster on the fly by filling in parameters on the connection string. This feature allows users to switch to their target Presto cluster without any endpoint changes. The switch works instantly whenever they do the change. That means users can run different queries in different clusters based on their requirements.

设置全部后,最终用户将对DataGateway使用相同的端点。 我们提供了一项称为“ 群集切换器”的功能,该功能允许用户通过在连接字符串上填写参数来在AWS Presto群集和Azure Presto群集之间进行即时切换。 此功能使用户无需更改任何端点即可切换到其目标Presto群集。 只要他们进行更改,该开关就会立即起作用。 这意味着用户可以根据自己的需求在不同的集群中运行不同的查询。

This feature has helped the DE team to maintain Presto Cluster easily. We can spin up different Presto clusters for different teams, so that each team has their own query engine to run their queries with dedicated resources.

此功能已帮助DE团队轻松维护Presto Cluster。 我们可以为不同的团队启动不同的Presto集群,以便每个团队都有自己的查询引擎,以使用专用资源运行其查询。

Figure 5. Sub-account connections and Queries 图5.子帐户连接和查询

Figure 5 shows an example of how sub-accounts connect to DataGateway and run queries on resources in different clouds and clusters.

图5显示了一个子帐户如何连接到DataGateway并在不同云和群集中的资源上运行查询的示例。

Figure 6. Sample scenario without DataGateway 图6.没有DataGateway的示例场景

Figure 6 shows a scenario of what would happen if DataGatway doesn’t exist. Each of the accounts would have to maintain its own connections, Virtual Private Cloud (VPC) peering, and express link to connect to our Presto resources.

图6显示了如果不存在DataGatway将会发生的情况。 每个帐户都必须维护自己的连接, 虚拟私有云(VPC)对等关系,并表达链接以连接到我们的Presto资源。

摘要 (Summary)

DataGateway is playing a key role in Grab’s entire Presto ecosystem. It helps us manage user access and cluster selections on a single endpoint, ensuring that everyone is running their Presto queries on the same place. It also helps distribute workload to different types and versions of Presto clusters.

DataGateway在Grab的整个Presto生态系统中发挥着关键作用。 它可以帮助我们在单个端点上管理用户访问和集群选择,确保每个人都在同一位置运行Presto查询。 它还有助于将工作负载分配到不同类型和版本的Presto群集。

When we started to deploy the DataGateway on Kubernetes, our vision for the Presto ecosystem underwent an epic change as it further motivated us to continuously improve. Since then, we’ve had new ideas on deployment method/pipeline, microservice implementations, scaling strategy, resource control, we even made use of Kubernetes and designed an on-demand, container-based Presto cluster provisioning engine. We’ll share this in another engineering blog, so do stay tuned!.

当我们开始在Kubernetes上部署DataGateway时,我们对Presto生态系统的愿景发生了重大变化,这进一步激发了我们不断改进的动力。 从那时起,我们就部署方法/管道,微服务实现,扩展策略,资源控制有了新的想法,我们甚至利用Kubernetes并设计了按需的基于容器的Presto集群配置引擎。 我们将在另一个工程博客中分享此消息,敬请期待!

We also made crucial enhancements on data access control as we extended Presto’s access controls down to the schema/table-level.

当我们将Presto的访问控制扩展到架构/表级别时,我们还对数据访问控制进行了重要的改进。

In day-to-day operations, especially when we started to implement data lake in multiple clouds, DataGateway solved a lot of implementation issues. DataGateway made it simpler to switch a user’s Presto cluster from one cloud to another or allow a user to use a different Presto cluster using parameters. DataGateway allowed us to provide a seamless experience to our users.

在日常操作中,尤其是当我们开始在多个云中实现数据湖时,DataGateway解决了许多实现问题。 DataGateway使将用户的Presto群集从一个云切换到另一云变得更加简单,或者允许用户使用参数使用其他Presto群集。 DataGateway使我们能够为用户提供无缝体验。

Looking forward, we’ve more and more ideas for our Presto ecosystem, such Spark DataGateway or AWS Athena integrations, to keep our data safe at any time and to provide our users with a smoother experience when dealing with data used for analysis or research.

展望未来,我们对Presto生态系统有越来越多的想法,例如Spark DataGateway或AWS Athena集成,可以随时保护我们的数据安全,并在处理用于分析或研究的数据时为用户提供更流畅的体验。

Authored by Vinnson Lee on behalf of the Presto Development Team at Grab — Edwin Law, Qui Hieu Nguyen, Rahul Penti, Wenli Wan, Wang Hui and the Data Engineering Team.

由Vinnson Lee代表Grab的Presto开发团队-Edwin Law,Quieueu Nguyen,Rahul Penti,Wenli Wan,Wang Hui和数据工程团队撰写。

加入我们 (Join us)

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

Grab不仅仅是东南亚领先的乘车和移动支付平台。 我们使用数据和技术来改善从超过6.2亿人口的地区到交通,支付和金融服务的所有方面。 我们渴望释放东南亚的真正潜力,并寻找志趣相投的个人加入我们的行列。

If you share our vision of driving South East Asia forward, apply to join our team today.

如果你认同我们驾驶东南亚向前的视野, 申请今天加入我们的团队。

Originally published at https://engineering.grab.

最初发布在 https://engineering.grab

翻译自: https://medium/@grab/securing-and-managing-multi-cloud-presto-clusters-with-grabs-datagateway-dbccf9a80e33

presto集群

本文标签: 集群网关数据Presto