Hive实战之视频网站数据分析|电子爱好者

admin管理员组
文章数量:1570220

需求描述：

统计影音视频网站的常规指标，各种TopN指标：

统计视频观看数Top10
统计视频类别热度Top10
统计视频观看数Top20所属类别以及类别包含的Top20的视频个数
统计视频观看数Top50所关联视频的所属类别Rank
统计每个类别中的视频热度Top10
统计每个类别中视频流量Top10
统计上传视频最多的用户Top10以及他们上传的观看次数在前20视频
统计每个类别视频观看数Top10

项目

数据表构成

视频表

字段	备注	详细描述
video id	视频唯一id	11位字符串
uploader	视频上传者	上传视频的用户名String
age	视频年龄	视频在平台上的整数天
category	视频类别	上传视频指定的视频分类
length	视频长度	整形数字标识的视频长度
views	观看次数	视频被浏览的次数
rate	视频评分	满分5分
ratings	流量	视频的流量，整型数字
conments	评论数	一个视频的整数评论数
related ids	相关视频id	相关视频的id，最多20个

用户表

字段	备注	字段类型
uploader	上传者用户名	string
videos	上传视频数	int
friends	朋友数量	int

源数据ETL

通过观察原始数据形式，可以发现，视频可以有多个所属分类，每个所属分类用&符号分割，且分割的两边有空格字符，同时相关视频也是可以有多个元素，多个相关视频又用“\t”进行分割。为了分析数据时方便对存在多个子元素的数据进行操作，首先进行数据重组清洗操作。即：将所有的类别用“&”分割，同时去掉两边空格，多个相关视频id也使用“&”进行分割。

ETLUtil.java工具类

public class ETLUtil {
	public static String oriString2ETLString(String ori){
		StringBuilder etlString = new StringBuilder();
		String[] splits = ori.split("\t");
		if(splits.length < 9) return null;
		splits[3] = splits[3].replace(" ", "");
		for(int i = 0; i < splits.length; i++){
			if(i < 9){
				if(i == splits.length - 1){
					etlString.append(splits[i]);					
				}else{
					etlString.append(splits[i] + "\t");	
				}
			}else{
				if(i == splits.length - 1){
					etlString.append(splits[i]);
				}else{
					etlString.append(splits[i] + "&");
				}
			}
		}
		
		return etlString.toString();
	}
}

VideoETLMapper.java

import java.io.IOException;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import com.src.util.ETLUtil;

public class VideoETLMapper extends Mapper<Object, Text, NullWritable, Text>{
	Text text = new Text();
	
	@Override
	protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
		String etlString = ETLUtil.oriString2ETLString(value.toString());
		
		if(StringUtils.isBlank(etlString)) return;
		
		text.set(etlString);
		context.write(NullWritable.get(), text);
	}
}

ETLRunner.java

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class VideoETLRunner implements Tool {
	private Configuration conf = null;

	@Override
	public void setConf(Configuration conf) {
		this.conf = conf;
	}

	@Override
	public Configuration getConf() {
		return this.conf;
	}

	@Override
	public int run(String[] args) throws Exception {
		conf = this.getConf();
		conf.set("inpath", args[0]);
		conf.set("outpath", args[1]);

		Job job = Job.getInstance(conf);
		
		job.setJarByClass(VideoETLRunner.class);
		
		job.setMapperClass(VideoETLMapper.class);
		job.setMapOutputKeyClass(NullWritable.class);
		job.setMapOutputValueClass(Text.class);
		job.setNumReduceTasks(0);
		
		this.initJobInputPath(job);
		this.initJobOutputPath(job);
		
		return job.waitForCompletion(true) ? 0 : 1;
	}

	private void initJobOutputPath(Job job) throws IOException {
		Configuration conf = job.getConfiguration();
		String outPathString = conf.get("outpath");
		
		FileSystem fs = FileSystem.get(conf);
		
		Path outPath = new Path(outPathString);
		if(fs.exists(outPath)){
			fs.delete(outPath, true);
		}
		
		FileOutputFormat.setOutputPath(job, outPath);
		
	}

	private void initJobInputPath(Job job) throws IOException {
		Configuration conf = job.getConfiguration();
		String inPathString = conf.get("inpath");
		
		FileSystem fs = FileSystem.get(conf);
		
		Path inPath = new Path(inPathString);
		if(fs.exists(inPath)){
			FileInputFormat.addInputPath(job, inPath);
		}else{
			throw new RuntimeException("HDFS中该文件目录不存在：" + inPathString);
		}
	}

	public static void main(String[] args) {
		try {
			int resultCode = ToolRunner.run(new VideoETLRunner(), args);
			if(resultCode == 0){
				System.out.println("Success!");
			}else{
				System.out.println("Fail!");
			}
			System.exit(resultCode);
		} catch (Exception e) {
			e.printStackTrace();
			System.exit(1);
		}
	}
}

将数据打包，执行ETL操作

bin/yarn jar ~/softwares/jars/ETL-0.0.1-SNAPSHOT.jar  com.src.etl.ETLVideosRunner 
/keven/video/2008/0222  /keven/output/video/2008/0222

创建表

创建表(原始表)：

video_ori

create table video_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited 
fields terminated by "\t"
collection items terminated by "&"
stored as textfile;

video_user_ori

create table video_user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "\t" 
stored as textfile;

把原始数据插入到orc表中

video_orc

create table video_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;

video_user_orc

create table video_user_orc(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "\t" 
stored as orc;

导入ETL数据

video_ori：

load data inpath "/video/output/video/2008/0222" into  table video_ori;

video_user_ori：

 load data inpath "/video/user/2008/0903" into table  video_user_ori;

ORC表插入数据

video_orc：

 insert into table video_orc select * from video_ori;

video_user_orc：

insert into table video_user_orc select * from video_user_ori;

业务分析

统计视频观看数Top10

思路：使用order by按照views字段做一个全局排序即可，同时我们设置只显示前10条。

select 
    videoId, 
    uploader, 
    age, 
    category, 
    length, 
    views, 
    rate, 
    ratings, 
    comments 
from 
    video_orc 
order by 
    views 
desc limit 
    10;

统计视频类别热度Top10

思路：

即统计每个类别有多少个视频，显示出包含视频最多的前10个类别。
需要按照类别group by聚合，然后count组内的videoId个数即可。
因为当前表结构为：一个视频对应一个或多个类别。所以如果要group by类别，需要先将类别进行列转行(展开)，然后再进行count即可。
最后按照热度排序，显示前10条。

select 
    category_name as category, 
    count(t1.videoId) as hot 
from (
    select 
        videoId,
        category_name 
    from 
        video_orc lateral view explode(category) t_catetory as category_name) t1 
group by 
    t1.category_name 
order by 
    hot 
desc limit 
    10;

统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

思路：

先找到观看数最高的20个视频所属条目的所有信息，降序排列
把这20条信息中的category分裂出来(列转行)
最后查询视频分类名称和该分类下有多少个Top20的视频

select 
    category_name as category, 
    count(t2.videoId) as hot_with_views 
from (
    select 
        videoId, 
        category_name 
    from (
        select 
            * 
        from 
            video_orc 
        order by 
            views 
        desc limit 
            20) t1 lateral view explode(category) t_catetory as category_name) t2 
group by 
    category_name 
order by 
    hot_with_views 
desc;

统计视频观看数Top50所关联视频的所属类别Rank

思路：

查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频)，记为临时表t1

t1：观看数前50的视频

select 
    * 
from 
    video_orc 
order by 
    views 
desc limit 
    50;

将找到的50条视频信息的相关视频relatedId列转行，记为临时表t2

t2：将相关视频的id进行列转行操作

select 
    explode(relatedId) as videoId 
from 
	t1;

将相关视频的id和video_orc表进行inner join操作

t5：得到两列数据，一列是category，一列是之前查询出来的相关视频id

 (select 
    distinct(t2.videoId), 
    t3.category 
from 
    t2
inner join 
    video_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name;

按照视频类别进行分组，统计每组视频个数，然后排行

select 
    category_name as category, 
    count(t5.videoId) as hot 
from (
    select 
        videoId, 
        category_name 
    from (
        select 
            distinct(t2.videoId), 
            t3.category 
        from (
            select 
                explode(relatedId) as videoId 
            from (
                select 
                    * 
                from 
                    video_orc 
                order by 
                    views 
                desc limit 
                    50) t1) t2 
        inner join 
            video_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name) t5
group by 
    category_name 
order by 
    hot 
desc;

统计每个类别中的视频热度Top10，以Music为例

思路：

要想统计Music类别中的视频热度Top10，需要先找到Music类别，那么就需要将category展开，所以可以创建一张表用于存放categoryId展开的数据。
向category展开的表中插入数据。
统计对应类别（Music）中的视频热度。

创建表类别表：

create table video_category(
    videoId string, 
    uploader string, 
    age int, 
    categoryId string, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int, 
    relatedId array<string>)
row format delimited 
fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;

向类别表中插入数据：

insert into table video_category  
    select 
        videoId,
        uploader,
        age,
        categoryId,
        length,
        views,
        rate,
        ratings,
        comments,      
        relatedId 
    from 
        video_orc lateral view explode(category) catetory as categoryId;

统计Music类别的Top10（也可以统计其他）

select 
    videoId, 
    views
from 
    video_category 
where 
    categoryId = "Music" 
order by 
    views 
desc limit
    10;

统计每个类别中视频流量Top10，以Music为例

思路：

创建视频类别展开表（categoryId列转行后的表）
按照ratings排序即可

select 
    videoId,
    views,
    ratings 
from video_category 
where 
    categoryId = "Music" 
order by 
    ratings 
desc limit 
    10;

统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

思路：

先找到上传视频最多的10个用户的用户信息

select 
    * 
from 
    video_user_orc 
order by 
    videos 
desc limit 
    10;

通过uploader字段与video_orc表进行join，得到的信息按照views观看次数进行排序即可。

select 
    t2.videoId, 
    t2.views,
    t2.ratings,
    t1.videos,
    t1.friends 
from (
    select  * from video_user_orc order by videos desc  limit  10) t1 
join 
    video_orc t2
on 
    t1.uploader = t2.uploader 
order by 
    views desc 
limit 
    20;

统计每个类别视频观看数Top10

思路：

先得到categoryId展开的表数据
子查询按照categoryId进行分区，然后分区内排序，并生成递增数字，该递增数字这一列起名为rank列
通过子查询产生的临时表，查询rank值小于等于10的数据行即可。

select 
    t1.* 
from (
    select 
        videoId,
        categoryId,
        views,
        row_number() over(partition by categoryId order by views desc) rank from video_category) t1 
where 
    rank <= 10;

本文标签：实战数据视频网站 Hive

版权声明：本文标题：Hive实战之视频网站数据分析内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/dongtai/1727664819a1124543.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

Hive实战之视频网站数据分析

需求描述：

项目

数据表构成

源数据ETL

创建表

导入ETL数据

ORC表插入数据

业务分析

统计视频观看数Top10

统计视频类别热度Top10

统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

统计视频观看数Top50所关联视频的所属类别Rank

统计每个类别中的视频热度Top10，以Music为例

统计每个类别中视频流量Top10，以Music为例

统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

统计每个类别视频观看数Top10

更多相关文章

猫咪视频_猫视频如何进入您的手机

纯净系统网站

win7windows10window11 IIs服务器配置网站以及报错

使用 BTFS 在线观看种子视频

wordpress安全_保持WordPress网站安全的48种方法

大话SEO网站优化|SEO优化入门技术详解

高仿QQ举报网页和举报成功网站页源码

腾讯视频网页版卡死

ipad的正确使用方法视频,ipad的正确使用方法图解

网络原理（网络层、数据链路层）

如何找回误删除的数据以及如何彻底删除硬盘上的数据 原理

数据恢复原理与数据清除原理

MySQL数据相关问题总结（精选）

数据分析大数据面试题大杂烩01

[转载]恢复数据远比彻底删除它要简单

体验迅雷看看HD3.3“远程视频”功能

视频中字幕如何遮挡—英语学习

影音图教程网站

下载的视频花屏怎么办？其实很简单！！！

汇编语言控制屏幕信息提示，然后输入数据的代码

发表评论

推荐文章

产品管理 - 互联网产品（4）： 交互设计

mac安装win10_mac磁盘空间 mac安装win10分割多少磁盘空间合适

Word使用

将您的Android手机变成PC的遥控器

重定向： 重定向到文本的同时在屏幕输出，2＞devnull，2＞&amp;1

热门文章

计算机输入用户名但是进去黑屏,win7输入账号密码后无法进入系统直接黑屏咋办...

【一周头条盘点】中国软件网（2017.12.18~2017.12.22）

## 电脑更换系统（方法一）—启动U盘win10官方ISO镜像

Windows Server 2016 OVF, updated Aug 2024 (sysin) - VMware 虚拟机模板

认识AD RMS

【Windows】合并分区教程（解决C盘空间不足）

如何通过微信配置智能设备的WIFI参数

电脑无线(外网)和有线（内网）网络同时使用方法

华米手表会安装鸿蒙系统那,【教程】华米运动手表3安装第三方应用、第三方表盘及使用教程...

曙光服务器怎么进入bios_怎么进入bios,教您怎么进入bios

最新文章

Centos8.5.2111（1）之本地yum源搭建和docker部署与网络配置

记一次sublime text3更新 注册码失效问题和永久解决~

SQLServer2019的安装

Endnote x7.5 破解 注册 激活

HTML5 移动页面自适应手机屏幕四类方法

ASAv931安装&amp;初始化及ASDM管理

​MathType7.9破解激活码注册机分享

WIN常用小技巧

mathtype7.7.1.258破解版下载附激活教程+ 注册激活码

安卓APP让屏幕保持常亮，不息屏的方法

coreldraw2021序列号和激活码使用教程分享2024最新

【完整梳理验证】企业微信第三方应用接入全流程java版

通过EasyRecovery如何恢复被永久删除的音频？

Matlab2012b安装步骤(附带Matlab2012b破解码及序列号)

linux命令 查看分辨率,Linux命令行(console)屏幕分辨率调整

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

如何找回误删除的数据以及如何彻底删除硬盘上的数据原理

产品管理 - 互联网产品（4）：交互设计

重定向：重定向到文本的同时在屏幕输出，2＞devnull，2＞&1

记一次sublime text3更新注册码失效问题和永久解决~

Endnote x7.5 破解注册激活

ASAv931安装&初始化及ASDM管理

MathType7.9破解激活码注册机分享

linux命令查看分辨率,Linux命令行(console)屏幕分辨率调整

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载