Hive简单优化总结

1、只取需要的列，尽量减少用select *；

2、如果有，尽量使用分区字段过滤；

1 2	--查看table分区字段 >show partitions table

3、join时，将条件写入子查询或写在on中；

>select a.id from table_a a left outer join table_b b on a.id = b.id where b.day='2017-12-13';

--更快的写法
>select a.id from table_a a left outer join table_b b on (a.id = b.id and b.day='2017-12-13');

--或者直接写成子查询
>select a.id from table_a a left outer join (select id from table_b where day='2017-12-13') b on a.id = b.id;

4、当数据量较大时，用group by代替count(distinct)；

>select day,count(distinct id) as uv from table_a group by day;

--可以写成如下形式，当数据量大时运行较快
>select day,count(id) as uv from (select day,id from table_a group by day,id) a group by day;

5、MapJoin

MapJoin通常用于一个小表和一个大表进行join的场景。0.7版本之后，默认自动会转换MapJoin。

--可以查看参数hive.auto.convert.join的值，来确认是否会自动转换。
>set hive.auto.convert.join
>hive.auto.convert.join=true

--默认小表的大小不能超过25M，否则不会自动转换。但可以修改参数hive.mapjoin.smalltable.filesize的值，来修改小表的大小。
>set hive.mapjoin.smalltable.filesize = 50000000
>Success

6、并行job

在不存在依赖关系的情况下，是可以并行执行job的，比如以下情况。

select * from (
select count(*) from logs where log_date = 20130801 and item_id = 1
union all
select count(*) from logs where log_date = 20130802 and item_id = 2
union all
select count(*) from logs where log_date = 20130803 and item_id = 3
) t

--开启并行的参数是hive.exec.parallel。默认并行的job数不超过8，可以通过hive.exec.parallel.thread.number进行设置，但避免设置过大而占用过多资源。
>set hive.exec.parallel=ture
>Success
>set hive.exec.parallel.thread.number
>hive.exec.parallel.thread.number=8