Clickhouse系列之整合Hive数据仓库示例详解

前言

什么是Hive？ Apache Hive 数据仓库软件便于使用SQL读取、写入和管理驻留在分布式存储中的大型数据集。结构可以投射到已存储的数据上。提供了一个命令行工具和JDBC驱动程序，用于将用户连接到Hive。

Hive引擎允许您对HDFS配置单元表执行SELECT查询。目前支持如下输入格式：

文本：仅支持简单标量列类型，二进制除外； ORC：支持除char以外的简单标量列类型；仅支持数组等复杂类型； parquet：支持所有简单的标量列类型；仅支持数组等复杂类型。

正文

创建Hive引擎表详细信息以及参数详解

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ ON CLUSTER cluster]

(

name1 [type1] [ALIAS expr1],

name2 [type2] [ALIAS expr2],

...

) ENGINE = Hive( 'thrift://host:port' , 'database' , 'table' );

PARTITION BY expr

表结构可以与原始配置单元表结构不同：

列名应该与原始配置单元表中的列名相同( 推荐列名相同处理 )，但您可以只使用其中的一些列，并且可以按任何顺序使用，也可以使用从其他列计算的一些别名列。列类型应与原始配置单元表中的列类型相同。按表达式划分应该与原始Hive表一致，按表达式划分中的列应该在表结构中。

引擎参数：

thrift://host:port-配置单元元存储地址 database—远程数据库名称。 table—远程表名称。

实战案例

为远程文件系统启用本地缓存。通过官方的基准测试表明，使用缓存的速度快了近两倍。在使用缓存之前，将其添加到config.xml

< local_cache_for_remote_fs >

< enable >true</ enable >

< root_dir >local_cache</ root_dir >

< limit_size >559096952</ limit_size >

< bytes_read_before_flush >1048576</ bytes_read_before_flush >

</ local_cache_for_remote_fs >

参数详解：

enable:ClickHouse将在启动后维护远程文件系统（HDFS）的本地缓存（如果为true）。 root_dir：必需。用于存储远程文件系统的本地缓存文件的根目录。 limit_size：必填。本地缓存文件的最大大小（字节）。 bytes_read_before_flush：从远程文件系统下载文件时，在刷新到本地文件系统之前控制字节数。默认值为1MB。

尽管ClickHouse在启用远程文件系统本地缓存的情况下启动时，我们仍然可以选择不使用其查询中设置为 use_local_cache_for_remote_fs=0 的缓存。u se_local_cache_for_remote_fs默认为false 。

ORC数据格式

Hive创建ORC数据格式表

								
									 CREATE   TABLE   `test`.`test_orc`(   

									 `f_tinyint` tinyint,   

									 `f_smallint`   smallint  ,   

									 `f_int`   int  ,   

									 `f_integer`   int  ,   

									 `f_bigint`   bigint  ,   

									 `f_float`   float  ,   

									 `f_double`   double  ,   

									 `f_decimal`   decimal  (10,0),   

									 `f_timestamp`   timestamp  ,   

									 `f_date`   date  ,   

									 `f_string` string,   

									 `f_varchar`   varchar  (100),   

									 `f_bool` boolean,   

									 `f_binary`   binary  ,   

									 `f_array_int` array<  int  >,   

									 `f_array_string` array<string>,   

									 `f_array_float` array<  float  >,   

									 `f_array_array_int` array<array<  int  >>,   

									 `f_array_array_string` array<array<string>>,   

									 `f_array_array_float` array<array<  float  >>)   

									 PARTITIONED   BY   (   

									 `  day  ` string)   

									 ROW FORMAT SERDE   

									 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 

									 STORED   AS   INPUTFORMAT   

									 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 

									 OUTPUTFORMAT   

									 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 

									 LOCATION   

									 'hdfs://testcluster/data/hive/test.db/test_orc'

insert into test.test_orc partition( day = '2021-09-18' ) select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp (), current_date (), 'hello world' , 'hello world' , 'hello world' , true , 'hello world' , array(1, 2, 3), array( 'hello world' , 'hello world' ), array( float (1.1), float (1.2)), array(array(1, 2), array(3, 4)), array(array( 'a' , 'b' ), array( 'c' , 'd' )), array(array( float (1.11), float (2.22)), array( float (3.33), float (4.44)));

Clickhouse创建Hive表引擎

								
									 CREATE   TABLE   test.test_orc 

									 ( 

									       `f_tinyint` Int8, 

									       `f_smallint` Int16, 

									       `f_int` Int32, 

									       `f_integer` Int32, 

									       `f_bigint` Int64, 

									       `f_float` Float32, 

									       `f_double` Float64, 

									       `f_decimal` Float64, 

									       `f_timestamp` DateTime, 

									       `f_date`   Date  , 

									       `f_string` String, 

									       `f_varchar` String, 

									       `f_bool` Bool, 

									       `f_binary` String, 

									       `f_array_int` Array(Int32), 

									       `f_array_string` Array(String), 

									       `f_array_float` Array(Float32), 

									       `f_array_array_int` Array(Array(Int32)), 

									       `f_array_array_string` Array(Array(String)), 

									       `f_array_array_float` Array(Array(Float32)), 

									       `  day  ` String 

									 ) 

									 ENGINE = Hive(  'thrift://202.168.117.26:9083'  ,   'test'  ,   'test_orc'  ) 

									 PARTITION   BY   day

通过Clickhouse查询Hive数据

1	SELECT * FROM test.test_orc settings input_format_orc_allow_missing_columns = 1\G

Parquet数据格式

Hive创建Parquet数据格式表

								
									 CREATE   TABLE   `test`.`test_parquet`(   

									 `f_tinyint` tinyint,   

									 `f_smallint`   smallint  ,   

									 `f_int`   int  ,   

									 `f_integer`   int  ,   

									 `f_bigint`   bigint  ,   

									 `f_float`   float  ,   

									 `f_double`   double  ,   

									 `f_decimal`   decimal  (10,0),   

									 `f_timestamp`   timestamp  ,   

									 `f_date`   date  ,   

									 `f_string` string,   

									 `f_varchar`   varchar  (100),   

									 `f_char`   char  (100),   

									 `f_bool` boolean,   

									 `f_binary`   binary  ,   

									 `f_array_int` array<  int  >,   

									 `f_array_string` array<string>,   

									 `f_array_float` array<  float  >,   

									 `f_array_array_int` array<array<  int  >>,   

									 `f_array_array_string` array<array<string>>,   

									 `f_array_array_float` array<array<  float  >>)   

									 PARTITIONED   BY   (   

									 `  day  ` string)   

									 ROW FORMAT SERDE   

									 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 

									 STORED   AS   INPUTFORMAT   

									 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 

									 OUTPUTFORMAT   

									 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 

									 LOCATION   

									 'hdfs://testcluster/data/hive/test.db/test_parquet'

insert into test.test_parquet partition( day = '2021-09-18' ) select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp (), current_date (), 'hello world' , 'hello world' , 'hello world' , true , 'hello world' , array(1, 2, 3), array( 'hello world' , 'hello world' ), array( float (1.1), float (1.2)), array(array(1, 2), array(3, 4)), array(array( 'a' , 'b' ), array( 'c' , 'd' )), array(array( float (1.11), float (2.22)), array( float (3.33), float (4.44)));

Clickhouse创建Hive表引擎

								
									 CREATE   TABLE   test.test_parquet   

									 (   

									 `f_tinyint` Int8,   

									 `f_smallint` Int16,   

									 `f_int` Int32,   

									 `f_integer` Int32,   

									 `f_bigint` Int64,   

									 `f_float` Float32,   

									 `f_double` Float64,   

									 `f_decimal` Float64,   

									 `f_timestamp` DateTime,   

									 `f_date`   Date  ,   

									 `f_string` String,   

									 `f_varchar` String,   

									 `f_char` String,   

									 `f_bool` Bool,   

									 `f_binary` String,   

									 `f_array_int` Array(Int32),   

									 `f_array_string` Array(String),   

									 `f_array_float` Array(Float32),   

									 `f_array_array_int` Array(Array(Int32)),   

									 `f_array_array_string` Array(Array(String)),   

									 `f_array_array_float` Array(Array(Float32)),   

									 `  day  ` String   

									 )   

									 ENGINE = Hive(  'thrift://localhost:9083'  ,   'test'  ,   'test_parquet'  )   

									 PARTITION   BY   day

通过Clickhouse查询Hive数据

1	SELECT * FROM test.test_parquet settings input_format_parquet_allow_missing_columns = 1\G

TextFile数据格式

Hive创建TextFile数据格式表

								
									 CREATE   TABLE   `test`.`test_text`(   

									 `f_tinyint` tinyint,   

									 `f_smallint`   smallint  ,   

									 `f_int`   int  ,   

									 `f_integer`   int  ,   

									 `f_bigint`   bigint  ,   

									 `f_float`   float  ,   

									 `f_double`   double  ,   

									 `f_decimal`   decimal  (10,0),   

									 `f_timestamp`   timestamp  ,   

									 `f_date`   date  ,   

									 `f_string` string,   

									 `f_varchar`   varchar  (100),   

									 `f_char`   char  (100),   

									 `f_bool` boolean,   

									 `f_binary`   binary  ,   

									 `f_array_int` array<  int  >,   

									 `f_array_string` array<string>,   

									 `f_array_float` array<  float  >,   

									 `f_array_array_int` array<array<  int  >>,   

									 `f_array_array_string` array<array<string>>,   

									 `f_array_array_float` array<array<  float  >>)   

									 PARTITIONED   BY   (   

									 `  day  ` string)   

									 ROW FORMAT SERDE   

									 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 

									 STORED   AS   INPUTFORMAT   

									 'org.apache.hadoop.mapred.TextInputFormat' 

									 OUTPUTFORMAT   

									 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 

									 LOCATION   

									 'hdfs://testcluster/data/hive/test.db/test_text'

insert into test.test_text partition( day = '2021-09-18' ) select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp (), current_date (), 'hello world' , 'hello world' , 'hello world' , true , 'hello world' , array(1, 2, 3), array( 'hello world' , 'hello world' ), array( float (1.1), float (1.2)), array(array(1, 2), array(3, 4)), array(array( 'a' , 'b' ), array( 'c' , 'd' )), array(array( float (1.11), float (2.22)), array( float (3.33), float (4.44)));

Clickhouse创建Hive表引擎

								
									 CREATE   TABLE   test.test_text   

									 (   

									 `f_tinyint` Int8,   

									 `f_smallint` Int16,   

									 `f_int` Int32,   

									 `f_integer` Int32,   

									 `f_bigint` Int64,   

									 `f_float` Float32,   

									 `f_double` Float64,   

									 `f_decimal` Float64,   

									 `f_timestamp` DateTime,   

									 `f_date`   Date  ,   

									 `f_string` String,   

									 `f_varchar` String,   

									 `f_char` String,   

									 `f_bool` Bool,   

									 `  day  ` String   

									 )   

									 ENGINE = Hive(  'thrift://localhost:9083'  ,   'test'  ,   'test_text'  )   

									 PARTITION   BY   day

通过Clickhouse查询Hive数据

1	SELECT * FROM test.test_text settings input_format_skip_unknown_fields = 1, input_format_with_names_use_header = 1, date_time_input_format = 'best_effort' \G

总结

本节主要讲解了Clickhouse整合Hive数仓，利用了Hive引擎并通过thrift方式去连接，需要注意这种连接参数的设置以及代表意义。另外，这个过程我们需要注意的是，推荐开启缓存，这样查询速度会快很多。与此同时，也对Hive常用的三种数据类型ORC,Parquet,TextFile进行了一个实战案例操作，更多关于Clickhouse整合Hive数据仓库的资料请关注服务器之家其它相关文章！

原文链接：https://juejin.cn/post/7153507694140194823

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://haodehen.cn/did231352

更新时间：2023-05-19 阅读：22次