您的当前位置:首页spark sql use hive

spark sql use hive

2024-12-12 来源:哗拓教育

spark sql

通过Spark SQL 查询hive中的数据 要求提供操作截图

  1. 切换数据库,选择指定的表;
  2. 操作
sunyonggang@gg01:~/spark-1.4.0-bin-hadoop2.6$ ./bin/spark-sql --driver-class-path /home/sunyonggang/apache-hive-0.13.1-bin/lib/mysql-connector-java-5.1.36-bin.jar
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
spark-sql> show databases;
16/05/07 10:55:13 INFO ParseDriver: Parsing command: show databases
16/05/07 10:55:13 INFO ParseDriver: Parse Completed
16/05/07 10:55:15 INFO DAGScheduler: Job 0 finished: processCmd at CliDriver.java:423, took 0.342186 s
16/05/07 10:55:15 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
default
sogou
Time taken: 1.951 seconds, Fetched 2 row(s)
16/05/07 10:55:15 INFO CliDriver: Time taken: 1.951 seconds, Fetched 2 row(s)
16/05/07 10:55:15 INFO StatsReportListener:     96 %    96 %    96 %    96 %    96 %    96 %    96 %    96 %    96 %
use sogou;
16/05/07 10:55:23 INFO ParseDriver: Parsing command: use sogou
16/05/07 10:55:23 INFO ParseDriver: Parse Completed
show tables;
16/05/07 10:55:28 INFO HiveMetaStore: 0: get_tables: db=sogou pat=.*
16/05/07 10:55:28 INFO audit: ugi=sunyonggang   ip=unknown-ip-addr  cmd=get_tables: db=sogou pat=.*
16/05/07 10:55:28 INFO SparkContext: Starting job: processCmd at CliDriver.java:423
16/05/07 10:55:28 INFO DAGScheduler: Got job 2 (processCmd at CliDriver.java:423) with 1 output partitions (allowLocal=false)
16/05/07 10:55:28 INFO DAGScheduler: Final stage: ResultStage 2(processCmd at CliDriver.java:423)
16/05/07 10:55:28 INFO DAGScheduler: Parents of final stage: List()
16/05/07 10:55:28 INFO DAGScheduler: Missing parents: List()
16/05/07 10:55:28 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[5] at processCmd at CliDriver.java:423), which has no missing parents
16/05/07 10:55:28 INFO MemoryStore: ensureFreeSpace(2816) called with curMem=4314, maxMem=280248975
16/05/07 10:55:28 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.8 KB, free 267.3 MB)
16/05/07 10:55:28 INFO MemoryStore: ensureFreeSpace(1685) called with curMem=7130, maxMem=280248975
16/05/07 10:55:28 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1685.0 B, free 267.3 MB)
16/05/07 10:55:28 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:48651 (size: 1685.0 B, free: 267.3 MB)
16/05/07 10:55:28 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
16/05/07 10:55:28 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[5] at processCmd at CliDriver.java:423)
16/05/07 10:55:28 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
16/05/07 10:55:28 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1902 bytes)
16/05/07 10:55:28 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
16/05/07 10:55:28 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1149 bytes result sent to driver
16/05/07 10:55:28 INFO DAGScheduler: ResultStage 2 (processCmd at CliDriver.java:423) finished in 0.038 s
16/05/07 10:55:28 INFO DAGScheduler: Job 2 finished: processCmd at CliDriver.java:423, took 0.062450 s
sg  false16/05/07 10:55:28 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@6f05894a
16/05/07 10:55:28 INFO StatsReportListener: task runtime:(count: 1, mean: 42.000000, stdev: 0.000000, max: 42.000000, min: 42.000000)

sunyonggang false
userlog false
16/05/07 10:55:28 INFO StatsReportListener:     0%  5%  10% 25% 50% 75% 90% 95% 100%
16/05/07 10:55:28 INFO StatsReportListener:     42.0 ms 42.0 ms 42.0 ms 42.0 ms 42.0 ms 42.0 ms 42.0 ms 42.0 ms 42.0 ms
Time taken: 0.096 seconds, Fetched 3 row(s)

3.编写sql

select address, count(*) as number from userlog group by address order by number desc;

吉首市 2243516/05/07 11:00:40 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@8a8fbe2
16/05/07 11:00:40 INFO StatsReportListener: task runtime:(count: 9, mean: 25.666667, stdev: 9.153020, max: 42.000000, min: 13.000000)
16/05/07 11:00:40 INFO StatsReportListener:     0%  5%  10% 25% 50% 75% 90% 95% 100%
16/05/07 11:00:40 INFO StatsReportListener:     13.0 ms 13.0 ms 13.0 ms 20.0 ms 26.0 ms 30.0 ms 42.0 ms 42.0 ms 42.0 ms

龙山县 12445
永顺县 12146
凤凰县 10548
花垣县 10453
16/05/07 11:00:40 INFO StatsReportListener: fetch wait time:(count: 9, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
保靖县 7258
泸溪县 7102
古丈县 3721

写一个简单的程序通过JDBC访问Spark SQL Thrift server能实现查询,要求过程和截图

  1. 启动thrift server
sunyonggang@gg01:~/spark-1.4.0-bin-hadoop2.6$ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.server2.thrift.bind.host=gg01 --master spark://gg01:7077 --driver-class-path /home/sunyonggang/apache-hive-0.13.1-bin/lib/mysql-connector-java-5.1.36-bin.jar --executor-memory 1g
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /home/sunyonggang/spark-1.4.0-bin-hadoop2.6/sbin/../logs/spark-sunyonggang-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-gg01.out
sunyonggang@gg01:~/spark-1.4.0-bin-hadoop2.6$ cd ~/apache-hive-0.13.1-bin/
sunyonggang@gg01:~/apache-hive-0.13.1-bin$ ./bin/beeline
Beeline version 0.13.1 by Apache Hive
beeline> !connect jdbc:hive2://gg01:10000
scan complete in 3ms
Connecting to jdbc:hive2://gg01:10000
Enter username for jdbc:hive2://gg01:10000: sunyonggang
Enter password for jdbc:hive2://gg01:10000: ***********
Connected to: Spark SQL (version 1.4.0)
Driver: Hive JDBC (version 0.13.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://gg01:10000> show databases;
+----------+
|  result  |
+----------+
| default  |
| sogou    |
+----------+
2 rows selected (1.852 seconds)
0: jdbc:hive2://gg01:10000>

2.访问数据: 导入hive lib下的的所有包,包括了mysql-jdbc的那个包

package com.github.sunyonggang

import java.sql.DriverManager

/**
  * Created by sunyonggang on 16-5-7.
  */
class UseJdbcGetUserlog {

}
object UseJdbcGetUserlog {
  def main(args: Array[String]) {
    Class.forName("org.apache.hive.jdbc.HiveDriver")
    val conn = DriverManager.getConnection("jdbc:hive2://gg01:10000/sogou", "sunyonggang", "**************")
    val preStat = conn.prepareStatement("select * from userlog limit 5")
    val rs = preStat.executeQuery()

    //desc userlog
    /**
      * hive> desc userlog;
      * OK
      * id                      int
      * username                varchar(100)
      * phone                   bigint
      * address                 varchar(25)
      * age                     int
      * gender                  varchar(5)
      * level                   varchar(25)
      * logindate               varchar(10)
      * Time taken: 0.376 seconds, Fetched: 8 row(s)
      */

    while(rs.next()) {
      println("id: " + rs.getInt("id") + " username: " + rs.getString("username") + " phone: " + rs.getString("phone") +
        " address: " + rs.getString("address") + " age: " + rs.getInt("age")
      )
    }

    rs.close()
    preStat.close()
    conn.close()
  }
}

3.查看结果

/home/sunyonggang/jdk1.7.0_79/bin/java -Didea.launcher.port=7533 -Didea.launcher.bin.path=/home/sunyonggang/idea-IC-145.597.3/bin -Dfile.encoding=UTF-8 -classpath  com.intellij.rt.execution.application.AppMain com.github.sunyonggang.UseJdbcGetUserlog
log4j:WARN No appenders could be found for logger (org.apache.thrift.transport.TSaslTransport).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
id: 1 username: user1 phone: 15200776350 address: 凤凰县 age: 25
id: 2 username: user2 phone: 15200771700 address: 凤凰县 age: 41
id: 3 username: user3 phone: 13487837597 address: 吉首市 age: 23
id: 4 username: user4 phone: 15074344263 address: 吉首市 age: 21
id: 5 username: user5 phone: 15007439761 address: 吉首市 age: 20
显示全文