[HANA Vora] The Simple Word Count Example

For an overview of what SAP Hana Vora is then please check out:

With the introductions aside, Spark is fast becoming the de-facto data processing engine for Hadoop; it's fast, flexible and operates "In-Memory" (when the dataset can fit). I like to think of HANA Vora as an add-on to Spark. It provides added business features as well as "best in class" integration with HANA Databases.

Lets now dive right into the typical "hello world" style example for Hadoop - The Simple Word Count.

How often is "Watson" referred to directly in the "The Adventures Of Sherlock Holmes"?

Is the answer:

A) 42

B) 81

C) 136

D) The Sum of the Above

Note: The answer is at the bottom.

Spark and Vora support several languages such as Scala, Python and Java. Since Scala is still slightly ahead, in terms of popularity with Spark, I'll use that. For utilising Vora you can use the Spark shell or use Notebooks application such as Zeppelin. In this example I use Zeppelin, which is also covered in the installation steps of Vora, as well as in the Hana Academy videos.

Firstly lets download a free copy of the book, strip out all special characters, collect the words, aggregate the results and finally store as, an "in-Memory", resilient distributed dataset (RDD):

Scala: Process The File
import java.net.URL import java.io.File import org.apache.commons.io.FileUtils //Load External File to HDFS val HDFS_NAMENODE = "107.20.0.138:8020" val HDFS_DIR = "/user/vora" val tmpFile = new File(s"""hdfs://${HDFS_NAMENODE}${HDFS_DIR}/SherlockHolmes.txt""") FileUtils.copyURLToFile(new URL("https://ia600300.us.archive.org/10/items/TheAdventuresOfHolmesSherlock/DoyleArthurConan-AdventuresOfSherlockHolmesThe.txt"), tmpFile) //Read Files line as Array[String] into Spark RDD val textFile = sc.parallelize(FileUtils.readLines(tmpFile).toArray.map(x => x.toString)) println("----------------------------------------") //Print first 2 Lines of File textFile.take(2).foreach(println) //Rows println("Rows in File: " + textFile.count() ) //Perform full word count, strip ou specify chacters val word_counts = textFile.flatMap(line => line.replaceAll("[^\\p{L}\\p{Nd}\\s]+", "").toLowerCase.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) //Put results into a resilient distributed dataset (RDD) case class WordCount(word: String, wordcount: Long) var wcRDD = word_counts.map(t => WordCount(t._1, t._2)) //First 10 Rows of Word Count println("----------------------------------------") println("First 10 Rows of Word Count:") wcRDD.take(10).foreach(println)

Scala: Process The File

import java.net.URL

import java.io.File

import org.apache.commons.io.FileUtils

//Load External File to HDFS

val HDFS_NAMENODE = "107.20.0.138:8020"

val HDFS_DIR = "/user/vora"

val tmpFile = new File(s"""hdfs://${HDFS_NAMENODE}${HDFS_DIR}/SherlockHolmes.txt""")

FileUtils.copyURLToFile(new URL("https://ia600300.us.archive.org/10/items/TheAdventuresOfHolmesSherlock/DoyleArthurConan-AdventuresOfSherlockHolmesThe.txt"), tmpFile)

//Read Files line as Array[String] into Spark RDD

val textFile = sc.parallelize(FileUtils.readLines(tmpFile).toArray.map(x => x.toString))

println("----------------------------------------")

//Print first 2 Lines of File

textFile.take(2).foreach(println)

//Rows

println("Rows in File: " + textFile.count() )

//Perform full word count, strip ou specify chacters

val word_counts = textFile.flatMap(line => line.replaceAll("[^\\p{L}\\p{Nd}\\s]+", "").toLowerCase.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

//Put results into a resilient distributed dataset (RDD)

case class WordCount(word: String, wordcount: Long)

var wcRDD = word_counts.map(t => WordCount(t._1, t._2))

//First 10 Rows of Word Count

println("----------------------------------------")

println("First 10 Rows of Word Count:")

wcRDD.take(10).foreach(println)

In Zeppelin it appears as follows:

The Results of Executing are:

Next lets use Vora to register the RDD as a temporary "In Memory" table and then perform a SQL Query to find how many times "Watson" appears in the book:

Scala: Use Vora to Register the RDD as a Temporary Table then Query Results
import org.apache.spark.sql._ val sapSqlContext = new SapSQLContext (sc) val wordCountDataFrame = sapSqlContext.createDataFrame(wcRDD) wordCountDataFrame.registerTempTable("wc") val results = sapSqlContext.sql("SELECT word, wordcount FROM wc where word = 'watson' ").map{ case Row(word: String, wordcount: Long) => { word + "\t" + wordcount }}.collect()

Scala: Use Vora to Register the RDD as a Temporary Table then Query Results

import org.apache.spark.sql._

val sapSqlContext = new SapSQLContext (sc)

val wordCountDataFrame = sapSqlContext.createDataFrame(wcRDD)

wordCountDataFrame.registerTempTable("wc")

val results = sapSqlContext.sql("SELECT word, wordcount FROM wc where word = 'watson' ").map{

case Row(word: String, wordcount: Long) => {

word + "\t" + wordcount

}}.collect()

Execute in Zeppelin:

Finally lets use Zeppelin to Visualise the results:

Visualise with Zeppelin
println("%table Word\tCount\n" + results.mkString("\n"))

Execute in Zeppelin:

Note: Zepplin's Visualisation capabilities are better demonstrated with %vora sql statements if the results have been stored to HDFS.

So the Answer is B) 81

Did you guess right?

From Tech Ed 2015 SAP Hana Vora has also been demonstrated to process a Petabyte of data for Intel, so hopeful some more challenging Vora examples in SCN will follow from Mr Appleby and co soon.

But in the meantime it's still always fun to use a sledge hammer to smash a nut. I hope you enjoyed.

[HANA Vora] The Simple Word Count Example

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...