Example VerdictDB Application with TPC-H¶
VerdictDB on MySQL¶
On this page, we will demonstrate an example of a Java/Python application with a VerdictDB library that creates a scrambled table for your database, then executes a query that enable VerdictDB to utilize the scrambled table.
Here we assume:
- Your MySQL database is running
- TPC-H data has been loaded into your database following the instructions in the previous Setup TPC-H Data page.
Getting VerdictDB Example Application¶
Before we start, you need to get our example Java/Python application that we will be referring to throughout this entire tutorial. To do so, you need git
and run the following command in your working directory:
$ git clone git@github.com:verdictdb/verdictdb-tutorial.git
This will clone the current VerdictDB tutorial application into your_working_directory/verdictdb-tutorial
.
Move into the directory with the example we will use.
cd verdictdb_on_mysql
cd pyverdict_on_mysql
Compiling the Application¶
The tutorial java application has been set up with Apache Maven and you should be able to build a runnable jar file under your_working_directory/verdictdb-tutorial/target
.
For the python application, install the required packages with pip.
mvn package
pip install -r requirements.txt
Running the Application¶
We included a shell script run.sh
that you can invoke to execute the example Java application.
The script takes 4 arguments: hostname
, port
, name of database/schema
, and command
.
Creating a Scrambled Table¶
In this tutorial, we are going to create a scrambled table on lineitem
table. Assume you are running the MySQL database following this page:
./run.sh localhost 3306 tpch1g create
python tutorial.py create
It prints something like this:
Creating a scrambled table for lineitem... Scrambled table for lineitem has been created. Time Taken = 67 s
This command executes the following SQL statement via VerdictDB:
CREATE SCRAMBLE tpch1g.lineitem_scramble FROM tpch1g.lineitem
Snippet of the corresponding source code (simplified for brevity):
... Statement stmt = verdictConn.createStatement(); String createQuery = String.format( "CREATE SCRAMBLE %s.lineitem_scramble " + "FROM %s.lineitem", database, database); System.out.println("Creating a scrambled table for lineitem..."); stmt.execute(createQuery); stmt.close(); System.out.println("Scrambled table for lineitem has been created."); ...
... print('Creating a scrambled table for lineitem...') start = time.time() verdict_conn.sql('CREATE SCRAMBLE tpch1g.lineitem_scramble FROM tpch1g.lineitem') duration = time.time() - start verdict_conn.close() print('Scrambled table for lineitem has been created.') print('Time Taken = {:f} s'.format(duration)) ...
It's just like running SQL statement via any JDBC connection.
After the scrambled table being created, VerdictDB will automatically utilize this scrambled table for any query that involves the lineitem
table.
Running a Sample Query¶
The example application can execute one very simple aggregation query with and without VerdictDB and compare their performance:
SELECT avg(l_extendedprice) FROM tpch1g.lineitem_scramble
To run this query with/without VerdictDB, you can simply run the following command:
./run.sh localhost 3306 tpch1g run
python tutorial.py run
The result looks like this:
Without VerdictDB: average(l_extendedprice) = 38255.138485 Time Taken = 43 s With VerdictDB: average(l_extendedprice) = 38219.8871659 Time Taken = 10 s
You can see that VerdictDB achieves more than 4x speedup with a 99.9% accurate result even with a relatively small dataset that we use in this tutorial (i.e., 1GB).
VerdictDB on Apache Spark¶
We will write a simple example program in Scala, the standard programming language for Apache Spark. To compile our example program, the following tools must be installed.
Create an empty project¶
The following command creates a project that prints out "hello".
$ sbt new sbt/scala-seed.g8 A minimal Scala project. name [Scala Seed Project]: hello-verdict Template applied in ./hello-verdict
Move into the project directory: cd hello-verdict
.
Remove the src/test
directory, which we do not need: rm -rf src/test
.
Configure build setting to use Spark and VerdictDB¶
Add the following line in build.sbt
, under the existing import Dependencies._
line. As of the time of writing, the latest version of Apache Spark only supports Scala 2.11.
scalaVersion := "2.11.1"
Also, replace the existing dependency list with
libraryDependencies ++= Seq( scalaTest % Test, "org.verdictdb" % "verdictdb-core" % "0.5.8", "org.apache.spark" %% "spark-core" % "2.3.1" % "provided", "org.apache.spark" %% "spark-sql" % "2.3.1" % "provided" )
This dependency declaration will let the compiler (sbt
in our case) download relevant libraries automatically.
Write an example program¶
Edit src/main/scala/example/Hello.scala
as follows:
package example import org.apache.spark.sql.SparkSession import org.verdictdb.VerdictContext import org.verdictdb.connection.SparkConnection import scala.util.Random object Hello extends App { val spark = SparkSession .builder() .appName("VerdictDB basic example") .enableHiveSupport() .getOrCreate() spark.sparkContext.setLogLevel("ERROR") import spark.implicits._ val verdict = VerdictContext.fromSparkSession(spark) // prepare data prepareData(spark, verdict) // run a query and print its result val rs = verdict.sql("select count(*) from myschema.sales") rs.printCsv() // simply the following lines will be printed (the actual count value may vary) // c2 // 950.0 def prepareData(spark: SparkSession, verdict: VerdictContext): Unit = { // create a schema and a table spark.sql("DROP SCHEMA IF EXISTS myschema CASCADE") spark.sql("CREATE SCHEMA IF NOT EXISTS myschema") spark.sql("CREATE TABLE IF NOT EXISTS myschema.sales (product string, price double)") // insert 1000 rows val productList = List("milk", "egg", "juice") val rand = new Random() var query = "INSERT INTO myschema.sales VALUES" for (i <- 0 until 1000) { val randInt: Int = rand.nextInt(3) val product: String = productList(randInt) val price: Double = (randInt+2) * 10 + rand.nextInt(10) if (i == 0) { query = query + f" ('$product', $price%.0f)" } else { query = query + f", ('$product', $price%.0f)" } } spark.sql(query) verdict.sql("BYPASS DROP TABLE IF EXISTS myschema.sales_scramble") verdict.sql("BYPASS DROP SCHEMA IF EXISTS verdictdbtemp CASCADE") verdict.sql("BYPASS DROP SCHEMA IF EXISTS verdictdbmeta CASCADE") verdict.sql("CREATE SCRAMBLE myschema.sales_scramble FROM myschema.sales BLOCKSIZE 100") } }
Package and Submit¶
$ sbt assembly $ spark-submit target/scala-2.11/Hello-assembly-0.1.0-SNAPSHOT.jar --class example.Hello
This example program is available on this public GitHub repository. See the directory verdictdb_on_spark
.