This month’s dev blog comes from one of our Senior Software Engineers, Landon Robinson, our very own Hadoopster. Check out his site for more great Hadoop tutorials and tips.
I’ve been working with Hadoop, Map-Reduce, and other “scalable” frameworks for a little over 3 years now. One of the latest and greatest innovations in our open-source space has been Apache Spark — a parallel processing framework that’s built on the paradigm Map-Reduce introduced but packed with enhancements, improvements, optimizations, and features. You probably know about Spark, so I don’t need to give you the whole pitch.
You’re likely also aware of its main components:
- Spark Core: The parallel processing engine written in the Scala programming language
- Spark SQL: Allows you to programmatically use SQL in a Spark pipeline for data manipulation
- Spark MLlib: Machine-learning algorithms ported to Spark for easy use by devs
- Spark GraphX: A graphing library built on the Spark Core engine
- Spark Streaming: A framework for handling data that is live-streaming at high speed
Spark Streaming is what I’ve been working on lately. Specifically, building apps in Scala that utilize Spark Streaming to stream data from Kafka topics, do some on-the-fly manipulations and joins in memory, and write newly augmented data in “near real time” to HDFS.
I’ve learned a few things along the way. Here are my tips:
1. Determine what version of Scala and Spark you’re using on your cluster, and use those during development.
In my workplace, we use Spark 2.1.1 with Scala 2.11.8. It’s critical to align versions of these appropriately or you’re likely to suffer errors when compiling and running your code.
What type of errors, you ask? They usually manifest as “NoClassDefFoundErrors,” which is when a class/function/method used in your code isn’t found when it’s called – typically because it has moved packages between releases. That’s why it’s important to make sure your Scala and Spark releases are in lockstep. Here is a good reference article on StackOverflow.
2. Leverage your IDE to help you understand Scala quirks
Syntactically, Scala is a remarkably concise language, with — most of the time — high readability. This is thanks to its often high level of abstraction for complex functionality. While this results in brief code that accomplishes impressive tasks, it can be very difficult to comprehend at first — even frustrating.
You can remedy this by using the power of your modern Integrated Development Environment (IDE) — I prefer and highly recommend IntelliJ Community Edition — to learn what’s happening in the background. For example:
- Retype pieces of code you get from StackOverflow or a peer and observe the results in the dialog windows that appear. When you see syntax you don’t understand, type it slowly and watch as the dialogs showcase what objects you’re dealing with, what’s being returned in a method call, and what methods are available on certain objects.
- Use CMD+B (or Navigate > Declaration) to view the original declaration or source code of an object, class, or method. This is super helpful for understanding what API calls expect as input or seeing what Spark’s map function will return as output.
3. Don’t be afraid to use Java stuff.
Scala isn’t Java, but it does run in the JVM — which gives it access to Java functionality. In short, you can write in-line Java code right in your Scala apps. Pretty neat, huh? Pretty confusing too.
But don’t be afraid to use it if you’re up against a wall. Java has had years to build backlogs upon backlogs of solutions to just about any possible programming-related problem you should incur.
One example is timestamp manipulation. In data engineering, we play with timestamps all the time: epochs, yyyy-mm-dd, hh:mm:ss, you name it. There are hundreds of ways to do this in Java, all very well documented on places like — you guessed it — StackOverflow.
If there’s a Java solution and it’s efficient enough for your program, don’t fear using it just because it isn’t “fully Scala.” You’ll be glad you used it when it just works.
4. Take the time to study and understand Spark Streaming.
Get a hot cup of coffee, get comfortable, and chew through the documentation on Spark Streaming. Get more than acquainted with the terms, get friendly. Diagram process flows if that helps you (it helps me). Learn what a DStream is.
Let the programming guide — and other helpful blogs from others who have suffered the hard, pioneering work — guide you to success and knowledge.
And when you’re done with that, rewrite some of the documentation of the components you use in your Spark applications. Not only so that your team and company can understand what you’ve written but also so that you may better understand it yourself. Teaching is the best way to know something better and better.
5. Monitor your app results and performance by building simple tables and charts in Microsoft Excel.
I know Excel can get tricky, but this was remarkably helpful to me in truly understanding what was happening under the hood with my first streaming apps. In the scenario below, I had a streaming app that was reading two Kafka topics to get two different datasets: We’ll call them event type 1 and event type 2. The goal is to join them within the streaming app by a column they have in common — a key/id. By having my app print the counts of each step in my process, I was able to see how successful my “join rate” was over time (using stateful streaming) by simply using a few sum and division functions that are then fed into simple line graphs.
Then, as I changed my code to debug and introduce improvements, I could clone this sheet with its simple formulas and line graphs plug in new data, and see immediate changes visually.
It’s not as critical a suggestion as the first four, but I found it immensely valuable and the stakeholders of my application did as well. Visuals always help.
6. Get Spark running locally in your local IDE.
For real. Don’t use your cluster for testing — or at least basic, functional testing. Get Spark running in IntelliJ with this tutorial (setting up spark 2.0 with intellij). The time upfront used doing this will save you hours and headaches in the future.
Landon Robinson, Senior Software Engineer