This blog post is an excerpt taken from the SpotX Big Data talk at BELFOSS 2018. The full talk can be viewed here: https://www.youtube.com/watch?v=OlmbuyCH4lQ
SpotX is a video ad serving platform based out of Colorado, USA and chances are, you have interacted with our service and didn’t know it.
We work with publishers to provide targeted video advertisements to consumers, automating sales of video inventory that align with the brand integrity of our customers. The publishers who use this service have specific targeting needs and require highly granular data and detailed reports as quickly as can be provided to measure the effectiveness of their content placement.
What does this mean in terms of data?
SpotX handles millions of HTTP requests per second and will deliver a response (from sending an ad call to serving an impression) within milliseconds. This is no easy feat, and providing this level of service requires a ton of processing power.
That’s what ‘big data’ means for us.
Clearly, this volume of data cannot be stored on a single machine. This is where taking a distributed approach to data storage comes in. Distributed data is data that is stored across multiple machines in various locations, linked by a network. It provides:
- Scalability — spreading the load across multiple machines. To handle more data, add more machines.
- Fault tolerance & high availability — If a server goes down due to a fault or upgrade, another can take over. High availability provides us with the ability for systems to remain running despite unexpected failures.
- Latency — Users can be served from a data center that is geographically close to them.
We have four data centers globally handling multiple 10’s of terabytes of data on a daily basis and currently storing multiple petabytes of data. We have thousands of servers running our auctions, Hadoop, Kafka and Druid. SpotX also invests in big data technologies to bolster our core capabilities around reporting and analytics at this scale.
A beginner’s experience of working with data technologies
I have been working in SpotX for over a year. In January, I had the opportunity to join the team focused on data.
Coming from a web development background, I didn’t really have to consider infrastructure, networking or use of resources much. I just wrote code and handed it to someone else to deal with.
Now I have to think about servers, cluster resources and workloads based on the time of day across the globe!
I also had to switch my development workflow expectations. By way of example, running Spark jobs, which we use to aggregate information amongst other things, is a lengthy process. You make changes, package up a jar, run it on a dev cluster and monitor task/job times.
Spark is just one of the new technologies I have had to learn. We also make use of a lot of other big open-source projects:
My beginners guide to Hadoop
Arguably one of the most well-known names in big data processing is Hadoop. Hadoop was formed in 2006 and spun out of a project called Lucene created by Doug Cutting who later brought Hadoop to Yahoo!.
Hadoop is an Apache software library and framework that allows for the distributed processing of large data sets across clusters of computers. It is a cheap, scalable way to store and process large quantities of data across multiple computers, usually at petabyte scale. Hadoop allows us to handle the massive amounts of data produced on a daily basis, providing storage, processing and speed.
Storage — The Hadoop Distributed Filesystem
The Hadoop Distributed Filesystem (HDFS) is a module shipped with Hadoop, and at its heart, it is a clustered file system which holds all of the data and yet presents itself as a single file system, hiding operational complexity from users. It does this using nodes — i.e. cheap, commodity hardware/computers.
Within HDFS, there are the DataNodes and the NameNode.
The NameNode is the master node, and it acts as the coordinator for all of the DataNodes. It is also responsible for all the HDFS metadata, regulates access to files, fulfills file system requests (e.g., keeping track of the number of copies of a file, changes to files, etc). There is typically one NameNode at a time, with another available as a backup if needed.
The DataNodes are responsible for serving read and write requests from the file system’s clients, and there can be as many of these as required.
Super simple HDFS diagram
You can query HDFS on the command line (prefaced with ‘hdfs dfs’), or you can query it in a sql-like way using Hive. Hive was built by Facebook and allows the user to write queries instead of searching through the filesystem. A query is then transformed into a job which grabs all the data you need from HDFS and presents it in an easy-to-read format. This makes it a lot easier, especially for data scientists or non-coders, to use HDFS.
We have processes that run at frequent intervals that create log files which were designed to be compressed and consumed by Hadoop.
They are placed in Hadoop’s distributed filesystem, where they are replicated for fault tolerance and distributed processing. Once we get it into Hadoop, we can make the raw data available for analytics and reporting!
As well as the storage of data, it’s worth briefly mentioning how data can be processed. MapReduce is Hadoop’s data processing model and allows for parallelism (running on multiple nodes at the same time). A MapReduce job usually occurs in 3 stages:
The diagram below shows a simple word count example, which counts the number of occurrences of each word within a given input. Note that there are a number of different stages.
- The ‘splitting’ stage processes one line at a time.
- ‘Mapping’ emits a key-value pair with each <word> and a value of 1.
- ‘Shuffling’ sorts the output of the mappers.
- The ‘reduce’ stage sums up the values, which are the occurrence counts for each key.
The key takeaway of MapReduce is that the input is split up between all the available nodes. The data processing is then split between these nodes so as to share the workload and the output is returned at the end of the task. This process provides the ability to process huge amounts of data.
As well as the core components of Hadoop we also make use of some additional tools. I won’t go into detail here, but you can check out the links below if you are interested:
- Sqoop — a tool that allows for the transfer from HDFS to a relational database like MySql
- Yarn — manages the scheduling of jobs and efficient management of the cluster resources
- Zookeeper — provides coordination between the nodes in the cluster.
As different processing techniques have evolved over time to tackle our ever-increasing amount of data and the need to process it faster, we have employed other technologies such as Spark, Kafka, Airflow and Druid. These will follow in further blog posts.
Jennifer Hanratty, Data Application Engineer