Top 5 challenges and recipes when starting Apache Spark project

Intro Apache Spark is the most advanced and powerful computation engine intended for big data analytics. Beside data engine it also provides libraries for streaming (Spark Streaming), machine learning (MLlib) and graph processing (GraphX). Historically Spark emerged as a successor of Hadoop ecosystem with a following key advantages: Spark provides…

How to deploy Spark job to cluster

In the previous post we have created simple Spark job and executed it locally on a single machine. Nevertheless the main goal of Spark framework is to utilize cluster resources consisting of multiple servers and in this way increase data processing throughput. In the real life the amount of data…

How to write Big Data application with Java and Spark

Spark is modern Big Data framework to build highly scalable and feature rich data transformation pipelines. Spark's main advantages are simplicity and high performance compared to its predecessor - Hadoop. You can write Spark applications in main 3 languages: Scala, Java and Python. In this guide I will show you…