Zentaly Blog

Top 5 challenges and recipes when starting Apache Spark project

Alex Dik — Fri, 16 Dec 2022 18:05:48 GMT

Intro

Apache Spark is the most advanced and powerful computation engine intended for big data analytics. Beside data engine it also provides libraries for streaming (Spark Streaming), machine learning (MLlib) and graph processing (GraphX).

Historically Spark emerged as a successor of Hadoop ecosystem with a following key advantages:

Spark provides significant performance improvement by keeping data in memory, while Hadoop relies on slow disk access
Spark has rich programming APIs compared to restricted Hadoop’s map-reduce model
Spark is compatible with Hadoop, so it is possible to run Spark on top of the existing Hadoop ecosystem

Zentaly as a team of experts has years of experience building Spark based big data business solutions. In this article I would like to share our knowledge with everyone who looks forward to start a new Apache Spark project.

Challenges

The content will be structured in a form of “challenge and recipe” approach to ease reading process.

Challenge #1: Do we really need Spark?

As with the most of software projects kick-offs, business might be curious why resources and money should be allocated for a particular project. While for software engineers benefits might look quite obvious, there is always extra amount of work to translate them into KPIs/OKRs.

Recipe

You should translate technical benefits to numeric values (KPIs/OKRs) in order to make business stakeholders be able to understand and measure its impact to the business.

Examples

You are going to decrease “Shopping Card Checkout” processing time from 2 hours down to 10 minutes per user. In practice it means that users will get better response time and become less disappointed. From business perspective it means that purchase returns will decrease by 5% and number of support calls will decrease by 15%, which in total will increase revenue by 3% and decrease costs by 7% (specific numbers must be precisely measured).
You are going to extend ad campaign targeting dimensions to improve served ad quality. Logically it will provide more suitable ads for each particular user. From business perspective it could be represented as increased CTC and CPI rates by 5%.

Challenge #2: We can do it ourself

“As people are building cars and rockets, its quite obvious that we can do it ourself”. That’s exactly what some engineers might want to say when facing a new technology stack, especially as complex as a Spark.

Recipe

Introducing a new framework like Apache Spark or any other technology requires some learning curve for engineers to understand. Learning and investing own time is the key factor to building deep understanding and improving our skills about technology.

So if you are going to start working with Spark you can find developer volunteers who might want to attend specialised trainings or prepare company internal knowledge sharing sessions by skilled employees.

More traction you will be able to build - the better understanding and higher skill set will grow inside development team.

Challenge #3: We decided to start using Spark, how can we integrate it with existing infrastructure?

While creating Spark apps is relatively simple (you can read about it in my previous article), running Spark in Production and integrating it with existing apps might be quite challenging task. The reason for this is an old idea which was born in Hadoop times - “Moving code is cheaper than moving data”. As a successor of Hadoop, Spark follows this pattern accordingly.

“Moving code” concepts assumes that big data application should be built as a standalone module and shipped to a runtime environment (usually cluster) where it will be executed and managed by that environment in a very special way. The good reason for this pattern to exist is high level of complexity which must be addressed by that environment, like following:

Automatic code distribution and execution on thousands of physical servers
Horizontal scalability
High availability

As for today, Spark execution environment in a very minimalistic form could be represented with a following diagram:

Spark runtime environment

The complexity of tools, platforms and its variety makes the entires system quite complex. What initially looked as a pretty simple Spark app, in the real life ends up as big standalone ecosystem.

Recipe #1

If you are big company with a lot of resources, most probably you will end up buying quite expensive solution from 3rd party vendor (e.g. Cloudera/Databricks) or even building this ecosystem yourself.

Unfortunately integrating big data stack with existing services would additionally require building middleware integration layer. Usually this middleware could be implemented as:

Service to service communication via REST or rarely SOAP
Service communication via Enterprise Service Bus solutions (rabbitmq, web sphere, kafka, amazon sms, etc)
Data sharing via trivial relational database or even files on distributed filesystem storage

Overall this is the most popular approach nowadays which makes big data solutions very rarely seen in a middle size companies.

Recipe #2

If your company is not too big and amount of data you work with is limited, you have a decent option using Spark as library inside your existing microservice (works for any JVM based language and Python). In this case you can just include Spark library as dependency and start writing big data code directly inside your existing app. Though it will have following limitations:

Data processing will be limited to a single machine where your service is running
Scalability is still possible but only if you can partition your data in a chunks by some criteria (preferably business domain) and make each service to be responsible for a single chunk
You might need to solve java dependency conflicts due to large number of 3rd party libraries pooled by Spark into java classpath (not a problem for a pyspark)

Combination of simplicity and pragmatism make this approach quite suitable for some projects but obviously not all.

Challenge #4: Troubleshooting Spark failures

While explicitly looking simple, Spark apps usually process terabytes of data, which causes even simple mistake to cause big impact. We can categorise all types of Spark failures into 2 main categories and review them separately.

Case #1: Application failures

This category includes all types of errors which are caused by design or development errors. Mostly they are represented by OOM (Out of Memory) errors.

Recipe

Spark provides a nice Spark UI app which helps developers to understand how data is flowing between multiple data processing steps (aka stages). You should be able to detect at which step you have got OOM and how much data was processed at that step.

Spark UI

By default Spark quite effectively distributes data load between cluster nodes, so you should not get memory bottleneck on a single node. Unfortunately on the Spark API level you can enforce some operations which will cause data to be aggregated on a single node, like following:

dataframe.collect() - loads all data into memory on a driver node
dataframe.groupBy() - aggregates all data by key, if key has a low variety, then entire dataframe might get into a single partition which will exceed worker memory limit
dataframe.join() (left/right/outer) - invalid key during join can causes significant data multiplication. While its being purely business logic error, it will be detected as OOM which make it hard to investigate

Case #2: Environment failures

By design Spark runtime environment implemented as a cluster is deployed to multiple physical nodes (servers). Each node runs multiple java processes communicating to each other. There are following process types forming cluster topology:

Cluster Manager - cluster wide process managing entire Spark cluster (aka Master)
Worker - node level process responsible for app execution by spawning multiple Executors
Driver - application level process responsible for converting your code into multiple tasks
App Manager - application level process responsible for resource negotiation with Cluster Manager (might run in the same JVM with Driver)
Executor - application level processes (multiple) responsible for task execution

Spark Cluster Topology

Each process is a standalone java app containing specific configuration settings intended for its optimal execution based on provided hardware resources.

Any misconfiguration of particular process settings might make cluster feel unhealthy while running big data loads. Overall configuration list is out of scope of this article and represent quite involved knowledge by itself.

Recipe #1

Cluster configuration is a highly challenging, no surprise that specialised solutions have been created. Commercially available: Databricks, Cloudera. Open source - Apache Ambari.

Recipe #2

You can also decide to manage your cluster configuration by yourself. In this case you should expect to pass a really long journey until it will be stable enough for Production use. Depending on team expertise level it could take from 1 month to 1 year.

Challenge #5: Data security

The last but not least challenge is data security. With CCPA and GDPR regulations came to enterprise world, data security is a must nowadays. Open investigation showed that 49% of enterprises have faced data security issues in their big data projects.

Spark out of the box has no solution to secure a sensitive user’s data.

Recipe

In order to support data security, its required to take this challenge seriously right on the project kick-off. The architecture design strategy should be built around data security concept as a first class citizen. Descoping data security to a later stage might require entire system rewrite and big resource waste.

The design approach should take in account multiple data security requirements and address them accordingly:

Data removal - how data will be removed on user request?
Data cleanup - how system will identify and remove sensitive user information?
Data anonymisation/pseudonymization - how user’s data will be secured without loosing ability to analyse it?

Summary

We have reviewed the most critical challenges you might face when starting new Apache Spark project. To address those challenges there were provided practical recipes proved to work from our long term experience.

Nevertheless we highly recommend to try for free our Zentadata Platform which addresses all technical challenges and provides a lot of extra features like a data security and ability to run scalable big data jobs within a standalone microservice.

How to work with Big Data from Java Spring applications

Alex Dik — Wed, 16 Nov 2022 12:32:24 GMT

Intro

This article shows how to get around all tough stuff related to Big Data infrastructure, how to work with data fast and comfortably, without thinking about code deployment, keeping focused on business goals and getting things done as quickly as possible. And Zentadata Platform is an answer. It is shipped with SDK that can be simply added to your java application so you can start developing Big Data applications right away.

Just for demo purposes we will build a pet Spring Boot microservice which will expose a few REST endpoints. Each of those will be triggering different big data tasks on Zentadata Cluster.

Java Spring microservice runs Big Data jobs with Zentadata SDK

In our previous article related to Data Studio you can also read about how to work with multiple data sources and files, and how to join them together: Data analytics for everyone with Zentadata Data Studio.

Note: to install local developer cluster on your machine please read our Quick start guide 🚀 article.

Data analytics use case - RetailRocket dataset

We will start with some data first. I have found this public dataset https://www.kaggle.com/datasets/retailrocket/ecommerce-dataset which looked interesting to me. It contains reasonable amount of data so you can play with it locally using the Zentadata Developer edition. There is also information about the data structure and description on the mentioned site but just for this demo purposes and simplicity I will take events.csv file and will be working with it.

Allright, so let’s take a look into events.csv file, I will do it from Data Studio.

DataFrame events = zen
    .read("localfs")
    .format(DataFormat.CSV)
    .option(CsvOpts.HAS_HEADER, "true")
    .from("file:/home/dmch/datasets/RetailRocket/events.csv");
 
events.limit(5).execute().show();

OUTPUT:
+-------------+---------+-----+------+-------------+
|timestamp    |visitorid|event|itemid|transactionid|
+-------------+---------+-----+------+-------------+
|1433221332117|257597   |view |355908|null         |
|1433224214164|992329   |view |248676|null         |
|1433221999827|111016   |view |318965|null         |
|1433221955914|483717   |view |253185|null         |
|1433221337106|951259   |view |367447|null         |
+-------------+---------+-----+------+-------------+

As you can see and as per description from the web site this is the user's behavior data, i.e. events like clicks, add to carts, transactions - represent interactions that were collected over a period of 4.5 months.

Now actually I need to build a Spring Boot microservice which will have 4 REST endpoints that allow me to manipulate events and create different report types in my Postgres DB based on the data we have:

/api/events/{limit} - will be responsible for getting some number of events and send them back as a response
/api/events/user/{userId} - will get all events of specific visitor
/api/events/daily-report - will be responsible for generating daily report for a particular date
/api/events/sales-report - will be responsible for generating sales reports for time interval, for this one data will be partitioned by date which is close to real life scenario with a lot of data

Create Spring Boot project

Let’s go https://start.spring.io/ pick up dependencies we need and create a project. Here is my setup:

Spring Boot starter

After the project is downloaded, let’s add dependency to the Zentadata client library.


    
        Zentaly
        https://libs.zentaly.com
    



    com.zentadata
    client
    0.3

Now pretty familiar stuff for each java developer who has ever worked with Spring Boot, let’s create Controller, Service etc. Here is my ReportsController:

@RestController
@RequestMapping("/api")
public class ReportsController {
    private final ReportService reportService;

    public ReportsController(ReportService reportService) {
        this.reportService = reportService;
    }

    @GetMapping("/events/{limit}")
    public List getEvents(@PathVariable int limit) {
        return reportService.getEvents(limit);
    }

    @PostMapping("/events/user/{userId}")
    public List getEventsByUserId(@PathVariable Long userId) {
        return reportService.getEventsByUserId(userId);
    }

    @PostMapping("/events/daily-report")
    public void generateDailyReport(@RequestParam @DateTimeFormat(pattern = "yyyy-MM-dd") Date date) {
        reportService.generateDailyReport(date);
    }

    @PostMapping("/events/sales-report")
    public void generateSalesReport(
            @RequestParam @DateTimeFormat(pattern = "yyyy-MM-dd") Date fromDate,
            @RequestParam @DateTimeFormat(pattern = "yyyy-MM-dd") Date toDate) {
        reportService.generateSalesReport(fromDate, toDate);
    }
}

Now we can do some real codding and implement each method of ReportService one by one.

Case #1 - get first N events from CSV file

We start with the simplest one - the getEvents method. We just simply need to get a number (limited by request parameter) of records from our events file and return back as a response. It’s quite simple.
First I need to specify the Zentadata Cluster connection properties (host, port, authentication details if needed etc).

ZenSession zen = new ZenSession("localhost", 8090);

Next we need to specify the DataFrame and where it should take data from and how to transform it if needed.

DataFrame events = zen
    .read("localfs")
    .format(DataFormat.CSV)
    .option(CsvOpts.HAS_HEADER, "true")
    .from("file:/home/dmch/datasets/RetailRocket/events.csv");

And the implementation for getEvents method which will simply collect the data and return it as a server response.

public List getEvents(int limit) {
    return events
            .limit(limit)
            .execute()
            .getPayload(Event.class);
}

So that’s it, let’s start our app, go to the terminal and check the result.

curl -s -X GET "http://localhost:9090/api/events/2" | jsonpp
[
  {
    "timestamp": "2015-06-02T05:02:12.117+00:00",
    "visitorid": 257597,
    "event": "view",
    "itemid": 355908,
    "transactionid": 0
  },
  {
    "timestamp": "2015-06-02T05:50:14.164+00:00",
    "visitorid": 992329,
    "event": "view",
    "itemid": 248676,
    "transactionid": 0
  }
]

Case #2 - get all events for a specific visitor

Now we want to limit events to the ones which belong to a specific user.
This could be done using where opearator (pretty similar to SQL language).

public List getEventsByUserId(Long userId) {
    return events
            .where(col("visitorid").equalTo(userId))
            .execute()
            .getPayload(Event.class);
}

Note: please notice that we use equalTo operator for comparision and not standard == operator. In fact our java application defines data processing job with this DSL statement, while actual heavy data processing runs on Zentadata Cluster.

Case #3 - generate events daily report

Let’s do something different now, but first of all a couple of words how Zentadata Platform works. As you might assume the volume of data can be huge, so you will not be able to download it into the java heap and do some manipulations with it. You simply get out of memory and your application fails.

That’s why all data processing is performed on Zentadata Cluster. So our java microservice can run on a tiny server with small amount of CPU and RAM, and process terrabytes of data on a cluster.

In our case, generating daily report might give us a huge amount of data. So instead of loading data into java heap, i will store it to the relational database. Later on that report can be used for UI or other downstream consumers, pretty standard approach.

Again using zentadata sql-like language it’s pretty much simple and straightforward:

public void generateDailyReport(Date date) {
    long from = date.getTime();
    long to = date.getTime() + 24 * 60 * 60 * 1000;
    DataFrame report = events.where(col("timestamp").geq(from)
            .and(col("timestamp").leq(to)));

    report.write("postgres")
            .option(SaveMode.KEY, SaveMode.OVERWRITE)
            .to(format("daily_report_%s", dateFormatForTableName.format(date)));
}

That’s actually it, you see that I have added one where clause and as a second step we write selected data to the postgres DB to the daily_report_ table. If that table does not exist it will be created automatically and each re-run of this operation (triggering REST endpoint in our case) will overwrite data.

Lets call /api/events/daily-report endpoint and provide a report date. After successfull execution we will find a new table created in postgres daily_report_2015_06_02.

curl -X POST "http://localhost:9090/api/events/daily-report" -d "date=2015-06-02"

Daily report table created in postgres

Case #4 - generate total sales report

Now let’s take a look at more real examples of data processing. I made some data preparation for the mentioned dataset - partitioned events.csv by date. So now I have a folder for each date, something like on the screenshots below. As a result we have data for 139 days, around 3 millions events and 1.5 millions unique visitors.

Events partitioned by date

Note: In a real life partitioning is very important technique to deal with a big data. Some of our clients are generating terrabytes of data every day in distributed filesystems like HDFS or AWS S3. Storing data in a separate folders helps to decrease amount of data to be processed when we need to analyze just limited time window (day/week/month). Otherwise, if data will be stored as a single folder, we will have to do a full scan even when only one specific day should be analyzed.

In our case we can mimic partitioning using a local filesytem, as it is totally similar to enterprise grade distributed data storages.

public void generateSalesReport(Date fromDate, Date toDate) {
    new Thread(() -> {
        long from = fromDate.getTime();
        long to = toDate.getTime();
        DataFrame report = zen
                .read("localfs")
                .format(DataFormat.CSV)
                .option(CsvOpts.HAS_HEADER, "true")
                .from("file:/home/dmch/datasets/RetailRocket/events/*/*.csv"))
                .where(col("event").equalTo("transaction")
                        .and(col("timestamp").geq(from))
                        .and(col("timestamp").leq(to)))
                .select(col("timestamp"),
                        col("visitorid"),
                        col("itemid"));
        report.write("postgres")
                .option(SaveMode.KEY, SaveMode.OVERWRITE)
                .to(format("sales_report_%s_%s",
                        dateFormatForTableName.format(from),
                        dateFormatForTableName.format(to)));
    }).start();
}

From application developer perspective and code, it does not matter how the data is partitioned, or where it’s stored, our code remains almost the same as long as it has a similar format.

You only need to take into consideration the performance which will depend on the partitioning approach and amount of data. That script might take a while to be executed so we trigger this job in a separate thread and give the REST response right away saying that job is submitted for the execution. After the triggering our REST endpoint the work will be done by cluster and we get sales_report_* table created.

Please note that you can specify the destination to the data on the file system using regex patterns. That’s exactly how we defined that we are going to read all csv files inside all subfolders of the data/event folder.

Summary

We went through a few cases that might give you an understanding of how the platform allows to work with Big Data using standard Java and Spring stack. What is more, there is no need to build any sort of pipelines in order to ship the code to the cluster, you just need to grab the client library, make configuration once and work with data right away.

Of course there are a lot of other possible business cases and scenarios that can be covered with the flexibility of the platform design and available features. We are going to keep writing about them so you don’t have any doubts of getting started with Zentadata.

Data analytics for everyone with Zentadata Data Studio

Dmytro Chaplai — Thu, 03 Nov 2022 14:34:00 GMT

Intro

This article is addressed for wider auditory of business analysts, data scientists, quality engineers and developers, in other words to people who work with data, make some analysis, build reports etc.

There are lot of real business cases that can be simply solved with Zentadata platform. Today we are going to take a closer look at the Data Studio application which is shipped together with Zentadata, it’s a simple user friendly desktop application that can be run locally on all popular platforms: MacOS, Windows, Linux.

Note: to install Zentadata Data Studio on your local machine please read our Quick start guide 🚀 article.

Data analytics use case - MovieLens Dataset

We are going to work with this dataset https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset, so you can download it and follow all the steps from this article.

First of all let’s take a closer look at the mentioned dataset, it consists of a few big csv files, some of them contain more than 1m records so it even will not be possible to completely open them in MS Excel.
Also data is spread among a few files which have relation with each other.

Zentadata platform with Data Studio is a good set of tools which allows us to analyze those files very quickly without any kind of additional data preparation. You also can work with a bunch of different data formats, csv is only one of them. Even more, data can be spread across multiple files and folders or partitioned by datetime. It does not rally matter - for us it will be represented as a single DataFrame object (aka table in RDBMS terms).

Note: as for dataset size limitations:

If you run Zentadata Developer Edition you can process up to hundreds GB of data, where performance depends on your machine hardware
And in case of Zentadata Enterprise there is no limit at all, being infinitely scalable it can process as much data as business needs

Alright let’s get started with a few test cases which will give you an understanding of platform flexibility and powerness.

Case #1 - join 2 csv files and calculate average movie rating

The dataset we have downloaded contains files movie.csv and rating.csv. They have a relationship by movieId field, where each rating was set by a specific user, and now we want to calculate the average rate for each movie from the list.

First of all, let's open the movie and rating files and verify the data. For that we will add 2 DataFrames and specify the source of the data.

String ROOT_PATH = "file:/Users/alex/Downloads";

DataFrame movies = zen
    .read("localfs")
    .format(DataFormat.CSV)
    .option(CsvOpts.HAS_HEADER, "true")
    .from(ROOT_PATH + "/MovieLens/movie.csv");

DataFrame ratings = zen
    .read("localfs")
    .format(DataFormat.CSV)
    .option(CsvOpts.HAS_HEADER, "true")
    .from(ROOT_PATH + "/MovieLens/rating.csv");

Now lets take first 5 records from each DataFrame and display them on screen:

movies.limit(5).execute().show();
ratings.limit(5).execute().show();

OUTPUT:
+-------+----------------------------------+-------------------------------------------+
|movieId|title                             |genres                                     |
+-------+----------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                  |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                    |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)           |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)          |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)|Comedy                                     |
+-------+----------------------------------+-------------------------------------------+
+------+-------+------+-------------------+
|userId|movieId|rating|timestamp          |
+------+-------+------+-------------------+
|1     |2      |3.5   |2005-04-02 23:53:47|
|1     |29     |3.5   |2005-04-02 23:31:16|
|1     |32     |3.5   |2005-04-02 23:33:39|
|1     |47     |3.5   |2005-04-02 23:32:07|
|1     |50     |3.5   |2005-04-02 23:29:40|
+------+-------+------+-------------------+

Writing queries using Zentadata query language is pretty straightforward and similar to writing standard SQL. Firstly we specify how data frames will be joined, then what we want to get out of that dataset and finally how to group the data.

movies
   .join(ratings, ratings.col("movieId").equalTo(movies.col("movieId")))
   .select(
           movies.col("movieId"),
           movies.col("title"),
           ratings.col("rating")
       )
   .groupBy(list(col("movieId"), col("title")), avg(col("rating")).as("average rating"))
   .limit(5)
   .execute().show();
   
OUTPUT:
+-------+------------------------------------------------------+------------------+
|movieId|title                                                 |average rating    |
+-------+------------------------------------------------------+------------------+
|4027   |O Brother, Where Art Thou? (2000)                     |3.891130068348836 |
|7153   |Lord of the Rings: The Return of the King, The (2003) |4.14238211356367  |
|2951   |Fistful of Dollars, A (Per un pugno di dollari) (1964)|3.9353664087391897|
|4995   |Beautiful Mind, A (2001)                              |3.91974830149104  |
|4015   |Dude, Where's My Car? (2000)                          |2.5065868263473052|
+-------+------------------------------------------------------+------------------+

So just a few lines and the task is completed. Please also note the listed query has a limit of 5 records to be displayed, you can either change or remove this filter if needed.

Case #2 - sorting and and where clause

Now we would like to find the most popular videos with a rating higher than 4.

Let’s create a new DataFrame and call it mostPopularMovies, then add 2 lines with where and sort clauses.

DataFrame mostPopularMovies = movies
   .join(ratings, ratings.col("movieId").equalTo(movies.col("movieId")))
   .select(
       movies.col("movieId"),
       movies.col("title"),
       ratings.col("rating")
   )
   .groupBy(list(col("movieId"), col("title")), avg(col("rating")).as("average rating"))
   .where(col("average rating").gt(lit("4")))
   .sort(col("average rating"));
 
mostPopularMovies.limit(5).execute().show();

OUTPUT:
+-------+----------------------------------------------------------------+-----------------+
|movieId|title                                                           |average rating   |
+-------+----------------------------------------------------------------+-----------------+
|91529  |Dark Knight Rises, The (2012)                                   |4.00020964360587 |
|76093  |How to Train Your Dragon (2010)                                 |4.000420079815165|
|1952   |Midnight Cowboy (1969)                                          |4.000634719136782|
|7096   |Rivers and Tides (2001)                                         |4.001824817518248|
|4928   |That Obscure Object of Desire (Cet obscur objet du désir) (1977)|4.003125         |
+-------+----------------------------------------------------------------+-----------------+

And that’s simply it.

Case #3 - filtering by tags

Next step will be selecting movies by tags, those tags have been assigned by users for each movie and they are stored in a separate file. For that first of all we look into the tag.csv file and investigate its structure.

DataFrame tags = zen
    .read("localfs")
    .format(DataFormat.CSV)
    .option(CsvOpts.HAS_HEADER, "true")
    .from(ROOT_PATH + "/MovieLens/tag.csv");
 
tags.limit(5).execute().show();

OUTPUT:
+------+-------+-------------+-------------------+
|userId|movieId|tag          |timestamp          |
+------+-------+-------------+-------------------+
|18    |4141   |Mark Waters  |2009-04-24 18:19:40|
|65    |208    |dark hero    |2013-05-10 01:41:18|
|65    |353    |dark hero    |2013-05-10 01:41:19|
|65    |521    |noir thriller|2013-05-10 01:39:43|
|65    |592    |dark hero    |2013-05-10 01:41:18|
+------+-------+-------------+-------------------+

It also has a reference to movies by id, and contains a bunch of tags placed by users, so let’s join it with the most popular movies and try to find movies that contain the tag Comedy.

mostPopularMovies
.join(tags, tags.col("movieId").equalTo(mostPopularMovies.col("movieId")))
.select(
   mostPopularMovies.col("movieId"),
   mostPopularMovies.col("title"),
   mostPopularMovies.col("average rating"),
   tags.col("tag")
 )
.where(col("tag").contains(lit("Comedy")))
.limit(5)
.execute().show();

OUTPUT:
+-------+--------------------------------------+------------------+----------------+
|movieId|title                                 |average rating    |tag             |
+-------+--------------------------------------+------------------+----------------+
|356    |Forrest Gump (1994)                   |4.029000181345584 |Classic Comedy  |
|356    |Forrest Gump (1994)                   |4.029000181345584 |Comedy          |
|1136   |Monty Python and the Holy Grail (1975)|4.174146075581396 |Classic Comedy  |
|910    |Some Like It Hot (1959)               |4.082677165354331 |Comedy          |
|951    |His Girl Friday (1940)                |4.1529984623270115|Screwball Comedy|
+-------+--------------------------------------+------------------+----------------+

Case #4 - join csv files with database table

Let’s do something more exciting now. We will create a report which will filter movies by user details that are going to be taken from a relational database.

You probably noticed that the provided dataset does not contain a user file but movies and rates have userId fields. We can assume that we have users table in our relational database and it needs to be joined with data from csv files in order to filter by user's data. Well it’s a piece of cake for Zentadata.

For the demonstration purposes I have created a simple table in Postgres and added a couple of records.

CREATE TABLE users
(
    id INTEGER PRIMARY KEY,
    first_name VARCHAR,
    last_name VARCHAR,
    country  VARCHAR
);

INSERT INTO users (id, first_name, last_name, country)
VALUES (1, 'John', 'Dow', 'US'),
       (2, 'Nuria', 'Fabricio', 'US'),
       (3, 'Itzel', 'Langosh', 'US'),
       (4, 'Lilliana', 'Larkin', 'PL'),
       (5, 'Walker', 'Quigley', 'PL');

Now let’s join that Postgres table with csv files and create a report which will filter out users outside of US.

DataFrame users = zen.read("postgres").from("users");
users.execute().show();

DataFrame report = movies
    .join(ratings, ratings.col("movieId").equalTo(movies.col("movieId")))
    .join(users, ratings.col("userId").equalTo(users.col("id")))
    .select(
        movies.col("movieId"),
        movies.col("title"),
        ratings.col("rating"),
        users.col("id")
    )
    .where(col("country").equalTo(lit("US")))
    .groupBy(list(col("movieId"), col("title")), avg(col("rating")).as("average rating"));

report.limit(5).execute().show();

OUTPUT:
+--+----------+---------+-------+
|id|first_name|last_name|country|
+--+----------+---------+-------+
|1 |John      |Dow      |US     |
|2 |Nuria     |Fabricio |US     |
|3 |Itzel     |Langosh  |US     |
|4 |Lilliana  |Larkin   |PL     |
|5 |Walker    |Quigley  |PL     |
+--+----------+---------+-------+
+-------+------------------------------------------------------+--------------+
|movieId|title                                                 |average rating|
+-------+------------------------------------------------------+--------------+
|4027   |O Brother, Where Art Thou? (2000)                     |4.0           |
|7153   |Lord of the Rings: The Return of the King, The (2003) |5.0           |
|2951   |Fistful of Dollars, A (Per un pugno di dollari) (1964)|4.0           |
|337    |Whats Eating Gilbert Grape (1993)                     |3.25          |
|2797   |Big (1988)                                            |4.0           |
+-------+------------------------------------------------------+--------------+

As you can see from our perspective there is no difference in the source of data, for us it's a DataFrame that can be operated in the same manner.

Case #5 - saving report to the relational database

Now our boss wants the report to be a separate table in the database. Sounds like a huge amount of work, but with help of Zentadata platform you can do it just in a few minutes and what is more important just in couple of lines of code.

report.write("postgres")
   .option(SaveMode.KEY, SaveMode.OVERWRITE)
   .to("report");

And that’s it, let’s check out Postgress, so you will see a new table report has been created. It has a similar structure as our select statement, and it also contains all the data.

Summary

We went through a few cases that might give you an understanding of a product and bring ideas how you can use it in your daily work. Zentadata platform and tools shipped with it like Data Studio can significantly improve your productivity and make life easier.

In the next article we will discuss how to work with big data from java spring boot applications.

Quick start guide - Zentadata Developer Edition

Alex Dik — Wed, 26 Oct 2022 15:03:44 GMT

Overview

Zentadata Developer Edition is the simplest solution to start data analysis on your local machine right away. It is totally free and available for everyone.

Zentadata Developer Edition consits of 2 modules:

Data Studio data analytics IDE where you actually work with a data
Developer Cluster data processing engine shipped as a docker container

Developer Edition Features

Data formats: JSON, CSV, XML, Parquet
Data sources: Local File System, PostgresDB

Installation

Prerequisites

Docker installed on your local machine
Minimum 1GB of RAM for Docker container
Get free Developer Edition license key at https://account.zentadata.com

Install Data Studio

You can download and install Data Studio from this link📦.

Data Studio connects to the Developer Cluster to execute user defined data jobs. By default it is cofigured to connect to the local Developer Cluster at http://localhost:8090 which is good enough for our use case.

Install Developer Cluster

Download docker image and start container:

docker pull zentadata/zentadata-dev:latest

docker run -di -p 8090:8090 --name zentadata-dev \
--mount type=bind,source=/Users//datasets,target=/datasets \
-e POSTGRES_URL=jdbc:postgresql://host.docker.internal:5432/postgres \
-e POSTGRES_USERNAME=postgres \
-e POSTGRES_PASSWORD=********* \
-e ZENTADATA_LICENSE_KEY=****** \
zentadata/zentadata-dev:latest

Note: if you are running under Docker under Linux, you might need to add 1 extra parameter --add-host=host.docker.internal:host-gateway. Otherwise container will not be able to resolve address host.docker.internal.

This will start docker container running Developer Cluster, but most probably you will need to adjust configuration for your needs. See the next chapter how to configure each parameter.

Mount local folder to container filesystem

Please notice how we use --mount parameter. To process data files from your local file system (/Users/Alex/datasets), you need to mount it into Docker container filesystem (/datasets) to be available for data engine.

Docker container configuration

There are multiple parameters available to configure Developer Cluster running in docker container via environment variables.

Note: Please notice if you want to connect to PostgresDB running on localhost, you need to set address as host.docker.internal - it is docker alias to connect from within container to localhost.

Env variable	Default value	Description
POSTGRES_URL	jdbc:postgresql://host.docker.internal:5432/postgres	PostgresDB connection string
POSTGRES_USERNAME	postgres	PostgresDB username
POSTGRES_PASSWORD	postgres	PostgresDB password
MAX_HEAP_SIZE	1g	Max memory allocated for Developer Cluster
ZENTADATA_LICENSE_KEY		Developer License Key you can obtain registering at https://account.zentadata.com

Simple app

Now once we have all in place, lets try to run Data Studio to execute simple queries.

Read PostgresDB

In my local postgres database i have table users defined as following:

CREATE TABLE users
(
    id INTEGER PRIMARY KEY,
    first_name VARCHAR,
    last_name VARCHAR,
    country  VARCHAR
);

INSERT INTO users (id, first_name, last_name, country)
VALUES (1, 'John', 'Dow', 'US'),
       (2, 'Nuria', 'Fabricio', 'US'),
       (3, 'Itzel', 'Langosh', 'US'),
       (4, 'Lilliana', 'Larkin', 'PL'),
       (5, 'Walker', 'Quigley', 'PL');

Lets copy paste following code into Data Studio and execute it (hotkey F9):

zen
    .read("postgres")
    .from("users")
    .execute().show();

EXPECTED OUTPUT:
+--+----------+---------+-------+
|id|first_name|last_name|country|
+--+----------+---------+-------+
|1 |John      |Dow      |US     |
|2 |Nuria     |Fabricio |US     |
|3 |Itzel     |Langosh  |US     |
|4 |Lilliana  |Larkin   |PL     |
|5 |Walker    |Quigley  |PL     |
+--+----------+---------+-------+

Read JSON files

On my local filesystem i have a file /Users/Alex/data-samples/orders.json with a following content:

[
  {
    "order_id": "1",
    "date": "2020101",
    "items": [{
        "name": "ipad",
        "price": 449.99
    }]
  },
  {
    "order_id": "2",
    "date": "2020101",
    "items": [{
        "name": "imac 27",
        "price": 1700
    }]
  }
]

Lets try to read this json file with Data Studio and print its content:

zen
    .read("localfs")
    .format(DataFormat.JSON)
    .option(JsonOpts.IS_MULTILINE, "true")
    .from("file:/datasets/orders.json")
    .execute().show();  

EXPECTED OUTPUT:
+-------+------------------+--------+
|date   |items             |order_id|
+-------+------------------+--------+
|2020101|[[ipad,449.99]]   |1       |
|2020101|[[imac 27,1700.0]]|2       |
+-------+------------------+--------+

Note: Please notice how we set a path to the file relative to container mounted volume: "file:/datasets/orders.json"

Summary

We have installed Zentadata Developer Edition and successfully executed simple queries.

Ofcourse the true data analytics power comes with more advanced queries which we will show in the next blog posts.

How Data Driven Enterprise makes business effective?

Alex Dik — Wed, 19 Oct 2022 05:57:13 GMT

Intro

There is a lot of information about benefits of Data Driven Enterprise. With the most prominent characteristics like following:

Leverage data to prove multiple theories and choose the best one
Continuously research business data to find new opportunities
Seamless integration of ML into business processes to gain extra revenue
Innovative data techniques resolve challenges in hours, days or weeks

While for some people mentioned benefits might look quite obvious, still understanding how they are made possible might make even stronger mind shift towards data driven culture.

So in this article i would like to explain anthology of data driven approach and how exactly it makes possible to boost enterprise performance so much.

Enterprise effectiveness breakdown

Today any modern enterprise consist of multiple departments. Each department has variety of business processes to follow according to their goals. From organisational perspective department consists of multiple teams and employees.

To make entire enterprise work more efficient, organisations declare tremendous number of rules and policies (aka business processes) which could be generalised as following:

Enterprise is effective when all of its departments are following their business processes properly
Enterprise is effective when departments are able to collaborate effectively
Enterprise is effective when it is able to deliver competitive product to the end consumer

Note: in this schema product quality is derivative of processes and collaboration.

Enterprise effectiveness schema

Based on this definition we can highlight 3 core KPIs:

Effectiveness of business processes
Effectiveness of cross-team collaboration
Effectiveness of product delivery

Enterprise data maturity stages

The maturity of data processing practices significantly varies from company to company. Some companies might not be aware of data value at all, while others could have immense experience growing data culture through their employees.

Based on level of data practices penetration into organisation processes, i would like to highlight 3 well distinguished stages.

Note: please remember that there is no clear separation between these 3 stages. In fact continuous improvement of data practices leads from one stage to another. The ultimate goal of Data Driven Enterprise is not to apply strict rules, but instead to follow the right direction by practicing data culture. Where decisions and actions are made based on real use cases for each specific enterprise.

So here are 3 the most prominent stages of data maturity practices within organisation. We will review each stage separately and see how it affects enterprise effectiveness KPIs.

Stage 1 - Automation
Stage 2 - Data Awareness
Stage 3 - Data Driven

Stage 1 - Automation

It is a starting point where company applies initial efforts for automation and/or digitalisation of their business. Particularly it could address following scenarios: standardise specific business case, automate manual workflow, decrease human factor, etc. What is important to understand here is that the goal of Automation is to address one specific business problem in the most straightforward way possible.

At this stage organisation's departments do have a data storages represented as data silos focusing on a single service and strongly isolated from other systems even in scope of the parent department.

Typical achievements of Automation in scope of the core KPIs:

Core KPI	Achievement
Processes	Minimising error rate and limiting the role of human factor by automating business processes
Collaboration	Decreasing collaboration gaps through well defined workflows
Product	Improving predictability of delivery process by mitigating the most common risks addressed by previous 2 KPIs

Stage 2 - Data Awareness

The next stage after Automation is the Data Awareness. This stage is characterised by initial efforts to separate data from business processes into clean and reusable form. It is often implemented as date warehouse or some other kind of centralised data storage.

This stage helps to implement some quantitative improvements of business processes by reusing existing data sets. Having centralised data storages also helps to cut the costs and maintenance efforts.

Note: at this point enterprise decision makers are responsible to define business processes, which being executed properly by IT department should deliver business results. It’s important to note that first 2 stages Automation and Data Awareness are not intended to change existing organisation processes but rather to improve and optimise them.

Nevertheless this stage is characterised by clear understanding of data value and it’s ability to make an impact. It is a moment when first data leaders arise with idea of increasing product quality improvement turnaround. There are first signs that significant product improvements could be done on the team level.

Data Awareness impact to the core KPIs:

Core KPI	Achievement
Processes	Improved time to market
Collaboration	Faster collaboration through data sharing
Product	Quantitive improvement based on the sum of the previous two KPIs

Stage 3 - Data Driven Enterprise

This is the highest level of data practices in our list.

At this level the amount of available information and org wide data access makes possible for the most of employees to answer pretty much any question about their company business. But same time it is equally important that not only infrastructure but also people are mature enough to be able to take the value from data in their daily duties.

Most successful employees are emerging to data leaders inside cross-functional teams focusing on tremendous product improvement. And not only improvement - but finding a space for a new products which will be able to conquer the market.

While on previous stages organisation was focused on quality of processes and collaboration, this stage puts people in a centre as a key factor to create outstanding product value. Same time processes are intended to empower people and provide right data context required for optimal decisions making in every specific case.

So in this new organisation - business processes and collaboration are no longer driving factor. They become just an extra data context to support the main goal - make people to create a value through product innovation.

Stage 3 - Data Driven Enterprise

Data Driven approach impact to the core KPIs:

Core KPI	Achievement
Processes	Improved relevance by making correct decisions with the help of rich data context
Collaboration	Department level collaboration is replaced by team level collaboration
Product	Cross-functional teams powered by org-wide shared data make tremendous impact directly to the product development life cycle

Summary

Let's summarise how typical enterprise benefits from data driven approach in a short easy to remember table.

	Automation	Data Awareness	Data Driven
Processes	Improves quality	Improves time to market	Optimised to focus on real problems based on rich data context
Collaboration	Decreases gaps	Faster collaboration through data sharing	Department level collaboration moved to team level
Product	Improves predictability, decreases human factor	Quantitative improvements by X%	Cross-functional teams of experts directly impact product development life cycle

How to deploy Spark job to cluster

Alex Dik — Sun, 01 Nov 2020 10:39:36 GMT

In the previous post we have created simple Spark job and executed it locally on a single machine. Nevertheless the main goal of Spark framework is to utilize cluster resources consisting of multiple servers and in this way increase data processing throughput. In the real life the amount of data processed by production grade cluster is estimated at-least in terabytes.

The big advantage of Spark application is that it is ready for distributed cluster execution by design. Still there are few details which should be taken in account assuming distributed nature of application.

Lets revisit our application and highlight 3 main concerns which should be addressed to execute our job on Spark cluster:

import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import static org.apache.spark.sql.functions.*;

public class ClusterDeployment {
  public static void main(String[] args) {
    // TODO Concern 1: how to setup master?
    SparkConf conf = new SparkConf().setMaster("local[*]");
    SparkSession spark = SparkSession.builder()
        .config(conf)
        .getOrCreate();

    Dataset dataframe = spark.read()
        .option("multiline", "true")
        // TODO Concern 2: how to read file?
        .json("invoice.json")
        .select(
            col("item"),
            expr("price * quantity").as("item_total"));

    // TODO Concern 3: what to do with output?
    dataframe.show();
  }
}

Concern 1 - Setup master server

In our sample we have set master as local[*] which means - run job locally using all available CPUs. For cluster deployment setMaster statement shoud be omitted as hardware resources will be automatically managed by cluster environment.

Concern 2 - How to read source file?

Our naive test used local file invoice.json as datasource. But as you already now real big data application is expected to read files which could be too large to be stored on your local hard disk drive.

To store such a big files you need a very special file system - HDFS (hadoop distributed file system). It is deployed on top of multiple servers and stores data in a distributed way. Alternatively you can also use Amazon S3 storage, which is very similar to HDFS but hosted in cloud.

In our case what matters is to set proper path to file system where actual data is stored, e.g:

File system	File path sample
HDFS	hdfs://namenode:8020/user/data/invoice.json
Amazon S3	s3a://aws-account-id/user/data/invoice.json

Spark production grade cluster supports HDFS and Amazon S3 by default, so setting correct path is all you need.

Concern 3 - What to do with output?

As for our test app, we used show() method to print first 20 rows of dataframe to standard output. Thats good for development purpose, but for real application we need something more applicable.

There are 2 main options to consider collectAsList or write underlying data.

Method	Description
collectAsList	Returns a Java list that contains all rows in this Dataset. Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
write	Interface for saving the content of the non-streaming Dataset out into external storage.

One important aspect of collectAsList is that it will return all data to your local JVM instance. Assuming dataframe might contain gigabytes of data, executing this operation could crush your application with OutOfMemory exception. Still this operation could be safe if you do heavy filtering and assure that underlying result set is small enough to be processed in a single JVM.

Nevertheless most of time you would want to write results of data transformation to persistent storage. For example you can write result set back to HDFS in JSON format as following:

dataframe.write()
  .json("hdfs:///user/data/invoice_report");

So here is the final version of our application with all concerns adressed:

import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import static org.apache.spark.sql.functions.*;

public class ClusterDeployment {
  public static void main(String[] args) {
    SparkConf conf = new SparkConf();
    SparkSession spark = SparkSession.builder()
        .config(conf)
        .getOrCreate();

    Configuration hconf = spark.sparkContext().hadoopConfiguration();
    hconf.set("fs.defaultFS", "hdfs://namenode:8020");
    hconf.set("dfs.client.use.datanode.hostname", "true");

    Dataset dataframe = spark.read()
        .option("multiline", "true")
        .json("hdfs:///user/data/invoice.json")
        .select(
            col("item"),
            expr("price * quantity").as("item_total"));

    dataframe.write().json("hdfs:///user/data/invoice_report");
  }
}

Deployment

To test cluster deployment, you will need real Spark cluster up and running.

Once your application is ready its time to build a jar file and deploy it to cluster:

~/spark/bin/spark-submit \
--master spark://spark-master:6066 \
--deploy-mode client \
--class ClusterDeployment \
./java-spark-job-0.1-SNAPSHOT.jar

Lets go through the most important parameters one by one:

--master is a master URL of the Spark cluster
--deploy-mode defines an option where spark driver is hosted: client means the driver is hosted on a same machine where spark-submit executed, cluster deploys driver to one of cluster workers
--class entry point class of a spark job
the last parameter should be path to the application jar file

Summary

We have prepared our application to process data in production environment and successfully deployed it to the Spark cluster.

How to write Big Data application with Java and Spark

Alex Dik — Sun, 25 Oct 2020 10:09:40 GMT

Spark is modern Big Data framework to build highly scalable and feature rich data transformation pipelines.

Spark's main advantages are simplicity and high performance compared to its predecessor - Hadoop. You can write Spark applications in main 3 languages: Scala, Java and Python.

In this guide I will show you how to write simple Spark application in Java.

Writing Spark application

To create Spark job, as a first step you will need to add Spark library dependency into your maven project:


    org.apache.spark
    spark-sql_2.11
    2.4.7

In this guide we will try to read and transform list of invoices provided in invoice.json file:

[
  {
    "item": "iphone X",
    "price": 1000.00,
    "quantity": 2
  },
  {
    "item": "airpods",
    "price": 150.00,
    "quantity": 1
  },
  {
    "item": "macbook pro 13",
    "price": 1500.00,
    "quantity": 3
  }
]

Now we are ready to start writing application

import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.*;
import org.apache.spark.SparkConf;

public class SparkJob {
  public static void main(String[] args) {
    // setup spark session
    SparkConf conf = new SparkConf().setMaster("local[*]");
    SparkSession spark = SparkSession.builder()
        .config(conf)
        .getOrCreate();

    // read original json file
    Dataset dataframe = spark.read()
        .option("multiline", "true")
        .json("invoice.json");
    dataframe.show();
  }
}

Here we have created a SparkSession object which is entry point to build data jobs. And then created DataFrame, represented as java type Dataset.

DataFrame

DataFrame is distributed data structure representing underlying data similar to a table from relational database. Executing dataframe.show() generates following output:

+--------------+------+--------+
|          item| price|quantity|
+--------------+------+--------+
|      iphone X|1000.0|       2|
|       airpods| 150.0|       1|
|macbook pro 13|1500.0|       3|
+--------------+------+--------+

Beside data representation, DataFrame provides API methods to transform underlying data like following:

select
where
sort
groupBy
etc.

Now lets try to do some data transformations.

Task 1 - Filter items based on a price

Imagine that we need to build an invoice report which includes expsensive items only, with a price >= $1000. It can be implemented as following:

// filter items with a price greater than or equal to 1000
Dataset expensiveItems = dataframe
    .filter("price >= 1000");
expensiveItems.show();

+--------------+------+--------+
|          item| price|quantity|
+--------------+------+--------+
|      iphone X|1000.0|       2|
|macbook pro 13|1500.0|       3|
+--------------+------+--------+

Alternatively this code could be improved by applying statically typed functions:

import static org.apache.spark.sql.functions.*;

Dataset expensiveItems = dataframe
        .filter(col("price").geq(1000));

Here we define column price and apply function geq (Greater than or equal to an expression) with a parameter 1000. It provides exactly same result but provides little bit less space to make a typo.

Task 2 - Calculate total price for each item

Having 2 columns item price and item quantity we can calculate a total amount for that item as following:

// multiply item price by item quantity in invoice
Dataset sumPerRow = dataframe
    .select(
        col("*"),
        expr("price * quantity").as("sum_per_row"));
sumPerRow.show();
    
+--------------+------+--------+-----------+
|          item| price|quantity|sum_per_row|
+--------------+------+--------+-----------+
|      iphone X|1000.0|       2|     2000.0|
|       airpods| 150.0|       1|      150.0|
|macbook pro 13|1500.0|       3|     4500.0|
+--------------+------+--------+-----------+

And statically typed alternative:

Dataset sumPerRow = dataframe
    .select(
        col("*"),
        col("price").multiply(col("quantity")).as("sum_per_row"));

Task 3 - Calculate total amount for all items in invoice

As a last task lets aggregate all rows and calculate total invoice amount:

// calculate total amount of all items
Dataset total = dataframe
    .select(sum(expr("price * quantity")).as("total"));
total.show();
    
+------+
| total|
+------+
|6650.0|
+------+

Summary

We have written java application with a spark library and created simple data transformations jobs. You can run application locally and see results in place.

The great power of Spark is that you can deploy exactly same application to cluster of multiple nodes and scale performance of your application as much as needed to process arbitrary amount of data.

How to run Postgres in Docker

Alex Dik — Thu, 18 Aug 2016 21:34:00 GMT

If you need to run Postgres database for development needs you can just install it manually on local machine. But you should be aware that this procedure will require some level of knowledge about Postgres installation and maintaining procedures.

On the other hand there is much more simple way - run Postgres database inside Docker container. Docker effectively incapsulates deployment, administration and configuration procedures. So if you want to deploy Postgres locally with minimum efforts Docker is the best choice. All you need to do is just start pre-build Docker container and you will have Postgres database ready for your service.

Here is my github repo to build Docker container with embedded Postgres database: https://github.com/alexdik/dockerized-postgres.

If you don't have Docker yet, you can download and install it from official site: https://docs.docker.com/install/

To build Docker container you will need to run 3 simple commands in terminal:

git clone https://github.com/alexd84/dockerized-postgres.git
docker build -t postgres dockerized-postgres
docker run -di -p 5432:5432 postgres

Here you are cloning project from github, building container and launching it.

So now you can verify Postgres database instance running on your local machine:

telnet 127.0.0.1 5432
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

To authorise into Postgres database following default credentials should be used: postgres/postgres.

How is it working? (for those who actually care)

Repository you just clonned contains 3 files:

Dockerfile
entrypoint.sh
pg_hba.conf

Dockerfile

This is main file which instructs Docker to create new container based on Ubuntu image, download Postgres distributive, configure and proceed to entrypoint.sh script:

FROM ubuntu:14.04

RUN apt-get update -y
RUN apt-get install -y wget

RUN sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main 9.5" >> /etc/apt/sources.list.d/pgdg.list'
RUN wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O - | sudo apt-key add -

RUN apt-get update -y
RUN apt-get install postgresql-9.5 postgresql-contrib-9.5 -y

RUN mv /etc/postgresql/9.5/main/pg_hba.conf /etc/postgresql/9.5/main/pg_hba.conf.backup
COPY pg_hba.conf /etc/postgresql/9.5/main/pg_hba.conf
RUN echo "listen_addresses = '*'" >> /etc/postgresql/9.5/main/postgresql.conf

EXPOSE 5432

COPY entrypoint.sh /
ENTRYPOINT sudo /entrypoint.sh

entrypoint.sh

Here we start Postgres database and setting default login and password for access:

sudo service postgresql start
sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'postgres'"
tail

The last tail command is needed to assure that entrypoint.sh script never completes and it effectively makes container to run in background, otherwise it will stop execution immediately after startup.

pg_hba.conf

local   all             postgres                                peer
host    all             all             127.0.0.1/32            md5
host    all             all             0.0.0.0/0               md5

This file is exclusively used to configure Postgres to accept connections from remote hosts. By default Postgres permits external connections only from localhost, so when running in Docker you cant access it from your host machine. Thus we add pg_hba.conf configuration file to permit all inbound connections from any machines.

Summary

So now you have got Postgres database running locally without deep knowledge about it's internals. What is also useful to know that there are pre-build Docker containers for almost all kind of applications you can imagine like email server or Hadoop cluster. And so you can effectively rely on them to start complex applications in minutes and save time by avoiding manual installation and configuration.

Persisting 100k messages per second on single server in real-time

Alex Dik — Thu, 21 Jul 2016 07:31:11 GMT

Beside regular problem of Big Data analysis there is one more complex subtle task - persisting highly intensive data stream in a real-time. Image a scenario when your application cluster generates 100k business transactions per second, each one should be properly processed and written to data storage for further analysis and business intelligence reporting. Every transaction is also very precise and you can not lose any as it will cause data integrity constraints to be broken.

There are different message broker tools available to address this kind of scenario, like Apache Kafka, Amazon SQS or RabbitMQ. They do their job pretty well, but tend to be more general purpose and as such slightly bloated and time consuming for deployment needs.

In this article I would like to show how you can solve real-time data processing task simply and elegant with Apache Flume.

Flume design

Apache Flume is a distributed, reliable, and high available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralised data store.

Flume is simple and straightforward as name implies. You can configure it with .properties file and setup complex data flows in a distributed manner. For sake of our task we will configure Flume to get maximum real-time throughput out of it. We set it up to act as message broker between clustered java application and Hadoop HDFS raw data storage.

Every Flume data flow should have at least 3 main components: Source, Channel and Sink.

Source receives data from external clients
Channel aggregates data and pass to Sink
Sink is responsible for writing data to external system.

Source could be any of following adapters: Avro, Thrift, JMS, Kafka. And Sink supports: HDFS, Avro, Thrift, Hive, Hbase, Elastic, Kafka, formats.

For our scenario we will configure Flume Server to receive Avro messages from multiple clients and write them to HDFS filesystem. See diagram below.

There are 2 main problems to address when trying to persist 100k mes/sec on single server node:

Network/disk throughput (writing to HDFS)
Handling data stream peaks resiliently

Configure for high throughput

We are going to solve throughput problem with batch processing, so Flume will aggregate specific number of messages into single batch when sending them to HDFS.

agent.sinks.hdfs.hdfs.rollCount = 300000
agent.sinks.hdfs.hdfs.batchSize = 10000

Here rollCount defines maximum number of messages which could be saved into single file on HDFS. And batchSize controls number of messages which are written to HDFS per single transaction. The more data we write per transaction to HDFS - less transactions we need, this decreases disk and network load.

Managing data stream peaks resiliently

Another problem is data flow peaks, say if throughput could rise up to 500k mes/sec for some short period of time. We need to tolerate such scenario by allocating some intermediate store to avoid out of memory exceptions.

agent.channels.c1.capacity = 1000000
agent.channels.c1.transactionCapacity = 10000
agent.channels.c1.type = memory

This could be achieved with Channel configured to store up to 1 million records received from Source before being written to Sink. It assures capacity size enough to preserve data growth level when Source is receiving records faster than Sink is able to write. It makes entire system behave resiliently under data stream peaks.

Next transactionCapacity parameters controls how many records are passed from Channel to Sink per transaction (could be equal to agent.sinks.hdfs.hdfs.batchSize). Making value higher makes less IO operations required to write the same amount of data.

If Channel needs to be fault tolerant and preserve messages in case of system failures there is option to set persistent storage on disk with parameter agent.channels.c1.type = file.

Here is complete listing of Flume Server configuration:

agent.sources = avro
agent.channels = c1
agent.sinks = hdfs

agent.sources.avro.type = avro
agent.sources.avro.channels = c1
agent.sources.avro.bind = 0.0.0.0
agent.sources.avro.port = 44444
agent.sources.avro.threads = 4

agent.sinks.hdfs.type = hdfs
agent.sinks.hdfs.channel = c1
agent.sinks.hdfs.hdfs.path = hdfs://cdh-master:8020/logs/%{messageType}/%y-%m-%d/%H
agent.sinks.hdfs.hdfs.filePrefix = event
agent.sinks.hdfs.hdfs.rollCount = 300000
agent.sinks.hdfs.hdfs.batchSize = 10000
agent.sinks.hdfs.hdfs.rollInterval = 0
agent.sinks.hdfs.hdfs.rollSize = 0
agent.sinks.hdfs.hdfs.idleTimeout = 60
agent.sinks.hdfs.hdfs.timeZone = UTC

agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000000
agent.channels.c1.transactionCapacity = 10000

How to run Docker

Alex Dik — Mon, 23 May 2016 14:40:36 GMT

This approach is deprecated as for now you can download and install docker from official site

https://docs.docker.com/install/

There is conventional wisdom that Docker is pretty much diversed and sophisticated tool, but recently there were gorgeous updates which make it much simpler and convenient to use.

So I guess it is good time to summarise knowledge about design and deployment details for everyone who starts working with Docker.

Docker design at glance

Docker provides appliance to run your application inside isolated Linux environment aka Docker container. This container is just lightweight simulation of Linux OS, which is intended to host single user application and isolate applications from each other.

This makes mental shift from running multiple applications on Linux OS to running multiple containers hosted inside Linux core.

This provides multiple advantages like:

Declarative and reproducible application definition
Simplified and automated application deployment
Applications isolation from each other

How to run Docker

There were troublous times when you had to setup docker differently depending on you host OS. After releasing Docker Toolbox this process is got to be standardised.

Lets go step by step to install and run docker on MacOS or Windows operating systems.

Step 1: install Docker

You should download and install following applications:

Step 2: configure Docker

There is additional concept needed to run Docker - docker-machine. It is responsible to setup and run Docker containers inside virtual machine with Linux (as containers could be run only inside linux core).

To configure Docker you need to initialise docker-machine with following command:

docker-machine create --driver virtualbox default

It will create virtual machine with tiny Linux core which will host Docker containers.

After docker machine created the last step is to set 3 environment variables:

DOCKER_TLS_VERIFY
DOCKER_HOST
DOCKER_CERT_PATH

To get values for this variables you could run docker-machine env command. There should be output like following:

>>Alex$ docker-machine env
export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://192.168.99.100:2376"
export DOCKER_CERT_PATH="/Users/Alex/.docker/machine/machines/default"

Done!

Now you have docker up and running, to assure all is fine run
docker ps command.