<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Zentaly Blog]]></title><description><![CDATA[Ideas and solutions born inside our team]]></description><link>https://blog.zentaly.com/</link><image><url>https://blog.zentaly.com/favicon.png</url><title>Zentaly Blog</title><link>https://blog.zentaly.com/</link></image><generator>Ghost 4.43</generator><lastBuildDate>Fri, 10 Apr 2026 13:25:31 GMT</lastBuildDate><atom:link href="https://blog.zentaly.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Top 5 challenges and recipes when starting Apache Spark project]]></title><description><![CDATA[Zentaly team has years of experience building big data business solutions. In this article I would like to share Top 5 challenges and recipes when starting Apache Spark project.]]></description><link>https://blog.zentaly.com/top-5-challenges-and-recipes-when-starting-apache-spark-project/</link><guid isPermaLink="false">639ca9c06934cd06f238917e</guid><category><![CDATA[big data]]></category><category><![CDATA[spark]]></category><category><![CDATA[enterprise]]></category><category><![CDATA[dataanalytics]]></category><category><![CDATA[zentadata]]></category><dc:creator><![CDATA[Alex Dik]]></dc:creator><pubDate>Fri, 16 Dec 2022 18:05:48 GMT</pubDate><media:content url="https://blog.zentaly.com/content/images/2022/12/jakub-skafiriak-AljDaiCbCVY-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h1 id="intro">Intro</h1>
<img src="https://blog.zentaly.com/content/images/2022/12/jakub-skafiriak-AljDaiCbCVY-unsplash.jpg" alt="Top 5 challenges and recipes when starting Apache Spark project"><p>Apache Spark is the most advanced and powerful computation engine intended for big data analytics. Beside data engine it also provides libraries for streaming (Spark Streaming), machine learning (MLlib) and graph processing (GraphX).</p>
<p>Historically Spark emerged as a successor of Hadoop ecosystem with a following key advantages:</p>
<ol>
<li>Spark provides significant performance improvement by keeping data in memory, while Hadoop relies on slow disk access</li>
<li>Spark has rich programming APIs compared to restricted Hadoop&#x2019;s map-reduce model</li>
<li>Spark is compatible with Hadoop, so it is possible to run Spark on top of the existing Hadoop ecosystem</li>
</ol>
<p>Zentaly as a team of experts has years of experience building Spark based big data business solutions. In this article I would like to share our knowledge with everyone who looks forward to start a new Apache Spark project.</p>
<h1 id="challenges">Challenges</h1>
<p>The content will be structured in a form of &#x201C;challenge and recipe&#x201D; approach to ease reading process.</p>
<h2 id="challenge-1-do-we-really-need-spark">Challenge #1:  Do we really need Spark?</h2>
<p>As with the most of software projects kick-offs, business might be curious why resources and money should be allocated for a particular project. While for software engineers benefits might look quite obvious, there is always extra amount of work to translate them into KPIs/OKRs.</p>
<h5 id="recipe">Recipe</h5>
<p>You should translate technical benefits to numeric values (KPIs/OKRs) in order to make business stakeholders be able to understand and measure its impact to the business.</p>
<h5 id="examples">Examples</h5>
<ol>
<li>You are going to decrease &#x201C;Shopping Card Checkout&#x201D; processing time from 2 hours down to 10 minutes per user. In practice it means that users will get better response time and become less disappointed. From business perspective it means that purchase returns will decrease by 5% and number of support calls will decrease by 15%, which in total will increase revenue by 3% and decrease costs by 7% (specific numbers must be precisely measured).</li>
<li>You are going to extend ad campaign targeting dimensions to improve served ad quality. Logically it will provide more suitable ads for each particular user. From business perspective it could be represented as increased CTC and CPI rates by 5%.</li>
</ol>
<h2 id="challenge-2-we-can-do-it-ourself">Challenge #2:  We can do it ourself</h2>
<p>&#x201C;As people are building cars and rockets, its quite obvious that we can do it ourself&#x201D;. That&#x2019;s exactly what some engineers might want to say when facing a new technology stack, especially as complex as a Spark.</p>
<h5 id="recipe">Recipe</h5>
<p>Introducing a new framework like Apache Spark or any other technology requires some learning curve for engineers to understand. Learning and investing own time is the key factor to building deep understanding and improving our skills about technology.</p>
<p>So if you are going to start working with Spark you can find developer volunteers who might want to attend specialised trainings or prepare company internal knowledge sharing sessions by skilled employees.</p>
<p>More traction you will be able to build - the better understanding and higher skill set will grow inside development team.</p>
<h2 id="challenge-3-we-decided-to-start-using-spark-how-can-we-integrate-it-with-existing-infrastructure">Challenge #3: We decided to start using Spark, how can we integrate it with existing infrastructure?</h2>
<p>While creating Spark apps is relatively simple (<a href="https://blog.zentaly.com/how-to-write-big-data-application-with-java-and-spark/">you can read about it in my previous article</a>), running Spark in Production and integrating it with existing apps might be quite challenging task. The reason for this is an old idea which was born in Hadoop times - &#x201C;Moving code is cheaper than moving data&#x201D;. As a successor of Hadoop, Spark follows this pattern accordingly.</p>
<p>&#x201C;Moving code&#x201D; concepts assumes that big data application should be built as a standalone module and shipped to a runtime environment (usually cluster) where it will be executed and managed by that environment in a very special way. The good reason for this pattern to exist is high level of complexity which must be addressed by that environment, like following:</p>
<ul>
<li>Automatic code distribution and execution on thousands of physical servers</li>
<li>Horizontal scalability</li>
<li>High availability</li>
</ul>
<p>As for today, Spark execution environment in a very minimalistic form could be represented with a following diagram:</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.zentaly.com/content/images/2022/12/big-data-layers.svg" class="kg-image" alt="Top 5 challenges and recipes when starting Apache Spark project" loading="lazy" width="1000" height="570"><figcaption>Spark runtime environment</figcaption></figure><!--kg-card-begin: markdown--><p>The complexity of tools, platforms and its variety makes the entires system quite complex. What initially looked as a pretty simple Spark app, in the real life ends up as big standalone ecosystem.</p>
<h5 id="recipe-1">Recipe #1</h5>
<p>If you are big company with a lot of resources, most probably you will end up buying quite expensive solution from 3rd party vendor (e.g. Cloudera/Databricks) or even building this ecosystem yourself.</p>
<p>Unfortunately integrating big data stack with existing services would additionally require building middleware integration layer. Usually this middleware could be implemented as:</p>
<ul>
<li>Service to service communication via REST or rarely SOAP</li>
<li>Service communication via Enterprise Service Bus solutions (rabbitmq, web sphere, kafka, amazon sms, etc)</li>
<li>Data sharing via trivial relational database or even files on distributed filesystem storage</li>
</ul>
<p>Overall this is the most popular approach nowadays which makes big data solutions very rarely seen in a middle size companies.</p>
<h5 id="recipe-2">Recipe #2</h5>
<p>If your company is not too big and amount of data you work with is limited, you have a decent option using Spark as library inside your existing microservice (works for any JVM based language and Python). In this case you can just include Spark library as dependency and start writing big data code directly inside your existing app. Though it will have following limitations:</p>
<ul>
<li>Data processing will be limited to a single machine where your service is running</li>
<li>Scalability is still possible but only if you can partition your data in a chunks by some criteria (preferably business domain) and make each service to be responsible for a single chunk</li>
<li>You might need to solve java dependency conflicts due to large number of 3rd party libraries pooled by Spark into java classpath (not a problem for a pyspark)</li>
</ul>
<p>Combination of simplicity and pragmatism make this approach quite suitable for some projects but obviously not all.</p>
<h2 id="challenge-4-troubleshooting-spark-failures">Challenge #4: Troubleshooting Spark failures</h2>
<p>While explicitly looking simple, Spark apps usually process terabytes of data, which causes even simple mistake to cause big impact. We can categorise all types of Spark failures into 2 main categories and review them separately.</p>
<h5 id="case-1-application-failures">Case #1: Application failures</h5>
<p>This category includes all types of errors which are caused by design or development errors. Mostly they are represented by OOM (Out of Memory) errors.</p>
<h5 id="recipe">Recipe</h5>
<p>Spark provides a nice Spark UI app which helps developers to understand how data is flowing between multiple data processing steps (aka stages). You should be able to detect at which step you have got OOM and how much data was processed at that step.</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="https://blog.zentaly.com/content/images/2022/12/spark-ui.png" class="kg-image" alt="Top 5 challenges and recipes when starting Apache Spark project" loading="lazy" width="2000" height="1464" srcset="https://blog.zentaly.com/content/images/size/w600/2022/12/spark-ui.png 600w, https://blog.zentaly.com/content/images/size/w1000/2022/12/spark-ui.png 1000w, https://blog.zentaly.com/content/images/size/w1600/2022/12/spark-ui.png 1600w, https://blog.zentaly.com/content/images/2022/12/spark-ui.png 2208w" sizes="(min-width: 1200px) 1200px"><figcaption>Spark UI</figcaption></figure><!--kg-card-begin: markdown--><p>By default Spark quite effectively distributes data load between cluster nodes, so you should not get memory bottleneck on a single node. Unfortunately on the Spark API level you can enforce some operations which will cause data to be aggregated on a single node, like following:</p>
<ul>
<li><code>dataframe.collect()</code> - loads all data into memory on a driver node</li>
<li><code>dataframe.groupBy()</code> - aggregates all data by key, if key has a low variety, then entire dataframe might get into a single partition which will exceed worker memory limit</li>
<li><code>dataframe.join()</code> (left/right/outer) - invalid key during join can causes significant data multiplication. While its being purely business logic error, it will be detected as OOM which make it hard to investigate</li>
</ul>
<h5 id="case-2-environment-failures">Case #2: Environment failures</h5>
<p>By design Spark runtime environment implemented as a cluster is deployed to multiple physical nodes (servers). Each node runs multiple java processes communicating to each other. There are following process types forming cluster topology:</p>
<ul>
<li><strong>Cluster Manager</strong> - cluster wide process managing entire Spark cluster (aka Master)</li>
<li><strong>Worker</strong> - node level process responsible for app execution by spawning multiple Executors</li>
<li><strong>Driver</strong> - application level process responsible for converting your code into multiple tasks</li>
<li><strong>App Manager</strong> - application level process responsible for resource negotiation with Cluster Manager (might run in the same JVM with Driver)</li>
<li><strong>Executor</strong> - application level processes (multiple) responsible for task execution</li>
</ul>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.zentaly.com/content/images/2022/12/spark-cluster-topology.svg" class="kg-image" alt="Top 5 challenges and recipes when starting Apache Spark project" loading="lazy" width="538" height="614"><figcaption>Spark Cluster Topology</figcaption></figure><!--kg-card-begin: markdown--><p>Each process is a standalone java app containing specific configuration settings intended for its optimal execution based on provided hardware resources.</p>
<p>Any misconfiguration of particular process settings might make cluster feel unhealthy while running big data loads. Overall configuration list is out of scope of this article and represent quite involved knowledge by itself.</p>
<h5 id="recipe-1">Recipe #1</h5>
<p>Cluster configuration is a highly challenging, no surprise that specialised solutions have been created. Commercially available: Databricks, Cloudera. Open source - Apache Ambari.</p>
<h5 id="recipe-2">Recipe #2</h5>
<p>You can also decide to manage your cluster configuration by yourself. In this case you should expect to pass a really long journey until it will be stable enough for Production use. Depending on team expertise level it could take from 1 month to 1 year.</p>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><h2 id="challenge-5-data-security">Challenge #5: Data security</h2>
<p>The last but not least challenge is data security. With CCPA and GDPR regulations came to enterprise world, data security is a must nowadays. Open investigation showed that 49% of enterprises have faced data security issues in their big data projects.</p>
<p>Spark out of the box has no solution to secure a sensitive user&#x2019;s data.</p>
<h5 id="recipe">Recipe</h5>
<p>In order to support data security, its required to take this challenge seriously right on the project kick-off. The architecture design strategy should be built around data security concept as a first class citizen. Descoping data security to a later stage might require entire system rewrite and big resource waste.</p>
<p>The design approach should take in account multiple data security requirements and address them accordingly:</p>
<ul>
<li>Data removal - how data will be removed on user request?</li>
<li>Data cleanup - how system will identify and remove sensitive user information?</li>
<li>Data anonymisation/pseudonymization - how user&#x2019;s data will be secured without loosing ability to analyse it?</li>
</ul>
<h1 id="summary">Summary</h1>
<p>We have reviewed the most critical challenges you might face when starting new Apache Spark project. To address those challenges there were provided practical recipes proved to work from our long term experience.</p>
<p>Nevertheless we highly recommend <a href="https://zentadata.com">to try for free our Zentadata Platform</a> which addresses all technical challenges and provides a lot of extra features like a data security and ability to run scalable big data jobs within a standalone microservice.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[How to work with Big Data from Java Spring applications]]></title><description><![CDATA[This article shows how to get around all tough stuff related to Big Data infrastructure, work with data fast and comfortably, without thinking about code deployment, keeping focused on business goals and getting things done as quickly as possible.]]></description><link>https://blog.zentaly.com/how-to-work-with-big-data-from-java-spring-applications/</link><guid isPermaLink="false">6370e52fd6c5d80707b24268</guid><category><![CDATA[java]]></category><category><![CDATA[spring]]></category><category><![CDATA[big data]]></category><category><![CDATA[enterprise]]></category><category><![CDATA[zentadata]]></category><category><![CDATA[dataanalytics]]></category><dc:creator><![CDATA[Alex Dik]]></dc:creator><pubDate>Wed, 16 Nov 2022 12:32:24 GMT</pubDate><media:content url="https://blog.zentaly.com/content/images/2022/11/spring-big-data.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h1 id="intro">Intro</h1>
<img src="https://blog.zentaly.com/content/images/2022/11/spring-big-data.jpg" alt="How to work with Big Data from Java Spring applications"><p>This article shows how to get around all tough stuff related to Big Data infrastructure, how to work with data fast and comfortably, without thinking about code deployment, keeping focused on business goals and getting things done as quickly as possible. And Zentadata Platform is an answer. It is shipped with SDK that can be simply added to your java application so you can start developing Big Data applications right away.</p>
<p>Just for demo purposes we will build a pet Spring Boot microservice which will expose a few REST endpoints. Each of those will be triggering different big data tasks on Zentadata Cluster.</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.zentaly.com/content/images/2022/11/spring-boot-diagram.svg" class="kg-image" alt="How to work with Big Data from Java Spring applications" loading="lazy" width="853" height="560"><figcaption>Java Spring microservice runs Big Data jobs with Zentadata SDK</figcaption></figure><!--kg-card-begin: markdown--><p>In our previous article related to Data Studio you can also read about how to work with multiple data sources and files, and how to join them together: <a href="https://blog.zentaly.com/how-to-use-zentadata-data-studio-to-work-with-data/">Data analytics for everyone with Zentadata Data Studio</a>.</p>
<p><strong>Note:</strong> to install local developer cluster on your machine please read our <a href="https://blog.zentaly.com/how-to-install-and-run-zentadata-developer-edition/">Quick start guide &#x1F680;</a> article.</p>
<h1 id="data-analytics-use-caseretailrocket-dataset">Data analytics use case - RetailRocket dataset</h1>
<p>We will start with some data first. I have found this public dataset <a href="https://www.kaggle.com/datasets/retailrocket/ecommerce-dataset">https://www.kaggle.com/datasets/retailrocket/ecommerce-dataset</a> which looked interesting to me. It contains reasonable amount of data so you can play with it locally using the Zentadata Developer edition. There is also information about the data structure and description on the mentioned site but just for this demo purposes and simplicity I will take <code>events.csv</code> file and will be working with it.</p>
<p>Allright, so let&#x2019;s take a look into <code>events.csv</code> file, I will do it from Data Studio.</p>
<pre><code>DataFrame events = zen
    .read(&quot;localfs&quot;)
    .format(DataFormat.CSV)
    .option(CsvOpts.HAS_HEADER, &quot;true&quot;)
    .from(&quot;file:/home/dmch/datasets/RetailRocket/events.csv&quot;);
 
events.limit(5).execute().show();

OUTPUT:
+-------------+---------+-----+------+-------------+
|timestamp    |visitorid|event|itemid|transactionid|
+-------------+---------+-----+------+-------------+
|1433221332117|257597   |view |355908|null         |
|1433224214164|992329   |view |248676|null         |
|1433221999827|111016   |view |318965|null         |
|1433221955914|483717   |view |253185|null         |
|1433221337106|951259   |view |367447|null         |
+-------------+---------+-----+------+-------------+
</code></pre>
<p>As you can see and as per description from the web site this is the user&apos;s behavior data, i.e. events like clicks, add to carts, transactions - represent interactions that were collected over a period of 4.5 months.</p>
<p>Now actually I need to build a Spring Boot microservice which will have 4 REST endpoints that allow me to manipulate events and create different report types in my Postgres DB based on the data we have:</p>
<ul>
<li><code>/api/events/{limit}</code> - will be responsible for getting some number of events and send them back as a response</li>
<li><code>/api/events/user/{userId}</code> - will get all events of specific visitor</li>
<li><code>/api/events/daily-report</code> - will be responsible for generating daily report for a particular date</li>
<li><code>/api/events/sales-report</code> - will be responsible for generating sales reports for time interval, for this one data will be partitioned by date which is close to real life scenario with a lot of data</li>
</ul>
<h1 id="create-spring-boot-project">Create Spring Boot project</h1>
<p>Let&#x2019;s go <a href="https://start.spring.io/">https://start.spring.io/</a> pick up dependencies we need and create a project. Here is my setup:</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="https://blog.zentaly.com/content/images/2022/11/spring-boot-start-page-1.png" class="kg-image" alt="How to work with Big Data from Java Spring applications" loading="lazy" width="1510" height="760" srcset="https://blog.zentaly.com/content/images/size/w600/2022/11/spring-boot-start-page-1.png 600w, https://blog.zentaly.com/content/images/size/w1000/2022/11/spring-boot-start-page-1.png 1000w, https://blog.zentaly.com/content/images/2022/11/spring-boot-start-page-1.png 1510w" sizes="(min-width: 1200px) 1200px"><figcaption>Spring Boot starter</figcaption></figure><!--kg-card-begin: markdown--><p>After the project is downloaded, let&#x2019;s add dependency to the Zentadata client library.</p>
<pre><code>&lt;repositories&gt;
    &lt;repository&gt;
        &lt;id&gt;Zentaly&lt;/id&gt;
        &lt;url&gt;https://libs.zentaly.com&lt;/url&gt;
    &lt;/repository&gt;
&lt;/repositories&gt;

&lt;dependency&gt;
    &lt;groupId&gt;com.zentadata&lt;/groupId&gt;
    &lt;artifactId&gt;client&lt;/artifactId&gt;
    &lt;version&gt;0.3&lt;/version&gt;
&lt;/dependency&gt;
</code></pre>
<p>Now pretty familiar stuff for each java developer who has ever worked with Spring Boot, let&#x2019;s create Controller, Service etc. Here is my <code>ReportsController</code>:</p>
<pre><code>@RestController
@RequestMapping(&quot;/api&quot;)
public class ReportsController {
    private final ReportService reportService;

    public ReportsController(ReportService reportService) {
        this.reportService = reportService;
    }

    @GetMapping(&quot;/events/{limit}&quot;)
    public List&lt;Event&gt; getEvents(@PathVariable int limit) {
        return reportService.getEvents(limit);
    }

    @PostMapping(&quot;/events/user/{userId}&quot;)
    public List&lt;Event&gt; getEventsByUserId(@PathVariable Long userId) {
        return reportService.getEventsByUserId(userId);
    }

    @PostMapping(&quot;/events/daily-report&quot;)
    public void generateDailyReport(@RequestParam @DateTimeFormat(pattern = &quot;yyyy-MM-dd&quot;) Date date) {
        reportService.generateDailyReport(date);
    }

    @PostMapping(&quot;/events/sales-report&quot;)
    public void generateSalesReport(
            @RequestParam @DateTimeFormat(pattern = &quot;yyyy-MM-dd&quot;) Date fromDate,
            @RequestParam @DateTimeFormat(pattern = &quot;yyyy-MM-dd&quot;) Date toDate) {
        reportService.generateSalesReport(fromDate, toDate);
    }
}
</code></pre>
<p>Now we can do some real codding and implement each method of ReportService one by one.</p>
<h3 id="case-1get-first-n-events-from-csv-file">Case #1 - get first N events from CSV file</h3>
<p>We start with the simplest one - the <code>getEvents</code> method. We just simply need to get a number (limited by request parameter) of records from our events file and return back as a response. It&#x2019;s quite simple.<br>
First I need to specify the Zentadata Cluster connection properties (host, port, authentication details if needed etc).</p>
<pre><code>ZenSession zen = new ZenSession(&quot;localhost&quot;, 8090);
</code></pre>
<p>Next we need to specify the DataFrame and where it should take data from and how to transform it if needed.</p>
<pre><code>DataFrame events = zen
    .read(&quot;localfs&quot;)
    .format(DataFormat.CSV)
    .option(CsvOpts.HAS_HEADER, &quot;true&quot;)
    .from(&quot;file:/home/dmch/datasets/RetailRocket/events.csv&quot;);
</code></pre>
<p>And the implementation for getEvents method which will simply collect the data and return it as a server response.</p>
<pre><code>public List&lt;Event&gt; getEvents(int limit) {
    return events
            .limit(limit)
            .execute()
            .getPayload(Event.class);
}
</code></pre>
<p>So that&#x2019;s it, let&#x2019;s start our app, go to the terminal and check the result.</p>
<pre><code>curl -s -X GET &quot;http://localhost:9090/api/events/2&quot; | jsonpp
[
  {
    &quot;timestamp&quot;: &quot;2015-06-02T05:02:12.117+00:00&quot;,
    &quot;visitorid&quot;: 257597,
    &quot;event&quot;: &quot;view&quot;,
    &quot;itemid&quot;: 355908,
    &quot;transactionid&quot;: 0
  },
  {
    &quot;timestamp&quot;: &quot;2015-06-02T05:50:14.164+00:00&quot;,
    &quot;visitorid&quot;: 992329,
    &quot;event&quot;: &quot;view&quot;,
    &quot;itemid&quot;: 248676,
    &quot;transactionid&quot;: 0
  }
]
</code></pre>
<h3 id="case-2get-all-events-for-a-specific-visitor">Case #2 - get all events for a specific visitor</h3>
<p>Now we want to limit events to the ones which belong to a specific user.<br>
This could be done using <code>where</code> opearator (pretty similar to SQL language).</p>
<pre><code>public List&lt;Event&gt; getEventsByUserId(Long userId) {
    return events
            .where(col(&quot;visitorid&quot;).equalTo(userId))
            .execute()
            .getPayload(Event.class);
}
</code></pre>
<p><strong>Note:</strong> please notice that we use <code>equalTo</code> operator for comparision and not standard <code>==</code> operator. In fact our java application <strong>defines</strong> data processing job with this DSL statement, while actual heavy <strong>data processing</strong> runs on Zentadata Cluster.</p>
<h3 id="case-3generate-events-daily-report">Case #3 - generate events daily report</h3>
<p>Let&#x2019;s do something different now, but first of all a couple of words how Zentadata Platform works. As you might assume the volume of data can be huge, so you will not be able to download it into the java heap and do some manipulations with it. You simply get out of memory and your application fails.</p>
<p>That&#x2019;s why all data processing is performed on Zentadata Cluster. So our java microservice can run on a tiny server with small amount of CPU and RAM, and process terrabytes of data on a cluster.</p>
<p>In our case, generating daily report might give us a huge amount of data. So instead of loading data into java heap, i will store it to the relational database. Later on that report can be used for UI or other downstream consumers, pretty standard approach.</p>
<p>Again using zentadata sql-like language it&#x2019;s pretty much simple and straightforward:</p>
<pre><code>public void generateDailyReport(Date date) {
    long from = date.getTime();
    long to = date.getTime() + 24 * 60 * 60 * 1000;
    DataFrame report = events.where(col(&quot;timestamp&quot;).geq(from)
            .and(col(&quot;timestamp&quot;).leq(to)));

    report.write(&quot;postgres&quot;)
            .option(SaveMode.KEY, SaveMode.OVERWRITE)
            .to(format(&quot;daily_report_%s&quot;, dateFormatForTableName.format(date)));
}
</code></pre>
<p>That&#x2019;s actually it, you see that I have added one <code>where</code> clause and as a second step we <code>write</code> selected data to the postgres DB to the <code>daily_report_&lt;date&gt;</code> table. If that table does not exist it will be created automatically and each re-run of this operation (triggering REST endpoint in our case) will overwrite data.</p>
<p>Lets call <code>/api/events/daily-report</code> endpoint and provide a report date. After successfull execution we will find a new table created in postgres <code>daily_report_2015_06_02</code>.</p>
<pre><code>curl -X POST &quot;http://localhost:9090/api/events/daily-report&quot; -d &quot;date=2015-06-02&quot;
</code></pre>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.zentaly.com/content/images/2022/11/Screenshot-2022-11-13-at-20.48.11.png" class="kg-image" alt="How to work with Big Data from Java Spring applications" loading="lazy" width="1876" height="1088" srcset="https://blog.zentaly.com/content/images/size/w600/2022/11/Screenshot-2022-11-13-at-20.48.11.png 600w, https://blog.zentaly.com/content/images/size/w1000/2022/11/Screenshot-2022-11-13-at-20.48.11.png 1000w, https://blog.zentaly.com/content/images/size/w1600/2022/11/Screenshot-2022-11-13-at-20.48.11.png 1600w, https://blog.zentaly.com/content/images/2022/11/Screenshot-2022-11-13-at-20.48.11.png 1876w" sizes="(min-width: 720px) 720px"><figcaption>Daily report table created in postgres</figcaption></figure><!--kg-card-begin: markdown--><h3 id="case-4generate-total-sales-report">Case #4 - generate total sales report</h3>
<p>Now let&#x2019;s take a look at more real examples of data processing. I made some data preparation for the mentioned dataset - partitioned <code>events.csv</code> by date. So now I have a folder for each date, something like on the screenshots below. As a result we have data for 139 days, around 3 millions events and 1.5 millions unique visitors.</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.zentaly.com/content/images/2022/11/Screenshot-2022-11-13-at-21.14.55.png" class="kg-image" alt="How to work with Big Data from Java Spring applications" loading="lazy" width="1760" height="1210" srcset="https://blog.zentaly.com/content/images/size/w600/2022/11/Screenshot-2022-11-13-at-21.14.55.png 600w, https://blog.zentaly.com/content/images/size/w1000/2022/11/Screenshot-2022-11-13-at-21.14.55.png 1000w, https://blog.zentaly.com/content/images/size/w1600/2022/11/Screenshot-2022-11-13-at-21.14.55.png 1600w, https://blog.zentaly.com/content/images/2022/11/Screenshot-2022-11-13-at-21.14.55.png 1760w" sizes="(min-width: 720px) 720px"><figcaption>Events partitioned by date</figcaption></figure><!--kg-card-begin: markdown--><p><strong>Note:</strong> In a real life partitioning is very important technique to deal with a big data. Some of our clients are generating terrabytes of data every day in distributed filesystems like <code>HDFS</code> or <code>AWS S3</code>. Storing data in a separate folders helps to decrease amount of data to be processed when we need to analyze just limited time window (day/week/month). Otherwise, if data will be stored as a single folder, we will have to do a full scan even when only one specific day should be analyzed.</p>
<p>In our case we can mimic partitioning using a local filesytem, as it is totally similar to enterprise grade distributed data storages.</p>
<pre><code>public void generateSalesReport(Date fromDate, Date toDate) {
    new Thread(() -&gt; {
        long from = fromDate.getTime();
        long to = toDate.getTime();
        DataFrame report = zen
                .read(&quot;localfs&quot;)
                .format(DataFormat.CSV)
                .option(CsvOpts.HAS_HEADER, &quot;true&quot;)
                .from(&quot;file:/home/dmch/datasets/RetailRocket/events/*/*.csv&quot;))
                .where(col(&quot;event&quot;).equalTo(&quot;transaction&quot;)
                        .and(col(&quot;timestamp&quot;).geq(from))
                        .and(col(&quot;timestamp&quot;).leq(to)))
                .select(col(&quot;timestamp&quot;),
                        col(&quot;visitorid&quot;),
                        col(&quot;itemid&quot;));
        report.write(&quot;postgres&quot;)
                .option(SaveMode.KEY, SaveMode.OVERWRITE)
                .to(format(&quot;sales_report_%s_%s&quot;,
                        dateFormatForTableName.format(from),
                        dateFormatForTableName.format(to)));
    }).start();
}
</code></pre>
<p>From application developer perspective and code, it does not matter how the data is partitioned, or where it&#x2019;s stored, our code remains almost the same as long as it has a similar format.</p>
<p>You only need to take into consideration the performance which will depend on the partitioning approach and amount of data. That script might take a while to be executed so we trigger this job in a separate thread and give the REST response right away saying that job is submitted for the execution. After the triggering our REST endpoint the work will be done by cluster and we get <code>sales_report_*</code> table created.</p>
<p>Please note that you can specify the destination to the data on the file system using regex patterns. That&#x2019;s exactly how we defined that we are going to read all csv files inside all subfolders of the data/event folder.</p>
<h1 id="summary">Summary</h1>
<p>We went through a few cases that might give you an understanding of how the platform allows to work with Big Data using standard Java and Spring stack. What is more, there is no need to build any sort of pipelines in order to ship the code to the cluster, you just need to grab the client library, make configuration once and work with data right away.</p>
<p>Of course there are a lot of other possible business cases and scenarios that can be covered with the flexibility of the platform design and available features. We are going to keep writing about them so you don&#x2019;t have any doubts of getting started with Zentadata.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Data analytics for everyone with Zentadata Data Studio]]></title><description><![CDATA[This article is addressed for business analysts, data scientists, quality engineers and developers, in other words to people who work with a data.]]></description><link>https://blog.zentaly.com/how-to-use-zentadata-data-studio-to-work-with-data/</link><guid isPermaLink="false">634c0dbb80279a54a6ba658c</guid><category><![CDATA[big data]]></category><category><![CDATA[zentadata]]></category><category><![CDATA[dataanalytics]]></category><dc:creator><![CDATA[Dmytro Chaplai]]></dc:creator><pubDate>Thu, 03 Nov 2022 14:34:00 GMT</pubDate><media:content url="https://blog.zentaly.com/content/images/2022/10/stephen-dawson-qwtCeJ5cLYs-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h1 id="intro">Intro</h1>
<img src="https://blog.zentaly.com/content/images/2022/10/stephen-dawson-qwtCeJ5cLYs-unsplash.jpg" alt="Data analytics for everyone with Zentadata Data Studio"><p>This article is addressed for wider auditory of business analysts, data scientists, quality engineers and developers, in other words to people who work with data, make some analysis, build reports etc.</p>
<p>There are lot of real business cases that can be simply solved with Zentadata platform. Today we are going to take a closer look at the Data Studio application which is shipped together with Zentadata, it&#x2019;s a simple user friendly desktop application that can be run locally on all popular platforms: MacOS, Windows, Linux.</p>
<p><strong>Note:</strong> to install Zentadata Data Studio on your local machine please read our <a href="https://blog.zentaly.com/how-to-install-and-run-zentadata-developer-edition/">Quick start guide &#x1F680;</a> article.</p>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><h1 id="data-analytics-use-casemovielens-dataset">Data analytics use case - MovieLens Dataset</h1>
<p>We are going to work with this dataset  <a href="https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset">https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset</a>, so you can download it and follow all the steps from this article.</p>
<p>First of all let&#x2019;s take a closer look at the mentioned dataset, it consists of a few big csv files, some of them contain more than 1m records so it even will not be possible to completely open them in MS Excel.<br>
Also data is spread among a few files which have relation with each other.</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card"><img src="https://blog.zentaly.com/content/images/2022/11/movielens-dataset.png" class="kg-image" alt="Data analytics for everyone with Zentadata Data Studio" loading="lazy" width="1448" height="784" srcset="https://blog.zentaly.com/content/images/size/w600/2022/11/movielens-dataset.png 600w, https://blog.zentaly.com/content/images/size/w1000/2022/11/movielens-dataset.png 1000w, https://blog.zentaly.com/content/images/2022/11/movielens-dataset.png 1448w" sizes="(min-width: 720px) 720px"></figure><!--kg-card-begin: markdown--><p>Zentadata platform with Data Studio is a good set of tools which allows us to analyze those files very quickly without any kind of additional data preparation. You also can work with a bunch of different data formats, csv is only one of them. Even more, data can be spread across multiple files and folders or partitioned by datetime. It does not rally matter - for us it will be represented as a single DataFrame object (aka table in RDBMS terms).</p>
<p><strong>Note:</strong> as for dataset size limitations:</p>
<ul>
<li>If you run <code>Zentadata Developer Edition</code> you can process up to hundreds GB of data, where performance depends on your machine hardware</li>
<li>And in case of <code>Zentadata Enterprise</code> there is no limit at all, being infinitely scalable it can process as much data as business needs</li>
</ul>
<p>Alright let&#x2019;s get started with a few test cases which will give you an understanding of platform flexibility and powerness.</p>
<h3 id="case-1join-2-csv-files-and-calculate-average-movie-rating">Case #1 - join 2 csv files and calculate average movie rating</h3>
<p>The dataset we have downloaded contains files <code>movie.csv</code> and <code>rating.csv</code>. They have a relationship by movieId field, where each rating was set by a specific user, and now we want to calculate the average rate for each movie from the list.</p>
<p>First of all, let&apos;s open the movie and rating files and verify the data. For that we will add 2 DataFrames and specify the source of the data.</p>
<pre><code>String ROOT_PATH = &quot;file:/Users/alex/Downloads&quot;;

DataFrame movies = zen
    .read(&quot;localfs&quot;)
    .format(DataFormat.CSV)
    .option(CsvOpts.HAS_HEADER, &quot;true&quot;)
    .from(ROOT_PATH + &quot;/MovieLens/movie.csv&quot;);

DataFrame ratings = zen
    .read(&quot;localfs&quot;)
    .format(DataFormat.CSV)
    .option(CsvOpts.HAS_HEADER, &quot;true&quot;)
    .from(ROOT_PATH + &quot;/MovieLens/rating.csv&quot;);
</code></pre>
<p>Now lets take first 5 records from each DataFrame and display them on screen:</p>
<pre><code>movies.limit(5).execute().show();
ratings.limit(5).execute().show();

OUTPUT:
+-------+----------------------------------+-------------------------------------------+
|movieId|title                             |genres                                     |
+-------+----------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                  |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                    |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)           |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)          |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)|Comedy                                     |
+-------+----------------------------------+-------------------------------------------+
+------+-------+------+-------------------+
|userId|movieId|rating|timestamp          |
+------+-------+------+-------------------+
|1     |2      |3.5   |2005-04-02 23:53:47|
|1     |29     |3.5   |2005-04-02 23:31:16|
|1     |32     |3.5   |2005-04-02 23:33:39|
|1     |47     |3.5   |2005-04-02 23:32:07|
|1     |50     |3.5   |2005-04-02 23:29:40|
+------+-------+------+-------------------+
</code></pre>
<p>Writing queries using Zentadata query language is pretty straightforward and similar to writing standard SQL. Firstly we specify how data frames will be joined, then what we want to get out of that dataset and finally how to group the data.</p>
<pre><code>movies
   .join(ratings, ratings.col(&quot;movieId&quot;).equalTo(movies.col(&quot;movieId&quot;)))
   .select(
           movies.col(&quot;movieId&quot;),
           movies.col(&quot;title&quot;),
           ratings.col(&quot;rating&quot;)
       )
   .groupBy(list(col(&quot;movieId&quot;), col(&quot;title&quot;)), avg(col(&quot;rating&quot;)).as(&quot;average rating&quot;))
   .limit(5)
   .execute().show();
   
OUTPUT:
+-------+------------------------------------------------------+------------------+
|movieId|title                                                 |average rating    |
+-------+------------------------------------------------------+------------------+
|4027   |O Brother, Where Art Thou? (2000)                     |3.891130068348836 |
|7153   |Lord of the Rings: The Return of the King, The (2003) |4.14238211356367  |
|2951   |Fistful of Dollars, A (Per un pugno di dollari) (1964)|3.9353664087391897|
|4995   |Beautiful Mind, A (2001)                              |3.91974830149104  |
|4015   |Dude, Where&apos;s My Car? (2000)                          |2.5065868263473052|
+-------+------------------------------------------------------+------------------+
</code></pre>
<p>So just a few lines and the task is completed. Please also note the listed query has a limit of 5 records to be displayed, you can either change or remove this filter if needed.</p>
<h3 id="case-2sorting-and-and-where-clause">Case #2 - sorting and and where clause</h3>
<p>Now we would like to find the most popular videos with a rating higher than 4.</p>
<p>Let&#x2019;s create a new DataFrame and call it <code>mostPopularMovies</code>, then add 2 lines with where and sort clauses.</p>
<pre><code>DataFrame mostPopularMovies = movies
   .join(ratings, ratings.col(&quot;movieId&quot;).equalTo(movies.col(&quot;movieId&quot;)))
   .select(
       movies.col(&quot;movieId&quot;),
       movies.col(&quot;title&quot;),
       ratings.col(&quot;rating&quot;)
   )
   .groupBy(list(col(&quot;movieId&quot;), col(&quot;title&quot;)), avg(col(&quot;rating&quot;)).as(&quot;average rating&quot;))
   .where(col(&quot;average rating&quot;).gt(lit(&quot;4&quot;)))
   .sort(col(&quot;average rating&quot;));
 
mostPopularMovies.limit(5).execute().show();

OUTPUT:
+-------+----------------------------------------------------------------+-----------------+
|movieId|title                                                           |average rating   |
+-------+----------------------------------------------------------------+-----------------+
|91529  |Dark Knight Rises, The (2012)                                   |4.00020964360587 |
|76093  |How to Train Your Dragon (2010)                                 |4.000420079815165|
|1952   |Midnight Cowboy (1969)                                          |4.000634719136782|
|7096   |Rivers and Tides (2001)                                         |4.001824817518248|
|4928   |That Obscure Object of Desire (Cet obscur objet du d&#xE9;sir) (1977)|4.003125         |
+-------+----------------------------------------------------------------+-----------------+
</code></pre>
<p>And that&#x2019;s simply it.</p>
<h3 id="case-3filtering-by-tags">Case #3 - filtering by tags</h3>
<p>Next step will be selecting movies by tags, those tags have been assigned by users for each movie and they are stored in a separate file. For that first of all we look into the <code>tag.csv</code> file and investigate its structure.</p>
<pre><code>DataFrame tags = zen
    .read(&quot;localfs&quot;)
    .format(DataFormat.CSV)
    .option(CsvOpts.HAS_HEADER, &quot;true&quot;)
    .from(ROOT_PATH + &quot;/MovieLens/tag.csv&quot;);
 
tags.limit(5).execute().show();

OUTPUT:
+------+-------+-------------+-------------------+
|userId|movieId|tag          |timestamp          |
+------+-------+-------------+-------------------+
|18    |4141   |Mark Waters  |2009-04-24 18:19:40|
|65    |208    |dark hero    |2013-05-10 01:41:18|
|65    |353    |dark hero    |2013-05-10 01:41:19|
|65    |521    |noir thriller|2013-05-10 01:39:43|
|65    |592    |dark hero    |2013-05-10 01:41:18|
+------+-------+-------------+-------------------+
</code></pre>
<p>It also has a reference to movies by id, and contains a bunch of tags placed by users, so let&#x2019;s join it with the most popular movies and try to find movies that contain the tag <code>Comedy</code>.</p>
<pre><code>mostPopularMovies
.join(tags, tags.col(&quot;movieId&quot;).equalTo(mostPopularMovies.col(&quot;movieId&quot;)))
.select(
   mostPopularMovies.col(&quot;movieId&quot;),
   mostPopularMovies.col(&quot;title&quot;),
   mostPopularMovies.col(&quot;average rating&quot;),
   tags.col(&quot;tag&quot;)
 )
.where(col(&quot;tag&quot;).contains(lit(&quot;Comedy&quot;)))
.limit(5)
.execute().show();

OUTPUT:
+-------+--------------------------------------+------------------+----------------+
|movieId|title                                 |average rating    |tag             |
+-------+--------------------------------------+------------------+----------------+
|356    |Forrest Gump (1994)                   |4.029000181345584 |Classic Comedy  |
|356    |Forrest Gump (1994)                   |4.029000181345584 |Comedy          |
|1136   |Monty Python and the Holy Grail (1975)|4.174146075581396 |Classic Comedy  |
|910    |Some Like It Hot (1959)               |4.082677165354331 |Comedy          |
|951    |His Girl Friday (1940)                |4.1529984623270115|Screwball Comedy|
+-------+--------------------------------------+------------------+----------------+
</code></pre>
<h3 id="case-4join-csv-files-with-database-table">Case #4 - join csv files with database table</h3>
<p>Let&#x2019;s do something more exciting now. We will create a report which will filter movies by user details that are going to be taken from a relational database.</p>
<p>You probably noticed that the provided dataset does not contain a user file but movies and rates have userId fields. We can assume that we have <code>users</code> table in our relational database and it needs to be joined with data from csv files in order to filter by user&apos;s data. Well it&#x2019;s a piece of cake for Zentadata.</p>
<p>For the demonstration purposes I have created a simple table in Postgres and added a couple of records.</p>
<pre><code>CREATE TABLE users
(
    id INTEGER PRIMARY KEY,
    first_name VARCHAR,
    last_name VARCHAR,
    country  VARCHAR
);

INSERT INTO users (id, first_name, last_name, country)
VALUES (1, &apos;John&apos;, &apos;Dow&apos;, &apos;US&apos;),
       (2, &apos;Nuria&apos;, &apos;Fabricio&apos;, &apos;US&apos;),
       (3, &apos;Itzel&apos;, &apos;Langosh&apos;, &apos;US&apos;),
       (4, &apos;Lilliana&apos;, &apos;Larkin&apos;, &apos;PL&apos;),
       (5, &apos;Walker&apos;, &apos;Quigley&apos;, &apos;PL&apos;);

</code></pre>
<p>Now let&#x2019;s join that Postgres table with csv files and create a report which will filter out users outside of <code>US</code>.</p>
<pre><code>DataFrame users = zen.read(&quot;postgres&quot;).from(&quot;users&quot;);
users.execute().show();

DataFrame report = movies
    .join(ratings, ratings.col(&quot;movieId&quot;).equalTo(movies.col(&quot;movieId&quot;)))
    .join(users, ratings.col(&quot;userId&quot;).equalTo(users.col(&quot;id&quot;)))
    .select(
        movies.col(&quot;movieId&quot;),
        movies.col(&quot;title&quot;),
        ratings.col(&quot;rating&quot;),
        users.col(&quot;id&quot;)
    )
    .where(col(&quot;country&quot;).equalTo(lit(&quot;US&quot;)))
    .groupBy(list(col(&quot;movieId&quot;), col(&quot;title&quot;)), avg(col(&quot;rating&quot;)).as(&quot;average rating&quot;));

report.limit(5).execute().show();

OUTPUT:
+--+----------+---------+-------+
|id|first_name|last_name|country|
+--+----------+---------+-------+
|1 |John      |Dow      |US     |
|2 |Nuria     |Fabricio |US     |
|3 |Itzel     |Langosh  |US     |
|4 |Lilliana  |Larkin   |PL     |
|5 |Walker    |Quigley  |PL     |
+--+----------+---------+-------+
+-------+------------------------------------------------------+--------------+
|movieId|title                                                 |average rating|
+-------+------------------------------------------------------+--------------+
|4027   |O Brother, Where Art Thou? (2000)                     |4.0           |
|7153   |Lord of the Rings: The Return of the King, The (2003) |5.0           |
|2951   |Fistful of Dollars, A (Per un pugno di dollari) (1964)|4.0           |
|337    |Whats Eating Gilbert Grape (1993)                     |3.25          |
|2797   |Big (1988)                                            |4.0           |
+-------+------------------------------------------------------+--------------+
</code></pre>
<p>As you can see from our perspective there is no difference in the source of data, for us it&apos;s a DataFrame that can be operated in the same manner.</p>
<h3 id="case-5saving-report-to-the-relational-database">Case #5 - saving report to the relational database</h3>
<p>Now our boss wants the report to be a separate table in the database. Sounds like a huge amount of work, but with help of Zentadata platform you can do it just in a few minutes and what is more important just in couple of lines of code.</p>
<pre><code>report.write(&quot;postgres&quot;)
   .option(SaveMode.KEY, SaveMode.OVERWRITE)
   .to(&quot;report&quot;);
</code></pre>
<p>And that&#x2019;s it, let&#x2019;s check out Postgress, so you will see a new table <code>report</code> has been created. It has a similar structure as our select statement, and it also contains all the data.</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card"><img src="https://blog.zentaly.com/content/images/2022/10/data-studio-db-table.png" class="kg-image" alt="Data analytics for everyone with Zentadata Data Studio" loading="lazy" width="990" height="471" srcset="https://blog.zentaly.com/content/images/size/w600/2022/10/data-studio-db-table.png 600w, https://blog.zentaly.com/content/images/2022/10/data-studio-db-table.png 990w" sizes="(min-width: 720px) 720px"></figure><!--kg-card-begin: markdown--><h1 id="summary">Summary</h1>
<p>We went through a few cases that might give you an understanding of a product and bring ideas how you can use it in your daily work. Zentadata platform and tools shipped with it like Data Studio can significantly improve your productivity and make life easier.</p>
<p>In the next article we will discuss how to work with big data from java spring boot applications.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Quick start guide - Zentadata Developer Edition]]></title><description><![CDATA[Zentadata Developer Edition is the simplest solution to start data analysis on your local machine right away. It is totally free and available for everyone.]]></description><link>https://blog.zentaly.com/how-to-install-and-run-zentadata-developer-edition/</link><guid isPermaLink="false">6326e56e80279a54a6ba6190</guid><category><![CDATA[big data]]></category><category><![CDATA[zentadata]]></category><dc:creator><![CDATA[Alex Dik]]></dc:creator><pubDate>Wed, 26 Oct 2022 15:03:44 GMT</pubDate><media:content url="https://blog.zentaly.com/content/images/2022/10/andy-hermawan-bVBvv5xlX3g-unsplash-1.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h1 id="overview">Overview</h1>
<img src="https://blog.zentaly.com/content/images/2022/10/andy-hermawan-bVBvv5xlX3g-unsplash-1.jpg" alt="Quick start guide - Zentadata Developer Edition"><p>Zentadata Developer Edition is the simplest solution to start data analysis on your local machine right away. It is totally free and available for everyone.</p>
<p>Zentadata Developer Edition consits of 2 modules:</p>
<ol>
<li><strong><code>Data Studio</code></strong> data analytics IDE where you actually work with a data</li>
<li><strong><code>Developer Cluster</code></strong> data processing engine shipped as a docker container</li>
</ol>
<h4 id="developer-edition-features">Developer Edition Features</h4>
<ul>
<li>Data formats: JSON, CSV, XML, Parquet</li>
<li>Data sources: Local File System, PostgresDB</li>
</ul>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card"><img src="https://blog.zentaly.com/content/images/2022/10/quick-start-diagram.svg" class="kg-image" alt="Quick start guide - Zentadata Developer Edition" loading="lazy" width="493" height="493"></figure><!--kg-card-begin: markdown--><h1 id="installation">Installation</h1>
<h4 id="prerequisites">Prerequisites</h4>
<ul>
<li>Docker installed on your local machine</li>
<li>Minimum 1GB of RAM for Docker container</li>
<li>Get free Developer Edition license key at <a href="https://account.zentadata.com">https://account.zentadata.com</a></li>
</ul>
<h4 id="install-data-studio">Install Data Studio</h4>
<p>You can download and install Data Studio from this <a href="https://zentadata.com/get-started">link&#x1F4E6;</a>.</p>
<p>Data Studio connects to the Developer Cluster to execute user defined data jobs. By default it is cofigured to connect to the local Developer Cluster at <code>http://localhost:8090</code> which is good enough for our use case.</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card"><img src="https://blog.zentaly.com/content/images/2022/10/Screenshot-2022-09-18-at-12.21.31.png" class="kg-image" alt="Quick start guide - Zentadata Developer Edition" loading="lazy" width="1762" height="928" srcset="https://blog.zentaly.com/content/images/size/w600/2022/10/Screenshot-2022-09-18-at-12.21.31.png 600w, https://blog.zentaly.com/content/images/size/w1000/2022/10/Screenshot-2022-09-18-at-12.21.31.png 1000w, https://blog.zentaly.com/content/images/size/w1600/2022/10/Screenshot-2022-09-18-at-12.21.31.png 1600w, https://blog.zentaly.com/content/images/2022/10/Screenshot-2022-09-18-at-12.21.31.png 1762w" sizes="(min-width: 720px) 720px"></figure><!--kg-card-begin: markdown--><h4 id="install-developer-cluster">Install Developer Cluster</h4>
<p>Download docker image and start container:</p>
<pre><code>docker pull zentadata/zentadata-dev:latest

docker run -di -p 8090:8090 --name zentadata-dev \
--mount type=bind,source=/Users/&lt;user_name&gt;/datasets,target=/datasets \
-e POSTGRES_URL=jdbc:postgresql://host.docker.internal:5432/postgres \
-e POSTGRES_USERNAME=postgres \
-e POSTGRES_PASSWORD=********* \
-e ZENTADATA_LICENSE_KEY=****** \
zentadata/zentadata-dev:latest
</code></pre>
<p><strong>Note:</strong> if you are running under Docker under Linux, you might need to add 1 extra parameter <code>--add-host=host.docker.internal:host-gateway</code>. Otherwise container will not be able to resolve address <code>host.docker.internal</code>.</p>
<p>This will start docker container running Developer Cluster, but most probably you will need to adjust configuration for your needs. See the next chapter how to configure each parameter.</p>
<h6 id="mount-local-folder-to-container-filesystem">Mount local folder to container filesystem</h6>
<p>Please notice how we use <code>--mount</code> parameter. To process data files from your local file system (/Users/Alex/datasets), you need to mount it into Docker container filesystem (/datasets) to be available for data engine.</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card"><img src="https://blog.zentaly.com/content/images/2022/10/quick-start-mounts-1.svg" class="kg-image" alt="Quick start guide - Zentadata Developer Edition" loading="lazy" width="733" height="293"></figure><!--kg-card-begin: markdown--><h6 id="docker-container-configuration">Docker container configuration</h6>
<p>There are multiple parameters available to configure Developer Cluster running in docker container via environment variables.</p>
<p><strong>Note:</strong> Please notice if you want to connect to PostgresDB running on localhost, you need to set address as <code>host.docker.internal</code> - it is docker alias to connect from within container to <code>localhost</code>.</p>
<table>
<thead>
<tr>
<th>Env variable</th>
<th>Default value</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>POSTGRES_URL</td>
<td>jdbc:postgresql://host.docker.internal:5432/postgres</td>
<td>PostgresDB connection string</td>
</tr>
<tr>
<td>POSTGRES_USERNAME</td>
<td>postgres</td>
<td>PostgresDB username</td>
</tr>
<tr>
<td>POSTGRES_PASSWORD</td>
<td>postgres</td>
<td>PostgresDB password</td>
</tr>
<tr>
<td>MAX_HEAP_SIZE</td>
<td>1g</td>
<td>Max memory allocated for Developer Cluster</td>
</tr>
<tr>
<td>ZENTADATA_LICENSE_KEY</td>
<td></td>
<td>Developer License Key you can obtain registering at <a href="https://account.zentadata.com">https://account.zentadata.com</a></td>
</tr>
</tbody>
</table>
<h1 id="simple-app">Simple app</h1>
<p>Now once we have all in place, lets try to run Data Studio to execute simple queries.</p>
<h4 id="read-postgresdb">Read PostgresDB</h4>
<p>In my local postgres database i have table <strong><code>users</code></strong> defined as following:</p>
<pre><code>CREATE TABLE users
(
    id INTEGER PRIMARY KEY,
    first_name VARCHAR,
    last_name VARCHAR,
    country  VARCHAR
);

INSERT INTO users (id, first_name, last_name, country)
VALUES (1, &apos;John&apos;, &apos;Dow&apos;, &apos;US&apos;),
       (2, &apos;Nuria&apos;, &apos;Fabricio&apos;, &apos;US&apos;),
       (3, &apos;Itzel&apos;, &apos;Langosh&apos;, &apos;US&apos;),
       (4, &apos;Lilliana&apos;, &apos;Larkin&apos;, &apos;PL&apos;),
       (5, &apos;Walker&apos;, &apos;Quigley&apos;, &apos;PL&apos;);
</code></pre>
<p>Lets copy paste following code into Data Studio and execute it (hotkey F9):</p>
<pre><code>zen
    .read(&quot;postgres&quot;)
    .from(&quot;users&quot;)
    .execute().show();

EXPECTED OUTPUT:
+--+----------+---------+-------+
|id|first_name|last_name|country|
+--+----------+---------+-------+
|1 |John      |Dow      |US     |
|2 |Nuria     |Fabricio |US     |
|3 |Itzel     |Langosh  |US     |
|4 |Lilliana  |Larkin   |PL     |
|5 |Walker    |Quigley  |PL     |
+--+----------+---------+-------+
</code></pre>
<h4 id="read-json-files">Read JSON files</h4>
<p>On my local filesystem i have a file <code>/Users/Alex/data-samples/orders.json</code> with a following content:</p>
<pre><code>[
  {
    &quot;order_id&quot;: &quot;1&quot;,
    &quot;date&quot;: &quot;2020101&quot;,
    &quot;items&quot;: [{
        &quot;name&quot;: &quot;ipad&quot;,
        &quot;price&quot;: 449.99
    }]
  },
  {
    &quot;order_id&quot;: &quot;2&quot;,
    &quot;date&quot;: &quot;2020101&quot;,
    &quot;items&quot;: [{
        &quot;name&quot;: &quot;imac 27&quot;,
        &quot;price&quot;: 1700
    }]
  }
]
</code></pre>
<p>Lets try to read this json file with Data Studio and print its content:</p>
<pre><code>zen
    .read(&quot;localfs&quot;)
    .format(DataFormat.JSON)
    .option(JsonOpts.IS_MULTILINE, &quot;true&quot;)
    .from(&quot;file:/datasets/orders.json&quot;)
    .execute().show();  

EXPECTED OUTPUT:
+-------+------------------+--------+
|date   |items             |order_id|
+-------+------------------+--------+
|2020101|[[ipad,449.99]]   |1       |
|2020101|[[imac 27,1700.0]]|2       |
+-------+------------------+--------+

</code></pre>
<p><strong>Note:</strong> Please notice how we set a path to the file relative to container mounted volume: <code>&quot;file:/datasets/orders.json&quot;</code></p>
<h1 id="summary">Summary</h1>
<p>We have installed <strong>Zentadata Developer Edition</strong> and successfully executed simple queries.</p>
<p>Ofcourse the true data analytics power comes with more advanced queries which we will show in the next blog posts.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[How Data Driven Enterprise makes business effective?]]></title><description><![CDATA[This article explains anthology of data driven approach and how it boosts enterprise performance so much.]]></description><link>https://blog.zentaly.com/how-data-driven-enterprise-makes-business-effective/</link><guid isPermaLink="false">634a72f580279a54a6ba641a</guid><category><![CDATA[big data]]></category><category><![CDATA[enterprise]]></category><category><![CDATA[zentadata]]></category><dc:creator><![CDATA[Alex Dik]]></dc:creator><pubDate>Wed, 19 Oct 2022 05:57:13 GMT</pubDate><media:content url="https://blog.zentaly.com/content/images/2022/10/bilge-tekin-GiATUqz4NYY-unsplash-3.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: markdown--><h1 id="intro">Intro</h1>
<img src="https://blog.zentaly.com/content/images/2022/10/bilge-tekin-GiATUqz4NYY-unsplash-3.jpg" alt="How Data Driven Enterprise makes business effective?"><p>There is a lot of information about benefits of Data Driven Enterprise. With the most prominent characteristics like following:</p>
<ol>
<li>Leverage data to prove multiple theories and choose the best one</li>
<li>Continuously research business data to find new opportunities</li>
<li>Seamless integration of ML into business processes to gain extra revenue</li>
<li>Innovative data techniques resolve challenges in hours, days or weeks</li>
</ol>
<p>While for some people mentioned benefits might look quite obvious, still understanding how they are made possible might make even stronger mind shift towards data driven culture.</p>
<p>So in this article i would like to explain anthology of data driven approach and how exactly it makes possible to boost enterprise performance so much.</p>
<h1 id="enterprise-effectiveness-breakdown">Enterprise effectiveness breakdown</h1>
<p>Today any modern enterprise consist of multiple departments. Each department has variety of business processes to follow according to their goals. From organisational perspective department consists of multiple teams and employees.</p>
<p>To make entire enterprise work more efficient, organisations declare tremendous number of rules and policies (aka business processes) which could be generalised as following:</p>
<ol>
<li>Enterprise is effective when all of its departments are following their <strong>business processes</strong> properly</li>
<li>Enterprise is effective when departments are able to <strong>collaborate</strong> effectively</li>
<li>Enterprise is effective when it is able to deliver competitive <strong>product</strong> to the end consumer</li>
</ol>
<p><strong>Note:</strong> <em>in this schema product quality is derivative of processes and collaboration.</em></p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.zentaly.com/content/images/2022/10/dde-effectiveness-stage-1-3.svg" class="kg-image" alt="How Data Driven Enterprise makes business effective?" loading="lazy" width="626" height="386"><figcaption>Enterprise effectiveness schema</figcaption></figure><!--kg-card-begin: markdown--><p>Based on this definition we can highlight 3 core KPIs:</p>
<ol>
<li>Effectiveness of business processes</li>
<li>Effectiveness of cross-team collaboration</li>
<li>Effectiveness of product delivery</li>
</ol>
<h1 id="enterprise-data-maturity-stages">Enterprise data maturity stages</h1>
<p>The maturity of data processing practices significantly varies from company to company. Some companies might not be aware of data value at all, while others could have immense experience growing data culture through their employees.</p>
<p>Based on level of data practices penetration into organisation processes, i would like to highlight 3 well distinguished stages.</p>
<p><strong>Note:</strong> <em>please remember that there is no clear separation between these 3 stages. In fact continuous improvement of data practices leads from one stage to another. The ultimate goal of Data Driven Enterprise is not to apply strict rules, but instead to follow the right direction by practicing data culture. Where decisions and actions are made based on real use cases for each specific enterprise.</em></p>
<p>So here are 3 the most prominent stages of data maturity practices within organisation. We will review each stage separately and see how it affects enterprise effectiveness KPIs.</p>
<ul>
<li><strong>Stage 1</strong> - Automation</li>
<li><strong>Stage 2</strong> - Data Awareness</li>
<li><strong>Stage 3</strong> - Data Driven</li>
</ul>
<h3 id="stage-1automation">Stage 1 - Automation</h3>
<p>It is a starting point where company applies initial efforts for automation and/or digitalisation of their business. Particularly it could address following scenarios: standardise specific business case, automate manual workflow, decrease human factor, etc. What is important to understand here is that the goal of Automation is to address one specific business problem in the most straightforward way possible.</p>
<p>At this stage organisation&apos;s departments do have a data storages represented as data silos focusing on a single service and strongly isolated from other systems even in scope of the parent department.</p>
<p>Typical achievements of Automation in scope of the core KPIs:</p>
<table>
<thead>
<tr>
<th>Core KPI</th>
<th>Achievement</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Processes</strong></td>
<td>Minimising error rate and limiting the role of human factor by automating business processes</td>
</tr>
<tr>
<td><strong>Collaboration</strong></td>
<td>Decreasing collaboration gaps through well defined workflows</td>
</tr>
<tr>
<td><strong>Product</strong></td>
<td>Improving predictability of delivery process by mitigating the most common risks addressed by previous 2 KPIs</td>
</tr>
</tbody>
</table>
<h3 id="stage-2data-awareness">Stage 2 - Data Awareness</h3>
<p>The next stage after Automation is the Data Awareness. This stage is characterised by initial efforts to separate data from business processes into clean and reusable form. It is often implemented as date warehouse or some other kind of centralised data storage.</p>
<p>This stage helps to implement some quantitative improvements of business processes by reusing existing data sets. Having centralised data storages also helps to cut the costs and maintenance efforts.</p>
<p><strong>Note:</strong> <em>at this point enterprise decision makers are responsible to define business processes, which being executed properly by IT department should deliver business results. It&#x2019;s important to note that first 2 stages Automation and Data Awareness are not intended to change existing organisation processes but rather to improve and optimise them.</em></p>
<p>Nevertheless this stage is characterised by clear understanding of data value and it&#x2019;s ability to make an impact. It is a moment when first data leaders arise with idea of increasing product quality improvement turnaround. There are first signs that significant product improvements could be done on the team level.</p>
<p>Data Awareness impact to the core KPIs:</p>
<table>
<thead>
<tr>
<th>Core KPI</th>
<th>Achievement</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Processes</strong></td>
<td>Improved time to market</td>
</tr>
<tr>
<td><strong>Collaboration</strong></td>
<td>Faster collaboration through data sharing</td>
</tr>
<tr>
<td><strong>Product</strong></td>
<td>Quantitive improvement based on the sum of the previous two KPIs</td>
</tr>
</tbody>
</table>
<h3 id="stage-3data-driven-enterprise">Stage 3 - Data Driven Enterprise</h3>
<p>This is the highest level of data practices in our list.</p>
<p>At this level the amount of available information and org wide data access makes possible for the most of employees to answer pretty much any question about their company business. But same time it is equally important that not only infrastructure but also people are mature enough to be able to take the value from data in their daily duties.</p>
<p>Most successful employees are emerging to data leaders inside cross-functional teams focusing on tremendous product improvement. And not only improvement - but finding a space for a new products which will be able to conquer the market.</p>
<p>While on previous stages organisation was focused on quality of processes and collaboration, this stage puts people in a centre as a key factor to create outstanding product value. Same time processes are intended to empower people and provide right data context required for optimal decisions making in every specific case.</p>
<p>So in this new organisation - business processes and collaboration are no longer driving factor. They become just an extra data context to support the main goal - make people to create a value through product innovation.</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://blog.zentaly.com/content/images/2022/10/dde-effectiveness-stage-2-6.svg" class="kg-image" alt="How Data Driven Enterprise makes business effective?" loading="lazy" width="533" height="533"><figcaption>Stage 3 - Data Driven Enterprise</figcaption></figure><!--kg-card-begin: markdown--><p>Data Driven approach impact to the core KPIs:</p>
<table>
<thead>
<tr>
<th>Core KPI</th>
<th>Achievement</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Processes</strong></td>
<td>Improved relevance by making correct decisions with the help of rich data context</td>
</tr>
<tr>
<td><strong>Collaboration</strong></td>
<td>Department level collaboration is replaced by team level collaboration</td>
</tr>
<tr>
<td><strong>Product</strong></td>
<td>Cross-functional teams powered by org-wide shared data make tremendous impact directly to the product development life cycle</td>
</tr>
</tbody>
</table>
<h1 id="summary">Summary</h1>
<p>Let&apos;s summarise how typical enterprise benefits from data driven approach in a short easy to remember table.</p>
<table>
<thead>
<tr>
<th></th>
<th>Automation</th>
<th>Data Awareness</th>
<th>Data Driven</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Processes</strong></td>
<td>Improves quality</td>
<td>Improves time to market</td>
<td>Optimised to focus on real problems based on rich data context</td>
</tr>
<tr>
<td><strong>Collaboration</strong></td>
<td>Decreases gaps</td>
<td>Faster collaboration through data sharing</td>
<td>Department level collaboration moved to team level</td>
</tr>
<tr>
<td><strong>Product</strong></td>
<td>Improves predictability, decreases human factor</td>
<td>Quantitative improvements by X%</td>
<td>Cross-functional teams of experts directly impact product development life cycle</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[How to deploy Spark job to cluster]]></title><description><![CDATA[<p>In the <a href="https://blog.zentaly.com/how-to-write-big-data-application-with-java-and-spark">previous post</a> we have created simple Spark job and executed it locally on a single machine. Nevertheless the main goal of Spark framework is to utilize cluster resources consisting of multiple servers and in this way increase data processing throughput. In the real life the amount of data</p>]]></description><link>https://blog.zentaly.com/how-to-deploy-spark-job-to-cluster/</link><guid isPermaLink="false">5f9db94d8ccd9c06dfc69fb7</guid><category><![CDATA[big data]]></category><category><![CDATA[spark]]></category><dc:creator><![CDATA[Alex Dik]]></dc:creator><pubDate>Sun, 01 Nov 2020 10:39:36 GMT</pubDate><media:content url="https://blog.zentaly.com/content/images/2022/10/mihail-tregubov-WYTMdCBrBok-unsplash.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://blog.zentaly.com/content/images/2022/10/mihail-tregubov-WYTMdCBrBok-unsplash.jpg" alt="How to deploy Spark job to cluster"><p>In the <a href="https://blog.zentaly.com/how-to-write-big-data-application-with-java-and-spark">previous post</a> we have created simple Spark job and executed it locally on a single machine. Nevertheless the main goal of Spark framework is to utilize cluster resources consisting of multiple servers and in this way increase data processing throughput. In the real life the amount of data processed by production grade cluster is estimated at-least in terabytes.</p><p>The big advantage of Spark application is that it is ready for distributed cluster execution by design. Still there are few details which should be taken in account assuming distributed nature of application.</p><p>Lets revisit our application and highlight 3 main concerns which should be addressed to execute our job on Spark cluster:</p><pre><code>import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import static org.apache.spark.sql.functions.*;

public class ClusterDeployment {
  public static void main(String[] args) {
    // TODO Concern 1: how to setup master?
    SparkConf conf = new SparkConf().setMaster(&quot;local[*]&quot;);
    SparkSession spark = SparkSession.builder()
        .config(conf)
        .getOrCreate();

    Dataset&lt;Row&gt; dataframe = spark.read()
        .option(&quot;multiline&quot;, &quot;true&quot;)
        // TODO Concern 2: how to read file?
        .json(&quot;invoice.json&quot;)
        .select(
            col(&quot;item&quot;),
            expr(&quot;price * quantity&quot;).as(&quot;item_total&quot;));

    // TODO Concern 3: what to do with output?
    dataframe.show();
  }
}</code></pre><h4 id="concern-1-setup-master-server">Concern 1 - Setup master server</h4><p>In our sample we have set master as <code>local[*]</code> which means - run job locally using all available CPUs. For cluster deployment <code>setMaster</code> statement shoud be omitted as hardware resources will be automatically managed by cluster environment.</p><h4 id="concern-2-how-to-read-source-file">Concern 2 - How to read source file?</h4><p>Our naive test used local file <code>invoice.json</code> as datasource. But as you already now real big data application is expected to read files which could be too large to be stored on your local hard disk drive.</p><p>To store such a big files you need a very special file system - <strong>HDFS</strong> (hadoop distributed file system). It is deployed on top of multiple servers and stores data in a distributed way. Alternatively you can also use <strong>Amazon S3</strong> storage, which is very similar to HDFS but hosted in cloud.</p><p>In our case what matters is to set proper path to file system where actual data is stored, e.g:</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>File system</th>
<th>File path sample</th>
</tr>
</thead>
<tbody>
<tr>
<td>HDFS</td>
<td>hdfs://namenode:8020/user/data/invoice.json</td>
</tr>
<tr>
<td>Amazon S3</td>
<td>s3a://aws-account-id/user/data/invoice.json</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Spark production grade cluster supports HDFS and Amazon S3 by default, so setting correct path is all you need.</p><h4 id="concern-3-what-to-do-with-output">Concern 3 - What to do with output?</h4><p>As for our test app, we used <code>show()</code> method to print first 20 rows of dataframe to standard output. Thats good for development purpose, but for real application we need something more applicable.</p><p>There are 2 main options to consider <code>collectAsList</code> or <code>write</code> underlying data.</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>Method</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>collectAsList</td>
<td>Returns a Java list that contains all rows in this Dataset. Running collect requires moving all the data into the application&apos;s driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.</td>
</tr>
<tr>
<td>write</td>
<td>Interface for saving the content of the non-streaming Dataset out into external storage.</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>One important aspect of <code>collectAsList</code> is that it will return all data to your local JVM instance. Assuming dataframe might contain gigabytes of data, executing this operation could crush your application with OutOfMemory exception. Still this operation could be safe if you do heavy filtering and assure that underlying result set is small enough to be processed in a single JVM.</p><p>Nevertheless most of time you would want to write results of data transformation to persistent storage. For example you can write result set back to HDFS in JSON format as following:</p><pre><code>dataframe.write()
  .json(&quot;hdfs:///user/data/invoice_report&quot;);</code></pre><p>So here is the final version of our application with all concerns adressed:</p><pre><code>import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import static org.apache.spark.sql.functions.*;

public class ClusterDeployment {
  public static void main(String[] args) {
    SparkConf conf = new SparkConf();
    SparkSession spark = SparkSession.builder()
        .config(conf)
        .getOrCreate();

    Configuration hconf = spark.sparkContext().hadoopConfiguration();
    hconf.set(&quot;fs.defaultFS&quot;, &quot;hdfs://namenode:8020&quot;);
    hconf.set(&quot;dfs.client.use.datanode.hostname&quot;, &quot;true&quot;);

    Dataset&lt;Row&gt; dataframe = spark.read()
        .option(&quot;multiline&quot;, &quot;true&quot;)
        .json(&quot;hdfs:///user/data/invoice.json&quot;)
        .select(
            col(&quot;item&quot;),
            expr(&quot;price * quantity&quot;).as(&quot;item_total&quot;));

    dataframe.write().json(&quot;hdfs:///user/data/invoice_report&quot;);
  }
}
</code></pre><h3 id="deployment">Deployment</h3><p>To test cluster deployment, you will need real Spark cluster up and running.</p><p>Once your application is ready its time to build a jar file and deploy it to cluster:</p><pre><code>~/spark/bin/spark-submit \
--master spark://spark-master:6066 \
--deploy-mode client \
--class ClusterDeployment \
./java-spark-job-0.1-SNAPSHOT.jar</code></pre><p>Lets go through the most important parameters one by one:</p><ul><li><code>--master</code> is a master URL of the Spark cluster</li><li><code>--deploy-mode</code> defines an option where spark driver is hosted: <code>client</code> means the driver is hosted on a same machine where spark-submit executed, <code>cluster</code> deploys driver to one of cluster workers</li><li><code>--class</code> entry point class of a spark job</li><li>the last parameter should be path to the application jar file</li></ul><h3 id="summary">Summary</h3><p>We have prepared our application to process data in production environment and successfully deployed it to the Spark cluster.</p>]]></content:encoded></item><item><title><![CDATA[How to write Big Data application with Java and Spark]]></title><description><![CDATA[<p>Spark is modern Big Data framework to build highly scalable and feature rich data transformation pipelines.</p><p>Spark&apos;s main advantages are simplicity and high performance compared to its predecessor - Hadoop. You can write Spark applications in main 3 languages: Scala, Java and Python.</p><p>In this guide I will</p>]]></description><link>https://blog.zentaly.com/how-to-write-big-data-application-with-java-and-spark/</link><guid isPermaLink="false">5f947b7d8ccd9c06dfc69ec7</guid><category><![CDATA[big data]]></category><category><![CDATA[java]]></category><category><![CDATA[spark]]></category><dc:creator><![CDATA[Alex Dik]]></dc:creator><pubDate>Sun, 25 Oct 2020 10:09:40 GMT</pubDate><media:content url="https://blog.zentaly.com/content/images/2022/10/apache-spark-with-java-3.svg" medium="image"/><content:encoded><![CDATA[<img src="https://blog.zentaly.com/content/images/2022/10/apache-spark-with-java-3.svg" alt="How to write Big Data application with Java and Spark"><p>Spark is modern Big Data framework to build highly scalable and feature rich data transformation pipelines.</p><p>Spark&apos;s main advantages are simplicity and high performance compared to its predecessor - Hadoop. You can write Spark applications in main 3 languages: Scala, Java and Python.</p><p>In this guide I will show you how to write simple Spark application in Java.</p><h3 id="writing-spark-application">Writing Spark application</h3><p>To create Spark job, as a first step you will need to add Spark library dependency into your maven project:</p><pre><code>&lt;dependency&gt;
    &lt;groupId&gt;org.apache.spark&lt;/groupId&gt;
    &lt;artifactId&gt;spark-sql_2.11&lt;/artifactId&gt;
    &lt;version&gt;2.4.7&lt;/version&gt;
&lt;/dependency&gt;</code></pre><p>In this guide we will try to read and transform list of invoices provided in <strong>invoice.json</strong> file:</p><pre><code>[
  {
    &quot;item&quot;: &quot;iphone X&quot;,
    &quot;price&quot;: 1000.00,
    &quot;quantity&quot;: 2
  },
  {
    &quot;item&quot;: &quot;airpods&quot;,
    &quot;price&quot;: 150.00,
    &quot;quantity&quot;: 1
  },
  {
    &quot;item&quot;: &quot;macbook pro 13&quot;,
    &quot;price&quot;: 1500.00,
    &quot;quantity&quot;: 3
  }
]</code></pre><p>Now we are ready to start writing application</p><pre><code>import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.*;
import org.apache.spark.SparkConf;

public class SparkJob {
  public static void main(String[] args) {
    // setup spark session
    SparkConf conf = new SparkConf().setMaster(&quot;local[*]&quot;);
    SparkSession spark = SparkSession.builder()
        .config(conf)
        .getOrCreate();

    // read original json file
    Dataset&lt;Row&gt; dataframe = spark.read()
        .option(&quot;multiline&quot;, &quot;true&quot;)
        .json(&quot;invoice.json&quot;);
    dataframe.show();
  }
}</code></pre><p>Here we have created a SparkSession object which is entry point to build data jobs. And then created DataFrame, represented as java type <code>Dataset&lt;Row&gt;</code>.</p><h3 id="dataframe">DataFrame</h3><p>DataFrame is distributed data structure representing underlying data similar to a table from relational database. Executing <code>dataframe.show()</code> generates following output:</p><pre><code>+--------------+------+--------+
|          item| price|quantity|
+--------------+------+--------+
|      iphone X|1000.0|       2|
|       airpods| 150.0|       1|
|macbook pro 13|1500.0|       3|
+--------------+------+--------+</code></pre><p>Beside data representation, DataFrame provides API methods to transform underlying data like following:</p><ul><li>select</li><li>where</li><li>sort</li><li>groupBy</li><li>etc.</li></ul><p>Now lets try to do some data transformations.</p><h4 id="task-1-filter-items-based-on-a-price">Task 1 - Filter items based on a price</h4><p>Imagine that we need to build an invoice report which includes expsensive items only, with a price &gt;= $1000. It can be implemented as following:</p><pre><code>// filter items with a price greater than or equal to 1000
Dataset&lt;Row&gt; expensiveItems = dataframe
    .filter(&quot;price &gt;= 1000&quot;);
expensiveItems.show();

+--------------+------+--------+
|          item| price|quantity|
+--------------+------+--------+
|      iphone X|1000.0|       2|
|macbook pro 13|1500.0|       3|
+--------------+------+--------+</code></pre><p>Alternatively this code could be improved by applying statically typed functions:</p><pre><code>import static org.apache.spark.sql.functions.*;

Dataset&lt;Row&gt; expensiveItems = dataframe
        .filter(col(&quot;price&quot;).geq(1000));</code></pre><p>Here we define column price and apply function <code>geq</code> <em>(Greater than or equal to an expression)</em> with a parameter 1000. It provides exactly same result but provides little bit less space to make a typo.</p><h4 id="task-2-calculate-total-price-for-each-item">Task 2 - Calculate total price for each item</h4><p>Having 2 columns item price and item quantity we can calculate a total amount for that item as following:</p><pre><code>// multiply item price by item quantity in invoice
Dataset&lt;Row&gt; sumPerRow = dataframe
    .select(
        col(&quot;*&quot;),
        expr(&quot;price * quantity&quot;).as(&quot;sum_per_row&quot;));
sumPerRow.show();
    
+--------------+------+--------+-----------+
|          item| price|quantity|sum_per_row|
+--------------+------+--------+-----------+
|      iphone X|1000.0|       2|     2000.0|
|       airpods| 150.0|       1|      150.0|
|macbook pro 13|1500.0|       3|     4500.0|
+--------------+------+--------+-----------+</code></pre><p>And statically typed alternative:</p><pre><code>Dataset&lt;Row&gt; sumPerRow = dataframe
    .select(
        col(&quot;*&quot;),
        col(&quot;price&quot;).multiply(col(&quot;quantity&quot;)).as(&quot;sum_per_row&quot;));</code></pre><h4 id="task-3-calculate-total-amount-for-all-items-in-invoice">Task 3 - Calculate total amount for all items in invoice</h4><p>As a last task lets aggregate all rows and calculate total invoice amount:</p><pre><code>// calculate total amount of all items
Dataset&lt;Row&gt; total = dataframe
    .select(sum(expr(&quot;price * quantity&quot;)).as(&quot;total&quot;));
total.show();
    
+------+
| total|
+------+
|6650.0|
+------+</code></pre><h3 id="summary">Summary</h3><p>We have written java application with a spark library and created simple data transformations jobs. You can run application locally and see results in place.</p><p>The great power of Spark is that you can deploy exactly same application to cluster of multiple nodes and scale performance of your application as much as needed to process arbitrary amount of data.</p>]]></content:encoded></item><item><title><![CDATA[How to run Postgres in Docker]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p><img src="https://blog.zentaly.com/content/images/2016/08/postgres_in_docker.png" alt loading="lazy"></p>
<p>If you need to run Postgres database for development needs you can just install it manually on local machine. But you should be aware that this procedure will require some level of knowledge about Postgres installation and maintaining procedures.</p>
<p>On the other hand there is much more simple way -</p>]]></description><link>https://blog.zentaly.com/how-to-run-postgres-in-docker/</link><guid isPermaLink="false">5c4e286598d81473d8c36fe5</guid><category><![CDATA[docker]]></category><category><![CDATA[postgres]]></category><dc:creator><![CDATA[Alex Dik]]></dc:creator><pubDate>Thu, 18 Aug 2016 21:34:00 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p><img src="https://blog.zentaly.com/content/images/2016/08/postgres_in_docker.png" alt loading="lazy"></p>
<p>If you need to run Postgres database for development needs you can just install it manually on local machine. But you should be aware that this procedure will require some level of knowledge about Postgres installation and maintaining procedures.</p>
<p>On the other hand there is much more simple way - run Postgres database inside Docker container. Docker effectively incapsulates deployment, administration and configuration procedures. So if you want to deploy Postgres locally with minimum efforts Docker is the best choice. All you need to do is just start pre-build Docker container and you will have Postgres database ready for your service.</p>
<p>Here is my github repo to build Docker container with embedded Postgres database: <a href="https://github.com/alexdik/dockerized-postgres">https://github.com/alexdik/dockerized-postgres</a>.</p>
<p>If you don&apos;t have Docker yet, you can download and install it from official site: <a href="https://docs.docker.com/install/">https://docs.docker.com/install/</a></p>
<p>To build Docker container you will need to run 3 simple commands in terminal:</p>
<pre><code>git clone https://github.com/alexd84/dockerized-postgres.git
docker build -t postgres dockerized-postgres
docker run -di -p 5432:5432 postgres
</code></pre>
<p>Here you are cloning project from github, building container and launching it.</p>
<p>So now you can verify Postgres database instance running on your local machine:</p>
<pre><code>telnet 127.0.0.1 5432
Trying 127.0.0.1...
Connected to localhost.
Escape character is &apos;^]&apos;.
</code></pre>
<p>To authorise into Postgres database following default credentials should be used: <strong>postgres</strong>/<strong>postgres</strong>.</p>
<h4 id="howisitworkingforthosewhoactuallycare">How is it working? (for those who actually care)</h4>
<p>Repository you just clonned contains 3 files:</p>
<ul>
<li>Dockerfile</li>
<li>entrypoint.sh</li>
<li>pg_hba.conf</li>
</ul>
<h6 id="dockerfile">Dockerfile</h6>
<p>This is main file which instructs Docker to create new container based on Ubuntu image, download Postgres distributive, configure and proceed to entrypoint.sh script:</p>
<pre><code>FROM ubuntu:14.04

RUN apt-get update -y
RUN apt-get install -y wget

RUN sudo sh -c &apos;echo &quot;deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main 9.5&quot; &gt;&gt; /etc/apt/sources.list.d/pgdg.list&apos;
RUN wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O - | sudo apt-key add -

RUN apt-get update -y
RUN apt-get install postgresql-9.5 postgresql-contrib-9.5 -y

RUN mv /etc/postgresql/9.5/main/pg_hba.conf /etc/postgresql/9.5/main/pg_hba.conf.backup
COPY pg_hba.conf /etc/postgresql/9.5/main/pg_hba.conf
RUN echo &quot;listen_addresses = &apos;*&apos;&quot; &gt;&gt; /etc/postgresql/9.5/main/postgresql.conf

EXPOSE 5432

COPY entrypoint.sh /
ENTRYPOINT sudo /entrypoint.sh
</code></pre>
<h6 id="entrypointsh">entrypoint.sh</h6>
<p>Here we start Postgres database and setting default login and password for access:</p>
<pre><code>sudo service postgresql start
sudo -u postgres psql -c &quot;ALTER USER postgres WITH PASSWORD &apos;postgres&apos;&quot;
tail
</code></pre>
<p>The last <strong>tail</strong> command is needed to assure that entrypoint.sh script never completes and it effectively makes container to run in background, otherwise it will stop execution immediately after startup.</p>
<h6 id="pg_hbaconf">pg_hba.conf</h6>
<pre><code>local   all             postgres                                peer
host    all             all             127.0.0.1/32            md5
host    all             all             0.0.0.0/0               md5

</code></pre>
<p>This file is exclusively used to configure Postgres to accept connections from remote hosts. By default Postgres permits external connections only from localhost, so when running in Docker you cant access it from your host machine. Thus we add pg_hba.conf configuration file to permit all inbound connections from any machines.</p>
<h4 id="summary">Summary</h4>
<p>So now you have got Postgres database running locally without deep knowledge about it&apos;s internals. What is also useful to know that there are pre-build Docker containers for almost all kind of  applications you can imagine like email server or Hadoop cluster. And so you can effectively rely on them to start complex applications in minutes and save time by avoiding manual installation and configuration.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Persisting 100k messages per second on single server in real-time]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p><img src="https://blog.zentaly.com/content/images/2016/07/flume_logo-1.jpg" alt loading="lazy"><br>
Beside regular problem of Big Data analysis there is one more complex subtle task - persisting highly intensive data stream in a real-time. Image a scenario when your application cluster generates 100k business transactions per second, each one should be properly processed and written to data storage for further analysis</p>]]></description><link>https://blog.zentaly.com/persisting-100k-messages-per-second/</link><guid isPermaLink="false">5c4e286598d81473d8c36fe4</guid><category><![CDATA[big data]]></category><category><![CDATA[flume]]></category><category><![CDATA[high load]]></category><category><![CDATA[hdfs]]></category><category><![CDATA[streaming]]></category><dc:creator><![CDATA[Alex Dik]]></dc:creator><pubDate>Thu, 21 Jul 2016 07:31:11 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p><img src="https://blog.zentaly.com/content/images/2016/07/flume_logo-1.jpg" alt loading="lazy"><br>
Beside regular problem of Big Data analysis there is one more complex subtle task - persisting highly intensive data stream in a real-time. Image a scenario when your application cluster generates 100k business transactions per second, each one should be properly processed and written to data storage for further analysis and business intelligence reporting. Every transaction is also very precise and you can not lose any as it will cause data integrity constraints to be broken.</p>
<p>There are different message broker tools available to address this kind of scenario, like Apache Kafka, Amazon SQS or RabbitMQ. They do their job pretty well, but tend to be more general purpose and as such slightly bloated and time consuming for deployment needs.</p>
<p>In this article I would like to show how you can solve real-time data processing task simply and elegant with Apache Flume.</p>
<h5 id="flumedesign">Flume design</h5>
<p>Apache Flume is a distributed, reliable, and high available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralised data store.</p>
<p>Flume is simple and straightforward as name implies. You can configure it with .properties file and setup complex data flows in a distributed manner. For sake of our task we will configure Flume to get maximum real-time throughput out of it. We set it up to act as message broker between clustered java application and Hadoop HDFS raw data storage.</p>
<p>Every Flume data flow should have at least 3 main components: Source, Channel and Sink.</p>
<ul>
<li>Source receives data from external clients</li>
<li>Channel aggregates data and pass to Sink</li>
<li>Sink is responsible for writing data to external system.</li>
</ul>
<p>Source could be any of following adapters: Avro, Thrift, JMS, Kafka. And Sink supports: HDFS, Avro, Thrift, Hive, Hbase, Elastic, Kafka, formats.</p>
<p>For our scenario we will configure Flume Server to receive Avro messages from multiple clients and write them to HDFS filesystem. See diagram below.<br>
<img src="https://blog.zentaly.com/content/images/2016/07/Flume-diagram.png" alt loading="lazy"><br>
There are 2 main problems to address when trying to persist 100k mes/sec on single server node:</p>
<ul>
<li>Network/disk throughput (writing to HDFS)</li>
<li>Handling data stream peaks resiliently</li>
</ul>
<h5 id="configureforhighthroughput">Configure for high throughput</h5>
<p>We are going to solve throughput problem with batch processing, so Flume will aggregate specific number of messages into single batch when sending them to HDFS.</p>
<pre><code>agent.sinks.hdfs.hdfs.rollCount = 300000
agent.sinks.hdfs.hdfs.batchSize = 10000
</code></pre>
<p>Here rollCount defines maximum number of messages which could be saved into single file on HDFS. And batchSize controls number of messages which are written to HDFS per single transaction. The more data we write per transaction to HDFS - less transactions we need, this decreases disk and network load.</p>
<h5 id="managingdatastreampeaksresiliently">Managing data stream peaks resiliently</h5>
<p>Another problem is data flow peaks, say if throughput could rise up to 500k mes/sec for some short period of time. We need to tolerate such scenario by allocating some intermediate store to avoid out of memory exceptions.</p>
<pre><code>agent.channels.c1.capacity = 1000000
agent.channels.c1.transactionCapacity = 10000
agent.channels.c1.type = memory
</code></pre>
<p>This could be achieved with Channel configured to store up to 1 million records received from Source before being written to Sink. It assures capacity size enough to preserve data growth level when Source is receiving records faster than Sink is able to write. It makes entire system behave resiliently under data stream peaks.</p>
<p>Next transactionCapacity parameters controls how many records are passed from Channel to Sink per transaction (could be equal to agent.sinks.hdfs.hdfs.batchSize). Making value higher makes less IO operations required to write the same amount of data.</p>
<p>If Channel needs to be fault tolerant and preserve messages in case of system failures there is option to set persistent storage on disk with parameter agent.channels.c1.type = file.</p>
<p>Here is complete listing of Flume Server configuration:</p>
<pre><code>agent.sources = avro
agent.channels = c1
agent.sinks = hdfs

agent.sources.avro.type = avro
agent.sources.avro.channels = c1
agent.sources.avro.bind = 0.0.0.0
agent.sources.avro.port = 44444
agent.sources.avro.threads = 4

agent.sinks.hdfs.type = hdfs
agent.sinks.hdfs.channel = c1
agent.sinks.hdfs.hdfs.path = hdfs://cdh-master:8020/logs/%{messageType}/%y-%m-%d/%H
agent.sinks.hdfs.hdfs.filePrefix = event
agent.sinks.hdfs.hdfs.rollCount = 300000
agent.sinks.hdfs.hdfs.batchSize = 10000
agent.sinks.hdfs.hdfs.rollInterval = 0
agent.sinks.hdfs.hdfs.rollSize = 0
agent.sinks.hdfs.hdfs.idleTimeout = 60
agent.sinks.hdfs.hdfs.timeZone = UTC

agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000000
agent.channels.c1.transactionCapacity = 10000
</code></pre>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[How to run Docker]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p><img src="https://blog.zentaly.com/content/images/2016/05/docker_toolbox-2.png" alt loading="lazy"></p>
<h4 id="thisapproachisdeprecatedasfornowyoucandownloadandinstalldockerfromofficialsite">This approach is deprecated as for now you can download and install docker from official site</h4>
<p><a href="https://docs.docker.com/install/">https://docs.docker.com/install/</a></p>
<p>There is conventional wisdom that Docker is pretty much diversed and sophisticated tool, but recently there were gorgeous updates which make it much simpler and convenient to use.</p>
<p>So</p>]]></description><link>https://blog.zentaly.com/how-to-run-docker/</link><guid isPermaLink="false">5c4e286598d81473d8c36fe3</guid><category><![CDATA[docker]]></category><dc:creator><![CDATA[Alex Dik]]></dc:creator><pubDate>Mon, 23 May 2016 14:40:36 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p><img src="https://blog.zentaly.com/content/images/2016/05/docker_toolbox-2.png" alt loading="lazy"></p>
<h4 id="thisapproachisdeprecatedasfornowyoucandownloadandinstalldockerfromofficialsite">This approach is deprecated as for now you can download and install docker from official site</h4>
<p><a href="https://docs.docker.com/install/">https://docs.docker.com/install/</a></p>
<p>There is conventional wisdom that Docker is pretty much diversed and sophisticated tool, but recently there were gorgeous updates which make it much simpler and convenient to use.</p>
<p>So I guess it is good time to summarise knowledge about design and deployment details for everyone who starts working with Docker.</p>
<h6 id="dockerdesignatglance">Docker design at glance</h6>
<p>Docker provides appliance to run your application inside isolated Linux environment aka Docker container. This container is just lightweight simulation of Linux OS, which is intended to host single user application and isolate applications from each other.</p>
<p>This makes mental shift from running multiple applications on Linux OS to running multiple containers hosted inside Linux core.</p>
<p>This provides multiple advantages like:</p>
<ul>
<li>Declarative and reproducible application definition</li>
<li>Simplified and automated application deployment</li>
<li>Applications isolation from each other</li>
</ul>
<h6 id="howtorundocker">How to run Docker</h6>
<p>There were troublous times when you had to setup docker differently depending on you host OS. After releasing Docker Toolbox this process is got to be standardised.</p>
<p>Lets go step by step to install and run docker on MacOS or Windows operating systems.</p>
<h6 id="step1installdocker">Step 1: install Docker</h6>
<p>You should download and install following applications:</p>
<ol>
<li><a href="https://www.virtualbox.org">Install Virtual Box</a></li>
<li><a href="https://www.docker.com/products/docker-toolbox">Install Docker Toolbox</a></li>
</ol>
<h6 id="step2configuredocker">Step 2: configure Docker</h6>
<p>There is additional concept needed to run Docker - docker-machine. It is responsible to setup and run Docker containers inside virtual machine with Linux (as containers could be run only inside linux core).</p>
<p>To configure Docker you need to initialise docker-machine with following command:</p>
<pre><code>docker-machine create --driver virtualbox default
</code></pre>
<p>It will create virtual machine with tiny Linux core which will host Docker containers.</p>
<p>After docker machine created the last step is to set 3 environment variables:</p>
<ol>
<li>DOCKER_TLS_VERIFY</li>
<li>DOCKER_HOST</li>
<li>DOCKER_CERT_PATH</li>
</ol>
<p>To get values for this variables you could run <code>docker-machine env</code> command. There should be output like following:</p>
<pre><code>&gt;&gt;Alex$ docker-machine env
export DOCKER_TLS_VERIFY=&quot;1&quot;
export DOCKER_HOST=&quot;tcp://192.168.99.100:2376&quot;
export DOCKER_CERT_PATH=&quot;/Users/Alex/.docker/machine/machines/default&quot;
</code></pre>
<p>Done!</p>
<p>Now you have docker up and running, to assure all is fine run<br>
<code>docker ps</code> command.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item></channel></rss>