Wait! Don’t write your microservice … yet

Knoldus Blogs

Day one, we were super excited to start a new project for a huge financial institution. They seemed to know the domain and as Knoldus we understood the technology. The stakeholders were excited about the Reactive Paradigm and the client architects were all stoked up the way there would be microservices which would allow them all the benefits that other companies had been harping about. We had proposed Lagom and it was taken well.

We had an overview of the domain and we had a set of domain experts assigned to us who would be product owners on different tracks that we would work on. Great, does seem like a perfect start. But when things are too good to be true … probably they are.

The lead architect (LA) asks me, so how much logs would each microservice generate? I am like … what? And he explains “Since we are…

View original post 493 more words


Sparkling Water: H2O + Spark Introduction

What is H2O ?

Open source in memory, distributed, machine learning and predictive analysis platform.

  1. H2O core code written in Java.
  2. A Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines.
  3. The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading.
  4. H2O’s REST API allows access to all the capabilities of H2O from an external program or script via JSON over HTTP.
  5. The Rest API is used by H2O’s web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python).


Note : Java is always required

Language Language like Scala, R not required unless you want to use it in those environment.

Browser is required to run H2O UI Flow.

Hadoop is not required to run H2O unless you want to deploy it on Hadoop cluster.

Spark V1.4 or later is required only if you want to run Sparkling Water.

H2O UI Flow

By default, flows are saved to the h2oflows directory underneath your home directory

  1. Download and extract H2O package.  Go to H2O directory
  2. Run  command java <JVM Options> -jar h2o.jar <H2O Options>

            java -jar h2o.jar

  1. Point browser


Note: H2O requires some space in the /tmp directory to launch. If you can not launch H2O, try freeing up some space in the /tmp directory, then try launching H2O again.

Sparkling Water

Sparkling Water is designed to be executed as a regular Spark application. It provides a way to initialize H2O services on each node in the Spark cluster and access data stored in data structures of Spark and H2O.

As SW is designed as Spark Application, its is launched inside  Spark Executor. H2O start all services including distributed KV store and memory manager

When H2O services start running, it is possible to create H2O structure, call H2O algorithm, transform from / to RDD.



  1. Uses H2O algorithm in Spark Workflow.
  2. Convert H2O RDD into Spark RDD and vice-versa.
  3. Transformation between H2O and Spark data structure.
  4. Transparence execution of SW on top of Spark.

Supported Data Sources:

  1. Standard Spark RDD API  to load the data from data sources and then transform into H2O Frame.
  2. H2O RDD API to load data directly into H2O Frame from:
  • Local Files
  • HDFC Files
  • S3 Files

Supported Data Format:

  1. CSV
  2. SVMLight
  3. ARFF

Run Sparkling Water Application

We can run application by using

  1. Spark Submit and including Sparkling Water library via –jar or –packages option
  2. Spark Shell and including Sparkling Water library via –jar or –packages option
  3. Sparkling Shell
  4. Sparkling Water Driver
  5. pySpark with pySparkling

Using Spark Submit  

The Spark option –packages points to published Sparkling Water packages in Maven repository. The benefits of using spark packages is that you do not need to download Sparkling Water fat jar.

spark-submit –packages ai.h2o:sparkling-water-core_2.11:2.0.0,ai.h2o:sparkling-water-examples_2.11:2.0.0  –class RecommendationEngineDemo /home/mortred/workspace/RecommendationEngine/target/scala-2.11/recommendationengine_2.11-1.0.jar

Note: When you are using Spark packages you do not need to download Sparkling Water distribution! Spark installation is sufficient!

In next blog, i would be come up with examples 🙂






Akka Event Bus

The architecture of Akka is based on Actor system models. Akka Event Bus based on Publishers and Subscribers models.  An Event Bus is a mediator between Publishers and Subscribers which have classifier to classify the events. Publishers publish events into Event Bus and Subscriber listening certain types of events will receive those events.

Have a look,  how these Akka Event Bus Work 😉


Event Bus has been generalized into a set of abstract data type and abstract methods. An Event Bus must need to be define these abstract data type . Here, type Event , type Classifier and type Subscriber are abstract data type .

trait EventBus {
  type Event
  type Classifier
  type Subscriber

  def subscribe(subscriber: Subscriber, to: Classifier): Boolean

  def unsubscribe(subscriber: Subscriber, from: Classifier): Boolean
  def unsubscribe(subscriber: Subscriber): Unit

  def publish(event: Event): Unit
  • Event is type of all events which is publish on Event Bus . It could be case class object , case object  or any other data type .
  • Classifier classify all coming events into Event Bus. It select the registered subscriber for specific events. It could be any data type type like ActorRef, String , Boolean etc.
  • Subscriber are allowed to register into Event Bus. Subscribers are basically Actor listening certain type of events. It could be any data type type like ActorRef, String , Boolean etc.

Notable Points :

1) If Classifier type is ActorRef, ActorClassifier trait can be mix into Event Bus.

2) If Classifier type is Function from Event to Boolean, PredicateClassifier trait can be mix  into Event Bus.

3) If Subscriber type is ActorRef,  ActorEventBus trait can be mix into Event Bus.


In order to implement valid Event Bus, you need to inherit classifications trait into Event Bus. The classifier here, are part of Akka distribution. For finding perfect classifier, check out existing implementation of classifiers on Git. There are several classification implemented in Akka System.

  1. Lookup Classification 
  2. SubChannel Classification
  3. Scanning Classification
  4. Managed Actor Classification 
  5. Actor Classification 

Actor Classification  has been deprecated in Akka2.4 ,  We can use Managed Actor Classification instead .

Here, we will only talk about Lookup Classification .
Lookup Classification is used to extracting arbitrary classifier for every coming events into Event Bus and define possible subscribers for each classifier.


Lets start with a simple tutorials

Define a MessageEvent model which would be send to channel.

case class MessageEvent(channel: String, message: String)

Now, implements Event Bus trait with Lookup Classification .

import akka.event.{ActorEventBus, LookupClassification}

class LookUpBusImp extends ActorEventBus with LookupClassification{

  type Event = MessageEvent

  type Classifier = String

  def mapSize() = 2

  def classify(event: Event): Classifier = event.channel

  def publish(event: Event, subscriber: Subscriber): Unit = subscriber ! event


Create a subscriber listening MessageEvent types of events.

import akka.actor.Actor

class SubscriberActor extends Actor {
  override def receive: Receive = {
    case messageEvent: MessageEvent =>
      println("Channel name is >>>>>>" + messageEvent.channel)

Subscribe and publish Event into Event Bus.

import akka.actor.Props
import akka.actor.ActorSystem

object LookUpBusApp extends App{
 val system = ActorSystem("mySystem")
 val subscriberActor = system.actorOf(Props[SubscriberActor], "subscriberActor")

 val lookUpBusEvent = new LookUpBusImp
 lookUpBusEvent.subscribe(subscriberActor, "colors")
 lookUpBusEvent.publish(MessageEvent("colors", "Greetings"))


You may check complete code on GitHub

And So this is hopefully enough to whet your appetite on the Akka Event Bus 😉




Introduction to Spark 2.0

Rklick Solutions LLC

Overview of Dataset , Dataframe and RDD API :

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

But due to facing issue related to advanced optimization move to dataframe.

Dataframe brought custom memory management and runtime code generation which greatly improved performance. So in last year most of the improvements went into Dataframe API.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

Though dataframe API solved many issues, it was not a…

View original post 742 more words

Building Analytics Engine Using Akka, Kafka & ElasticSearch

Knoldus Blogs

In this blog , I will share my experience on building scalable, distributed and fault-tolerant  Analytics engine using Scala, Akka, Play, Kafka and ElasticSearch.

I would like to take you through the journey of  building an analytics engine which was primarily used for text analysis. The inputs were structured, unstructured and semi-structured data and we were doing a lot of data crunching using it. The Analytics engine was accessible by the rest-client and web-client(Built In with engine)  as shown in below diagram.

architecture-00(1) First Architecture

Here is a quick overview on technology stack :

  1. Play Framework  as Rest Server & Web Application  (Play is MVC  framework based on  lightweight, stateless and web friendly architecture.)
  2. Akka cluster as processing engine.(Akka is a toolkit and runtime for building highly concurrent,distributed, and resilient message driven applications on the JVM.)
  3. ClusterClient (It was contributed module) for communication with Akka cluster. It used to run…

View original post 996 more words

Hadoop on Multi Node Cluster

Rklick Solutions LLC

Step 1: Installing Java:

Java is the primary requirement to running hadoop on system, so make sure you have Java installed on your system using following command:

$ java -version

If you don’t have Java installed on your system, use one of following link to install it first.

Step 2: Creating Hadoop User :

We recommend to create a normal (nor root) account for hadoop working. So create a system account using following command:

$ adduser hadoop
$ passwd hadoop

Step 3 : Generate SSH Keys

After creating account, it also required to set up key based ssh to its own account. To do this use execute following commands.

[root@rklick01 ~]# su hadoop
[hadoop@rklick01 root]$ cd
[hadoop@rklick01 ~]$
[hadoop@rklick01 ~]$ ssh-keygen -t rsa

We would see these types of logs and follow these instructions
Generating public/private rsa key pair. Enter…

View original post 1,470 more words