Maruti Gollapudi's Blog: Hadoop

Monday, January 6, 2014

"Apache Flume: Distributed Log Collection for Hadoop" - Book review

We have been working on Apache Flume for quite some time now. We used it to load data from Social network into MongoDB and also for log collection. Recently I read a book on Apache Flume titled "Apache Flume: Distributed Log Collection for Hadoop".

This is a good starter material for a serious Flume developer. 2nd Chapter provides a good step by step guide to setup and get running with Flume. Liked the way the flow is presented and the mention about important arguments that can be provided to Flume command line. Other useful information from this book is about the Monitoring tools. Though the description is brief good to have introduction to some tools. Overall book gives good details with examples on Flume flow and architecture which includes Channels, agents, sinks, interceptors etc. It would have been more helpful if last two chapters were elaborated more.

Tuesday, July 9, 2013

Modes of Big Data Analysis

We can look at analysis in three modes based on trigger for analysis.

Offline/Batch Mode

Analytics performed and results are made available for applications to use
Ex: Clinical Trails, Voice of Customer

Real Time – OnDemand

Analysis done and results are presented when requested.
Ex: Up-sell/Cross-sell

Real Time – Stream based

Monitor streaming data (Twitter messages, Transaction logs, data from Sensors) and trigger analysis based on event/data.
Ex: Monitor and analyze online transactions for Fraud, Monitor social media messages for serious incidents.

And below are the implementation approaches:

Massive Parallel Programming (Data Bases and Programming)Hadoop MapReduce
Scalable Database – NoSQL databases and Databases with ability to store huge data (Ex Oracle ExaData) and to perform operations on data.
In-memory Analytics - an approach to querying data when it resides in random access memory (RAM), as opposed to querying data that is stored on physical disks.
Big Data Appliance - combination hardware and software products designed specifically for analytical processing.
Processing in Memory (PIM) - a chip architecture in which the processor is integrated into a memory chip to reduce latency.
In-Database Analytics - a technology that allows data processing to be conducted within the database by building analytic logic into the database itself.
Real-time Stream Processing & CEP

Combination of above approaches need to implement the Analytic Apps

Almost 2 years back, for couple of months, I had my first stint with Big data and Hadoop before moving on to Social Analytics. As I resumed my interest into Big data I was looking at my old work and above are from one of my early presentations.

Maruti Gollapudi's Blog

Monday, January 6, 2014

"Apache Flume: Distributed Log Collection for Hadoop" - Book review

Tuesday, July 9, 2013

Modes of Big Data Analysis

About Me

Blog Archive