David is Coding

Just another techie blog

Real-Time Twitter Analysis 4: Displaying the Results

In the previous post, we processed a stream of tweets in real-time with Spark Streaming in order to calculate son information such as tops and counters. Now is the turn of displaying this data in an easier way to be consumed by humans. Along with this post, we’ll create a simple web-based Dashboard by using […]

Real-Time Twitter Analysis 3: Tweet Analysis on Spark

Real-Time Analysis on Spark

We already got a Twitter Stream ingested in our cluster using Flume and Kafka, as was described in my previous post. The next step is to process and analyze tweets taken from a Kafka topic with Apache Spark Streaming. Our goal here is to make some calculations on top of the received tweets in order […]

Real-Time Twitter Analysis 2: Twitter Stream with Flume

Ingesting Twitter Stream

We already discussed the architecture for this project in my previous post here. Now, it’s time for jumping into the mood and start working on it. The first step is to ingest the Twitter Stream into our cluster. For this task, we’ll use Apache Flume and Apache Kafka, which in conjunction are also known as […]

Real-Time Twitter Analysis 1: Introduction

After setting up the Cloudera’s Quickstart VM, as described in my previous post, it’s time to show some hands-on experience about Data Engineering. For this purpose, I opted for performing a real-time sentiment analysis over this social media. The idea is to put into play different tools and skills I got during the Big Data […]

Installing Spark 2 and Kafka on Cloudera’s Quickstart VM

As you probably know, to operate with Big Data, we need a cluster of several nodes. Unfortunately, people normally don’t have access to any of them. If we want to learn how to use the technologies behind, we need to make use of VMs with a pseudo cluster assembled in it, and a set of […]

Impala: retrieve data from HDFS

Cloudera Impala is another tool that allows queries with a language very similar to SQL over data stored in Hadoop file systems. This tool is designed to return results with low latency, which makes it ideal for interactive queries. It can be very similar to Hive, since, in essence, they have the same purpose, retrieve […]

Query data stored in HDFS with Hive

Apache Hive tool that works on Hadoop systems that allow querying data stored in HDFS as if it were a SQL relational database. Hive is a high-level abstraction on top of MapReduce that allows us to generate jobs using statements in a language very similar to SQL, called HiveQL. Using Hive is much faster and […]

Import and export data with Sqoop in HDFS

When working with Big Data in Hadoop environments, a very useful command line tool is Apache Sqoop. This allows us to import data stored in relational databases into HDFS, as well as to export data in HDFS to relational databases. The name of this tool comes from SQL + Hadoop, Sqoop, and it is based […]