Sei sulla pagina 1di 13

Integrating Flume with Hadoop and Twitter

Follow the instructions:

2.1 Download Flume from apache mirror.

wget https://www.apache.org/dist/flume/stable/apache-flume-1.6.0-bin.tar.gz

2.2 unzip flume tar file.

tar -xvzf apache-flume-1.6.0-bin.tar.gz

2.3 download jar file from this link

https://drive.google.com/file/d/0B_t6uqPmWadsdWJNQ0NjaXBUYUk/view?usp=sharin
g

2.4 paste in flume lib folder

cp flume-sources-1.0-SNAPSHOT.jar $FLUME_HOME/lib

2.5 Modify flume-env.sh file Paste given lines at the end of the file.

export JAVA_HOME=/usr/local/java/jdk1.8.0_144

FLUME_CLASSPATH="/home/hadoop/work/apache-flume-1.6.0-bin/lib/flume-sources-
1.0-SNAPSHOT.jar"
2.6 Download consumerKey, consumerSecret, accessTokenSecret
from https://apps.twitter.com/ which can be accessed from your twitter
developer account by creating a simple app. See here how to create a twitter
app: https://www.youtube.com/watch?v=xqSp7060Gj0

2.7 Create flume-twitter.conf file in conf folder and paste given lines

Note: Here change consumerKey, ConsumerSecret, accessToken, accessTokenSecret

TwitterAgent.sources = Twitter

TwitterAgent.channels = MemChannel

TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sources.Twitter.consumerKey = xxxxxxxxxxxxxxxxxxxxxxxx

TwitterAgent.sources.Twitter.consumerSecret =
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

TwitterAgent.sources.Twitter.accessToken = xxxxxxxx-
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

TwitterAgent.sources.Twitter.accessTokenSecret =
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

This is where you give keywords to be fetched from twitter. Replace spark,
flink with your desired keyword.

TwitterAgent.sources.Twitter.keywords = spark, flink


TwitterAgent.sinks.HDFS.channel = MemChannel

TwitterAgent.sinks.HDFS.type = hdfs

TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/spark/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000

TwitterAgent.sinks.HDFS.hdfs.rollSize = 0

TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 10000

2.8 configure in bashrc paste given line in bashrc

export FLUME_HOME=/usr/local/apache-flume-1.6.0-bin

export FLUME_CONF_DIR=$FLUME_HOME/conf
export FLUME_CLASS_PATH=$FLUME_CONF_DIR

export PATH=$FLUME_HOME/bin:$PATH

2.9 change directory

cd $FLUME_HOME/lib

2.10 and rename these 3 files. (All you need to do just change the extention of
these files from .jar to .org)

twitter4j-core-3.0.3.jar twitter4j-media-support-3.0.3.jar twitter4j-stream-3.0.3.jar

to

twitter4j-core-3.0.3.org twitter4j-media-support-3.0.3.org twitter4j-stream-3.0.3.org

2.11 change directory

cd $FLUME_HOME
2.12 Enter this command to get twitter data continuously.

bin/flume-ng agent -n TwitterAgent --conf ./conf/-f --conf-file conf/flume-twitter.conf -


Dflume.root.logger=DEBUG,console

To access raw data retrieved from Twitter, go to GUI of Hadoop


Administrator: http://localhost:50070/explorer.html#/
To stop retrieving tweets from twitter, simply press CTRL+C

Note: Make sure all the daemons/datanodes/namenodes are running before


executing this command.

If not, execute these command after step 2.12

/usr/local/hadoop/sbin/start-dfs.sh /usr/local/hadoop/sbin/start-yarn.sh

Note: To stop datanode/namenode/daemons, execute these commands in


terminal

/usr/local/hadoop/sbin/stop-dfs.sh /usr/local/hadoop/sbin/stop-yarn.sh

Don't forget to change the environment variable of JAVA_HOME according to your


system
flume-ng.version
flume-twitter.conf file configuration parametets

TwitterAgent.sinks.HDFS.channel = MemCh
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path =
hdfs://localhost:9000/flume/Twitter/day_key=%Y%m%d/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeformat=Text
TwitterAgent.sinks.HDFS.hdfs.batchSize=1000
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600
capacity 100 The maximum number of events stored in the channel

The maximum number of events the channel will take from a


transactionCapacity 100
source or give to a sink per transaction

TwitterAgent.channels.MemCh.type = memory
TwitterAgent.channels.MemCh.capacity = 10000
TwitterAgent.channels.MemCh.transactionCapacity = 1000

We can execute the following command in the $FLUME_HOME directory to start


the ingestion.

./bin/flume-ng agent -f TwitterStream.properties --name TwitterAgent --conf


$FLUME_HOME/conf -Dflume.root.logger=INFO,console

 conf-file : The flume configuration file where we have configured the


source, channel, sink and the related properties.
 name : Name of the Agent. In this case, TwitterAgent
 conf : Configuration directory of the flume.
Generally, $FLUME_HOME/conf
 Dflume.root.logger=INFO,console : It writes the logs to console
Flume Source, Sink and Channels examples

Potrebbero piacerti anche