Twitter Trends Analysis using Apache Spark (PySpark) on a local 2-node cluster.
Uses socketstream and listens to a TCP server, which integrates to twitter on it behalf and provides the tweets to this socket stream listener. These tweets can be analysed in real time by accepting a trending term and scouring the tweet stream to count the number of occurences of the term in each minute.
- Jupyter notebook - twitter_feed_bda.ipynb
- Server broker - tweetread.py
- Scoured data - tweet_count.csv
Configuring PySpark and iPython notebooks
- https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f
- https://medium.com/@GalarnykMichael/install-spark-on-mac-pyspark-453f395f240b
Rest is self-explanatory.