Data Cature

PostgreSQL with Python

how-to-build-a-postgresql-database-to-store-tweets

Internal notes: tweetdata/python/tweepy/

Amazon Kinesis to Collect, Process and Analyze in Real-time

Amazon Kinesis (SQL Analytics) makes it easy to collect, process, and analyze real-time, streaming data. Alternative to more complex offerings such as Apache Storm and Apache Spark Streaming.

Amazon Kinesis

Spark Databricks with Kafka Data Capture
using Google Cloud Platform (GPC)

Spark Databricks Tweet Capture

Spark Databricks provides SQL access (alternative to large amount of data in PostgreSQL). Databricks Platform allows one to create a free Spark-Scala cluster.

Add a project in an approved Twitter developer account to get:

Four keys: Consumer API key, API secret key, Access token, Access token secret

Use the free community edition of DataBricks - Create a notebook set to scala (ex. Twitter Southeast)

Click: "Clusters > Create Cluster" and name it cluster1

Wait a couple minutes for the cluster to appear in the Clusters list. Click the 3-dot menu and choose "Libraries > Install New"

Downloaded 2.4.3 of spark-streaming-kafka-0-10_2.11 (get latest) https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.11/

Downloaded 1.6.3 spark-streaming-twitter_2.11 (get latest) https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-twitter_2.11/

AWS Managed Streaming for Apache Kafka

Amazon Managed Streaming for Apache Kafka (MSK) – Generally Available as of May 2019 (includes Apache ZooKeeper) https://aws.amazon.com/blogs/aws/amazon-managed-streaming-for-apache-kafka-msk-now-generally-available/

"Apache Kafka (Kafka) is an open-source platform that enables customers to capture streaming data like click stream events, transactions, IoT events, application and machine logs, and have applications that perform real-time analytics, run continuous transformations, and distribute this data to data lakes and databases in real time."

Next, Create a Cluster in Amazon Managed Streaming for Apache Kafka (MSK)

Yikes, MSK pricing says $0.21 per hour! But it's not clear if a small dataset would be less.

Data Cature

PostgreSQL with Python

Amazon Kinesis to Collect, Process and Analyze in Real-time

Spark Databricks with Kafka Data Captureusing Google Cloud Platform (GPC)

AWS Managed Streaming for Apache Kafka

Spark Databricks with Kafka Data Capture
using Google Cloud Platform (GPC)