A backup of the Postgres database prepared for this study resides in an Amazon RDS > Snapshot within LH account.
After capturing data from Twitter, the following steps are used to pre-process Twitter geotagged records for mobility studies to allow for fast searches and visual analytics.
Pre-processing occurs in PostgreSQL using an imported CSV file containing an id, latitude, longitude and created date.
The following steps are preliminary to generating CSV output with 5-day norm-to-impact data, 6 displacement step tiers,
and daily radius of gyration calculations for a 33-day span surrounding each storm.
To try: Using the Query Editor for Aurora Serverless
Generate table with script, then import and merge multiple months and any missing days. AWS recommends managing PostgreSQL database using the client app pgAdmin, which includes CSV import and export tools.
CREATE TABLE public.matthew_jacksonville ( twitter_uid bigint NOT NULL, created TIMESTAMPTZ, longitude real, latitude real ) WITH ( OIDS = FALSE ) TABLESPACE pg_default; ALTER TABLE public.matthew_jacksonville OWNER to dataportal; -- Refresh your table list after creating new table
When importing a Twitter CSV file from a Northeastern University Mongo export , use the following script. The rows provided are: id, user_id, lon, lat, created_at, text. The "created" column is in text format initially to match the data source.
-- FOR DATA PROVIDED BY NORTEASTERN UNIVERSITY CREATE TABLE public.matthew_jacksonville ( id bigint NOT NULL, twitter_uid bigint NOT NULL, longitude real NOT NULL, latitude real NOT NULL, createdText text NOT NULL, text text ) WITH ( OIDS = FALSE ) TABLESPACE pg_default; ALTER TABLE public.matthew_jacksonville OWNER to dataportal; For 2013 Northeastern data, run after adding other indexes (omit the created index in step 5) and importing: ALTER TABLE public.matthew_jacksonville ADD COLUMN created TIMESTAMPTZ; UPDATE matthew_jacksonville SET created = to_timestamp(createdText::int); CREATE INDEX matthew_jacksonville_created_index ON matthew_jacksonville (created); ALTER TABLE public.matthew_jacksonville DROP COLUMN createdText, DROP COLUMN id, DROP COLUMN text;
Import 1 or more csv files into the same table when data has multiple source files.
Large CSV files need to be reduced below 1GB prior to import into PostgreSQL using pgAdmin. The app CSVSplitter will reduce to 5 million rows per file, leading to 5 files per GB.
In pgAdmin, right click table and choose Import/Export. (Refresh tables if not visible.) - Works without setting Encoding. - Set to Import. Set Header to yes. - Next two are not necessary, unless importing descriptions: - Set delimiter to comma Set the Escape charater to " to avoid error from Arabic. (This is only an issue when the CSV has unquoted text fields with the username or tweet text.) The import will return the count after the word COPY under More Details. Record count and distinct count on storm's GitHub repo readme page. (Wait to remove dups, after adding guid, as one step so vacuum can be run just once.) Wait to insert extra columns (step 4) after import to avoid the error: missing data for column. If storing altitude, use geometry PointZ rather than Point.
The raw data often contains posts that occur at the same time and location. Use the following to create a temp table containing the distinct rows, then copy back to original table name.
-- Make note of total vs unique rows SELECT count(*) total, (SELECT count(*) FROM (SELECT DISTINCT twitter_uid, created, latitude, longitude FROM matthew_jacksonville)x) AS distinctTotal FROM matthew_jacksonville; -- The following takes 3 minutes for 20 million rows -- step 1 - Move to a new table to make distinct CREATE TABLE matthew_jacksonville_temp (LIKE matthew_jacksonville); -- step 2 INSERT INTO matthew_jacksonville_temp(twitter_uid, created, latitude, longitude) SELECT DISTINCT twitter_uid, created, latitude, longitude FROM matthew_jacksonville; -- RUN "SET the_geom" here. (Difference of hours verses over 8 days for 20 million CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; ALTER TABLE matthew_jacksonville_temp ADD COLUMN the_geom geometry(Point,4326); UPDATE matthew_jacksonville_temp SET the_geom = ST_SetSrid(ST_MakePoint(longitude, latitude), 4326); -- step 3 - Replace original table DROP TABLE matthew_jacksonville; CREATE TABLE matthew_jacksonville (LIKE matthew_jacksonville_temp); INSERT INTO matthew_jacksonville(twitter_uid, created, latitude, longitude, the_geom) SELECT DISTINCT twitter_uid, created, latitude, longitude, the_geom FROM matthew_jacksonville_temp; DROP TABLE matthew_jacksonville_temp;
-- 3.5 minutes for 20 million rows CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; ALTER TABLE matthew_jacksonville ADD COLUMN localdate TIMESTAMPTZ, ADD COLUMN displacement real, ADD COLUMN displacement_hav real, ADD COLUMN prior_created TIMESTAMPTZ, ADD COLUMN prior_localdate TIMESTAMPTZ, ADD COLUMN prior_latitude real, ADD COLUMN prior_longitude real, -- ADD COLUMN the_geom geometry(Point,4326), -- Added in prior step ADD COLUMN prior_geom geometry(Point,4326), ADD COLUMN guid uuid DEFAULT uuid_generate_v4() NOT NULL unique, ADD COLUMN prior_guid uuid, ADD COLUMN datasource smallint;
You'll typically skip this step since merges occur before extra columns are added. Used occasionally to add missing data.
-- Simple additions to file being merged into the main file that already has all columns ALTER TABLE matthew_jacksonville_MONTH ADD COLUMN guid uuid DEFAULT uuid_generate_v4() NOT NULL unique; INSERT INTO matthew_jacksonville (twitter_uid, created, longitude, latitude, guid) SELECT twitter_uid, created, longitude, latitude, guid FROM matthew_jacksonville_MONTH WHERE created >= '2017-10-06 00:00'; # If from external source, include datasource. INSERT INTO matthew_jacksonville (twitter_uid, created, longitude, latitude, guid, datasource) SELECT twitter_uid, created1, longitude, latitude, guid, 1 FROM matthew_jacksonville_MONTH; INSERT INTO matthew_jacksonville (twitter_uid, created, longitude, latitude, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, the_geom, prior_geom, guid, prior_guid, datasource) SELECT twitter_uid, created, longitude, latitude, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, the_geom, prior_geom, guid, prior_guid, datasource FROM ph2014 WHERE created BETWEEN SYMMETRIC '2014-06-20 04:00' AND '2017-08-05 04:00';
Times in the "created" date field are GMT for both sources (NDL and Northeastern). Yan's day-light-savings times are incorrect by one hour for the summer of 2018: 11 March 2018 2am to 4 November 2018 2am. She changed the time on her Twitter capture server.
-- RUN ONE OF THE FOLLOWING, OR TWO IF NOV OR MARCH. -- CHOOSE DATABASE (created is in GMT and should never be changed. Change localdate.) -- For USA, summer and fall daylight savings, subtract 4 UPDATE j@pan SET localdate = created AT TIME ZONE 'GMT' + interval '1h' * 9; UPDATE h0ngkong SET localdate = created AT TIME ZONE 'GMT' + interval '1h' * 8; UPDATE tw2@16 SET localdate = created AT TIME ZONE 'GMT' + interval '1h' * 8; UPDATE j@2014 SET localdate = created AT TIME ZONE 'GMT' + interval '1h' * 8; UPDATE ph2@14 SET localdate = created AT TIME ZONE 'GMT' + interval '1h' * 8; -- Nov cutoff date here (run both) UPDATE se2017 SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 4 WHERE created < '2017-11-5 02:00:00'; UPDATE se2017 SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 5 WHERE created >= '2017-11-5 02:00:00'; -- harveyEdg@r 2017-05-19 03:18:20+00 to 2018-10-04 06:04:45+00 (10 minutes to run) UPDATE harveyEdg@r SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 5 WHERE created < '2017-11-5 02:00:00'; UPDATE harveyEdg@r SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 6 WHERE created >= '2017-11-5 02:00:00'; UPDATE m@tthew_jacksonville SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 4 WHERE created < '2016-11-6 02:00:00'; UPDATE m@tthew_jacksonville SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 5 WHERE created >= '2016-11-6 02:00:00'; Standard time is Sunday 4 November 2018 02:00 local time to Sunday 10 March 2019 02:00 local time Subtract 4 hours when US storm and... 2013 10 March 2013 2am to 3 November 2013 2am 2014 9 March 2014 2am to 2 November 2014 2am 2015 8 March 2015 2am to 1 November 2015 2am 2016 13 March 2016 2am to 6 November 2016 2am 2017 12 March 2017 2am to 5 November 2017 2am 2018 11 March 2018 2am to 4 November 2018 2am Otherwise subtract 5 hours (standard time in the winter) Past and future ranges: https://greenwichmeantime.com/time-zone/rules/usa/ The actual time zone is NOT saved with any of the Postgres date/time types. Each date file generated from the pickle files originates at midnight EST. The GMT time hence differs by either 4 or 5 hours from midnight.
For smaller datasets:
Discontinue using this first one. Runtimes: 6.5 million over 3 days (canceled) when 3 indexes already added. Try with no indexes next on table over 1 million.
24 minutes, now 31 (after guid and renaming id to twitter_uid) on se 201,309 (Irma) 1 minute for hongkong (Mangkhut) Took more than 12 hours on harveyEdg@r. All prior queries were fast for harveyEdg@r, which contains 1.2 million rows. --Table alias "b" is the next tweet, insert values from the individual's prior tweet row "a" WITH subquery AS ( SELECT b.guid, a.guid prior_guid, a.created, a.localdate, ST_DistanceSphere(a.the_geom,b.the_geom)/1000 displacement, a.the_geom, a.latitude, a.longitude FROM matthew_jacksonville a JOIN matthew_jacksonville b -- Select the user's next upcoming row ON a.twitter_uid = b.twitter_uid AND b.created > a.created AND b.guid = (SELECT c.guid FROM matthew_jacksonville c WHERE c.twitter_uid = a.twitter_uid AND c.created > a.created ORDER by c.created LIMIT 1) ) UPDATE matthew_jacksonville theData SET displacement = subquery.displacement, prior_created = subquery.created, prior_localdate = subquery.localdate, prior_geom = subquery.the_geom, prior_latitude = subquery.latitude, prior_longitude = subquery.longitude, prior_guid = subquery.prior_guid, -- http://daynebatten.com/2015/09/latitude-longitude-distance-sql/ -- 6371=radius earth in kilometers displacement_hav = 2 * 6371 * asin(sqrt((sin(radians((theData.latitude - subquery.latitude) / 2))) ^ 2 + cos(radians(subquery.latitude)) * cos(radians(theData.latitude)) * (sin(radians((theData.longitude - subquery.longitude) / 2))) ^ 2)) FROM subquery WHERE theData.guid = subquery.guid;
For large datasets exceeding 1 million, we'll join an indexed table to a copy
of the table where updates will occur to avoid slow simple update queries on PostgreSQL database by using the following (Reduces runtime from days to hours. 3 hours for 6.5 million, otherwise over 3 days.)
Learn more about dropping all the indexes, triggers and foreign keys while the
update runs, then recreate them at the end.
If indexes remain in place, there's not much speed improvement from running Vaccum, Analyze and Reindex.
(2.5 hrs for 8.5 mil without running. Still 3hrs for 6.5mil after running with retained indexes.)
-- Add indexes on the three fields to be used by the join (twitter_uid, created, guid) CREATE INDEX matthew_jacksonville_twitter_uid_index ON matthew_jacksonville (twitter_uid); CREATE INDEX matthew_jacksonville_created_index ON matthew_jacksonville (created); CREATE INDEX matthew_jacksonville_guid_index ON matthew_jacksonville (guid); CREATE TABLE matthew_jacksonville_copy (LIKE matthew_jacksonville); -- Note, this cannot be modified to set as DISTINCT since localdate is already populated. Distinct occurs previously. INSERT INTO matthew_jacksonville_copy(twitter_uid, created, latitude, longitude, the_geom, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, prior_geom, guid, prior_guid, datasource) SELECT twitter_uid, created, latitude, longitude, the_geom, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, prior_geom, guid, prior_guid, datasource FROM matthew_jacksonville; WITH subquery AS ( SELECT b.guid, a.guid prior_guid, a.created, a.localdate, ST_DistanceSphere(a.the_geom,b.the_geom)/1000 displacement, a.the_geom, a.latitude, a.longitude FROM matthew_jacksonville a JOIN matthew_jacksonville b -- Select the user's next upcoming row ON a.twitter_uid = b.twitter_uid AND b.created > a.created AND b.guid = (SELECT c.guid FROM matthew_jacksonville c WHERE c.twitter_uid = a.twitter_uid AND c.created > a.created ORDER by c.created LIMIT 1) ) UPDATE matthew_jacksonville_copy theData SET displacement = subquery.displacement, prior_created = subquery.created, prior_localdate = subquery.localdate, prior_geom = subquery.the_geom, prior_latitude = subquery.latitude, prior_longitude = subquery.longitude, prior_guid = subquery.prior_guid -- http://daynebatten.com/2015/09/latitude-longitude-distance-sql/ -- 6371=radius earth in kilometers --displacement_hav = 2 * 6371 * asin(sqrt((sin(radians((theData.latitude - subquery.latitude) / 2))) ^ 2 + cos(radians(subquery.latitude)) * cos(radians(theData.latitude)) * (sin(radians((theData.longitude - subquery.longitude) / 2))) ^ 2)) FROM subquery WHERE theData.guid = subquery.guid; ------------------ -- Skip the following. Only used for comparing displacement_hav formula. -- Populate displacement_hav. Also requires having two tables so indexes in one speed-up finding prior displacement. CREATE TABLE matthew_jacksonville_copy (LIKE matthew_jacksonville); INSERT INTO matthew_jacksonville_copy(twitter_uid, created, latitude, longitude, the_geom, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, prior_geom, guid, prior_guid, datasource) SELECT twitter_uid, created, latitude, longitude, the_geom, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, prior_geom, guid, prior_guid, datasource FROM matthew_jacksonville; WITH subquery AS ( SELECT b.guid, a.guid prior_guid, a.created, a.localdate, a.the_geom, a.latitude, a.longitude FROM matthew_jacksonville a JOIN matthew_jacksonville b -- Select the user's next upcoming row ON a.twitter_uid = b.twitter_uid AND b.created > a.created AND b.guid = (SELECT c.guid FROM matthew_jacksonville c WHERE c.twitter_uid = a.twitter_uid AND c.created > a.created ORDER by c.created LIMIT 1) ) UPDATE matthew_jacksonville_copy theData SET displacement_hav = 2 * 6371 * asin(sqrt((sin(radians((theData.latitude - subquery.latitude) / 2))) ^ 2 + cos(radians(subquery.latitude)) * cos(radians(theData.latitude)) * (sin(radians((theData.longitude - subquery.longitude) / 2))) ^ 2)) FROM subquery WHERE theData.guid = subquery.guid; -- Run the following after first query above. -- If storage is exceeded, process above may end without notification. -- So we pause here. Otherwise check if displacement field is populated to confirm copy sequence completed. -- Delete initial and rename copy to replace (compart count on copy first) DROP TABLE matthew_jacksonville; CREATE TABLE matthew_jacksonville (LIKE matthew_jacksonville_copy); INSERT INTO matthew_jacksonville(twitter_uid, created, longitude, latitude, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, the_geom, prior_geom, guid, prior_guid, datasource) SELECT DISTINCT twitter_uid, created, longitude, latitude, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, the_geom, prior_geom, guid, prior_guid, datasource FROM matthew_jacksonville_copy; DROP TABLE matthew_jacksonville_copy;
-- Typically skip this dropping part -- Used whek dropping and adding table becuae , run first DROP INDEX matthew_jacksonville_twitter_uid_index; DROP INDEX matthew_jacksonville_created_index; DROP INDEX matthew_jacksonville_guid_index; DROP INDEX matthew_jacksonville_latitude_index; DROP INDEX matthew_jacksonville_longitude_index; DROP INDEX matthew_jacksonville_prior_created_index; DROP INDEX matthew_jacksonville_localdate_index; DROP INDEX matthew_jacksonville_prior_guid_index; DROP INDEX matthew_jacksonville_the_geom; DROP INDEX matthew_jacksonville_prior_geom; DROP INDEX matthew_jacksonville_displacement_index; -- Runs in less than 10 minutes for 8 million records. -- Readd 3 indexes from step 8 since deleted when copying back data CREATE INDEX matthew_jacksonville_twitter_uid_index ON matthew_jacksonville (twitter_uid); CREATE INDEX matthew_jacksonville_created_index ON matthew_jacksonville (created); CREATE INDEX matthew_jacksonville_guid_index ON matthew_jacksonville (guid); CREATE INDEX matthew_jacksonville_latitude_index ON matthew_jacksonville (latitude); CREATE INDEX matthew_jacksonville_longitude_index ON matthew_jacksonville (longitude); CREATE INDEX matthew_jacksonville_prior_created_index ON matthew_jacksonville (prior_created); CREATE INDEX matthew_jacksonville_localdate_index ON matthew_jacksonville (localdate); CREATE INDEX matthew_jacksonville_prior_guid_index ON matthew_jacksonville (prior_guid); CREATE INDEX matthew_jacksonville_the_geom ON matthew_jacksonville USING GIST (geography(the_geom)); CREATE INDEX matthew_jacksonville_prior_geom ON matthew_jacksonville USING GIST (geography(prior_geom)); CREATE INDEX matthew_jacksonville_displacement_index ON matthew_jacksonville (displacement); -- Run the following to see all indexes on table: -- Includes key for unique index on guid -- CASE SENSITIVE, use lowercase since capital letters are ignored in initial table and index inserts. SELECT * FROM pg_indexes WHERE tablename = 'matthew_jacksonville'; -- Show which columns are indexed. Excludes the 2 geom columns. select t.relname as table_name, i.relname as index_name, array_to_string(array_agg(a.attname), ', ') as column_names from pg_class t, pg_class i, pg_index ix, pg_attribute a where t.oid = ix.indrelid and i.oid = ix.indexrelid and a.attrelid = t.oid and a.attnum = ANY(ix.indkey) and t.relkind = 'r' and t.relname like 'matthew_jacksonville' group by t.relname, i.relname order by t.relname, i.relname;
Add a table with a twitter_uid column containing known bots. russian-troll-tweets - List of bots Ran in: m@tthew_jacksonville, se2017, se2018, tw2016, harvey_houston, hongkong, ja2014 (none found)
SELECT count(*) FROM bots JOIN matthew_jacksonville ON bots.twitter_uid = matthew_jacksonville.twitter_uid; CREATE TABLE public.bots ( twitter_uid bigint NOT NULL ) WITH ( OIDS = FALSE ); ALTER TABLE public.bots OWNER to dataportal; -- After importing
Using FULL VACUUM locks table. Vaccuming large tables may exceed storage capacity.
Verbose report returns all 0's, so might not need to select.
(22 seconds on 6.5 million after copying and indexing. So vacumming may not be necessary immediately after adding indexes.)
-- Right-click the table name, choose Maintenance. Retain full as NO.
15 days before and 17 days after storm (plus overlap for 3am end-of-day times).
Dynamic-Function.md (private repo)
Use isimpactday='YES' to display the day of impact.
Norm-Impact-Chart.md (private repo)
Uses Norm-Impact-Chart to output 6 rows for each storm.
Includes output of degrees to match figure 1. Also includes percent change.
Perturbation-Chart.md (private repo)
Utilizes *_gyration tables generated by dynamic-function.md 240,611 user days of gyration. Of these, 81,024 have 5 or more steps. Does not include step_type since each gyration covers whole city. Gyration-All.md (private repo)
Utilizes all_norm_and_impact table.