Storm Tracker | Mobility Impact | Storm Stats | City Boundaries | Data Processing | Impact Charts | Refinements

Data Processing

A backup of the Postgres database prepared for this study resides in an Amazon RDS > Snapshot within LH account.

After capturing data from Twitter, the following steps are used to pre-process Twitter geotagged records for mobility studies to allow for fast searches and visual analytics.

Pre-processing occurs in PostgreSQL using an imported CSV file containing an id, latitude, longitude and created date. The following steps are preliminary to generating CSV output with 5-day norm-to-impact data, 6 displacement step tiers, and daily radius of gyration calculations for a 33-day span surrounding each storm.

To try: Using the Query Editor for Aurora Serverless


A. Import and Prep Data within PostgreSQL

In the first three steps, data is merged and duplicates in the source data are removed. The process then adds local times (including daylight savings time), position geometry (geom) for the current location and prior post location. The geom is used to calculate the displacement distance traveled between posts (within days starting at 3am), and a universally unique ID (guid) is added to prevent future duplicates.

1. CREATE A NEW TABLE

Generate table with script, then import and merge multiple months and any missing days. AWS recommends managing PostgreSQL database using the client app pgAdmin, which includes CSV import and export tools.

CREATE TABLE public.matthew_jacksonville
(
    twitter_uid bigint NOT NULL,
    created TIMESTAMPTZ,
    longitude real,
    latitude real
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;
ALTER TABLE public.matthew_jacksonville
    OWNER to dataportal;

-- Refresh your table list after creating new table

When importing a Twitter CSV file from a Northeastern University Mongo export , use the following script. The rows provided are: id, user_id, lon, lat, created_at, text. The "created" column is in text format initially to match the data source.

-- FOR DATA PROVIDED BY NORTEASTERN UNIVERSITY
CREATE TABLE public.matthew_jacksonville 
(
    id bigint NOT NULL,
    twitter_uid bigint NOT NULL,
    longitude real NOT NULL,
    latitude real NOT NULL,
    createdText text NOT NULL,
    text text
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;
ALTER TABLE public.matthew_jacksonville
    OWNER to dataportal;

For 2013 Northeastern data, run after adding other indexes 
(omit the created index in step 5) and importing:

ALTER TABLE public.matthew_jacksonville 
ADD COLUMN created TIMESTAMPTZ;
UPDATE matthew_jacksonville SET created = to_timestamp(createdText::int);
CREATE INDEX matthew_jacksonville_created_index ON matthew_jacksonville (created);
ALTER TABLE public.matthew_jacksonville 
DROP COLUMN createdText,
DROP COLUMN id,
DROP COLUMN text;

2. IMPORT/MERGE INTO ONE TABLE

Import 1 or more csv files into the same table when data has multiple source files.

Large CSV files need to be reduced below 1GB prior to import into PostgreSQL using pgAdmin. The app CSVSplitter will reduce to 5 million rows per file, leading to 5 files per GB.

In pgAdmin, right click table and choose Import/Export. (Refresh tables if not visible.)
- Works without setting Encoding.
- Set to Import. Set Header to yes.
- Next two are not necessary, unless importing descriptions:
- Set delimiter to comma

Set the Escape charater to " to avoid error from Arabic.  
(This is only an issue when the CSV has unquoted text fields with the username or tweet text.)

The import will return the count after the word COPY under More Details.
Record count and distinct count on storm's GitHub repo readme page.
(Wait to remove dups, after adding guid, as one step so vacuum can be run just once.)

Wait to insert extra columns (step 4) after import to avoid the error: missing data for column.
If storing altitude, use geometry PointZ rather than Point.

3. ADD THE_GEOM - DELETES DUPS

The raw data often contains posts that occur at the same time and location. Use the following to create a temp table containing the distinct rows, then copy back to original table name.

-- Make note of total vs unique rows
SELECT count(*) total, (SELECT count(*) FROM (SELECT DISTINCT twitter_uid, created, 
latitude, longitude FROM matthew_jacksonville)x) AS distinctTotal FROM matthew_jacksonville;

-- The following takes 3 minutes for 20 million rows

-- step 1 - Move to a new table to make distinct
CREATE TABLE matthew_jacksonville_temp (LIKE matthew_jacksonville);
 
-- step 2
INSERT INTO matthew_jacksonville_temp(twitter_uid, created, latitude, longitude)
SELECT DISTINCT twitter_uid, created, latitude, longitude FROM matthew_jacksonville; 

-- RUN "SET the_geom" here. (Difference of hours verses over 8 days for 20 million

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
ALTER TABLE matthew_jacksonville_temp ADD COLUMN the_geom geometry(Point,4326);

UPDATE matthew_jacksonville_temp SET the_geom = ST_SetSrid(ST_MakePoint(longitude, latitude), 4326);

-- step 3 - Replace original table
DROP TABLE matthew_jacksonville;
 
CREATE TABLE matthew_jacksonville (LIKE matthew_jacksonville_temp);

INSERT INTO matthew_jacksonville(twitter_uid, created, latitude, longitude, the_geom)
SELECT DISTINCT twitter_uid, created, latitude, longitude, the_geom FROM matthew_jacksonville_temp;

DROP TABLE matthew_jacksonville_temp;

4. ADD COLUMNS

-- 3.5 minutes for 20 million rows

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
ALTER TABLE matthew_jacksonville
ADD COLUMN localdate TIMESTAMPTZ,
ADD COLUMN displacement real,
ADD COLUMN displacement_hav real,
ADD COLUMN prior_created TIMESTAMPTZ,
ADD COLUMN prior_localdate TIMESTAMPTZ,
ADD COLUMN prior_latitude real,
ADD COLUMN prior_longitude real,
-- ADD COLUMN the_geom geometry(Point,4326), -- Added in prior step
ADD COLUMN prior_geom geometry(Point,4326),
ADD COLUMN guid uuid DEFAULT uuid_generate_v4() NOT NULL unique,
ADD COLUMN prior_guid uuid,
ADD COLUMN datasource smallint;

6. MERGE DATA

You'll typically skip this step since merges occur before extra columns are added. Used occasionally to add missing data.

-- Simple additions to file being merged into the main file that already has all columns

ALTER TABLE matthew_jacksonville_MONTH ADD COLUMN guid uuid DEFAULT uuid_generate_v4() NOT NULL unique;

INSERT INTO matthew_jacksonville (twitter_uid, created, longitude, latitude, guid)
SELECT twitter_uid, created, longitude, latitude, guid
FROM matthew_jacksonville_MONTH WHERE created >= '2017-10-06 00:00';

# If from external source, include datasource.

INSERT INTO matthew_jacksonville (twitter_uid, created, longitude, latitude, guid, datasource)
SELECT twitter_uid, created1, longitude, latitude, guid, 1
FROM matthew_jacksonville_MONTH;

INSERT INTO matthew_jacksonville (twitter_uid, created, longitude, latitude, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, the_geom, prior_geom, guid, prior_guid, datasource) 
SELECT twitter_uid, created, longitude, latitude, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, the_geom, prior_geom, guid, prior_guid, datasource 
FROM ph2014 WHERE created 
BETWEEN SYMMETRIC '2014-06-20 04:00' AND '2017-08-05 04:00';

7. SET THE LOCAL DATE

Times in the "created" date field are GMT for both sources (NDL and Northeastern). Yan's day-light-savings times are incorrect by one hour for the summer of 2018: 11 March 2018 2am to 4 November 2018 2am. She changed the time on her Twitter capture server.

-- RUN ONE OF THE FOLLOWING, OR TWO IF NOV OR MARCH.
-- CHOOSE DATABASE (created is in GMT and should never be changed. Change localdate.)
-- For USA, summer and fall daylight savings, subtract 4 

UPDATE j@pan SET localdate = created AT TIME ZONE 'GMT' + interval '1h' * 9;
UPDATE h0ngkong SET localdate = created AT TIME ZONE 'GMT' + interval '1h' * 8;
UPDATE tw2@16 SET localdate = created AT TIME ZONE 'GMT' + interval '1h' * 8;
UPDATE j@2014 SET localdate = created AT TIME ZONE 'GMT' + interval '1h' * 8;
UPDATE ph2@14 SET localdate = created AT TIME ZONE 'GMT' + interval '1h' * 8;

-- Nov cutoff date here (run both)
UPDATE se2017 SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 4 WHERE created < '2017-11-5 02:00:00';
UPDATE se2017 SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 5 WHERE created >= '2017-11-5 02:00:00';

-- harveyEdg@r 2017-05-19 03:18:20+00 to 2018-10-04 06:04:45+00 (10 minutes to run)
UPDATE harveyEdg@r SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 5 WHERE created < '2017-11-5 02:00:00';
UPDATE harveyEdg@r SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 6 WHERE created >= '2017-11-5 02:00:00';


UPDATE m@tthew_jacksonville SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 4 WHERE created < '2016-11-6 02:00:00';
UPDATE m@tthew_jacksonville SET localdate = created AT TIME ZONE 'GMT' - interval '1h' * 5 WHERE created >= '2016-11-6 02:00:00';


Standard time is 
Sunday 4 November 2018 02:00 local time to Sunday 10 March 2019 02:00 local time

Subtract 4 hours when US storm and...
2013    10 March 2013 2am  to 3 November 2013 2am
2014    9 March 2014 2am   to 2 November 2014 2am
2015    8 March 2015 2am   to 1 November 2015 2am
2016    13 March 2016 2am  to 6 November 2016 2am
2017    12 March 2017 2am  to 5 November 2017 2am
2018    11 March 2018 2am  to 4 November 2018 2am

Otherwise subtract 5 hours (standard time in the winter)

Past and future ranges: https://greenwichmeantime.com/time-zone/rules/usa/

The actual time zone is NOT saved with any of the Postgres date/time types.

Each date file generated from the pickle files originates at midnight EST.
The GMT time hence differs by either 4 or 5 hours from midnight.

8. GET THE USER'S NEXT TWEET AND INSERT DISPLACEMENT DISTANCE

For smaller datasets:

Discontinue using this first one.
Runtimes:
6.5 million over 3 days (canceled) when 3 indexes already added.  Try with no indexes next on table over 1 million.
24 minutes, now 31 (after guid and renaming id to twitter_uid) on se 201,309 (Irma) 1 minute for hongkong (Mangkhut) Took more than 12 hours on harveyEdg@r. All prior queries were fast for harveyEdg@r, which contains 1.2 million rows. --Table alias "b" is the next tweet, insert values from the individual's prior tweet row "a" WITH subquery AS ( SELECT b.guid, a.guid prior_guid, a.created, a.localdate, ST_DistanceSphere(a.the_geom,b.the_geom)/1000 displacement, a.the_geom, a.latitude, a.longitude FROM matthew_jacksonville a JOIN matthew_jacksonville b -- Select the user's next upcoming row ON a.twitter_uid = b.twitter_uid AND b.created > a.created AND b.guid = (SELECT c.guid FROM matthew_jacksonville c WHERE c.twitter_uid = a.twitter_uid AND c.created > a.created ORDER by c.created LIMIT 1) ) UPDATE matthew_jacksonville theData SET displacement = subquery.displacement, prior_created = subquery.created, prior_localdate = subquery.localdate, prior_geom = subquery.the_geom, prior_latitude = subquery.latitude, prior_longitude = subquery.longitude, prior_guid = subquery.prior_guid, -- http://daynebatten.com/2015/09/latitude-longitude-distance-sql/ -- 6371=radius earth in kilometers displacement_hav = 2 * 6371 * asin(sqrt((sin(radians((theData.latitude - subquery.latitude) / 2))) ^ 2 + cos(radians(subquery.latitude)) * cos(radians(theData.latitude)) * (sin(radians((theData.longitude - subquery.longitude) / 2))) ^ 2)) FROM subquery WHERE theData.guid = subquery.guid;

For large datasets exceeding 1 million, we'll join an indexed table to a copy of the table where updates will occur to avoid slow simple update queries on PostgreSQL database by using the following (Reduces runtime from days to hours. 3 hours for 6.5 million, otherwise over 3 days.)

Learn more about dropping all the indexes, triggers and foreign keys while the update runs, then recreate them at the end.

If indexes remain in place, there's not much speed improvement from running Vaccum, Analyze and Reindex. (2.5 hrs for 8.5 mil without running. Still 3hrs for 6.5mil after running with retained indexes.)

-- Add indexes on the three fields to be used by the join (twitter_uid, created, guid)
CREATE INDEX matthew_jacksonville_twitter_uid_index ON matthew_jacksonville (twitter_uid);
CREATE INDEX matthew_jacksonville_created_index ON matthew_jacksonville (created);
CREATE INDEX matthew_jacksonville_guid_index ON matthew_jacksonville (guid);

CREATE TABLE matthew_jacksonville_copy (LIKE matthew_jacksonville);

-- Note, this cannot be modified to set as DISTINCT since localdate is already populated. Distinct occurs previously.
INSERT INTO matthew_jacksonville_copy(twitter_uid, created, latitude, longitude, the_geom, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, prior_geom, guid, prior_guid, datasource)
SELECT twitter_uid, created, latitude, longitude, the_geom, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, prior_geom, guid, prior_guid, datasource FROM matthew_jacksonville;

WITH subquery AS (
  SELECT
    b.guid, a.guid prior_guid, a.created, a.localdate,
    ST_DistanceSphere(a.the_geom,b.the_geom)/1000 displacement,
    a.the_geom, a.latitude, a.longitude
  FROM
    matthew_jacksonville a JOIN matthew_jacksonville b -- Select the user's next upcoming row 
    ON a.twitter_uid = b.twitter_uid AND b.created > a.created AND 
    b.guid = (SELECT c.guid FROM matthew_jacksonville c WHERE c.twitter_uid = a.twitter_uid 
    AND c.created > a.created ORDER by c.created LIMIT 1)
)
UPDATE matthew_jacksonville_copy theData
SET displacement = subquery.displacement,
prior_created = subquery.created,
prior_localdate = subquery.localdate,
prior_geom = subquery.the_geom,
prior_latitude = subquery.latitude,
prior_longitude = subquery.longitude,
prior_guid = subquery.prior_guid
-- http://daynebatten.com/2015/09/latitude-longitude-distance-sql/
-- 6371=radius earth in kilometers
--displacement_hav = 2 * 6371 * asin(sqrt((sin(radians((theData.latitude - subquery.latitude) / 2))) ^ 2 + cos(radians(subquery.latitude)) * cos(radians(theData.latitude)) * (sin(radians((theData.longitude - subquery.longitude) / 2))) ^ 2))
FROM subquery
WHERE theData.guid = subquery.guid;


------------------

-- Skip the following. Only used for comparing displacement_hav formula.
-- Populate displacement_hav. Also requires having two tables so indexes in one speed-up finding prior displacement.
CREATE TABLE matthew_jacksonville_copy (LIKE matthew_jacksonville);
INSERT INTO matthew_jacksonville_copy(twitter_uid, created, latitude, longitude, the_geom, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, prior_geom, guid, prior_guid, datasource)
SELECT twitter_uid, created, latitude, longitude, the_geom, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, prior_geom, guid, prior_guid, datasource FROM matthew_jacksonville;
WITH subquery AS (
  SELECT
    b.guid, a.guid prior_guid, a.created, a.localdate,
    a.the_geom, a.latitude, a.longitude
  FROM
    matthew_jacksonville a JOIN matthew_jacksonville b -- Select the user's next upcoming row 
    ON a.twitter_uid = b.twitter_uid AND b.created > a.created AND 
    b.guid = (SELECT c.guid FROM matthew_jacksonville c WHERE c.twitter_uid = a.twitter_uid 
    AND c.created > a.created ORDER by c.created LIMIT 1)
)
UPDATE matthew_jacksonville_copy theData
SET displacement_hav = 2 * 6371 * asin(sqrt((sin(radians((theData.latitude - subquery.latitude) / 2))) ^ 2 + cos(radians(subquery.latitude)) * cos(radians(theData.latitude)) * (sin(radians((theData.longitude - subquery.longitude) / 2))) ^ 2))
FROM subquery
WHERE theData.guid = subquery.guid;



-- Run the following after first query above.
-- If storage is exceeded, process above may end without notification.
-- So we pause here.  Otherwise check if displacement field is populated to confirm copy sequence completed.
-- Delete initial and rename copy to replace (compart count on copy first)

DROP TABLE matthew_jacksonville;
CREATE TABLE matthew_jacksonville (LIKE matthew_jacksonville_copy);

INSERT INTO matthew_jacksonville(twitter_uid, created, longitude, latitude, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, the_geom, prior_geom, guid, prior_guid, datasource)
SELECT DISTINCT twitter_uid, created, longitude, latitude, localdate, displacement, displacement_hav, prior_created, prior_localdate, prior_latitude, prior_longitude, the_geom, prior_geom, guid, prior_guid, datasource FROM matthew_jacksonville_copy;

DROP TABLE matthew_jacksonville_copy;

9. ADD INDEXES

We add indexes last, because table constraints and indexes heavily delay every write.

-- Typically skip this dropping part
-- Used whek dropping and adding table becuae , run first
DROP INDEX matthew_jacksonville_twitter_uid_index;
DROP INDEX matthew_jacksonville_created_index;
DROP INDEX matthew_jacksonville_guid_index;

DROP INDEX matthew_jacksonville_latitude_index;
DROP INDEX matthew_jacksonville_longitude_index;
DROP INDEX matthew_jacksonville_prior_created_index;
DROP INDEX matthew_jacksonville_localdate_index;
DROP INDEX matthew_jacksonville_prior_guid_index;
DROP INDEX matthew_jacksonville_the_geom;
DROP INDEX matthew_jacksonville_prior_geom;
DROP INDEX matthew_jacksonville_displacement_index;




-- Runs in less than 10 minutes for 8 million records.

-- Readd 3 indexes from step 8 since deleted when copying back data
CREATE INDEX matthew_jacksonville_twitter_uid_index ON matthew_jacksonville (twitter_uid);
CREATE INDEX matthew_jacksonville_created_index ON matthew_jacksonville (created);
CREATE INDEX matthew_jacksonville_guid_index ON matthew_jacksonville (guid);

CREATE INDEX matthew_jacksonville_latitude_index ON matthew_jacksonville (latitude);
CREATE INDEX matthew_jacksonville_longitude_index ON matthew_jacksonville (longitude);
CREATE INDEX matthew_jacksonville_prior_created_index ON matthew_jacksonville (prior_created);
CREATE INDEX matthew_jacksonville_localdate_index ON matthew_jacksonville (localdate);
CREATE INDEX matthew_jacksonville_prior_guid_index ON matthew_jacksonville (prior_guid);
CREATE INDEX matthew_jacksonville_the_geom ON matthew_jacksonville USING GIST (geography(the_geom));
CREATE INDEX matthew_jacksonville_prior_geom ON matthew_jacksonville USING GIST (geography(prior_geom));
CREATE INDEX matthew_jacksonville_displacement_index ON matthew_jacksonville (displacement);




-- Run the following to see all indexes on table:
-- Includes key for unique index on guid
-- CASE SENSITIVE, use lowercase since capital letters are ignored in initial table and index inserts.
SELECT * FROM pg_indexes WHERE tablename = 'matthew_jacksonville';

-- Show which columns are indexed. Excludes the 2 geom columns.
select
    t.relname as table_name,
    i.relname as index_name,
    array_to_string(array_agg(a.attname), ', ') as column_names
from
    pg_class t,
    pg_class i,
    pg_index ix,
    pg_attribute a
where
    t.oid = ix.indrelid
    and i.oid = ix.indexrelid
    and a.attrelid = t.oid
    and a.attnum = ANY(ix.indkey)
    and t.relkind = 'r'
    and t.relname like 'matthew_jacksonville'
group by
    t.relname,
    i.relname
order by
    t.relname,
    i.relname;




10. Remove Bots (none found previously)

Add a table with a twitter_uid column containing known bots. russian-troll-tweets - List of bots Ran in: m@tthew_jacksonville, se2017, se2018, tw2016, harvey_houston, hongkong, ja2014 (none found)

SELECT count(*) FROM bots JOIN matthew_jacksonville 
ON bots.twitter_uid = matthew_jacksonville.twitter_uid;


CREATE TABLE public.bots
(
    twitter_uid bigint NOT NULL
)
WITH (
    OIDS = FALSE
);

ALTER TABLE public.bots
    OWNER to dataportal;

-- After importing

11. FOR LARGE DATASET, VACUUM

Using FULL VACUUM locks table. Vaccuming large tables may exceed storage capacity. Verbose report returns all 0's, so might not need to select.

(22 seconds on 6.5 million after copying and indexing. So vacumming may not be necessary immediately after adding indexes.)

-- Right-click the table name, choose Maintenance. Retain full as NO.

12. PRE-PROCESSING IN POSTGRESQL

The remaining steps for data preparation are managed internally. Contact us with special requests if you are working on research that needs additional pre-processing.


B.) 33 Day Span AND Radii of Gyration for days within 33 Day Span

15 days before and 17 days after storm (plus overlap for 3am end-of-day times).
Dynamic-Function.md (private repo)

C.) 5-Day Norm and Impact - All Storms

Use isimpactday='YES' to display the day of impact.
Norm-Impact-Chart.md (private repo)

D.) Perturbation - 6 Discrete Displacement Steps - All Storms

Uses Norm-Impact-Chart to output 6 rows for each storm. Includes output of degrees to match figure 1. Also includes percent change.
Perturbation-Chart.md (private repo)

E.) Generate Gyration All

Utilizes *_gyration tables generated by dynamic-function.md 240,611 user days of gyration. Of these, 81,024 have 5 or more steps. Does not include step_type since each gyration covers whole city. Gyration-All.md (private repo)

F.) Genertate all_summary and all_displacement (PowerLaw)

Utilizes all_norm_and_impact table.