Data Pipeline

About Zip Data

Our repository provides zipcode-related economic data, automatically updated via GitHub Actions workflows. The data is sourced from the U.S. Census API, processed, and stored in a DuckDB database, and exported to CSV files for easy access. Developers can also explore the code to customize the data processing pipeline and also use our NAICS .csv files prepared for the entire US, individual states and counties.

At the zipcode level, employees and payroll are often omitted by the census to protect privacy.
Our NAICS Imputation using ML can be updated to estimate blank values.

Table of Contents

Overview

This repository is designed to provide users with up-to-date economic data by ZIP code, categorized by industry levels. Currently, the data is configured to include industry levels 2, 5, and 6. Data is automatically fetched from the U.S. Census API, processed, and stored in a structured format, ensuring easy access and utilization.

Data Storage & Structure

Database Structure

The database is designed to store economic data related to various ZIP codes and industries. It consists of four tables: DimYear, DimNaics, DimZipCode, and DataEntry. Key constraints are not enforced in the current database to optimize data ingestion; however, data can be migrated to a database with constraints if needed in the future.

Table: DimYear

Table: DimNaics

Table: DimZipCode

Table: DataEntry

Relationships:

CSV Files

Data is exported to CSV files, stored in a structured directory. Files are named based on the state, industry level, and year:

industries/naics/US/zip/AK/US-AK-census-naics6-zip-2012.csv

Accessing the Data

Users can directly access the CSV files or the DuckDB database files to retrieve the data they need. It is recommended to access the DuckDB files directly for better efficiency, as this approach eliminates the overhead of creating pandas DataFrames from the CSV files. The files are updated regularly, ensuring that the latest data is always available in this repository.

For Developers

Developers interested in modifying the data pipeline should fork the repository and work on their own branch. The main scripts involved in data processing are located in the industries/naics/duck_zipcode_db/ directory:

Deprecated Files: Any deprecated files have been moved to a deprecated folder in the repo. If you need to use these files, ensure to create a new branch and move them back to the original location for testing or other purpose.

For detailed information on the configuration of the data management workflow (GitHub Actions), please refer to the Data Management Workflow README.

Documentation

For detailed logic and implementation, refer to the inline documentation within the scripts located in the repository. This includes explanations for the populator and exporter scripts.


If you're looking to understand how data is automatically updated, check the .github/workflows.md for details on the workflow.

Process Industry NAICS by Zip Code

For use in local US EPA impact comparisons

Our Community Datasets for industries use naics-annual.ipynb to generate country, states and county files.

Each zip code .csv will has 5 columns:

Folders are nested 3/0/3/1/8 to avoid GitHub's files-per-folder limit.

The results will be used in our Industry Comparison location filters, similar to countries, states and counties.

The old zip code files reside here:
community-data/us/zipcodes/naics - 30318

The new zip code files will reside at:
community-data/industries/naics/US/zip

For each year, there will be 5 zip code files for each state:

US/zip/NY/US-NY-census-naics2-zip-2023.csv
US/zip/NY/US-NY-census-naics3-zip-2023.csv
US/zip/NY/US-NY-census-naics4-zip-2023.csv
US/zip/NY/US-NY-census-naics5-zip-2023.csv
US/zip/NY/US-NY-census-naics6-zip-2023.csv

Some payroll fields provide from the census will be null to protect privacy.
Here's our work on Estimating using Machine Learning

Zip code files are pulled from the Census API and saved as DuckDB here in the duck_zipcode_db subfolder (by David C in June 2024)

DuckDB is faster at processing than Pandas. The DuckDB database was too big to deploy to GitHub.

TO DO: Create DuckDB files by year so they are small enough to share with GitHub.

TO DO: Process new data once a year using a GitHub Action.

David included additional fields like city in the DuckDB, which could be useful later for filling in gaps with ML.

In July, Badri ran this in VS Code terminal. Launch in "naics" folder using code .

python3 -m venv env
source env/bin/activate
python -m pip install ipykernel -U --force-reinstall
pip install duckdb
pip install pandas
pip install tqdm

Hit "Change Kernel"

Go to: naics/duck_zipcode_db/zip_data/duck_db_manager/duckdb_database.ipynb

Run the 1st and 4th cells in duckdb_database.ipynb

TO DO: Make the notebook intuitive by eliminating the need to skip a step.
Move retained files to naics/duckdb/prep. Use dashes instead of underscores in files names.

TO DO: Output nested folders within a new "zip" folder in a fork of community-data/industries/naics/US/zip

TO DO: This process needs to be run annually via a Github Action. We could avoid sending to DuckDB when just updating one year at a time.

NAICS Zip Code data

Starting with 2019, ZIP Codes Business Patterns (ZBP) are available in the
County Business Patterns (CBP) API that use for state and county naics processing.

Old page to revise: Processing zips codes prior to 2019

Note from David

The sqlite and zip utility folders are deprecated. We can delete them and the associated files.

TO DO: Delete the folders as David has advised. Thanks!

Key Changes (this is a note from David):

Replaced SQLite with DuckDB for its high-performance read capabilities.
Updated database connection logic to use DuckDB for all data operations.
Implemented a function that queries the database by year, industry level, and the first digit of the zipcode, which allows for more targeted data retrieval.
Implemented a function that exports the database to csv files for better portability.
Implemented a function that can rebuild the database from the csv files.
Optimized export_to_csv and import_csv_files functions to handle data segmentation by year and industry level. This makes sure that each CSV file does not greatly exceed 25MB. (Loren adds: Let's avoid deploying the year-industry files. We could instead rebuild DuckDB from the zipcode files in 3/0/3/1/8 folders.)

Also from David:

DuckDB zip database

Located in duck_zipcode_db > zip_data > duck_db_manager > database

Note: Only run the the 'populate_database' notebook if for whatever reason, you don't have the required CSV files. Otherwise, instantiating a 'DuckDBManager' object will automatically build a duckdb database from the CSV files.

If you're looking to update the database, instantiate a 'ZipPopulator' object and then run 'getzipforyear(year)' in the populatedatabase notebook.

Querying the database through Python (In progress)

This is a work in progress. The idea here is to make Python functions to easily query the database that would be in the 'zip_data' folder. This way, you can query data in a notebook environment, and get them in a csv file format.