Data Pipeline

About Zip Data

Our repository provides zipcode-related economic data, automatically updated via GitHub Actions workflows. The data is sourced from the U.S. Census API, processed, and stored in a DuckDB database, and exported to CSV files for easy access. Developers can also explore the code to customize the data processing pipeline and also use our NAICS .csv files prepared for the entire US, individual states and counties.

At the zipcode level, employees and payroll are often omitted by the census to protect privacy.
Our NAICS Imputation using ML can be updated to estimate blank values.

Table of Contents

Overview

This repository is designed to provide users with up-to-date economic data by ZIP code, categorized by industry levels. Currently, the data is configured to include industry levels 2, 5, and 6. Data is automatically fetched from the U.S. Census API, processed, and stored in a structured format, ensuring easy access and utilization.

Data Storage & Structure

Database Structure

The database is designed to store economic data related to various ZIP codes and industries. It consists of four tables: DimYear, DimNaics, DimZipCode, and DataEntry. Key constraints are not enforced in the current database to optimize data ingestion; however, data can be migrated to a database with constraints if needed in the future.

Table: DimYear

Table: DimNaics

Table: DimZipCode

Table: DataEntry

Relationships:

CSV Files

Data is exported to CSV files, stored in a structured directory. Files are named based on the state, industry level, and year:

industries/naics/US/zip/AK/US-AK-census-naics6-zip-2012.csv

Accessing the Data

Users can directly access the CSV files or the DuckDB database files to retrieve the data they need. It is recommended to access the DuckDB files directly for better efficiency, as this approach eliminates the overhead of creating pandas DataFrames from the CSV files. The files are updated regularly, ensuring that the latest data is always available in this repository.

For Developers

Developers interested in modifying the data pipeline should fork the repository and work on their own branch. The main scripts involved in data processing are located in the industries/naics/duck_zipcode_db/ directory:

Deprecated Files: Any deprecated files have been moved to a deprecated folder in the repo. If you need to use these files, ensure to create a new branch and move them back to the original location for testing or other purpose.

For detailed information on the configuration of the data management workflow (GitHub Actions), please refer to the Data Management Workflow README.

Documentation

For detailed logic and implementation, refer to the inline documentation within the scripts located in the repository. This includes explanations for the populator and exporter scripts.


If you're looking to understand how data is automatically updated, check the .github/workflows.md for details on the workflow.