Every tool on this site is built on one public dataset: New Jersey's annual Fall Enrollment Reports. This is everything we learned wrangling 28 years of it — the format eras, the categories that changed and exactly when, the traps, and the shape of the database we normalized it all into.
The data comes from the New Jersey Department of Education Fall Enrollment Reports, published at nj.gov/education/doedata/enr. It's an annual fall headcount of every student in every public school and district in the state, broken out by grade, race/ethnicity, and economic and program status. The state posts one ZIP file per school year, and the series runs from 1998–99 through 2025–26 — 28 years.
Each ZIP holds a single file (CSV or Excel) whose name and shape depend on the year. The grain underneath is consistent in spirit — numbers for each school — but the container, the column names, and even which categories exist have all changed over the years. That's the whole challenge.
The state rebuilt the report format twice. There are three distinct generations, and a parser has to detect which one it's looking at before it can read a row.
| Era | Years | Container | Shape |
|---|---|---|---|
| A | 1998–99 → 2008–09 (11 yrs) | One CSV, STAT_ENR.CSV |
One row per school × program/grade. Race is split into sex-coded columns — WHM/WHF (white male/female), BLM/BLF, HIM/HIF, and so on. You sum male+female to get a race count. |
| B | 2009–10 → 2018–19 (10 yrs) | Excel "long" workbook. 2009–10 and 2010–11 are legacy binary .xls (must be converted, e.g. via LibreOffice, before a modern library will open them); 2011–12 onward are .xlsx. Filenames drift: enr.xls, enr.xlsx, EnrollmentReport.xlsx. |
A long "SQL Results"–style sheet: one row per school × grade × program, with race still as race × gender columns. Header naming shifts (WhiteM vs WHM), and some years mislabel the grade column "Total" on every row, keeping the real grade in a separate PRGCODE. |
| C | 2019–20 → 2025–26 (7 yrs) | Modern .xlsx with separate State / District / School sheets. |
One row per school, with plainly-named columns (White, Black, Hispanic…) and a percentage column next to each count. Read the count, skip the percent. |
Two practical notes that cost us time: header rows are not always the first row (scan the first several rows for a known label like a district-code column), and the way to tell Era B from Era C apart — both are .xlsx — is to look at the sheet names: if there are separate "School" and "District" sheets, it's Era C; otherwise treat it as an Era B long workbook.
The single most important thing to understand: the race/ethnicity categories are not constant across the 28 years.
Five categories run the entire series, unchanged in meaning: White, Black, Hispanic, Asian, and Native American. Two more appear partway through. Here's exactly where:
| Category | Present | Note |
|---|---|---|
| White, Black, Hispanic, Native American | 1998–99 → 2025–26 (all 28 yrs) | Stable throughout. |
| Asian | 1998–99 → 2025–26 (all 28 yrs) | But its meaning narrows in 2006–07 — see below. |
| Native Hawaiian / Pacific Islander | 2006–07 → 2025–26 (20 yrs) | Split out as its own category. Before 2006–07 these students were counted inside Asian. |
| Two or More Races | 2006–07 → 2025–26 (20 yrs) | New multiracial category. Before 2006–07 there was no way to be multiracial; students were assigned a single race. |
| Dimension | Coverage |
|---|---|
| Free / reduced-price lunch (economic disadvantage) | Back to 1998–99, but blank ~2019–20 → 2021–22 in the source. |
| Multilingual / English learners (labeled LEP, then ELL, then ML over time — same dimension) | From ~2005–06. |
| Migrant | From ~2005–06. |
| Military-connected, Homeless | Modern only — 2022–23 onward. |
| Gender (male/female by race) | Exists in the legacy source files (1998–2018) but we do not load it. |
Because 2006–07 reshuffled the categories, you cannot naively compare raw race shares across that line. Two rules make any two years comparable:
Fold Pacific Islander back into Asian. Since Asian included Pacific Islander before 2006–07, the consistent measure across all 28 years is Asian + Native Hawaiian/Pacific Islander.
Only set aside "Two or More Races" when a comparison reaches back before it existed. Because the category (and a separate Pacific Islander count) didn't exist before 2006–07, including it in a comparison that crosses that line would count a group one side structurally couldn't have. So the rule is best-available: when both chosen years are 2006–07 or later, use all seven groups; when the range reaches earlier, fall back to the five consistent ones — White, Black, Hispanic, Asian (incl. Pacific Islander), Native American — with "two or more" set aside. Either way the shares are re-shared to sum to 100%.
That is exactly what the demographic-shift explorer does. Its Demographic Shift Index is ½ · Σ |Δ share| over whichever group set the chosen years support: the share of students who would have to be a different race to turn the earlier year's mix into the later one's.
In the CSV and long-workbook eras, each school has many rows: an authoritative "Total" row plus per-grade rows (and sometimes special-education-by-disability rows that are already inside the Total). Take a school's total, race, and lunch counts from the Total row; take per-grade counts from the grade rows; never add the two together. Only fall back to summing grade rows when a Total row is genuinely absent.
Shared-time students (those who split attendance between schools) are reported as halves. Counts are not always integers — there are roughly 4,400 fractional totals in the series. Preserve them as-is; don't round on import.
The files embed aggregate rows. A school code of 999, or a school name containing "TOTAL," is the district total, not a building. A district code of 9999, or names like "County Total" / "State Total," are county/state rollups. Separate the district aggregates into their own table and drop the county/state rollups, or they'll inflate everything.
Missing or suppressed values show up as ., *, N, -, or empty. Treat all of them as null, not zero.
Normalize the codes — district to 4 digits, school to 3 — and key schools on district_code + school_code, which is globally unique in New Jersey and stable across years. Do not key on county_code; it's unreliable in some legacy files. School names, meanwhile, drift constantly and sometimes change outright. One building in South Orange-Maplewood (district 4900, school 090) reads:
JEFFERSON 1998-99 → 2009-10
Jefferson E.S. 2010-11
Jefferson Elementary School 2011-12 → 2021-22
Delia Bolden Elementary School 2022-23 → 2025-26
Same code throughout, four names. Track the history (we keep a school_name_history table) so a rename doesn't read as a school appearing and disappearing.
All 28 years normalize into one SQLite file with three tables. Counts are stored as REAL (to preserve the fractional shared-time values), and missing values are NULL.
| Table | Grain | Rows |
|---|---|---|
enrollment | one row per school × year | ~69,300 |
district_enrollment | one row per district × year (authoritative district total; synthesized from member schools where the source omits it, and flagged) | ~18,400 |
school_name_history | one row per code × name, with first/last year — catches renames | — |
Coverage: 744 districts, ~3,038 schools, all 21 NJ counties (plus charter / state-operated groupings), 1998–99 → 2025–26.
The enrollment schema:
CREATE TABLE enrollment (
school_year TEXT, county_code TEXT, county_name TEXT,
district_code TEXT, district_name TEXT, school_code TEXT, school_name TEXT,
total REAL,
-- race / ethnicity (hawaiian_pi & two_or_more only populated from 2006-07)
white REAL, black REAL, hispanic REAL, asian REAL,
native_american REAL, hawaiian_pi REAL, two_or_more REAL,
-- grades
pk REAL, k REAL, g1 REAL, g2 REAL, g3 REAL, g4 REAL, g5 REAL, g6 REAL,
g7 REAL, g8 REAL, g9 REAL, g10 REAL, g11 REAL, g12 REAL,
-- economic / program
free_lunch REAL, reduced_lunch REAL, ml_learners REAL,
migrant REAL, military REAL, homeless REAL,
PRIMARY KEY (school_year, county_code, district_code, school_code)
);
-- district_enrollment mirrors these columns (no school_code/name; adds
-- synthesized INTEGER), keyed (school_year, county_code, district_code).
An example — every school's economic-disadvantage rate for a given year:
SELECT school_year, district_name, school_name,
ROUND(100.0 * (free_lunch + reduced_lunch) / total, 1) AS pct_frl
FROM enrollment
WHERE school_year = '2024-25' AND total >= 100 AND free_lunch IS NOT NULL
ORDER BY pct_frl DESC;
A single build script reproduces the whole database from the public source. For each year it: downloads the ZIP from the state archive; detects the era (CSV → Era A; an Excel workbook with School + District sheets → Era C; any other workbook → the Era B long format, converting legacy .xls first); reads each row by matching column names against the known naming variants; applies the Total-row / grade-row / rollup rules above; normalizes codes; and writes the three tables. Everything in sections 2–5 is the spec it implements — given the raw ZIPs, that's enough to rebuild the same schema from scratch.
The pipeline's output reproduces figures the South Orange-Maplewood district published itself (from its Nov 2025 integration forum), which validates the parse end to end across eras:
| Figure | Our extract | District-published |
|---|---|---|
| SOMSD White %, 1998–99 | 43.8% | 44% |
| SOMSD % free/reduced lunch, 1998–99 | 17.0% | 16.9% |
| South Mountain White %, 1998–99 | 56.1% | 56% |
| SOMSD White %, 2019–20 | 55.4% | 55.4% |
| SOMSD % free/reduced lunch, 2024–25 | 14.1% | 14.1% |
Source data: New Jersey Department of Education, Fall Enrollment Reports, 1998–99 through 2025–26 (nj.gov/education/doedata/enr).