Nightly Cleanup of Duplicate Data & Fixing Invalid Items
  • 10 Apr 2024
  • 6 minute read
  • Dark
    Light
  • PDF

Nightly Cleanup of Duplicate Data & Fixing Invalid Items

  • Dark
    Light
  • PDF

Article summary

Every night in production databases, cleanup scripts run to remove duplicate data, and fix invalid items.

The overnight cleanup process is placed into a queue of the overall nightly process, which begins at 2:30 AM Eastern time. Depending on its placement in the queue and other cluster database activity, the cleanup process can run anytime after 2:30 AM Eastern time. Typically everything is concluded by 10:30 AM Eastern time. However, there may be occasions when the process runs later in the afternoon.

Person Records

Phone numbers

There are three phone number columns on the person table:

  • phone

  • mobile

  • business

These columns are reformatted so that they are uniform, using proper phone formatting.

The Phone numbers columns are populated with the rank 1 value from devices that have a Device Type with a category of phone, mobile, or business. The category corresponds with the person table column. Other items stored on the devices table, such as email addresses, are not included in this process.

Citizenship

Secondary Citizenship is cleared if it has the same value as Primary Citizenship.

Birthdate

Birthdates before January 1, 1800, are cleared.

Slate Reference ID

Slate Reference IDs are nine-digit values assigned automatically to every person record. A check is run for duplicate Slate Reference IDs. A new value is assigned to the most recently created person record if there is a duplicate Slate Reference ID. The Slate Reference ID clean also runs hourly throughout the day in production databases.

📝 Note

This process does not check override IDs. Users set override IDs, and nothing prevents a user from duplicating a Slate ID. Likewise, a user can assign a nine-digit override ID that already exists (or can exist in the future) as an automatically generated Slate ID. These are not deduped.

Application Records

Slate Reference ID

Slate Reference IDs are nine-digit values assigned automatically to every Application record. A check is run for duplicate Slate Reference IDs. A new value is assigned to the most recently created Application record if there is a duplicate Slate Reference ID. The Slate Reference ID clean also runs hourly throughout the day in production databases.

📝 Note

This process does not check override IDs. Users set override IDs, and nothing prevents a user from duplicating a Slate ID. Likewise, a user can assign a nine-digit override ID that already exists (or can exist in the future) as an automatically generated Slate ID. These are not deduped.

Addresses

Addresses can be scrubbed in production and test databases on-demand in Database > Scrub Address Records.

Region

If an address is in the United States, and the region is the full name of the state rather than the abbreviated state code, it is updated to the abbreviated state code.

This can only occur if the name of the region matches the region name in Slate. For example, Connecticut will be updated to CT, but Conn will remain Conn.

Postal code

If an address is in the United States, and the postal code is only three or four characters long, leading 0s are added to make it five characters long.

Duplicate addresses

Duplicate addresses (matching record, type, effective, expires, first eight characters of the street) are deleted.

Priority is given to addresses based on their priority (high > normal > low > inactive), followed by the quality (from an applicant > not validated from an import), and then the most recently updated address.

Time-sensitive addresses

The ranks of addresses that have expired, or are not yet effective, are updated to NULL so that they are not included in exports or filters.

So that they can be used in exports and filters, addresses without ranks will be given one if they satisfy one of the following conditions:

  • They do not have an expires date

  • They haven’t expired yet and do not have an effective date

  • They have an effective date in the past

Geocodes

Geocodes (used for geographic proximity) are updated for addresses nightly for any addresses that do not yet have a geocode.

Address validation

Addresses that have not yet been validated get scrubbed (via USPS).

Devices

Duplicate Devices

Duplicate devices (matching record, type, value) are deleted. Priority is given to devices based on the rank, which is based on the priority (high > normal > low > inactive), and in the case of a tie, updated, then created dates.

Phone number device types

Phone number device types (mobile, business, phone) are deleted if the only numeric characters end with "0000000."

Fields

CommonApp

CommonApp fields scoped to the person are updated to be application-scoped if the person record has a submitted application. The application used is the one that is ranked 1.

Duplicate fields

Duplicate field values (matching field, record, order, prompt, index, related) are deleted.

Multiple values in single value field

If a record has multiple field values for a field configured to store a single value, all but a single field is deleted. The priority is based on the field ID (which is random).

Origins

Person records updated in the past 48 hours will have their first and last origins recalculated using the Origin Groups and Origin Sources configured in the Admin Tool. To force process these (if they make changes to their Origin Groups or Origin Sources), they can run a Retroactive Refresh in the Rules Editor. For more information about how Origins work, see the Query Filters/Exports Specialist Knowledge article "Origin Sources: Exports/Filters, and other important information."

Passwords

  • Passwords for person records are automatically expired if a person has not attempted to log in for 12 months.

  • PINs for person records are automatically expired if the PIN was set more than 12 months ago.

Schools

  • Duplicate schools (matching record, level of study, CEEB, name, from, to, degree, conferred, field, and major) are deleted. 

  • If one of the schools has a transcript attached to it, then that is the school that is kept. Otherwise, the school that was most recently updated is kept.

  • Duplicate schools (matching record, CEEB, and name) are deleted IF level of study is blank, from date is blank, to date is blank, there is no transcript associated with it, GPA is blank, class size is blank, and class rank is blank. 

  • Preference for the school record to keep is based on the most recently updated school.

Tests

  • Duplicate test scores (matching record, type, subtype, date, total, score1, score2, score3, score4, score5, alpha, location) are deleted. Preference is given to the test score that is verified, then to the test record that was most recently updated.

  • Duplicate SELF-REPORTED test scores (matching record, type, subtype, date) are deleted IF a VERIFIED test score for the same test type and subtype exists for the same date.

  • Duplicate GMAT test scores (matching record, type, subtype, test year, test month, total, score1, score2, score3, score4, score5, alpha, location) are deleted. Preference is given to the test score that is verified, then to the test record that was most recently updated.

  • To handle superscores (such as those that are imported from CommonApp, where the highest SAT Math score is reported with the highest SAT Math Date, and highest SAT Critical Reading score is reported with the highest SAT Critical Reading date, where those dates may or may not be the same), self-reported test scores that have different component scores on the same date are merged. Test types that are eligible for the merge are: ACT; SATI; SATII; AP; IB; TOEFL; GRE; IB; GMAT; IELTS. The following items must match for it to be considered for a merge: record, type, subtype, date, confirmed (verified status), and location. Test scores with no date are not eligible. The three other duplicate merge items above take place first, so if all of the component scores above existed and matched, they would not exist when this merge takes place.

  • To handle test types that expire (Database > Tests > "Expires After x Days" setting), the script will mark the tests as canceled if the current date is greater than the test date + the number of days specified.

  • "Score Report" checklist items are deleted if the record no longer has a self-reported test score of that type or if the Checklist setting for the test type in the Tests section of the Admin Tool is no longer set to Yes.


Was this article helpful?