- 12 Dec 2024
- 6 minute read
- Print
- DarkLight
- PDF
Nightly Cleanup of Duplicate Data & Fixing Invalid Items
- Updated 12 Dec 2024
- 6 minute read
- Print
- DarkLight
- PDF
Every night in production databases, cleanup scripts run to remove duplicate data, and fix invalid items.
The overnight cleanup process is placed into a queue of the overall nightly process, which begins at 2:30 AM Eastern time. Depending on its placement in the queue and other cluster database activity, the cleanup process can run anytime after 2:30 AM Eastern time. Typically everything is concluded by 10:30 AM Eastern time. However, there may be occasions when the process runs later in the afternoon.
๐ NoteSome of these scripts may be executed on demand. This should be done sparingly in production so as not to disrupt your normal workflows. Learn more about the Data Cleanup Tool in our Database Cleanup Tools article.
Person Records
Phone numbers
There are three phone number columns on the person table:
phone
mobile
business
These columns are reformatted so that they are uniform, using proper phone formatting.
The Phone numbers columns are populated with the rank 1 value from devices that have a Device Type with a category of phone, mobile, or business. The category corresponds with the person table column. Other items stored on the devices table, such as email addresses, are not included in this process.
Citizenship
Secondary Citizenship is cleared if it has the same value as Primary Citizenship.
Birthdate
Birthdates before January 1, 1800, are cleared.
Slate Reference ID
Slate Reference IDs are nine-digit values assigned automatically to every person record. A check is run for duplicate Slate Reference IDs. A new value is assigned to the most recently created person record if there is a duplicate Slate Reference ID. The Slate Reference ID clean also runs hourly throughout the day in production databases.
๐ NoteThis process does not check override IDs. Users set override IDs, and nothing prevents a user from duplicating a Slate ID. Likewise, a user can assign a nine-digit override ID that already exists (or can exist in the future) as an automatically generated Slate ID. These are not deduped.
Application Records
Slate Reference ID
Slate Reference IDs are nine-digit values assigned automatically to every Application record. A check is run for duplicate Slate Reference IDs. A new value is assigned to the most recently created Application record if there is a duplicate Slate Reference ID. The Slate Reference ID clean also runs hourly throughout the day in production databases.
๐ NoteThis process does not check override IDs. Users set override IDs, and nothing prevents a user from duplicating a Slate ID. Likewise, a user can assign a nine-digit override ID that already exists (or can exist in the future) as an automatically generated Slate ID. These are not deduped.
Addresses
Addresses can be scrubbed in production and test databases on-demand in Database > Scrub Address Records.
Region
If an address is in the United States, and the region is the full name of the state rather than the abbreviated state code, it is updated to the abbreviated state code.
This can only occur if the name of the region matches the region name in Slate. For example, Connecticut will be updated to CT, but Conn will remain Conn.
Postal code
If an address is in the United States, and the postal code is only three or four characters long, leading 0s are added to make it five characters long.
Duplicate addresses
Duplicate addresses (matching record, type, effective, expires, first eight characters of the street) are deleted.
Priority is given to addresses based on their priority (high > normal > low > inactive), followed by the quality (from an applicant > not validated from an import), and then the most recently updated address.
Time-sensitive addresses
The ranks of addresses that have expired, or are not yet effective, are updated to NULL
so that they are not included in exports or filters.
So that they can be used in exports and filters, addresses without ranks will be given one if they satisfy one of the following conditions:
They do not have an expires date
They havenโt expired yet and do not have an effective date
They have an effective date in the past
Geocodes
Geocodes (used for geographic proximity) are updated for addresses nightly for any addresses that do not yet have a geocode.
Address validation
Addresses that have not yet been validated get scrubbed (via USPS).
Devices
Duplicate Devices
Duplicate devices (matching record, type, value) are deleted. Priority is given to devices based on the rank, which is based on the priority (high > normal > low > inactive), and in the case of a tie, updated, then created dates.
Phone number device types
Phone number device types (mobile, business, phone) are deleted if the only numeric characters end with "0000000."
Fields
CommonApp
CommonApp fields scoped to the person are updated to be application-scoped if the person record has a submitted application. The application used is the one that is ranked 1.
Duplicate fields
Duplicate field values (matching field, record, order, prompt, index, related) are deleted.
Multiple values in single value field
If a record has multiple field values for a field configured to store a single value, all but a single field is deleted. The priority is based on the field ID (which is random).
Origins
Person records updated in the past 48 hours will have their first and last origins recalculated using the Origin Groups and Origin Sources configured in the Admin Tool. To force process these (if they make changes to their Origin Groups or Origin Sources), they can run a Retroactive Refresh in the Rules Editor. For more information about how Origins work, see the Query Filters/Exports Specialist Knowledge article "Origin Sources: Exports/Filters, and other important information."
Passwords
Passwords for person records are automatically expired if a person has not attempted to log in for 12 months.
PINs for person records are automatically expired if the PIN was set more than 12 months ago.
Schools
Duplicate schools (matching record, level of study, CEEB, name, from, to, degree, conferred, field, and major) are deleted.
If one of the schools has a transcript attached to it, then that is the school that is kept. Otherwise, the school that was most recently updated is kept.
Duplicate schools (matching record, CEEB, and name) are deleted IF level of study is blank, from date is blank, to date is blank, there is no transcript associated with it, GPA is blank, class size is blank, and class rank is blank.
Preference for the school record to keep is based on the most recently updated school.
Tests
Duplicate test scores (matching record, type, subtype, date, total, score1, score2, score3, score4, score5, alpha, location) are deleted. Preference is given to the test score that is verified, then to the test record that was most recently updated.
Duplicate SELF-REPORTED test scores (matching record, type, subtype, date) are deleted IF a VERIFIED test score for the same test type and subtype exists for the same date.
Duplicate GMAT test scores (matching record, type, subtype, test year, test month, total, score1, score2, score3, score4, score5, alpha, location) are deleted. Preference is given to the test score that is verified, then to the test record that was most recently updated.
To handle superscores (such as those that are imported from CommonApp, where the highest SAT Math score is reported with the highest SAT Math Date, and highest SAT Critical Reading score is reported with the highest SAT Critical Reading date, where those dates may or may not be the same), self-reported test scores that have different component scores on the same date are merged. Test types that are eligible for the merge are: ACT; SATI; SATII; AP; IB; TOEFL; GRE; IB; GMAT; IELTS. The following items must match for it to be considered for a merge: record, type, subtype, date, confirmed (verified status), and location. Test scores with no date are not eligible. The three other duplicate merge items above take place first, so if all of the component scores above existed and matched, they would not exist when this merge takes place.
To handle test types that expire (Database > Tests > "Expires After x Days" setting), the script will mark the tests as canceled if the current date is greater than the test date + the number of days specified.
"Score Report" checklist items are deleted if the record no longer has a self-reported test score of that type or if the Checklist setting for the test type in the Tests section of the Admin Tool is no longer set to Yes.