Data Cleansing

Author - Webner

Data Hygiene or Data Cleansing or Data Cleaning – call it by the name you prefer – all point to the process of making the data better than it is before the process is applied. “Better data” means removing what is not required, fixing what is not right, restoring data correctness, and putting in place the rules to not to get incorrect data in future.

As an example, suppose there is a database in which there are tables to store Account and Contacts information. There is Account name, office email, address, phone number, fax number etc. In Contacts table information of company’s contacts, with whom communication goes on, is stored via fields like contact first name, last name, email, phone number and account id (company to which contact belongs). At the time of creation it was not clear if company name, contact first name, company email or contact email should be set to unique or not. If these fields should be set to not null or not. If spam data prevention measures should be put in place or not to not to store irrelevant email ids or other contact information. So any value can be set in table fields – duplicate, null or invalid value. That is just an assumption for this case. In other cases half of the measure might already be in place but not all.

What will be the result of this? We already know it’s hard to stop spammers from filling forms available over the internet. A lot of spam accounts and contacts may get created in the database (including all sort of funny and not so funny, even abusive adjectives in use). There may also be a lot of duplicate accounts and contacts stored in the system as there is no check on uniqueness. It is possible a contact changed the job but it is now connected to both old account and new account. Null values may get through polluting the data further. Data may seem impossible to use or fix.

All these problems severely affect the system. Reports generated on top of this data will give incorrect information. Emails being sent by an application built over this database will see huge bounce backs and more scary is wrong information will be sent to the contacts since contact relationship might not be accurately stored.

What is the solution now when the data gets corrupted? The solution is data cleansing. There are a lot of tools available in the market that support defining the rules and then accordingly they try to clean the data. Apart from these tools (because a tool cannot be the silver bullet to solve each type of problem and it can also be pretty expensive), custom scripts can prove to be less expensive and more effective to the specific situation. Regardless of the approach followed, duplicate records can be merged into single record, obsolete ones can be discarded, null value records can be removed, spam records can be identified based on domain name with which email id ends or spam words that an email id or name may contain. These are examples of the rules that need to be defined.

Process to clean entire data can be slow. Depending on size of the data, level of errors and correction measures, it may take hours or even days for it to complete. Needless to say, it should be a background process in most of the cases (except in case of less data) written by experts who have dealt with large data before and know what to avoid to write optimized code, how much should be the hardware resources on the server where scripts will run and what is the resumption plan in case data hygiene script gets obstructed. These are the key questions that one needs to plan for and prepare for to accomplish more in less time with certainty.

Further, database needs to be tightened up by defining constraints on columns (unique, check, foreign key etc) and writing triggers if required. A new column in contacts table can indicate if contact is still with the account or has moved on to another one. Outside of the database, the API layer letting the data stored into the system needs to be very secure and strict about data verification and only valid data should be permitted by it to push into the database. User Interface layer should have additional checks (like captcha, security questions to make sure user is a valid user etc). Beyond it firewall security must prevent unauthorized access to the database and applications.

With these measures in place not only existing data can be cleaned but its future cleanliness can also be assured. Contact us today for your Data Hygiene projects irrespective of whether you data is in database, excel files, CRM or any other software system.

Webner Solutions is a Software Development company focused on developing CRM apps (Salesforce, Zoho), LMS Apps (Moodle/Totara), Websites and Mobile apps. If you need Web development or any other software development assistance please contact us at webdevelopment@webners.com

Webner Solutions is a Software Development company focused on developing Insurance Agency Management Systems, Learning Management Systems and Salesforce apps. Contact us at dev@webners.com for your Insurance, eLearning and Salesforce applications.

Leave a Reply

Your email address will not be published. Required fields are marked *