Data Quality
- Poor data quality negatively affects many data preprocessing efforts
- Example: When trying to classify by a model to detect people who are loan risks using poor data
º Some credit-worthy candidates are denied loans
º More loans are given to individuals that default
① Noise
- Objects: noise is and extraneous object
- Attributes: noise refers to distortion of original values
② Outliers
- Objects with characteristics that are considerably different than most of the other data objects
Causes
- Data entry errors: human errors during manufacturing the data
- Measurement errors: The weights measured on faulty machines
- Intentional errors: outliers may be intentionally introduced to test the robustness of statistical methods or machine learning models.
- Data processing errors: errors occurred during scaling, or normalization, and so on.
- Sampling errors: sample was wrong, mistakenly included other samples
- Natural outliers: Genuine rare events or extreme observations
③ Wrong data
④ Fake data
⑤ Missing values
Reasons for missing values
- Information is not collected
- Attributes may not be applicable to all cases (Annual income is not applicable to children)
Handling missing values
- Eliminate data objects or variables
- Estimate missing values
º Time series of temperature
º Census results
- Ignore the missing value during analysis
.dropna()
⑥ Duplicate data
Duplicates or almost duplicates of one another
- Major issue when merging the data from heterogeneous sources
Examples
- Same person with multiple email addresses
Data cleaning
- Process of dealing with duplicate data issues
When should duplicate data not be improved?
-
When duplicate data is meaningful (when a person purchased several times on one item)
-
When duplicates represent real-world frequency (repetitive review on a movie by same users could be a behavior to emphasize their review)
-
When modeling event occurrence (Log data, access records look duplicated, but normal event occurred repetitively)
-
When detecting fraud or abnormal behavior (Repetitive behavior also could be interpreted as fraud trial)
Leave a comment