Cleaning data is one of the largest topics to cover when working with data. There's so many sources of data and variability in how it's stored, organized, formatted, etc. So there's various ways to get consistent data in your work. So a large preoccupation of working in data oriented fields is cleaning the data.
This is a continuation of content that was moved from the PCDE course module content document.
Data often comes in a format that is not quite ready for efficient analysis. In fact, to perform an accurate analysis, it is of paramount importance that the data is in a tidy format.
Regardless of the programming language you use,
Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table or database and refers to identifying incomplete, inaccurate, incorrect, inaccurate or irrelevant parts of the data and then replacing, or deleting the dirty or coarse data -- Wu, 2013
In a similar way as you have seen for Python, cleaning data in SQL includes a set of techniques that are normally performed by developers when the data doesn't come in an easy to use format.
As you have learned so far, when a database is presented to you, the most common data types are numeric, string and datetime. In the case of numeric data types, a problem you may encounter is that the data comes in a numeric data type that does not suit what that property really describes.
For example,
suppose you have a table with a column Age
and
that this column contains floats so that
the ages are really written as 10.0, 34.0, etc.
In this case, it would be a good idea to convert these values to integers,
as it doesn't make sense to have an age expressed as floats.
Another issue may arise when dealing with zero values.
Consider the following example.
Suppose that your table also has a column titled BloodPressure
,
that contains zero values.
This doesn't make sense either,
since the blood pressure of a living person cannot be zero.
In this case,
it would be a good idea to replace the zero values with more meaningful entries,
such as the average blood pressure for all the individuals in that table.
As in, imputing the data that is clearly corrupt or missing.
TODO: Relate this to a larger note on data cleaning in general
and link to respective topics in the SQL and Python versions
In general, it is always important to deal with erroneous or NULL entries to ensure an accurate analysis. Like in Python, the two most common techniques for data cleaning in SQL are:
Another situation where
you may have problems when dealing with data that is not clean is table joining.
Consider the following scenario.
Suppose that you want to join two tables,
MedicalRecords
and Prognosis
,
on the key PatientID
.
Assume further that PatientID
is in integer format in the first table and
in float format in the second one.
When you try to join the two tables,
you will encounter an error because
the datatypes for the same column in each table are mismatched.
String values are also very common in databases. An issue you may face is that within the dame column, values that are supposed to represent the same value are written in a different way.
For example,
in the Medical Records
table above,
you may have a column CardioExams
,
which represents the type of exam a patient had to undergo.
If this column contains values such as "EKG" or "ECG",
which both describe the same type of exam (electrocardiogram),
this may cause inaccuracy when grouping data.
The most common problem that arises when working with data in a date or time format in SQL is that although the entries appear in date or time format, they are not actually saved as the appropriate data type.
A solution to this is to cast the original entries into the proper format to allow manipulation and analysis.
In general, when cleaning data of any data type in SQL, you must pay attention to the following:
Wu, Shaomin. "A Review on Coarse Warranty Data & Analysis". 2013 https://kar.kent.ac.uk/32972/1/LatestVersionV01.pdf
We'll be taking a look at the gender column of
the customer database from the bad_sakila
database.
SELECT gender
FROM customer;
If you look at the results,
you'll see a lot of different ways to
represent genders male and female.
Everything from FEMALE
, f
, F
, female
, FEM
are used for
female customers.
This is a mess and we need to clean the data into a consistent value for
male or female customers.
In this case we'll use a CASE
transformation to
transform various kinds of values into one consistent one.
To clean data values,
we can use the CASE
clause in a SELECT
statement to
transform the inconsistent values of a gender column for female customers.
SELECT *,
CASE
WHEN gender = 'FEMALE' THEN 'female'
WHEN gender = 'f' THEN 'female'
WHEN gender = 'FEM' THEN 'female'
ELSE gender
END AS gender_cleaned
FROM customer;
We start with the old SELECT
statement,
then we use the CASE
clause followed by WHEN-THEN
expressions to
search for different possible values for female.
The THEN
expression then renames those values as female
.
Then the AS gender_cleaned
clause adds this labeled column of cleaned values.
Remember values for gender like this could come from customers manually entering the data themselves. Or any other inconsistent data source. So it's often valuable to be able to validate the data like this.
We could also explore an example of creating categories for generations. We might want to know which customers are greatest generation (born before 1945), boomers (before 1964), Gen X (before 1976), millennials (before 1995)
SELECT *,
CASE
WHEN birthdate < '1945-01-01' THEN 'Greatest Generation'
WHEN birthdate < '1964-01-01' THEN 'Boomers'
WHEN birthdate < '1976-01-01' THEN 'Generation X'
WHEN birthdate < '1995-01-01' THEN 'Millennials'
ELSE 'Generation Z'
END AS generation
FROM customer;
This CASE
-WHEN
-THEN
clause divides birthdate
s into
generational categories.
This sort of procedure can be used to clean any kinds of data given to
a database.
When you're trying to determine what NULL is, because NULL is... NULL, you need to do some interpreting. Remember, ideally NULL represents non-presence of data.
So let's start with the film
table and its original_language
column.
The below query returns a lot of NULL values in original_language
which makes it effectively worthless.
Let's determine that any NULL we see means that;
the original language is unknown.
SELECT *
FROM film;
Let's rewrite this query.
Start off by selecting columns: title
, rating
, original_language_id
,
then for the final column,
COALESCE
original_language_id
NULLs into the string unknown
.
SELECT title, rating, original_language_id,
COALESCE(original_language_id, 'unknown') AS OriginalLanguage
FROM film;
The COALESCE
function is a built-in function that
returns the first non-null value.
As in,
you hand it a selected column and a value for it to coalesce to when null,
or to use the same column value given when it's non-null.
Now to perform a similar type of inspection.
You will also want to look at certain columns and change something within them.
When looking at the film
table again, but this time on the rental_rate
column,
it shows the cost to rent the film as decimal numbers.
At least most of the time,
some values are strings with the $
sign attached.
That's where the REPLACE()
function comes in.
SELECT *,
REPLACE(rental_rate, '$', '') AS clean
FROM film;
The REPLACE()
function replaces an existing value with a new string.
And here we REPLACE
any value in rental_rate
containing $
with ''
,
or an empty string to get rid of the dollar sign.
The first parameter in the function determines the column,
the second the substring to look for,
the third the string that replaces the substring.
The result is now a new column clean
that
displays prices in consistent format, without losing data.
Sometimes cleaning data involves extracting data from one query then using those results in aggregate within a cleaning query. Though this example doesn't really clean data per se, it demonstrates a way of using one query in another to transform data.
SELECT
COUNT(*) AS count,
FLOOR(YEAR(birthdate) / 10) * 10 AS decade
FROM customer
GROUP BY decade
Here we pull birthdate
from customer
table.
Then using the FLOOR()
function to transform the values and
GROUP
ing BY
the relabeled decade you have transformed the
birthdate
data and grouped it into decades and counts.
This data extraction can now be used to clean the data.
Sometimes you might need a sanity check on data. In the above example where we extracted decade counts of birthdates, we may wonder if a mistake was made in constructing the query. Using that query as the basis for the next one, let's see how it can instruct us on how to construct queries that allow us to visually check if the query makes sense for the data.
SELECT
-- COUNT(*) AS count,
birthdate,
FLOOR(YEAR(birthdate) / 10) * 10 AS decade
FROM customer
-- GROUP BY decade
ORDER BY decade, birthdate;
This will give a results table with the same data as the query before,
but it isn't aggregated.
Instead of GROUP
ing the data,
we order it.
This way,
the disaggregated data is presented next to
the transformed version of the data, in this case the decade,
where we can visually inspect that the extracted decade is correct
for the passed on birthdate
.
And it's easier to visually inspect when the dates are ordered.
Essentially the process of visual inspection of queries goes like this:
SELECT
statement.ORDER BY
the GROUP BY
column instead.