In real world applications involving databases, often the data has to be cleaned. In this post, some techniques are presented, such as converting the data type of a field, creating a script to repeat the steps in different columns and using functions from MySQL such as ENUM(), IF(), SUBSTRING(), CONCAT(), REGEX, PREPARE and EXECUTE statement. Some of the commands are sent from the MySQL Workbench while other commands are sent directly from the windows prompt.
- Importing a public csv file using mysql workbench wizard, that contains the data as well as the headers that represent the different table fields.
- Selecting the data types to be initially imported as text. They will be converted to the desired data type afterwards.
- Converting the 'dete_start_date' column from text to Small Integer unsigned, after have deleted the 'Not Stated' rows. For example text '2014' to SMALLINT UN 2014. Right click on the table and then click on 'Alter Table...'
- Converting the 'business_unit' column from text to VARCHAR(255), following the same steps: Right click on the table and then click on 'Alter Table...'
- Converting the 'gender' column from text to ENUM('Male','Female'), after deleting the rows where the gender was different from these 2 options.
- Converting the 'career_move_to_public_sector' column from text to Boolean. In fact the values in this field are either 'True' or 'False', so first a conversion is done using IF('True',1,0) and after the data type is changed from Text to TINYINT UNSIGNED. This data type is required because MySQL does not have BOOL data type, so TINYINT is used instead.
Checking the results of the conversion with a query:
Creating a Stored Procedure to automate the steps above. However it is not possible to pass table or column names as inputs to be directly updated on the Procedure, due to safety reasons.
One solution to overcome these limitations is to CONCAT() the desired SQL commands and then use a PREPARE statement and EXECUTE this statement.
- Converting the 'cease_date' column from text to SMALL INT UN. Note that some rows in this field had a date in this format MM/YYYY while the majority was on YYYY format. Using REGEX it was possible to filter these records and update them accordingly using the SET and SUBSTRING() functions.
No comments:
Post a Comment