January 1, 2021

#85 - Using Pandas to transform categorical data for machine learning models

When creating machine learning models in python, using for instance the libraries numpy, pandas and sklearn, depending on the algorithms selected, the input data should be on the numerical format.

In this post two methods to transform the data from string (or object) to numerical data types are shown, using different approaches from the pandas library. If this data type conversion is not done, the following error will appear: “ValueError: could not convert string to float: ‘male’”.

The first example is done using the following method:

    X = pd.get_dummies(df[features].fillna(-1))

And the second example is done using:

    X =df[features].fillna(-1)

    X['gender'] = X['gender'].astype('category').cat.codes

These techniques are particularly useful when working with grouped data such as gender, geographical location (country, state, city), and much more categorical applications. 

See below the program output using the Titanic dataset and the source code can be found on this github link:

No comments:

Post a Comment