January 1, 2021

#68 - Key points about Pandas (Python library)

Pandas is one of the key libraries used in Python for data manipulation and data analysis. It runs on the top NumPy, which is another important Python library.
The installation can be done by command line:
> pip install pandas

To import this library in a .py program:
> import pandas as pd

Note: Typically an alias pd is used for abbreviation. 

Pandas has two fundamental data structures called Data Frame and Series
Data Frame handles 2D objects and it is ideal to work with tabular data such as databases, spreadsheets, csv files, etc. The Data Frame has three main attributes: columns, indexes and values.
Series are similar to vectors without column and row labels and it has values only.

Here below some examples using Pandas:
  • Reading/Exporting data from a file. Many formats can be read such as csv, xls, html, json, SQL. Options are also available for file compression/decompression.
> df = pd.read_csv('https://..././mycsv.csv')
> df.to_csv('path/../tocsv.csv', index=False)
  • Create a df (values, index, columns):
> mydictionary = {"Col1": ["xxx","yyy","zzz"],
                               "Col2": [1, 2, 3],
                               "Col3": ["True", "False", "True"]}
> df = pd.DataFrame(mydictionary)
  • Create a df Series "vector":
> mylist = [1, 2, 3]
> mySerie = pd.Series(mylist)
  • Display df info:
> df, df.info(), df.columns, df.index, df.values, df.dtypes,
 df["mycol"].shape
  • Display first/last n rows of df:
> df.head(n), df.tail(n)
  • Calculate some stats with the df fields:
> df.describe(), df.max(), df.min(), df.mean(), df.corr(),
  • Sorted values on df:
> df['column_name'].value_counts()
> df['column_name'].value_counts().sort_index()
> df.sort_values('column_name', ascending=False)
> df.groupby(['mycol'])['mynumfield'].mean()
  • Slicing the columns in df:
> mycol=df['column_name'], mydf=df[['col1','col2']], mydf=df[df['col1'] condition] 

There are many more useful methods on their API and it really worth checking the documentation on the link. Some of them that I think  are particularly interesting are the ones related to time series and to handle missing data.

Another tip is to use the help() from the python prompt in order to know a bit more details from the methods, such as the syntax and the arguments.

For example:

> help(type(df))
> help('pandas.core.frame.DataFrame')
> help('pandas.core.series.Series.sort_index')

And also to export the dataframe to html for improved visualization (importing os and webbrowser):
> html = df.loc[0:99,:].to_html()
> with open('myweb.html','w') as f:
        f.write(html)

> myfile = os.path.abspath('myweb.html')
> webbrowser.open('file://{}'.format(myfile))

Links:

No comments:

Post a Comment