Pandas is one of the key
libraries used in Python for data manipulation and data analysis. It runs on
the top NumPy, which is another important Python library.
The installation can be
done by command line:
> pip install pandas
To import this library
in a .py program:
> import pandas as pd
Note: Typically an
alias pd is used for abbreviation.
Pandas has two
fundamental data structures called Data Frame and Series.
Data Frame handles 2D
objects and it is ideal to work with tabular data such as databases,
spreadsheets, csv files, etc. The Data Frame has three main attributes:
columns, indexes and values.
Series are similar to vectors without column and row labels and it has values only.
Here below some examples
using Pandas:
- Reading/Exporting data from a file. Many formats can be read such as csv, xls, html, json, SQL. Options are also available for file compression/decompression.
> df =
pd.read_csv('https://..././mycsv.csv')
> df.to_csv('path/../tocsv.csv', index=False)
> df.to_csv('path/../tocsv.csv', index=False)
- Create a df (values, index, columns):
> mydictionary = {"Col1": ["xxx","yyy","zzz"],
"Col2": [1, 2, 3],
"Col3": ["True", "False", "True"]}
> df =
pd.DataFrame(mydictionary)
- Create a df Series "vector":
> mylist = [1, 2, 3]
> mySerie = pd.Series(mylist)
> mySerie = pd.Series(mylist)
- Display df info:
> df, df.info(), df.columns, df.index, df.values, df.dtypes,
df["mycol"].shape
df["mycol"].shape
- Display first/last n rows of df:
>
df.head(n), df.tail(n)
- Calculate some stats with the df fields:
> df.describe(),
df.max(), df.min(), df.mean(), df.corr(),
- Sorted values on df:
> df['column_name'].value_counts()
> df['column_name'].value_counts().sort_index()
> df.sort_values('column_name', ascending=False)
> df.groupby(['mycol'])['mynumfield'].mean()
> df['column_name'].value_counts().sort_index()
> df.sort_values('column_name', ascending=False)
> df.groupby(['mycol'])['mynumfield'].mean()
- Slicing the columns in df:
>
mycol=df['column_name'], mydf=df[['col1','col2']], mydf=df[df['col1'] condition]
There are many more
useful methods on their API and it really worth checking the documentation on
the link. Some of them that I
think are particularly interesting are the ones related to time series and to handle missing
data.
Another tip is to use the help() from the python prompt in order to know a bit more details from the methods, such as the syntax and the arguments.
For example:
> help(type(df))
> help('pandas.core.frame.DataFrame')
> help('pandas.core.series.Series.sort_index')
And also to export the dataframe to html for improved visualization (importing os and webbrowser):
> html = df.loc[0:99,:].to_html()
> with open('myweb.html','w') as f:
f.write(html)
> myfile = os.path.abspath('myweb.html')
> webbrowser.open('file://{}'.format(myfile))
Another tip is to use the help() from the python prompt in order to know a bit more details from the methods, such as the syntax and the arguments.
For example:
> help(type(df))
> help('pandas.core.frame.DataFrame')
> help('pandas.core.series.Series.sort_index')
And also to export the dataframe to html for improved visualization (importing os and webbrowser):
> html = df.loc[0:99,:].to_html()
> with open('myweb.html','w') as f:
f.write(html)
> myfile = os.path.abspath('myweb.html')
> webbrowser.open('file://{}'.format(myfile))
Links:
No comments:
Post a Comment