Pandas - start

https://pandas.pydata.org/

https://pandas.pydata.org/docs/user_guide/10min.html

https://colab.research.google.com/notebooks/mlcc/intro_to_pandas.ipynb

https://www.w3schools.com/python/pandas/

INSTALLING PANDAS


If you already have Python, you can install Pandas with:

ANACONDA
conda install pandas

PIP (in a virtual environment)
pip install pandas

APT (for the entire computer)
Debian packages: python-pandas, python3-pandas, and dependencies.

INTRODUCTION

'pandas' is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 'pandas' aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

Main features:

The two primary data structures of pandas are Series (1-dimensional, a single column) and DataFrame (2-dimensional). A DataFrame contains one or more Series and a name for each Series.

DataFrame objects can be created by passing a dict mapping string column names to their respective Series. If the Series don't match in length, missing values are filled with special NA/NaN values.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt   # for plots

print(pd.__version__)   # 0.23.3 in Debian 10

# pd.DataFrame
#         pd.Series
#         |      |
#         o      o
#        col1   col2  o-- columns described by keys
#      +------+------+      (default 0,1,...)
# row1 | 'a1' |  11  |
#      +------+------+
# row2 | 'a2' |  22  |
#      +------+------+
# row3 | 'a3' |  33  |
#   o  +------+------+
#   |
# rows described by indexes (default 0,1,...)

# Examples from Colab.

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])   # strings
population = pd.Series([852469, 1015785, 485199])   # numbers (int)

# Constructing DataFrame from a dictionary.
df1 = pd.DataFrame({'City name': city_names, 'Population': population})
print(df1)
#        City name  Population     # column names are shown
# 0  San Francisco      852469     # row indexes are default (0,1,...)
# 1       San Jose     1015785
# 2     Sacramento      485199

# Access to Series using Python dict/list operations.
print(df1['City name'])
print(df1['Population'])
print(df1.Population)   # only if labels are proper Python identifiers
print(df1['Population'][1])   # 1015785
print(df1.Population[1])   # 1015785

# New colums can be added (with new labels).
df1['Area square miles'] = pd.Series([46.87, 176.53, 97.92])

# Calculations with numpy.
print(np.log(population))   # return new series
# 0    13.655892
# 1    13.831172
# 2    13.092314
# dtype: float64

# Transforming series.
# population.apply(lambda item: item > 1000000)   # return new boolean series
print(population > 1000000)   # numpy style (elementwise)
# 0    False
# 1     True
# 2    False
# dtype: bool

# Indexes.
print(df1.index)   # RangeIndex(start=0, stop=3, step=1)
# df1a = df1.reindex([2, 0, 1])   # changing order of rows
# df1b = df1.reindex(np.random.permutation(df1.index))   # random order of rows