Pandas - start

https://pandas.pydata.org/

https://pandas.pydata.org/docs/user_guide/10min.html

https://colab.research.google.com/notebooks/mlcc/intro_to_pandas.ipynb

https://www.w3schools.com/python/pandas/

INSTALLING PANDAS

If you already have Python, you can install Pandas with:

ANACONDA
conda install pandas

PIP (in a virtual environment)
pip install pandas

APT (for the entire computer)
Debian packages: python-pandas, python3-pandas, and dependencies.

INTRODUCTION

'pandas' is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. 'pandas' aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

Main features:

Column-oriented data analysis API.
Easy handling of missing data (float and non-float).
Label-based slicing, indexing, and subsetting of large data sets.
Size mutability (columns can be inserted and deleted)
Easy conversion of different data into DataFrame objects.
Robust IO tools for loading data from flat files (CSV, Excel, HDF5).
Time series-specific functionality.

The two primary data structures of pandas are Series (1-dimensional, a single column) and DataFrame (2-dimensional). A DataFrame contains one or more Series and a name for each Series.

DataFrame objects can be created by passing a dict mapping string column names to their respective Series. If the Series don't match in length, missing values are filled with special NA/NaN values.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt   # for plots

print(pd.__version__)   # 0.23.3 in Debian 10

# pd.DataFrame
#         pd.Series
#         |      |
#         o      o
#        col1   col2  o-- columns described by keys
#      +------+------+      (default 0,1,...)
# row1 | 'a1' |  11  |
#      +------+------+
# row2 | 'a2' |  22  |
#      +------+------+
# row3 | 'a3' |  33  |
#   o  +------+------+
#   |
# rows described by indexes (default 0,1,...)

# Examples from Colab.

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])   # strings
population = pd.Series([852469, 1015785, 485199])   # numbers (int)

# Constructing DataFrame from a dictionary.
df1 = pd.DataFrame({'City name': city_names, 'Population': population})
print(df1)
#        City name  Population     # column names are shown
# 0  San Francisco      852469     # row indexes are default (0,1,...)
# 1       San Jose     1015785
# 2     Sacramento      485199

# Access to Series using Python dict/list operations.
print(df1['City name'])
print(df1['Population'])
print(df1.Population)   # only if labels are proper Python identifiers
print(df1['Population'][1])   # 1015785
print(df1.Population[1])   # 1015785

# New colums can be added (with new labels).
df1['Area square miles'] = pd.Series([46.87, 176.53, 97.92])

# Calculations with numpy.
print(np.log(population))   # return new series
# 0    13.655892
# 1    13.831172
# 2    13.092314
# dtype: float64

# Transforming series.
# population.apply(lambda item: item > 1000000)   # return new boolean series
print(population > 1000000)   # numpy style (elementwise)
# 0    False
# 1     True
# 2    False
# dtype: bool

# Indexes.
print(df1.index)   # RangeIndex(start=0, stop=3, step=1)
# df1a = df1.reindex([2, 0, 1])   # changing order of rows
# df1b = df1.reindex(np.random.permutation(df1.index))   # random order of rows