Python Pandas Basics

Panda dad is sorry about the pun.
Installation depending on your environment (ie you installed conda, or have pip): pip install pandas
conda install pandas
+-------+---------+
| LABEL | VALUE |
+-------+---------+
| AA | 233.00 |
+-------+---------+
| XZ | 4555.00 |
+-------+---------+
| DF | 667.00 |
+-------+---------+
| FS | 890.00 |
+-------+---------+
The above in csv format...CSV FILE ( some_data.csv ):LABEL,VALUE
AA,233.00
XZ,4555.00
DF,667.00
FS,890.00
import pandas as pd df = pd.read_csv(“some_data.csv") # df is short for dataframe and widely used but you can name it anything you want.print(df)>> OUTPUT (DATAFRAME):   DATA_POINTS       VALUE
0 1 233.00
1 2 4,555.00
2 3 667.00
3 4 890.00
(see also head() and tail() later in this article)
Behold a yummy Dataframe
print(df.sum())>>LABEL    AAXZDFFS
VALUE 6345
or prettier just selecting the VALUE column:print(f"TOTAL VALUE: {df['VALUE'].sum()}")>> TOTAL VALUE: 6345.0•·················•·················•# 2: Summary statistics:print (df.describe())>> VALUE
count 4.00000
mean 1586.25000
std 1997.87777
min 233.00000
25% 558.50000
50% 778.50000
75% 1806.25000
max 4555.00000
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("some_data.csv")df.plot(x='LABEL', y='VALUE')
plt.show()
🔌 Power Tip (1): When scrapping tabular data from the internet or a document I usually find it easier to select and then copy paste from your browser into excel as text and from excel save as a csv for later manipulation in Pandas. Your alternative is to use read_html() which depending on the page might work or not.
🔌 Power Tip (2): Add some oomph and a bunch of other niceties to your plots with Seaborn, it only takes a couple of lines to get started:import seaborn as snssns.set() # before plot

More Features

Series

 Let's say you have this column ( it could also be a row ) of data:+----+
| 20 |
+----+
| 40 |
+----+
| 60 |
+----+
| 40 |
+----+
| 20 |
+----+
| 10 |
+----+
•·················Script starts >>> ·················•import pandas as pd# Instead of a csv file, let's change it into a simple python list:someNumbers = [20,40,60,40,20,10]# And feed it to pandas (note the Series word):
ser = pd.Series(someNumbers)
# Print it (note that there are no column names)
print(ser)
>>0 20
1 40
2 60
3 40
4 20
5 10
dtype: int64
And operate on it:print(f"Max number is: {ser.max()}" )>>Max number is: 60
You can add and extract series anywhere in the dataframe, not just under the bear.

Time Series

+---------+-------+
| DATE | VALUE |
+---------+-------+
| 3/24/20 | 14 |
+---------+-------+
| 3/25/20 | 15 |
+---------+-------+
| 3/26/20 | 86 |
+---------+-------+
| 3/27/20 | 34 |
+---------+-------+
| 3/28/20 | 9 |
+---------+-------+
date_date.csv:DATE,VALUE
3/24/20,14
3/25/20,15
3/26/20,86
3/27/20,34
3/28/20,9
•·················Scripts start >>> ·················•Time_Series = pd.read_csv('date_data.csv',header=0, parse_dates=[0], index_col=0, squeeze=True)# A few notes: squeeze reads the data as a series/column, not a dataframe, index_col=0 specifies the index is the first column, parse_dates[0] attempts to parse the first column [0] as dates, and header=0 specifies the column names are the first row...print(Time_Series)>>DATE
2020-03-24 14
2020-03-25 15
2020-03-26 86
2020-03-27 34
2020-03-28 9
MC Flava Panda ?
# Give me the last 2 days of data:print(Time_Series.last('2D'))>>DATE
2020-03-27 34
2020-03-28 9
Note that we are using a new notation for selecting the last 2 days 2D: Offset Aliases . ∘₊✧──────✧₊∘# Something more complex: add a value at the end of our time series with an offset of 4 days from the last record:lastRecord = Time_Series.last('1D')# This is both the date and value, we need just the Date (aka index):lastDate = lastRecord[[0]].index[0] # 2020-03-28 00:00:00# Now we need to add 4 days, note is not a simple + addition, we need to use the pd.DateOffset method:offsetDate = lastDate + pd.DateOffset(4)And finally insert a value at this date:Time_Series[offsetDate] = 100print(Time_Series)>>>DATE
2020-03-24 14
2020-03-25 15
2020-03-26 86
2020-03-27 34
2020-04-01 100
date_data_multi.csv :DATE,VALUE1,VALUE2
3/24/20,14,140
3/25/20,15,150
3/26/20,86,860
3/27/20,34,340
3/28/20,9,900
•·················Script starts >>> ·················•Time_Series_Multi = pd.read_csv('date_data_multi.csv',header=0, parse_dates=[0], index_col=0)print(Time_Series_Multi)>> VALUE1 VALUE2
DATE
2020-03-24 14 140
2020-03-25 15 150
2020-03-26 86 860
2020-03-27 34 340
2020-03-28 9 900
# And plot it:
import matplotlib.pyplot as plt
Time_Series_Multi.plot()
plt.show()
>>>

Selecting

Our dataset:+------+------+------+
| VAL1 | VAL2 | VAL3 |
+------+------+------+
| 97 | 44 | 93 |
+------+------+------+
| 27 | 72 | 75 |
+------+------+------+
| 38 | 2 | 51 |
+------+------+------+
| 100 | 8 | 30 |
+------+------+------+
some_data_2.csv:VAL1,VAL2,VAL3
97,44,93
27,72,75
38,2,51
100,8,30
•·················Script starts >>> ·················•import pandas as pd
df = pd.read_csv("some_data_2.csv")
print(df)
>>> VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75
2 38 2 51
3 100 8 30

⚠️ Note the index was added for us, but we can also define a column to be an index, if for instance we want the first column to be our index:
df = pd.read_csv("some_data_2.csv”,index_col=[0])
print(df)
VAL2 VAL3
VAL1
97 44 93
27 72 75
38 2 51
100 8 30
We'll use the first example...
Our dataframe once more for reference..
VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75
2 38 2 51
3 100 8 30
•·················Script starts >>> ·················•# SELECT 3rd ROW with Index 2:
row_we_want = df.loc[2]
print(row_we_want)
>>>
VAL1 38
VAL2 2
VAL3 51
Name: 2, dtype: int64
•·················||·················•# From that ROW, select VAL2:
print(row_we_want['VAL2'])
# we can also select directly:# row_Column = df.loc[2]['VAL2']>>>2•·················||·················•# SELECT VAL3 COLUMN:
column_we_want = df['VAL3']
print(column_we_want)>>>0 93
1 75
2 51
3 30
Name: VAL3, dtype: int64
•·················||·················•# FROM that Column select 3rd value down (index 2):print(column_we_want[2])We can also select directly:# column_row = df['VAL3'][2]>>>51•·················||·················•# Get only the values greater than 70, replace the rest with ‘under 70': greater_than = df.where(df > 70, ‘under 70')
print(greater_than)
>>>VAL1 VAL2 VAL3
0 97 under 70 93
1 under 70 72 75
2 under 70 under 70 under 70
3 100 under 70 under 70

Note: If you want the values of a dataframe you can use the values/numpy methods, here's a list for instance:
print(greater_than.values.tolist())[[97, 'under 70', 93], ['under 70', 72, 75], ['under 70', 'under 70', 'under 70'], [100, 'under 70', 'under 70']]
T-Shaped Panda ?

Operations

You can do dataframe wide operations, for instance: add()
sub()
mul()
div()
Our dataframe once more for reference.. VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75
2 38 2 51
3 100 8 30
•·················Script starts >>> ·················•# ADD 10 to every value…
print(df.add(10))
>>> VAL1 VAL2 VAL3
0 107 54 103
1 37 82 85
2 48 12 61
3 110 18 40
Same logic with the other operations, try them out.•·················||·················•But maybe you are more interested in column/row operations, a popular choice:# sum of columns ( you can also use sum() or any other operation with the usual ( +,-, /, * ) operators:sumColumns = df['VAL1'] + df['VAL2'] + df ['VAL3']print(sumColumns)>>>0 234
1 174
2 91
3 138
# Most likely you want this as a column in your dataframe, so let's add it:df['SUM'] = sumColumnsprint(df)>>> VAL1 VAL2 VAL3 SUM
0 97 44 93 234
1 27 72 75 174
2 38 2 51 91
3 100 8 30 138
•·················||·················•# Row Operations (using the above dataframe):sumRows = df.loc[0] + df.loc[1] + df.loc[2] + df.loc[3]print(sumRows):>>>VAL1 262
VAL2 126
VAL3 249
SUM 637
# Add this to the bottom as a SUM row:df.loc['SUM'] = sumRowsprint(df)>>> VAL1 VAL2 VAL3 SUM
0 97 44 93 234
1 27 72 75 174
2 38 2 51 91
3 100 8 30 138
SUM 262 126 249 637
•·················||·················•Advanced operations with applymap() (ie you have a function you want to apply to your dataframe :# function to use ( you can also add arguments ):def conditionalFunc(val):
if val > 50:
return val*2
else:
return val*4

df = df.applymap(conditionalFunc)
---+ Original +---VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75
2 38 2 51
3 100 8 30
---+ After applymap +--- VAL1 VAL2 VAL3
0 194 176 186
1 108 144 150
2 152 8 102
3 200 32 120
Also see apply() for single column/row/series
He’s doing his best.

Complexities / odds & ends

Our dataframe from previous examples...   VAL1  VAL2  VAL3
0 97 44 93
1 27 72 75
2 38 2 51
3 100 8 30
•·················Scripts start >>> ·················•head() and tail() show you the first and last n rows, useful while taking a peak at your data...print(df.head(2))>>> VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75
print(df.tail(2))>>> VAL1 VAL2 VAL3
2 38 2 51
3 100 8 30
Computational Tools:While you are most likely to use Pandas in conjunction with data science or statistics libraries, you can quickly calculate some complex relationships within pandas, let's say you want to see percentage changes from one row to the next in all columns:print(df.pct_change())>>       VAL1      VAL2      VAL3
0 NaN NaN NaN
1 -0.721649 0.636364 -0.193548
2 0.407407 -0.972222 -0.320000
3 1.631579 3.000000 -0.411765
Or even a correlation matrix with df.corr() ( with this data it's meaningless though). Check out the docs on computational tools if you are in need of statistics before heading elsewhere.
Pivot,reshape Tables
Group by

Sort
Cleaning Data 
Categorical Data
Dealing with large Datasets

AI, Software Developer, Designer : www.k3no.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store