
Python Pandas Basics
With Pandas ?
One of the most popular libraries in Python is arguably Pandas, if you want a dry, technical and scholarly explanation this is probably not it, I am here to try and give you the basics with simple code snippets and the occasional panda cartoon, your choice.

If I ever had to explain Pandas at a cocktail party I would probably start by mentioning excel and spreadsheets, you have some data that fits in columns and rows, you want to make some operations in between cells, columns and rows, then you see the result in your spreadsheet…
Pandas is very similar, you feed it some data that fits into rows and columns, you make some operations and you get a result, you just do all of this through commands that manipulate dataframes and series instead of spreadsheets, how you represent the information at the end is up to you (a chart, report, back to excel!).
In short Pandas is a data wrangling library
, pull up a computer and let’s start by installing it:
Installation depending on your environment (ie you installed conda, or have pip): pip install pandas
conda install pandas

As mentioned, Pandas allows you to feed data into a dataframe for further manipulation, the origin of said data can come from many places (excel, json, html tables etc, etc ), for data scientists and generalists like myself it’s usually a csv file, let’s say we have some tabular data:
+-------+---------+
| LABEL | VALUE |
+-------+---------+
| AA | 233.00 |
+-------+---------+
| XZ | 4555.00 |
+-------+---------+
| DF | 667.00 |
+-------+---------+
| FS | 890.00 |
+-------+---------+The above in csv format...CSV FILE ( some_data.csv ):LABEL,VALUE
AA,233.00
XZ,4555.00
DF,667.00
FS,890.00
I am cheating here by showing you the data beforehand, usually you have no idea what the data looks like until you get it into a Pandas dataframe and examine or print
it :
import pandas as pd df = pd.read_csv(“some_data.csv") # df is short for dataframe and widely used but you can name it anything you want.print(df)>> OUTPUT (DATAFRAME): DATA_POINTS VALUE
0 1 233.00
1 2 4,555.00
2 3 667.00
3 4 890.00(see also head() and tail() later in this article)
Great, we converted a csv file into a dataframe; notice that in this case the dataframe added and extra column on the left side, this is an index and serves as a reference point for the data on the right ( more on indexes later ) :

We’ll cover more complex operations later, but just to advance this brief overview let me show you 2 very easy operations we can do on our dataframe:
print(df.sum())>>LABEL AAXZDFFS
VALUE 6345or prettier just selecting the VALUE column:print(f"TOTAL VALUE: {df['VALUE'].sum()}")>> TOTAL VALUE: 6345.0•·················•·················•# 2: Summary statistics:print (df.describe())>> VALUE
count 4.00000
mean 1586.25000
std 1997.87777
min 233.00000
25% 558.50000
50% 778.50000
75% 1806.25000
max 4555.00000
The next step after reading, analyzing and operating on data is to get it out of python into something more useful like a report or most likely a plot, it’s fairly easy to plot things with pandas :
import pandas as pd
import matplotlib.pyplot as pltdf = pd.read_csv("some_data.csv")df.plot(x='LABEL', y='VALUE')
plt.show()

To wrap this section here’s a couple of reference links on Input/Output, plotting and a couple of tips.
🔌 Power Tip (1): When scrapping tabular data from the internet or a document I usually find it easier to select and then copy paste from your browser into excel as text and from excel save as a csv for later manipulation in Pandas. Your alternative is to use read_html() which depending on the page might work or not.
🔌 Power Tip (2): Add some oomph and a bunch of other niceties to your plots with Seaborn, it only takes a couple of lines to get started:import seaborn as snssns.set() # before plot

More Features
The above examples hopefully serve as an introduction to the pandas life, but there are a few other things ( features really since pandas can get complex ) that are worth mentioning, things that one might struggle since they might not be obvious at first…
Series
While the dataframe is the data structure usually associated with Pandas, it is not the only one, a series for instance deals with one single column of data along with an index:
Let's say you have this column ( it could also be a row ) of data:+----+
| 20 |
+----+
| 40 |
+----+
| 60 |
+----+
| 40 |
+----+
| 20 |
+----+
| 10 |
+----+•·················Script starts >>> ·················•import pandas as pd# Instead of a csv file, let's change it into a simple python list:someNumbers = [20,40,60,40,20,10]# And feed it to pandas (note the Series word):
ser = pd.Series(someNumbers)# Print it (note that there are no column names)
print(ser) >>0 20
1 40
2 60
3 40
4 20
5 10
dtype: int64And operate on it:print(f"Max number is: {ser.max()}" )>>Max number is: 60
The not so obvious part is that a dataframe is made out of series that share an index and have the addition of a column name:

This gives you the flexibility of virtually assembling and disassembling dataframes or using the series data by selecting it for further operations.
Time Series
Pandas can also deal with data that is ordered in time, ( aka time series ), while it is undeniably powerful, in my experience it can be hard to use, so don’t despair if at first you don’t get it or need to look up a specific answer for your case, anyway let’s start with some date data…
+---------+-------+
| DATE | VALUE |
+---------+-------+
| 3/24/20 | 14 |
+---------+-------+
| 3/25/20 | 15 |
+---------+-------+
| 3/26/20 | 86 |
+---------+-------+
| 3/27/20 | 34 |
+---------+-------+
| 3/28/20 | 9 |
+---------+-------+date_date.csv:DATE,VALUE
3/24/20,14
3/25/20,15
3/26/20,86
3/27/20,34
3/28/20,9•·················Scripts start >>> ·················•Time_Series = pd.read_csv('date_data.csv',header=0, parse_dates=[0], index_col=0, squeeze=True)# A few notes: squeeze reads the data as a series/column, not a dataframe, index_col=0 specifies the index is the first column, parse_dates[0] attempts to parse the first column [0] as dates, and header=0 specifies the column names are the first row...print(Time_Series)>>DATE
2020-03-24 14
2020-03-25 15
2020-03-26 86
2020-03-27 34
2020-03-28 9

Now you might be thinking what’s the big deal, well the big deal is that we can now use dates as the index for our series and do date/time like operations:
# Give me the last 2 days of data:print(Time_Series.last('2D'))>>DATE
2020-03-27 34
2020-03-28 9Note that we are using a new notation for selecting the last 2 days 2D: Offset Aliases . ∘₊✧──────✧₊∘# Something more complex: add a value at the end of our time series with an offset of 4 days from the last record:lastRecord = Time_Series.last('1D')# This is both the date and value, we need just the Date (aka index):lastDate = lastRecord[[0]].index[0] # 2020-03-28 00:00:00# Now we need to add 4 days, note is not a simple + addition, we need to use the pd.DateOffset method:offsetDate = lastDate + pd.DateOffset(4)And finally insert a value at this date:Time_Series[offsetDate] = 100print(Time_Series)>>>DATE
2020-03-24 14
2020-03-25 15
2020-03-26 86
2020-03-27 34
2020-04-01 100
You can also use dates as indexes in a dataframe with multiple columns:
date_data_multi.csv :DATE,VALUE1,VALUE2
3/24/20,14,140
3/25/20,15,150
3/26/20,86,860
3/27/20,34,340
3/28/20,9,900•·················Script starts >>> ·················•Time_Series_Multi = pd.read_csv('date_data_multi.csv',header=0, parse_dates=[0], index_col=0)print(Time_Series_Multi)>> VALUE1 VALUE2
DATE
2020-03-24 14 140
2020-03-25 15 150
2020-03-26 86 860
2020-03-27 34 340
2020-03-28 9 900# And plot it:
import matplotlib.pyplot as pltTime_Series_Multi.plot()
plt.show()>>>

More on time related pandas:
Selecting
Another thing that might not be clear at first is how to select a subset of data from a data frame or series, I’ll try covering a few common cases, but know that there are multiple notations, this is just a set of them :
Our dataset:+------+------+------+
| VAL1 | VAL2 | VAL3 |
+------+------+------+
| 97 | 44 | 93 |
+------+------+------+
| 27 | 72 | 75 |
+------+------+------+
| 38 | 2 | 51 |
+------+------+------+
| 100 | 8 | 30 |
+------+------+------+some_data_2.csv:VAL1,VAL2,VAL3
97,44,93
27,72,75
38,2,51
100,8,30•·················Script starts >>> ·················•import pandas as pd
df = pd.read_csv("some_data_2.csv")
print(df)>>> VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75
2 38 2 51
3 100 8 30
⚠️ Note the index was added for us, but we can also define a column to be an index, if for instance we want the first column to be our index:df = pd.read_csv("some_data_2.csv”,index_col=[0])
print(df) VAL2 VAL3
VAL1
97 44 93
27 72 75
38 2 51
100 8 30We'll use the first example...
Our dataframe once more for reference..
VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75
2 38 2 51
3 100 8 30•·················Script starts >>> ·················•# SELECT 3rd ROW with Index 2:
row_we_want = df.loc[2]print(row_we_want)
>>>VAL1 38
VAL2 2
VAL3 51
Name: 2, dtype: int64•·················||·················•# From that ROW, select VAL2:
print(row_we_want['VAL2'])# we can also select directly:# row_Column = df.loc[2]['VAL2']>>>2•·················||·················•# SELECT VAL3 COLUMN:
column_we_want = df['VAL3']print(column_we_want)>>>0 93
1 75
2 51
3 30
Name: VAL3, dtype: int64•·················||·················•# FROM that Column select 3rd value down (index 2):print(column_we_want[2])We can also select directly:# column_row = df['VAL3'][2]>>>51•·················||·················•# Get only the values greater than 70, replace the rest with ‘under 70': greater_than = df.where(df > 70, ‘under 70')
print(greater_than)>>>VAL1 VAL2 VAL3
0 97 under 70 93
1 under 70 72 75
2 under 70 under 70 under 70
3 100 under 70 under 70
Note: If you want the values of a dataframe you can use the values/numpy methods, here's a list for instance:print(greater_than.values.tolist())[[97, 'under 70', 93], ['under 70', 72, 75], ['under 70', 'under 70', 'under 70'], [100, 'under 70', 'under 70']]
Or in handy cheat-sheet :

Operations
So we now know the basics of loading and selecting data, what you do after this depends on what your specific project requires but let’s see what some common operations would look like:
You can do dataframe wide operations, for instance: add()
sub()
mul()
div()Our dataframe once more for reference.. VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75
2 38 2 51
3 100 8 30•·················Script starts >>> ·················•# ADD 10 to every value…
print(df.add(10))>>> VAL1 VAL2 VAL3
0 107 54 103
1 37 82 85
2 48 12 61
3 110 18 40Same logic with the other operations, try them out.•·················||·················•But maybe you are more interested in column/row operations, a popular choice:# sum of columns ( you can also use sum() or any other operation with the usual ( +,-, /, * ) operators:sumColumns = df['VAL1'] + df['VAL2'] + df ['VAL3']print(sumColumns)>>>0 234
1 174
2 91
3 138# Most likely you want this as a column in your dataframe, so let's add it:df['SUM'] = sumColumnsprint(df)>>> VAL1 VAL2 VAL3 SUM
0 97 44 93 234
1 27 72 75 174
2 38 2 51 91
3 100 8 30 138•·················||·················•# Row Operations (using the above dataframe):sumRows = df.loc[0] + df.loc[1] + df.loc[2] + df.loc[3]print(sumRows):>>>VAL1 262
VAL2 126
VAL3 249
SUM 637# Add this to the bottom as a SUM row:df.loc['SUM'] = sumRowsprint(df)>>> VAL1 VAL2 VAL3 SUM
0 97 44 93 234
1 27 72 75 174
2 38 2 51 91
3 100 8 30 138
SUM 262 126 249 637•·················||·················•Advanced operations with applymap() (ie you have a function you want to apply to your dataframe :# function to use ( you can also add arguments ):def conditionalFunc(val):
if val > 50:
return val*2
else:
return val*4
df = df.applymap(conditionalFunc)---+ Original +---VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75
2 38 2 51
3 100 8 30---+ After applymap +--- VAL1 VAL2 VAL3
0 194 176 186
1 108 144 150
2 152 8 102
3 200 32 120Also see apply() for single column/row/series

Complexities / odds & ends
Before wrapping things, here’s a few things worth mentioning in this overview:
Our dataframe from previous examples... VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75
2 38 2 51
3 100 8 30•·················Scripts start >>> ·················•head() and tail() show you the first and last n rows, useful while taking a peak at your data...print(df.head(2))>>> VAL1 VAL2 VAL3
0 97 44 93
1 27 72 75print(df.tail(2))>>> VAL1 VAL2 VAL3
2 38 2 51
3 100 8 30
Computational Tools:While you are most likely to use Pandas in conjunction with data science or statistics libraries, you can quickly calculate some complex relationships within pandas, let's say you want to see percentage changes from one row to the next in all columns:print(df.pct_change())>> VAL1 VAL2 VAL3
0 NaN NaN NaN
1 -0.721649 0.636364 -0.193548
2 0.407407 -0.972222 -0.320000
3 1.631579 3.000000 -0.411765Or even a correlation matrix with df.corr() ( with this data it's meaningless though). Check out the docs on computational tools if you are in need of statistics before heading elsewhere.
And of course there is a ton of functions and features that I didn’t cover as to not overwhelm, think about the aspects we just covered as the tip of a sometimes deep iceberg or going back to excel… things that you can do in Excel you can probably also do in Pandas, some specific examples:
Pivot,reshape Tables
Group by
Sort
Then there is the demands from your specific project, if you are doing data science, you will most likely need to become proficient in some niche aspects of Pandas:
Cleaning Data
Categorical Data
Dealing with large Datasets
You are in a good place to follow the official documentation where you will expand on the core concepts we’ve just seen:
After that the user Guide and Tutorials are next:
I sincerely hope this helped you in going from zero to some Pandas.

Thanks for reading.