In my brief experiences working with data, I have always found it interesting. From creating and mapping data to feed to a neural network, to creating visualizations to demonstrate an idea. This note will serve as the base of my learning as I come across certain concepts.
Pandas
DataFrame Basics
Passing in a dictionary when constructing a DataFrame will create a table-like structure where the keys are the column labels, and values are column data (of type Series). Let’s look at the following example:
df = pd.DataFrame(
{
"A": [1, 2, 3],
"B": ["x", "y", "z"],
"C": [10.5, 20.1, 30.2]
"D": [1, "two", 3.0]
},
index=["row1", "row2", "row3"]
)Some observations:
- Accessing
df["A"]will return aSeriesof integers. - There is no such thing as a row internally, it is constructed on the fly.
- Pandas will autogenerate row labels from
0...nallowing access with loc - You can specify custom row labels with
index= - Entry
"D"in this case will receive theobjectdata type due to heterogenous data which may have performance implications.
Inspecting Data
df.head(): first 5 rowsdf.tail(): last 5 rowsdf.shape: string of format(rows, cols)df.info(): column data types infodf.describe(): summary stats for numeric columns
Selecting Data
df["A"]: Series (single column)df[["A", "D"]]: DataFrame (multiple columns)df.iloc[0]: row by integerdf.loc[0]: row by label (label is index integer by default)df.loc[0, "A"]: element at row0column"A"df[df["A"] > 1]: filter rowsdf.loc[df["A"] > 1]: filter rows with loc ()
Additional reading: The last two examples are a little difficult to decipher and require a stronger foundation of how
dfanddf.locdiffer. Take the following simplification with a grain of salt:locis capable of more and is generally recommended for most usage.Looking at
df[df["A"] > 1]: Here,df["A"]will return a column of values. For each value in this column we will check ifval > 1. Internally, this computes a boolean Series[False, False, True, True...]. Ifdf[]receives a boolean mask like this, pandas only keeps the rows where the mask is true. So this expression selects all rows where the value of column A is greater than 1.This SQL analogy may help:
SELECT * FROM df WHERE A = '1';
Modifying Data
df["D"] = df["A"] * 2:df["E"] = df["B"].str.upper():df.drop(columns="C", inplace=True): remove column(s)
Common Operations
df["A"].mean()
df["C"].sum()
df.max()
df.sort_values("C", ascending=False)
df.rename(columns={"A":"Alpha"}, inplace=True)