In my brief experiences working with data, I have always found it interesting. From creating and mapping data to feed to a neural network, to creating visualizations to demonstrate an idea. This note will serve as the base of my learning as I come across certain concepts.

Pandas

DataFrame Basics

Passing in a dictionary when constructing a DataFrame will create a table-like structure where the keys are the column labels, and values are column data (of type Series). Let’s look at the following example:

df = pd.DataFrame(
	{
	    "A": [1, 2, 3],
	    "B": ["x", "y", "z"],
	    "C": [10.5, 20.1, 30.2]
	    "D": [1, "two", 3.0]
	},
	index=["row1", "row2", "row3"]
)

Some observations:

  • Accessing df["A"] will return a Series of integers.
  • There is no such thing as a row internally, it is constructed on the fly.
  • Pandas will autogenerate row labels from 0...n allowing access with loc
  • You can specify custom row labels with index=
  • Entry "D" in this case will receive the object data type due to heterogenous data which may have performance implications.

Inspecting Data

  • df.head(): first 5 rows
  • df.tail(): last 5 rows
  • df.shape: string of format (rows, cols)
  • df.info(): column data types info
  • df.describe(): summary stats for numeric columns

Selecting Data

  • df["A"]: Series (single column)
  • df[["A", "D"]]: DataFrame (multiple columns)
  • df.iloc[0]: row by integer
  • df.loc[0]: row by label (label is index integer by default)
  • df.loc[0, "A"]: element at row 0 column "A"
  • df[df["A"] > 1]: filter rows
  • df.loc[df["A"] > 1]: filter rows with loc ()

Additional reading: The last two examples are a little difficult to decipher and require a stronger foundation of how df and df.loc differ. Take the following simplification with a grain of salt: loc is capable of more and is generally recommended for most usage.

Looking at df[df["A"] > 1]: Here, df["A"] will return a column of values. For each value in this column we will check if val > 1. Internally, this computes a boolean Series [False, False, True, True...]. If df[] receives a boolean mask like this, pandas only keeps the rows where the mask is true. So this expression selects all rows where the value of column A is greater than 1.

This SQL analogy may help: SELECT * FROM df WHERE A = '1';

Modifying Data

  • df["D"] = df["A"] * 2:
  • df["E"] = df["B"].str.upper():
  • df.drop(columns="C", inplace=True): remove column(s)

Common Operations

df["A"].mean()
df["C"].sum()
df.max()
df.sort_values("C", ascending=False)
df.rename(columns={"A":"Alpha"}, inplace=True)