Pandas备忘录

记录数据分析中常用的 Pandas 函数和一些实用技巧。
数据分析
Author

Tom

Published

January 16, 2024

import pandas as pd

print(pd.__version__)
2.2.1

读取 CSV 文件

pandas.read_csv

  • dtype:指定数据类型,比如指定股票代码这一列为 str 类型
data = pd.read_csv("./stocks.csv", dtype={"secCode":str})
  • parse_dates:尝试将某一列解析为日期。设定为 True,则解析索引。也可以使用列表指定解析某一列
data = pd.read_csv("./stocks.csv", parse_dates=[1])        # 解析第二列
data = pd.read_csv("./stocks.csv", pares_dates=["date"])   # 解析 date 列
  • on_bad_lines:{‘error’, ‘warn’, ‘skip’}

修改列名/索引

pandas.DataFrame.rename

  • mappermapperaxis 搭配使用
  • columns:修改列名
  • index:修改索引
  • inplace:是否原地修改
  • axis
data = pd.DataFrame(
    {"A": [1, 2, 3], "B": [2, 3, 4], "C": [3, 4, 5]}, index=["x", "y", "z"]
)

data
A B C
x 1 2 3
y 2 3 4
z 3 4 5
data.rename(columns={"A": "a", "B": "b"})
a b C
x 1 2 3
y 2 3 4
z 3 4 5
data.rename(index={"x": "X", "y": "Y"})
A B C
X 1 2 3
Y 2 3 4
z 3 4 5
data.rename(mapper=lambda c: c.lower(), axis=1)
a b c
x 1 2 3
y 2 3 4
z 3 4 5

插入/修改列

pandas.DataFrame.insert

  • loc:插入的索引位置
  • column:列名
  • value:Scalar/Series/array-like,插入列的内容
data = pd.DataFrame(
    {"A": [1, 2, 3], "B": [2, 3, 4], "C": [3, 4, 5]}, index=["x", "y", "z"]
)

data
A B C
x 1 2 3
y 2 3 4
z 3 4 5
data.insert(0, column="D", value=0)
data
D A B C
x 0 1 2 3
y 0 2 3 4
z 0 3 4 5
data.insert(0, column="E", value=[6, 6, 6])
data
E D A B C
x 6 0 1 2 3
y 6 0 2 3 4
z 6 0 3 4 5

pandas.DataFrame.assign

data = pd.DataFrame(
    {"A": [1, 2, 3], "B": [2, 3, 4], "C": [3, 4, 5]}, index=["x", "y", "z"]
)

data
A B C
x 1 2 3
y 2 3 4
z 3 4 5
data.assign(D=lambda df: df.C + 1)
A B C D
x 1 2 3 4
y 2 3 4 5
z 3 4 5 6
data.assign(D=lambda df: df.C + 1, E=lambda df: df.D + 1)
A B C D E
x 1 2 3 4 5
y 2 3 4 5 6
z 3 4 5 6 7

loc/iloc

选择已有列名则修改原始数据;新定义列名则插入新的数据

data = pd.DataFrame(
    {"A": [1, 2, 3], "B": [2, 3, 4], "C": [3, 4, 5]}, index=["x", "y", "z"]
)

data
A B C
x 1 2 3
y 2 3 4
z 3 4 5
data.loc[:, "D"] = 0
data
A B C D
x 1 2 3 0
y 2 3 4 0
z 3 4 5 0
data.loc[:, "D"] = 1
data
A B C D
x 1 2 3 1
y 2 3 4 1
z 3 4 5 1

插入/修改行

loc/iloc

data = pd.DataFrame(
    {"A": [1, 2, 3], "B": [2, 3, 4], "C": [3, 4, 5]}, index=["x", "y", "z"]
)

data
A B C
x 1 2 3
y 2 3 4
z 3 4 5
data.loc["x"] = 0
data
A B C
x 0 0 0
y 2 3 4
z 3 4 5
data.loc["m"] = 1
data
A B C
x 0 0 0
y 2 3 4
z 3 4 5
m 1 1 1

删除行/列

pandas.DataFrame.drop

data = pd.DataFrame(
    {"A": [1, 2, 3], "B": [2, 3, 4], "C": [3, 4, 5]}, index=["x", "y", "z"]
)

data
A B C
x 1 2 3
y 2 3 4
z 3 4 5
data.drop(labels=["A", "B"], axis=1)
C
x 3
y 4
z 5
data.drop(columns=["A", "B"])
C
x 3
y 4
z 5
data.drop(labels=["x"], axis=0)
A B C
y 2 3 4
z 3 4 5
data.drop(index=["x"], inplace=True)
data
A B C
y 2 3 4
z 3 4 5

设置/重置索引

pandas.DataFrame.set_index

  • keys:label or array-like or list of labels/arrays
  • drop:是否删除作为索引的列,默认为 True
  • append:Whether to append columns to existing index, default False.
  • inplace
data = pd.DataFrame(
    {"month": [1, 4, 7, 10], "year": [2012, 2014, 2013, 2014], "sale": [55, 40, 84, 31]}
)

data
month year sale
0 1 2012 55
1 4 2014 40
2 7 2013 84
3 10 2014 31
data.set_index("month", drop=False)
month year sale
month
1 1 2012 55
4 4 2014 40
7 7 2013 84
10 10 2014 31
data.set_index(["month", "year"])
sale
month year
1 2012 55
4 2014 40
7 2013 84
10 2014 31
data.set_index("month", append=True)
year sale
month
0 1 2012 55
1 4 2014 40
2 7 2013 84
3 10 2014 31

pandas.DataFrame.reset_index

data = pd.DataFrame(
    [("bird", 389.0), ("bird", 24.0), ("mammal", 80.5), ("mammal", 78)],
    index=["falcon", "parrot", "lion", "monkey"],
    columns=("class", "max_speed"),
)

data
class max_speed
falcon bird 389.0
parrot bird 24.0
lion mammal 80.5
monkey mammal 78.0
data.reset_index()
index class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal 78.0
data.reset_index(drop=True)
class max_speed
0 bird 389.0
1 bird 24.0
2 mammal 80.5
3 mammal 78.0

字符串方法

pandas.Series.str.split

这个方法我常用在需要从日期中提取年份时,比如:

data = pd.DataFrame(
    {"ticker": ["000001", "000018", "600201"],
     "date": ["2018-08-03", "2019-02-12", "2020-12-31"]},
)

data
ticker date
0 000001 2018-08-03
1 000018 2019-02-12
2 600201 2020-12-31
# 提取年份,expand=True
data.loc[:, "year"] = data.date.str.split("-", expand=True, n=1).iloc[:,0]
data
ticker date year
0 000001 2018-08-03 2018
1 000018 2019-02-12 2019
2 600201 2020-12-31 2020

pandas.series.str.zfill

在读取股价数据时,有时候股票代码前面的 0 缺失,可以使用这个方法来填充。

data = pd.DataFrame(
    {"ticker": ["1", "18", "600201"],
     "date": ["2018-08-03", "2019-02-12", "2020-12-31"]},
)

data
ticker date
0 1 2018-08-03
1 18 2019-02-12
2 600201 2020-12-31
data.loc[:, "ticker"] = data.ticker.str.zfill(6)
data
ticker date
0 000001 2018-08-03
1 000018 2019-02-12
2 600201 2020-12-31