十分钟上手 Pandas

pandas 是一个 Python Data Analysis Library

安装请参考官网的教程,如果安装了 Anaconda,则不需要安装 pandas 库。

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

产生 Pandas 对象

pandas 中有三种基本结构:

  • Series
    • 1D labeled homogeneously-typed array
  • DataFrame
    • General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns
  • Panel
    • General 3D labeled, also size-mutable array

Series

一维 Series 可以用一维列表初始化:

In [2]:
s = pd.Series([1,3,5,np.nan,6,8])

print s
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64

默认情况下,Series 的下标都是数字(可以使用额外参数指定),类型是统一的。

DataFrame

DataFrame 则是个二维结构,这里首先构造一组时间序列,作为我们第一维的下标:

In [3]:
dates = pd.date_range('20130101', periods=6)

print dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

然后创建一个 DataFrame 结构:

In [4]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

df
Out[4]:
A B C D
2013-01-01 -0.605936 -0.861658 -1.001924 1.528584
2013-01-02 -0.165408 0.388338 1.187187 1.819818
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067
2013-01-04 1.289305 0.497115 -0.225351 0.040239
2013-01-05 0.038232 0.875057 -0.092526 0.934432
2013-01-06 -2.163453 -0.010279 1.699886 1.291653

默认情况下,如果不指定 index 参数和 columns,那么他们的值将用从 0 开始的数字替代。

除了向 DataFrame 中传入二维数组,我们也可以使用字典传入数据:

In [5]:
df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })

df2
Out[5]:
A B C D E F
0 1 2013-01-02 1 3 test foo
1 1 2013-01-02 1 3 train foo
2 1 2013-01-02 1 3 test foo
3 1 2013-01-02 1 3 train foo

字典的每个 key 代表一列,其 value 可以是各种能够转化为 Series 的对象。

Series 要求所有的类型都一致不同,DataFrame 值要求每一列数据的格式相同:

In [6]:
df2.dtypes
Out[6]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

头尾数据

headtail 方法可以分别查看最前面几行和最后面几行的数据(默认为 5):

In [7]:
df.head()
Out[7]:
A B C D
2013-01-01 -0.605936 -0.861658 -1.001924 1.528584
2013-01-02 -0.165408 0.388338 1.187187 1.819818
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067
2013-01-04 1.289305 0.497115 -0.225351 0.040239
2013-01-05 0.038232 0.875057 -0.092526 0.934432

最后 3 行:

In [8]:
df.tail(3)
Out[8]:
A B C D
2013-01-04 1.289305 0.497115 -0.225351 0.040239
2013-01-05 0.038232 0.875057 -0.092526 0.934432
2013-01-06 -2.163453 -0.010279 1.699886 1.291653

下标,列标,数据

下标使用 index 属性查看:

In [9]:
df.index
Out[9]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

列标使用 columns 属性查看:

In [10]:
df.columns
Out[10]:
Index([u'A', u'B', u'C', u'D'], dtype='object')

数据值使用 values 查看:

In [11]:
df.values
Out[11]:
array([[-0.60593585, -0.86165752, -1.00192387,  1.52858443],
       [-0.16540784,  0.38833783,  1.18718697,  1.81981793],
       [ 0.06525454, -1.60807414, -1.2823306 , -0.28606716],
       [ 1.28930486,  0.49711531, -0.22535143,  0.04023897],
       [ 0.03823179,  0.87505664, -0.0925258 ,  0.93443212],
       [-2.16345271, -0.01027865,  1.69988608,  1.29165337]])

统计数据

查看简单的统计数据:

In [12]:
df.describe()
Out[12]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.257001 -0.119917 0.047490 0.888110
std 1.126657 0.938705 1.182629 0.841529
min -2.163453 -1.608074 -1.282331 -0.286067
25% -0.495804 -0.648813 -0.807781 0.263787
50% -0.063588 0.189030 -0.158939 1.113043
75% 0.058499 0.469921 0.867259 1.469352
max 1.289305 0.875057 1.699886 1.819818

转置

In [13]:
df.T
Out[13]:
2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00 2013-01-04 00:00:00 2013-01-05 00:00:00 2013-01-06 00:00:00
A -0.605936 -0.165408 0.065255 1.289305 0.038232 -2.163453
B -0.861658 0.388338 -1.608074 0.497115 0.875057 -0.010279
C -1.001924 1.187187 -1.282331 -0.225351 -0.092526 1.699886
D 1.528584 1.819818 -0.286067 0.040239 0.934432 1.291653

排序

sort_index(axis=0, ascending=True) 方法按照下标大小进行排序,axis=0 表示按第 0 维进行排序。

In [14]:
df.sort_index(ascending=False)
Out[14]:
A B C D
2013-01-06 -2.163453 -0.010279 1.699886 1.291653
2013-01-05 0.038232 0.875057 -0.092526 0.934432
2013-01-04 1.289305 0.497115 -0.225351 0.040239
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067
2013-01-02 -0.165408 0.388338 1.187187 1.819818
2013-01-01 -0.605936 -0.861658 -1.001924 1.528584
In [15]:
df.sort_index(axis=1, ascending=False)
Out[15]:
D C B A
2013-01-01 1.528584 -1.001924 -0.861658 -0.605936
2013-01-02 1.819818 1.187187 0.388338 -0.165408
2013-01-03 -0.286067 -1.282331 -1.608074 0.065255
2013-01-04 0.040239 -0.225351 0.497115 1.289305
2013-01-05 0.934432 -0.092526 0.875057 0.038232
2013-01-06 1.291653 1.699886 -0.010279 -2.163453

sort_values(by, axis=0, ascending=True) 方法按照 by 的值的大小进行排序,例如按照 B 列的大小:

In [16]:
df.sort_values(by="B")
Out[16]:
A B C D
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067
2013-01-01 -0.605936 -0.861658 -1.001924 1.528584
2013-01-06 -2.163453 -0.010279 1.699886 1.291653
2013-01-02 -0.165408 0.388338 1.187187 1.819818
2013-01-04 1.289305 0.497115 -0.225351 0.040239
2013-01-05 0.038232 0.875057 -0.092526 0.934432

索引

虽然 DataFrame 支持 Python/Numpy 的索引语法,但是推荐使用 .at, .iat, .loc, .iloc 和 .ix 方法进行索引。

读取数据

选择单列数据:

In [17]:
df["A"]
Out[17]:
2013-01-01   -0.605936
2013-01-02   -0.165408
2013-01-03    0.065255
2013-01-04    1.289305
2013-01-05    0.038232
2013-01-06   -2.163453
Freq: D, Name: A, dtype: float64

也可以用 df.A

In [18]:
df.A
Out[18]:
2013-01-01   -0.605936
2013-01-02   -0.165408
2013-01-03    0.065255
2013-01-04    1.289305
2013-01-05    0.038232
2013-01-06   -2.163453
Freq: D, Name: A, dtype: float64

使用切片读取多行:

In [19]:
df[0:3]
Out[19]:
A B C D
2013-01-01 -0.605936 -0.861658 -1.001924 1.528584
2013-01-02 -0.165408 0.388338 1.187187 1.819818
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067

index 名字也可以进行切片:

In [20]:
df["20130101":"20130103"]
Out[20]:
A B C D
2013-01-01 -0.605936 -0.861658 -1.001924 1.528584
2013-01-02 -0.165408 0.388338 1.187187 1.819818
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067

使用 label 索引

loc 可以方便的使用 label 进行索引:

In [21]:
df.loc[dates[0]]
Out[21]:
A   -0.605936
B   -0.861658
C   -1.001924
D    1.528584
Name: 2013-01-01 00:00:00, dtype: float64

多列数据:

In [22]:
df.loc[:,['A','B']]
Out[22]:
A B
2013-01-01 -0.605936 -0.861658
2013-01-02 -0.165408 0.388338
2013-01-03 0.065255 -1.608074
2013-01-04 1.289305 0.497115
2013-01-05 0.038232 0.875057
2013-01-06 -2.163453 -0.010279

选择多行多列:

In [23]:
df.loc['20130102':'20130104',['A','B']]
Out[23]:
A B
2013-01-02 -0.165408 0.388338
2013-01-03 0.065255 -1.608074
2013-01-04 1.289305 0.497115

数据降维:

In [24]:
df.loc['20130102',['A','B']]
Out[24]:
A   -0.165408
B    0.388338
Name: 2013-01-02 00:00:00, dtype: float64

得到标量值:

In [25]:
df.loc[dates[0],'B']
Out[25]:
-0.86165751902832299

不过得到标量值可以用 at,速度更快:

In [26]:
%timeit -n100 df.loc[dates[0],'B']
%timeit -n100 df.at[dates[0],'B']

print df.at[dates[0],'B']
100 loops, best of 3: 329 µs per loop
100 loops, best of 3: 31.1 µs per loop
-0.861657519028

使用位置索引

iloc 使用位置进行索引:

In [27]:
df.iloc[3]
Out[27]:
A    1.289305
B    0.497115
C   -0.225351
D    0.040239
Name: 2013-01-04 00:00:00, dtype: float64

连续切片:

In [28]:
df.iloc[3:5,0:2]
Out[28]:
A B
2013-01-04 1.289305 0.497115
2013-01-05 0.038232 0.875057

索引不连续的部分:

In [29]:
df.iloc[[1,2,4],[0,2]]
Out[29]:
A C
2013-01-02 -0.165408 1.187187
2013-01-03 0.065255 -1.282331
2013-01-05 0.038232 -0.092526

索引整行:

In [30]:
df.iloc[1:3,:]
Out[30]:
A B C D
2013-01-02 -0.165408 0.388338 1.187187 1.819818
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067

整列:

In [31]:
df.iloc[:, 1:3]
Out[31]:
B C
2013-01-01 -0.861658 -1.001924
2013-01-02 0.388338 1.187187
2013-01-03 -1.608074 -1.282331
2013-01-04 0.497115 -0.225351
2013-01-05 0.875057 -0.092526
2013-01-06 -0.010279 1.699886

标量值:

In [32]:
df.iloc[1,1]
Out[32]:
0.3883378290420279

当然,使用 iat 索引标量值更快:

In [33]:
%timeit -n100 df.iloc[1,1]
%timeit -n100 df.iat[1,1]

df.iat[1,1]
100 loops, best of 3: 236 µs per loop
100 loops, best of 3: 14.5 µs per loop
Out[33]:
0.3883378290420279

布尔型索引

所有 A 列大于 0 的行:

In [34]:
df[df.A > 0]
Out[34]:
A B C D
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067
2013-01-04 1.289305 0.497115 -0.225351 0.040239
2013-01-05 0.038232 0.875057 -0.092526 0.934432

只留下所有大于 0 的数值:

In [35]:
df[df > 0]
Out[35]:
A B C D
2013-01-01 NaN NaN NaN 1.528584
2013-01-02 NaN 0.388338 1.187187 1.819818
2013-01-03 0.065255 NaN NaN NaN
2013-01-04 1.289305 0.497115 NaN 0.040239
2013-01-05 0.038232 0.875057 NaN 0.934432
2013-01-06 NaN NaN 1.699886 1.291653

使用 isin 方法做 filter 过滤:

In [36]:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']

df2
Out[36]:
A B C D E
2013-01-01 -0.605936 -0.861658 -1.001924 1.528584 one
2013-01-02 -0.165408 0.388338 1.187187 1.819818 one
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067 two
2013-01-04 1.289305 0.497115 -0.225351 0.040239 three
2013-01-05 0.038232 0.875057 -0.092526 0.934432 four
2013-01-06 -2.163453 -0.010279 1.699886 1.291653 three
In [37]:
df2[df2['E'].isin(['two','four'])]
Out[37]:
A B C D E
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067 two
2013-01-05 0.038232 0.875057 -0.092526 0.934432 four

设定数据的值

In [38]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))

s1
Out[38]:
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

像字典一样,直接指定 F 列的值为 s1,此时以 df 已有的 index 为标准将二者进行合并,s1 中没有的 index 项设为 NaN,多余的项舍去:

In [39]:
df['F'] = s1

df
Out[39]:
A B C D F
2013-01-01 -0.605936 -0.861658 -1.001924 1.528584 NaN
2013-01-02 -0.165408 0.388338 1.187187 1.819818 1
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067 2
2013-01-04 1.289305 0.497115 -0.225351 0.040239 3
2013-01-05 0.038232 0.875057 -0.092526 0.934432 4
2013-01-06 -2.163453 -0.010279 1.699886 1.291653 5

或者使用 atiat 修改单个值:

In [40]:
df.at[dates[0],'A'] = 0

df
Out[40]:
A B C D F
2013-01-01 0.000000 -0.861658 -1.001924 1.528584 NaN
2013-01-02 -0.165408 0.388338 1.187187 1.819818 1
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067 2
2013-01-04 1.289305 0.497115 -0.225351 0.040239 3
2013-01-05 0.038232 0.875057 -0.092526 0.934432 4
2013-01-06 -2.163453 -0.010279 1.699886 1.291653 5
In [41]:
df.iat[0, 1] = 0

df
Out[41]:
A B C D F
2013-01-01 0.000000 0.000000 -1.001924 1.528584 NaN
2013-01-02 -0.165408 0.388338 1.187187 1.819818 1
2013-01-03 0.065255 -1.608074 -1.282331 -0.286067 2
2013-01-04 1.289305 0.497115 -0.225351 0.040239 3
2013-01-05 0.038232 0.875057 -0.092526 0.934432 4
2013-01-06 -2.163453 -0.010279 1.699886 1.291653 5

设定一整列:

In [42]:
df.loc[:,'D'] = np.array([5] * len(df))

df
Out[42]:
A B C D F
2013-01-01 0.000000 0.000000 -1.001924 5 NaN
2013-01-02 -0.165408 0.388338 1.187187 5 1
2013-01-03 0.065255 -1.608074 -1.282331 5 2
2013-01-04 1.289305 0.497115 -0.225351 5 3
2013-01-05 0.038232 0.875057 -0.092526 5 4
2013-01-06 -2.163453 -0.010279 1.699886 5 5

设定满足条件的数值:

In [43]:
df2 = df.copy()

df2[df2 > 0] = -df2

df2
Out[43]:
A B C D F
2013-01-01 0.000000 0.000000 -1.001924 -5 NaN
2013-01-02 -0.165408 -0.388338 -1.187187 -5 -1
2013-01-03 -0.065255 -1.608074 -1.282331 -5 -2
2013-01-04 -1.289305 -0.497115 -0.225351 -5 -3
2013-01-05 -0.038232 -0.875057 -0.092526 -5 -4
2013-01-06 -2.163453 -0.010279 -1.699886 -5 -5

缺失数据

In [44]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1

df1
Out[44]:
A B C D F E
2013-01-01 0.000000 0.000000 -1.001924 5 NaN 1
2013-01-02 -0.165408 0.388338 1.187187 5 1 1
2013-01-03 0.065255 -1.608074 -1.282331 5 2 NaN
2013-01-04 1.289305 0.497115 -0.225351 5 3 NaN

丢弃所有缺失数据的行得到的新数据:

In [45]:
df1.dropna(how='any')
Out[45]:
A B C D F E
2013-01-02 -0.165408 0.388338 1.187187 5 1 1

填充缺失数据:

In [46]:
df1.fillna(value=5)
Out[46]:
A B C D F E
2013-01-01 0.000000 0.000000 -1.001924 5 5 1
2013-01-02 -0.165408 0.388338 1.187187 5 1 1
2013-01-03 0.065255 -1.608074 -1.282331 5 2 5
2013-01-04 1.289305 0.497115 -0.225351 5 3 5

检查缺失数据的位置:

In [47]:
pd.isnull(df1)
Out[47]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True

计算操作

统计信息

每一列的均值:

In [48]:
df.mean()
Out[48]:
A   -0.156012
B    0.023693
C    0.047490
D    5.000000
F    3.000000
dtype: float64

每一行的均值:

In [49]:
df.mean(1)
Out[49]:
2013-01-01    0.999519
2013-01-02    1.482023
2013-01-03    0.834970
2013-01-04    1.912214
2013-01-05    1.964153
2013-01-06    1.905231
Freq: D, dtype: float64

多个对象之间的操作,如果维度不对,pandas 会自动调用 broadcasting 机制:

In [50]:
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)

print s
2013-01-01   NaN
2013-01-02   NaN
2013-01-03     1
2013-01-04     3
2013-01-05     5
2013-01-06   NaN
Freq: D, dtype: float64

相减 df - s

In [51]:
df.sub(s, axis='index')
Out[51]:
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.934745 -2.608074 -2.282331 4 1
2013-01-04 -1.710695 -2.502885 -3.225351 2 0
2013-01-05 -4.961768 -4.124943 -5.092526 0 -1
2013-01-06 NaN NaN NaN NaN NaN

apply 操作

R 中的 apply 操作类似,接收一个函数,默认是对将函数作用到每一列上:

In [52]:
df.apply(np.cumsum)
Out[52]:
A B C D F
2013-01-01 0.000000 0.000000 -1.001924 5 NaN
2013-01-02 -0.165408 0.388338 0.185263 10 1
2013-01-03 -0.100153 -1.219736 -1.097067 15 3
2013-01-04 1.189152 -0.722621 -1.322419 20 6
2013-01-05 1.227383 0.152436 -1.414945 25 10
2013-01-06 -0.936069 0.142157 0.284941 30 15

求每列最大最小值之差:

In [53]:
df.apply(lambda x: x.max() - x.min())
Out[53]:
A    3.452758
B    2.483131
C    2.982217
D    0.000000
F    4.000000
dtype: float64

直方图

In [54]:
s = pd.Series(np.random.randint(0, 7, size=10))
print s
0    2
1    5
2    6
3    6
4    6
5    3
6    5
7    0
8    4
9    4
dtype: int64

直方图信息:

In [55]:
print s.value_counts()
6    3
5    2
4    2
3    1
2    1
0    1
dtype: int64

绘制直方图信息:

In [56]:
h = s.hist()

字符串方法

Series 或者 DataFrame 的某一列是字符串时,我们可以用 .str 对这个字符串数组进行字符串的基本操作:

In [57]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

print s.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并

连接

In [58]:
df = pd.DataFrame(np.random.randn(10, 4))

df
Out[58]:
0 1 2 3
0 -2.346373 0.105651 -0.048027 0.010637
1 -0.682198 0.943043 0.147312 -0.657871
2 0.515766 -0.768286 0.361570 1.146278
3 -0.607277 -0.003086 -1.499001 1.165728
4 -1.226279 -0.177246 -1.379631 -0.639261
5 0.807364 -1.855060 0.325968 1.898831
6 0.438539 -0.728131 -0.009924 0.398360
7 1.497457 -1.506314 -1.557624 0.869043
8 0.945985 -0.519435 -0.510359 -1.077751
9 1.597679 -0.285955 -1.060736 0.608629

可以使用 pd.concat 函数将多个 pandas 对象进行连接:

In [59]:
pieces = [df[:2], df[4:5], df[7:]]

pd.concat(pieces)
Out[59]:
0 1 2 3
0 -2.346373 0.105651 -0.048027 0.010637
1 -0.682198 0.943043 0.147312 -0.657871
4 -1.226279 -0.177246 -1.379631 -0.639261
7 1.497457 -1.506314 -1.557624 0.869043
8 0.945985 -0.519435 -0.510359 -1.077751
9 1.597679 -0.285955 -1.060736 0.608629

数据库中的 Join

merge 可以实现数据库中的 join 操作:

In [60]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

print left
print right
   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5
In [61]:
pd.merge(left, right, on='key')
Out[61]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

append

DataFrame 中添加行:

In [62]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

df
Out[62]:
A B C D
0 1.587778 -0.110297 0.602245 1.212597
1 -0.551109 0.337387 -0.220919 0.363332
2 1.207373 -0.128394 0.619937 -0.612694
3 -0.978282 -1.038170 0.048995 -0.788973
4 0.843893 -1.079021 0.092212 0.485422
5 -0.056594 1.831206 1.910864 -1.331739
6 -0.487106 -1.495367 0.853440 0.410854
7 1.830852 -0.014893 0.254025 0.197422

将第三行的值添加到最后:

In [63]:
s = df.iloc[3]

df.append(s, ignore_index=True)
Out[63]:
A B C D
0 1.587778 -0.110297 0.602245 1.212597
1 -0.551109 0.337387 -0.220919 0.363332
2 1.207373 -0.128394 0.619937 -0.612694
3 -0.978282 -1.038170 0.048995 -0.788973
4 0.843893 -1.079021 0.092212 0.485422
5 -0.056594 1.831206 1.910864 -1.331739
6 -0.487106 -1.495367 0.853440 0.410854
7 1.830852 -0.014893 0.254025 0.197422
8 -0.978282 -1.038170 0.048995 -0.788973

Grouping

In [64]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

df
Out[64]:
A B C D
0 foo one 0.773062 0.206503
1 bar one 1.414609 -0.346719
2 foo two 0.964174 0.706623
3 bar three 0.182239 -1.516509
4 foo two -0.096255 0.494177
5 bar two -0.759471 -0.389213
6 foo one -0.257519 -1.411693
7 foo three -0.109368 0.241862

按照 A 的值进行分类:

In [65]:
df.groupby('A').sum()
Out[65]:
C D
A
bar 0.837377 -2.252441
foo 1.274094 0.237472

按照 A, B 的值进行分类:

In [66]:
df.groupby(['A', 'B']).sum()
Out[66]:
C D
A B
bar one 1.414609 -0.346719
three 0.182239 -1.516509
two -0.759471 -0.389213
foo one 0.515543 -1.205191
three -0.109368 0.241862
two 0.867919 1.200800

改变形状

Stack

产生一个多 indexDataFrame

In [67]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df
Out[67]:
A B
first second
bar one -0.109174 0.958551
two -0.254743 -0.975924
baz one -0.132039 -0.119009
two 0.587063 -0.819037
foo one -0.754123 0.430747
two -0.426544 0.389822
qux one -0.382501 -0.562910
two -0.529287 0.826337

stack 方法将 columns 变成一个新的 index 部分:

In [68]:
df2 = df[:4]

stacked = df2.stack()

stacked
Out[68]:
first  second   
bar    one     A   -0.109174
               B    0.958551
       two     A   -0.254743
               B   -0.975924
baz    one     A   -0.132039
               B   -0.119009
       two     A    0.587063
               B   -0.819037
dtype: float64

可以使用 unstack() 将最后一级 index 放回 column

In [69]:
stacked.unstack()
Out[69]:
A B
first second
bar one -0.109174 0.958551
two -0.254743 -0.975924
baz one -0.132039 -0.119009
two 0.587063 -0.819037

也可以指定其他的级别:

In [70]:
stacked.unstack(1)
Out[70]:
second one two
first
bar A -0.109174 -0.254743
B 0.958551 -0.975924
baz A -0.132039 0.587063
B -0.119009 -0.819037

时间序列

金融分析中常用到时间序列数据:

In [71]:
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)

ts
Out[71]:
2012-03-06    1.096788
2012-03-07    0.029678
2012-03-08    0.511461
2012-03-09   -0.332369
2012-03-10    1.720321
Freq: D, dtype: float64

标准时间表示:

In [72]:
ts_utc = ts.tz_localize('UTC')

ts_utc
Out[72]:
2012-03-06 00:00:00+00:00    1.096788
2012-03-07 00:00:00+00:00    0.029678
2012-03-08 00:00:00+00:00    0.511461
2012-03-09 00:00:00+00:00   -0.332369
2012-03-10 00:00:00+00:00    1.720321
Freq: D, dtype: float64

改变时区表示:

In [73]:
ts_utc.tz_convert('US/Eastern')
Out[73]:
2012-03-05 19:00:00-05:00    1.096788
2012-03-06 19:00:00-05:00    0.029678
2012-03-07 19:00:00-05:00    0.511461
2012-03-08 19:00:00-05:00   -0.332369
2012-03-09 19:00:00-05:00    1.720321
Freq: D, dtype: float64

Categoricals

In [74]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

df
Out[74]:
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e

可以将 grade 变成类别:

In [75]:
df["grade"] = df["raw_grade"].astype("category")

df["grade"]
Out[75]:
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

将类别的表示转化为有意义的字符:

In [76]:
df["grade"].cat.categories = ["very good", "good", "very bad"]

df["grade"]
Out[76]:
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]

添加缺失的类别:

In [77]:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
Out[77]:
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

使用 grade 分组:

In [78]:
df.groupby("grade").size()
Out[78]:
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

绘图

使用 ggplot 风格:

In [79]:
plt.style.use('ggplot')

Series 绘图:

In [80]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

p = ts.cumsum().plot()

DataFrame 按照 columns 绘图:

In [81]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])

df.cumsum().plot()
p = plt.legend(loc="best")

文件读写

csv

写入文件:

In [82]:
df.to_csv('foo.csv')

从文件中读取:

In [83]:
pd.read_csv('foo.csv').head()
Out[83]:
Unnamed: 0 A B C D
0 2000-01-01 -1.011554 1.200283 -0.310949 -1.060734
1 2000-01-02 -1.030894 0.660518 -0.214002 -0.422014
2 2000-01-03 -0.488692 1.709209 -0.602208 1.115456
3 2000-01-04 -0.440243 0.826692 0.321648 -0.351698
4 2000-01-05 -0.165684 1.297303 0.817233 0.174767

hdf5

写入文件:

In [84]:
df.to_hdf("foo.h5", "df")

读取文件:

In [85]:
pd.read_hdf('foo.h5','df').head()
Out[85]:
A B C D
2000-01-01 -1.011554 1.200283 -0.310949 -1.060734
2000-01-02 -1.030894 0.660518 -0.214002 -0.422014
2000-01-03 -0.488692 1.709209 -0.602208 1.115456
2000-01-04 -0.440243 0.826692 0.321648 -0.351698
2000-01-05 -0.165684 1.297303 0.817233 0.174767

excel

写入文件:

In [86]:
df.to_excel('foo.xlsx', sheet_name='Sheet1')

读取文件:

In [87]:
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']).head()
Out[87]:
A B C D
2000-01-01 -1.011554 1.200283 -0.310949 -1.060734
2000-01-02 -1.030894 0.660518 -0.214002 -0.422014
2000-01-03 -0.488692 1.709209 -0.602208 1.115456
2000-01-04 -0.440243 0.826692 0.321648 -0.351698
2000-01-05 -0.165684 1.297303 0.817233 0.174767

清理生成的临时文件:

In [88]:
import glob
import os

for f in glob.glob("foo*"):
    os.remove(f)