In [11]: df2 Out[11]: A B C D E F 01.02013-01-02 1.03 test foo 11.02013-01-02 1.03 train foo 21.02013-01-02 1.03 test foo 31.02013-01-02 1.03 train foo
4、查看不同列的数据类型:
1 2 3 4 5 6 7 8 9
In [12]: df2.dtypes Out[12]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object
In [14]: df.head() Out[14]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 2013-01-04 0.721555 -0.706771 -1.0395750.271860 2013-01-05 -0.4249720.5670200.276232 -1.087401
In [15]: df.tail(3) Out[15]: A B C D 2013-01-04 0.721555 -0.706771 -1.0395750.271860 2013-01-05 -0.4249720.5670200.276232 -1.087401 2013-01-06 -0.6736900.113648 -1.4784270.524988
In [24]: df[0:3] Out[24]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804
In [25]: df['20130102':'20130104'] Out[25]: A B C D 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 2013-01-04 0.721555 -0.706771 -1.0395750.271860
通过标签选择
1、 使用标签来获取一个交叉的区域
1 2 3 4 5 6 7
In [26]: df.loc[dates[0]] Out[26]: A 0.469112 B -0.282863 C -1.509059 D -1.135632 Name: 2013-01-01 00:00:00, dtype: float64
2、 通过标签来在多个轴上进行选择
1 2 3 4 5 6 7 8 9
In [27]: df.loc[:,['A','B']] Out[27]: A B 2013-01-01 0.469112 -0.282863 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771 2013-01-05 -0.4249720.567020 2013-01-06 -0.6736900.113648
3、 标签切片
1 2 3 4 5 6
In [28]: df.loc['20130102':'20130104',['A','B']] Out[28]: A B 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771
4、 对于返回的对象进行维度缩减
1 2 3 4 5
In [29]: df.loc['20130102',['A','B']] Out[29]: A 1.212112 B -0.173215 Name: 2013-01-02 00:00:00, dtype: float64
5、 获取一个标量
1 2
In [30]: df.loc[dates[0],'A'] Out[30]: 0.46911229990718628
6、 快速访问一个标量(与上一个方法等价)
1 2
In [31]: df.at[dates[0],'A'] Out[31]: 0.46911229990718628
通过位置选择
1、 通过传递数值进行位置选择(选择的是行)
1 2 3 4 5 6 7
In [32]: df.iloc[3] Out[32]: A 0.721555 B -0.706771 C -1.039575 D 0.271860 Name: 2013-01-04 00:00:00, dtype: float64
2、 通过数值进行切片,与numpy/python中的情况类似
1 2 3 4 5
In [33]: df.iloc[3:5,0:2] Out[33]: A B 2013-01-040.721555 -0.706771 2013-01-05 -0.4249720.567020
3、 通过指定一个位置的列表,与numpy/python中的情况类似
1 2 3 4 5 6
In [34]: df.iloc[[1,2,4],[0,2]] Out[34]: A C 2013-01-02 1.2121120.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.4249720.276232
4、 对行进行切片
1 2 3 4 5
In [35]: df.iloc[1:3,:] Out[35]: A B C D 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804
5、 对列进行切片
1 2 3 4 5 6 7 8 9
In [36]: df.iloc[:,1:3] Out[36]: B C 2013-01-01 -0.282863 -1.509059 2013-01-02 -0.1732150.119209 2013-01-03 -2.104569 -0.494929 2013-01-04 -0.706771 -1.039575 2013-01-05 0.5670200.276232 2013-01-06 0.113648 -1.478427
6、 获取特定的值
1 2
In [37]: df.iloc[1,1] Out[37]: -0.17321464905330858
获得快速访问的一个标量(等同于以前的做法)
1 2
In [38]: df.iat[1,1] Out[38]: -0.17321464905330858
布尔索引
1、 使用一个单独列的值来选择数据:
1 2 3 4 5 6
In [39]: df[df.A > 0] Out[39]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-04 0.721555 -0.706771 -1.0395750.271860
2、 使用where操作来选择数据:
1 2 3 4 5 6 7 8 9
In [40]: df[df > 0] Out[40]: A B C D 2013-01-01 0.469112 NaN NaN NaN 2013-01-02 1.212112 NaN 0.119209 NaN 2013-01-03 NaN NaN NaN 1.071804 2013-01-04 0.721555 NaN NaN 0.271860 2013-01-05 NaN 0.5670200.276232 NaN 2013-01-06 NaN 0.113648 NaN 0.524988
3、 使用isin()方法来过滤:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In [41]: df2 = df.copy()
In [42]: df2['E'] = ['one', 'one','two','three','four','three']
In [43]: df2 Out[43]: A B C D E 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one 2013-01-02 1.212112 -0.1732150.119209 -1.044236 one 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 two 2013-01-04 0.721555 -0.706771 -1.0395750.271860 three 2013-01-05 -0.4249720.5670200.276232 -1.087401 four 2013-01-06 -0.6736900.113648 -1.4784270.524988 three
In [44]: df2[df2['E'].isin(['two','four'])] Out[44]: A B C D E 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 two 2013-01-05 -0.4249720.5670200.276232 -1.087401 four
设置
1、 设置一个新的列:
1 2 3 4 5 6 7 8 9 10 11 12 13
In [45]: s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
In [51]: df Out[51]: A B C D F 2013-01-010.0000000.000000 -1.5090595 NaN 2013-01-021.212112 -0.1732150.11920951.0 2013-01-03 -0.861849 -2.104569 -0.49492952.0 2013-01-040.721555 -0.706771 -1.03957553.0 2013-01-05 -0.4249720.5670200.27623254.0 2013-01-06 -0.6736900.113648 -1.47842755.0
5、 通过where操作来设置新的值:
1 2 3 4 5 6 7 8 9 10 11 12 13
In [52]: df2 = df.copy()
In [53]: df2[df2 > 0] = -df2
In [54]: df2 Out[54]: A B C D F 2013-01-01 0.0000000.000000 -1.509059 -5 NaN 2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0 2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0 2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0 2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0 2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0
In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
In [56]: df1.loc[dates[0]:dates[1],'E'] = 1
In [57]: df1 Out[57]: A B C D F E 2013-01-01 0.0000000.000000 -1.5090595 NaN 1.0 2013-01-02 1.212112 -0.1732150.11920951.01.0 2013-01-03 -0.861849 -2.104569 -0.49492952.0 NaN 2013-01-04 0.721555 -0.706771 -1.03957553.0 NaN
2、 去掉包含缺失值的行:
1 2 3 4
In [58]: df1.dropna(how='any') Out[58]: A B C D F E 2013-01-021.212112 -0.1732150.11920951.01.0
3、 对缺失值进行填充:
1 2 3 4 5 6 7
In [59]: df1.fillna(value=5) Out[59]: A B C D F E 2013-01-01 0.0000000.000000 -1.50905955.01.0 2013-01-02 1.212112 -0.1732150.11920951.01.0 2013-01-03 -0.861849 -2.104569 -0.49492952.05.0 2013-01-04 0.721555 -0.706771 -1.03957553.05.0
4、 对数据进行布尔填充:
1 2 3 4 5 6 7
In [60]: pd.isnull(df1) Out[60]: A B C D F E 2013-01-01 FalseFalseFalseFalseTrueFalse 2013-01-02 FalseFalseFalseFalseFalseFalse 2013-01-03 FalseFalseFalseFalseFalseTrue 2013-01-04 FalseFalseFalseFalseFalseTrue
五、相关操作
统计(相关操作通常情况下不包括缺失值)
1、 执行描述性统计:
1 2 3 4 5 6 7 8
In [61]: df.mean() Out[61]: A -0.004474 B -0.383981 C -0.687758 D 5.000000 F 3.000000 dtype: float64
In [63]: s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
In [64]: s Out[64]: 2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1.0 2013-01-04 3.0 2013-01-05 5.0 2013-01-06 NaN Freq: D, dtype: float64
In [65]: df.sub(s, axis='index') Out[65]: A B C D F 2013-01-01 NaN NaN NaN NaN NaN 2013-01-02 NaN NaN NaN NaN NaN 2013-01-03 -1.861849 -3.104569 -1.4949294.01.0 2013-01-04 -2.278445 -3.706771 -4.0395752.00.0 2013-01-05 -5.424972 -4.432980 -4.7237680.0 -1.0 2013-01-06 NaN NaN NaN NaN NaN
Apply
1、 对数据应用函数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
In [66]: df.apply(np.cumsum) Out[66]: A B C D F 2013-01-01 0.0000000.000000 -1.5090595 NaN 2013-01-02 1.212112 -0.173215 -1.389850101.0 2013-01-03 0.350263 -2.277784 -1.884779153.0 2013-01-04 1.071818 -2.984555 -2.924354206.0 2013-01-05 0.646846 -2.417535 -2.6481222510.0 2013-01-06 -0.026844 -2.303886 -4.1265493015.0
In [67]: df.apply(lambda x: x.max() - x.min()) Out[67]: A 2.073961 B 2.671590 C 1.785291 D 0.000000 F 4.000000 dtype: float64
In [87]: df Out[87]: ABCD 0 foo one -1.202872-0.055224 1 bar one -1.8144702.395985 2 foo two 1.0186011.552825 3 bar three -0.5954470.166599 4 foo two 1.3954330.047609 5 bar two -0.392670-0.136473 6 foo one 0.007207-0.561757 7 foo three 1.928123-1.623033
1、 分组并对每个分组执行sum函数:
1 2 3 4 5 6
In [88]: df.groupby('A').sum() Out[88]: C D A bar -2.8025882.42611 foo 3.146492 -0.63958
2、 通过多个列进行分组形成一个层次索引,然后执行函数:
1 2 3 4 5 6 7 8 9 10
In [89]: df.groupby(['A','B']).sum() Out[89]: C D A B bar one -1.8144702.395985 three -0.5954470.166599 two -0.392670 -0.136473 foo one -1.195665 -0.616981 three 1.928123 -1.623033 two 2.4140341.600434
In [91]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [92]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [93]: df2 = df[:4]
In [94]: df2 Out[94]: AB first second bar one 0.029399-0.542108 two 0.282696-0.087302 baz one -1.5751701.771208 two 0.8164821.100230
stack()方法在DataFrame列的层次上压缩
1 2 3 4 5 6 7 8 9 10 11 12 13 14
In [95]: stacked = df2.stack()
In [96]: stacked Out[96]: first second bar one A 0.029399 B -0.542108 two A 0.282696 B -0.087302 baz one A -1.575170 B 1.771208 two A 0.816482 B 1.100230 dtype: float64
In [101]: df Out[101]: A B C D E 0 one A foo 1.418757 -0.179666 1 one B foo -1.8790241.291836 2 two C foo 0.536826 -0.009614 3 three A bar 1.0061600.392149 4 one B bar -0.0297160.264599 5 one C bar -1.146178 -0.057409 6 two A foo 0.100900 -1.425638 7 three B foo -1.0350181.024098 8 one C foo 0.314665 -0.106062 9 one A bar -0.7737231.824375 10 two B bar -1.1706530.595974 11 three C bar 0.6487401.167115
可以从这个数据中轻松的生成数据透视表:
1 2 3 4 5 6 7 8 9 10 11 12 13
In [102]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[102]: C bar foo A B one A -0.7737231.418757 B -0.029716 -1.879024 C -1.1461780.314665 three A1.006160 NaN B NaN -1.035018 C 0.648740 NaN two A NaN 0.100900 B -1.170653 NaN C NaN 0.536826
In [127]: df["grade"] Out[127]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]
4、 排序是按照Categorical的顺序进行的而不是按照字典顺序进行:
1 2 3 4 5 6 7 8 9
In [128]: df.sort_values(by="grade") Out[128]: id raw_grade grade 56 e very bad 12b good 23b good 01a very good 34a very good 45a very good
5、 对Categorical列进行排序时存在空的类别:
1 2 3 4 5 6 7 8 9
In [129]: df.groupby("grade").size() Out[129]: grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
十一、画图
1 2 3 4 5 6
In [130]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
In [131]: ts = ts.cumsum()
In [132]: ts.plot() Out[132]: <matplotlib.axes._subplots.AxesSubplot at 0x10efd5a90>
>>> if pd.Series([False, True, False]): print("I was true") Traceback ... ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().