Python｜像在Excel中一样灵活操作数据

Excel是大家都很熟悉的一款数据处理，分析，统计，可视化工具，如果你打算学习数据分析，那么Excel应该作为你的基本功，不仅要掌握，而且要非常熟练的掌握Excel。

好在Excel作为一个日常的办公软件，几乎每个上过大学的人都会使用它，对Excel有一定使用经验的人应该知道，在Excel中，我们能够非常灵活的操作数据，比如进行汇总求和，求平均值，行列转置，数据透视表等。

初学Python时，在你看了Python的那些基础数据类型之后，你发现，没有一种数据类型对应着Excel中的表格，你就会疑惑，这Python处理数据，到底是怎么处理的呢？不用担心，虽然，Python的自带软件库里面是没有对应表格这种类型的数据类型，但是通过第三方扩展库 pandas，我们能够在Python中非常灵活的处理表格型数据。下面我给出使用pandas 操作和处理数据的一些案例。

在使用pandas库之前，我们需要先导入pandas库。如果你使用从Python官网下载的Python版本，需要在命令行中使用如下命令进行pandas库的安装

pip install pandas

然后在Python程序中，引入pandas库

import pandas as pd

如果你使用我推荐的anaconda数据分析集成发行版，那么pandas库是默认安装好的，只需要使用import语句进行引入即可。在上面的这条import语句中，pd是pandas库的简写，我们需要在代码中使用pandas的名称，但是这个名称太长，于是我们使用简写。

下面我们从本地导入一个Excel格式的数据文件。使用pandas库中的read_excel函数，可以直接从本地数据读入数据，非常方便。

df = pd.read_excel("E:/data/superstore.xlsx")

读入数据后，可以查看一下数据的前两行（此处省略输出）

df.head(2)

然后可以输出这个数据集有多少行,这里它输出了每一个变量的个数，对应的数据的行数即为10000行

df.count

ID 10000 订单ID 10000 订单日期 10000 发货日期 10000 邮寄方式 10000 客户名称 10000 细分 10000 城市 10000 省 10000 地区 10000 产品ID 10000 类别 10000 子类别 10000 销售额 10000 数量 10000 折扣 10000 利润 10000 dtype: int64

也可以查看一下，这个数据表有哪些变量，这个操作在数据集的变量很大时，非常有用,你可能注意到，前面查看数据前2行，和查看数据集行数时，有括号，而此处查看变量名称，没有括号。有括号，代表对数据集调用函数，没有则表示查看数据集的属性。

df.columns

Index(['ID', '订单ID', '订单日期', '发货日期', '邮寄方式', '客户名称', '细分', '城市', '省', '地区', '产品ID', '类别', '子类别', '销售额', '数量', '折扣', '利润'], dtype='object')

再来做一个数据的透视操作，这里按照产品市场细分，与产品类别，求平均利润，你可以在Excel当中做同样的操作，对比结果是否一致

df.groupby(['细分', '类别'])['利润'].mean

细分类别公司办公用品 129.251945 家具 339.108491 技术 343.436380 小型企业办公用品 148.291325 家具 301.062607 技术 387.178446 消费者办公用品 131.149533 家具 246.053086 技术 381.008612 Name: 利润, dtype: float64

除了这些我已经演示的函数之外，pandas中的DataFrame对象，上面通过导入数据，生成的df就是一个DataFrame对象，它支持的函数多达220种。（下表为pandas中DataFrame对象支持的函数）

abs	bfill	describe	from_csv	insert
add	blocks	diff	from_dict	interpolate
add_prefix	bool	div	from_items	is_copy
add_suffix	boxplot	divide	from_records	isin
agg	clip	dot	ftypes	is
aggregate	clip_lower	drop	ge	items
align	clip_upper	drop_duplicates	get	iteritems
all	columns	dropna	get_dtype_counts	iterrows
any	combine	dtypes	get_ftype_counts	itertuples
append	combine_first	duplicated	get_value	ix
apply	compound	empty	get_values	join
applymap	consolidate	eq	groupby	keys
as_blocks	convert_objects	equals	gt	kurt
as_matrix	copy	eval	head	kurtosis
asfreq	corr	ewm	hist	last
asof	corrwith	expanding	iat	last_valid_index
assign	count	ffill	ID	le
astype	cov	fillna	idxmax	loc
at	cummax	filter	idxmin	lookup
at_time	cummin	first	iloc	lt
axes	cumprod	first_valid_index	index	mad
between_time	cumsum	floordiv	info	mask
max	pop	round	style	to_period
mean	pow	rpow	sub	to_pickle
median	prod	rsub	subtract	to_records
melt	product	rtruediv	sum	to_sparse
memory_usage	quantile	sample	swapaxes	to_sql
merge	query	select	swaplevel	to_stata
min	radd	select_dtypes	T	to_string
mod	rank	sem	tail	to_timestamp
mode	rdiv	set_axis	take	to_xarray
mul	reindex	set_index	to_clipboard	transform
multiply	reindex_axis	set_value	to_csv	transpose
ndim	reindex_like	shape	to_dense	truediv
ne	rename	shift	to_dict	truncate
nlargest	rename_axis	size	to_excel	tshift
not	reorder_levels	skew	to_feather	tz_convert
nsmallest	replace	slice_shift	to_gbq	tz_localize
nunique	resample	sort_index	to_hdf	unstack
pct_change	reset_index	sort_values	to_html	update
pipe	rfloordiv	sortlevel	to_json	values
pivot	rmod	squeeze	to_latex	var
pivot_table	rmul	stack	to_msgpack	where
plot	rolling	std	to_panel	xs

这么多的函数，显然我不可能全部演示它们的用法，你也不能够全部记住，那么怎么办呢？好在，查看函数的帮助文档也是非常方便的，比如你想在调用某个函数之前，查看它的用法，我们这里以上面使用的groupby函数为例，帮助文档会详细给出函数的参数，函数的用法，按照帮助文档去使用函数即可。

df.groupby?

Signature: df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)Docstring:Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns. Parameters ---------- by : mapping, function, str, or iterable Used to determine the groups for the groupby. If ``by`` is a function, it's called on each value of the object's index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series' values are first aligned; see ``.align`` method). If an ndarray is passed, the values are used as-is determine the groups. A str or list of strs may be passed to group by the columns in ``self`` axis : int, default 0 level : int, level name, or sequence of such, default None If the axis is a MultiIndex (hierarchical), group by a particular level or levels as_index : boolean, default True For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively "SQL-style" grouped output sort : boolean, default True Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group. group_keys : boolean, default True When calling apply, add group keys to index to identify pieces squeeze : boolean, default False reduce the dimensionality of the return type if possible, otherwise return a consistent type Examples -------- DataFrame results >>> data.groupby(func, axis=0).mean >>> data.groupby(['col1', 'col2'])['col3'].mean DataFrame with hierarchical index >>> data.groupby(['col1', 'col2']).mean Returns ------- GroupBy objectFile: c:programdataanaconda3libsite-packagespandascoregeneric.pyType: method

pandas库是数据分析中必须使用的一个Python第三方库，你可以根据本文内容，自己多多联系pandas的使用，如果你去找工作，说你找我Python中，pandas库全部220个函数的用法，一定会脱颖而出，因为我至今也没全部掌握啊！