高性能Python程序

2015-05-10 | 振导社会 | 程序设计高性能计算 Python

程序性能剖析

程序性能剖析的目的在于解决如下基本问题：

程序运行有多快？
影响速度的瓶颈在哪里？
使用了多少内存？
存在内存泄漏么？

Python本身就内置了丰富的性能分析工具。Profiler是Python自带的一组程序，能够描述程序运行时候的性能，并提供各种统计帮助用户定位程序的性能瓶颈。Python标准模块提供三种profilers：cProfile、profile以及hotshot。

利用time工具计时

利用Linux／Unix的time工具，快速估计代码运行时间：

$ time python yourprogram.py

real    0m1.028s
user    0m0.001s
sys     0m0.003s

其中，每项表示的含义如下：

real：实际运行时间；
user：内核之外的CPU运行时间；
sys：内核特定函数的CPU运行时间。

通过将sys和user时间加起来，可以估计程序的CPU运行时间。如果该时间远小于real时间，很可能是IO等待影响了程序性能。

利用time包直接计时

from time import time 
#### some code  
print "total run time:"
print time() - t

利用上下文管理器计时

计时程序timer.py为：

import time

class Timer(object):
    def __init__(self, verbose=False):
        self.verbose = verbose

    def __enter__(self):
        self.start = time.time()
        return self

    def __exit__(self, *args):
        self.end = time.time()
        self.secs = self.end - self.start
        self.msecs = self.secs * 1000  # millisecs
        if self.verbose:
            print 'elapsed time: %f ms' % self.msecs

为了使用该程序，需要用with包裹被计时的代码：

from timer import Timer
from redis import Redis
rdb = Redis()

with Timer() as t:
    rdb.lpush("foo", "bar")
print "=> elasped lpush: %s s" % t.secs

with Timer() as t:
    rdb.lpop("foo")
print "=> elasped lpop: %s s" % t.secs

利用profiler按行计时

line_profiler可以估计每行运行的时间和频率。使用前需要安装：

$ sudo pip install line_profiler

安装之后可以使用模块line_profiler以及可执行脚本kernprof.py。

为了使用该计时工具，首先修改源代码，为需要计时的代码增加修饰器@profile。启用该修饰器不需要导入任何东西。kernprof.py脚本会自动在运行时插入相应代码。需要primes.py计时的代码如下：

@profile
def primes(n): 
    if n==2:
        return [2]
    elif n<2:
        return []
    s=range(3,n+1,2)
    mroot = n ** 0.5
    half=(n+1)/2-1
    i=0
    m=3
    while m <= mroot:
        if s[i]:
            j=(m*m-3)/2
            s[j]=0
            while j<half:
                s[j]=0
                j+=m
        i=i+1
        m=2*i+3
    return [2]+[x for x in s if x]
primes(100)

然后测试程序运行时间：

$ kernprof -l -v primes.py

注意：目前该包按修饰器的方式不可用

Wrote profile results to primes.py.lprof
Timer unit: 1e-06 s
Traceback (most recent call last):
  File "/usr/local/bin/kernprof", line 9, in <module>
    load_entry_point('line-profiler==1.0', 'console_scripts', 'kernprof')()
  File "/usr/local/lib/python2.7/site-packages/kernprof.py", line 221, in main
    execfile(script_file, ns, ns)
  File "primes.py", line 2, in <module>
    @profile
NameError: name 'profile' is not defined

利用cProfile分析程序性能

$ python -m cProfile  primes.py

利用profile包分析程序性能

import profile 

def profileTest(): 
    Total =1; 
    for i in range(10): 
        Total=Total*(i+1) 
        print Total 
    return Total 

if __name__ == "__main__": 
    profile.run("profileTest()")

上述代码的输出为：

1
2
6
24
120
720
5040
40320
362880
3628800
         5 function calls in 0.001 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(range)
        1    0.001    0.001    0.001    0.001 :0(setprofile)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.001    0.001 profile:0(profileTest())
        0    0.000             0.000          profile:0(profiler)
        1    0.000    0.000    0.000    0.000 test.py:3(profileTest)


[Finished in 0.1s]

其中输出每列的具体解释如下：

ncalls：表示函数调用的次数；
tottime：表示指定函数的总的运行时间，除掉函数中调用子函数的运行时间；
percall：（第一个 percall）等于 tottime/ncalls
cumtime：表示该函数及其所有子函数的调用运行的时间，即函数开始调用到返回的时间；
percall：（第二个 percall）即函数运行一次的平均时间，等于 cumtime/ncalls；
filename：lineno(function)：每个函数调用的具体信息。

如果需要将输出以日志的形式保存，只需要在调用的时候加入另外一个参数。如profile.run("profileTest()", "testprof")。

对于profile的剖析数据，如果以二进制文件的时候保存结果的时候，可以通过pstats模块进行文本报表分析，它支持多种形式的报表输出，是文本界面下一个较为实用的工具，使用非常简单：

import pstats 

p = pstats.Stats('testprof') 
p.sort_stats("name").print_stats()

其中sort_stats()方法能够对剖分数据进行排序，可以接受多个排序字段，如sort_stats('name', 'file') 将首先按照函数名称进行排序，然后再按照文件名进行排序。常见的排序字段包括：calls（被调用的次数）、time（函数内部运行时间）、cumulative（运行的总时间）等。此外 pstats也提供了命令行交互工具，执行python -m pstats后可以通过help了解更多使用方式。

对于大型应用程序，如果能够将性能分析的结果以图形的方式呈现，将会非常实用和直观，常见的可视化工具有Gprof2Dot、visualpytune、KCacheGrind 等。

利用memory_profiler分析内存占用

利用memory profiler，可以测试程序占用的内存，首先安装相关的包：

$ sudo pip install -U memory_profiler
$ sudo pip install psutil

安装psutil可以极大提高memory_profiler的性能。

与line_profiler类似，代码需要加入修饰器@profile：

@profile
def primes(n): 
    ...
    ...

然后查看程序内存占用情况：

$ python -m memory_profiler primes.py

IPython中性能测试

line_profiler和memory_profiler都有从IPython访问的快捷命令，仅需要在IPython的会话中设置：

%load_ext memory_profiler
%load_ext line_profiler

然后就可以这样就使用命令%lprun和%mprun：

from primes import primes
%mprun -f primes primes(1000)
%lprun -f primes primes(1000)

在IPython中这样使用，不再需要@profile修饰器。

内存泄漏检测

cPython通过引用计数纪录内存使用状态，当计数为0时释放内存。当对象不再被使用时，仍然有到该对象的引用，就认为发生了内存泄漏。快速发现内存泄漏的方法是利用objgraph包，可以查看内存中对象的数目，以及到这些对象的引用。

首先，安装objgraph包：

$ sudo pip install objgraph

然后，在代码中插入调用调试器的语句：

import pdb; pdb.set_trace()

具体用法可参考“Where’s the memory leak?”。

基于Python特性的优化

选择合适的数据结构

一、用字典（dictionary）代替列表（list）

Python字典中使用了hash table，因此查找操作的复杂度为O(1)，而list实际是个数组，在list中，查找需要遍历整个list，其复杂度为O(n)，因此对成员的查找访问等操作字典要比list更快。

from time import time 

t = time() 
list = ['a','b','is','python','jason','hello','hill','with','phone','test', 
'dfdf','apple','pddf','ind','basic','none','baecr','var','bana','dd','wrd'] 
#####list = dict.fromkeys(list,True) 
print list
filter = [] 
for i in range (1000000): 
    for find in ['is','hat','new','list','old','.']: 
        if find not in list: 
            filter.append(find) 
print "total run time:"
print time() - t

上述代码运行大概需要3.17秒。如果去掉行list = dict.fromkeys(list,True)的注释，将 list转换为字典之后再运行，时间大约为1.74秒。在需要频繁查找或访问多数据成员时，使用dict而不是list。

二、用集合（set）代替列表（list）

set的union，intersection、difference操作要比list的迭代要快。因此如果涉及到求list交集，并集或者差的问题可以转换为set来操作。

from time import time 

t = time() 
lista = [1,2,3,4,5,6,7,8,9,13,34,53,42,44] 
listb = [2,4,6,9,23] 
intersection=[] 
for i in range (1000000):
    for a in lista:
        for b in listb:
            if a == b:
                intersection.append(a) 
print "total run time:"
print time() - t

上述程序的运行时间大概为8.32秒，改为如下set操作：

from time import time 

t = time() 
lista=[1,2,3,4,5,6,7,8,9,13,34,53,42,44] 
listb=[2,4,6,9,23] 
intersection=[] 
for i in range (1000000): 
    list(set(lista) & set(listb)) 
print "total run time:"
print time() - t

运行时间缩减为1.75秒。

优化循环

优化循环遵循的原则是尽量减少循环过程中的计算量，有多重循环的尽量将内层的计算提到上一层。

from time import time 

t = time() 
lista = [1,2,3,4,5,6,7,8,9,10] 
listb = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.01] 
for i in range (1000000):      
    for a in range(len(lista)):
        for b in range(len(listb)):
            x = lista[a] + listb[b] 
print "total run time:"
print time() - t

上面的代码大概的运行时间约为27.14秒。将长度计算提到循环外，range用xrange代替，同时将第三层的计算提到循环的第二层：

from time import time 

t = time() 
lista = [1,2,3,4,5,6,7,8,9,10] 
listb = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.01] 
len1 = len(lista) 
len2 = len(listb) 
for i in xrange (1000000):
    for a in xrange(len1):
        temp = lista[a]
        for b in xrange(len2):
            x = temp + listb[b] 
print "total run time:"
print time() - t

上述优化后的程序其运行时间缩短为23.19秒。

利用Lazy if-evaluation特性

条件表达式是lazy evaluation的，也就是说如果存在条件表达式if x and y，在x为false的情况下y表达式的值将不再计算。因此可以利用该特性在一定程度上提高程序效率。

from time import time 

t = time() 
abbreviations = ['cf.', 'e.g.', 'ex.', 'etc.', 'fig.', 'i.e.', 'Mr.', 'vs.'] 
for i in range (1000000): 
    for w in ('Mr.', 'Hat', 'is', 'chasing', 'the', 'black', 'cat', '.'):
        if w in abbreviations: 
        #if w[-1] == '.' and w in abbreviations: 
            pass
print "total run time:"
print time() - t

上述代码的运行时间大概为1.78秒，如果使用注释行代替第一个if，运行的时间大概为1.38秒。

字符串优化

Python中的字符串对象是不可改变的，因此对任何字符串的操作（如拼接、修改等）都将产生一个新的字符串对象，这种持续的copy会在一定程度上影响性能。

一、使用`join()`代替`+`

from time import time 

t = time() 
s = ""
list = ['a','b','b','d','e','f','g','h','i','j','k','l','m','n'] 
for i in range (10000): 
    for substr in list: 
        s += substr     
print "total run time:"
print time() - t

上述代码使用+耗时约为0.027秒，改为如下使用join的代码时耗约为0.0048秒。

from time import time 

t = time() 
s = ""
list = ['a','b','b','d','e','f','g','h','i','j','k','l','m','n'] 
s = "".join(["".join(list) for i in range (10000)]) 
print "total run time:"
print time() - t

二、优先使用内置函数

当对字符串可使用正则表达式或内置函数时，选择内置函数。如str.isalpha()、str.isdigit()、str.startswith(('x', 'yz'))、str.endswith(('x', 'yz'))等。

三、优先使用格式化操作

对字符进行格式化比直接串联读取要快，因此要使用

out = "<html>%s%s%s%s</html>" % (head, prologue, query, tail)

避免使用

out = "<html>" + head + prologue + query + tail + "</html>"

使用列表解析（list comprehension）和生成器表达式（generator expression）

from time import time 

t = time() 
list = ['a','b','is','python','jason','hello','hill','with','phone','test', 
'dfdf','apple','pddf','ind','basic','none','baecr','var','bana','dd','wrd'] 
total=[] 
for i in range (1000000): 
    for w in list: 
        total.append(w) 
print "total run time:"
print time() - t

上述代码耗时约为3.09秒，若循环改为如下列表解析耗时约为2.33秒。

for i in range (1000000): 
    a = [w for w in list]

生成器表达式的优势较为明显，它不创建列表，只是返回生成器，效率较高。

for i in range (1000000): 
    a = (w for w in list)

若进一步改为生成器耗时约为0.67秒。

其它优化技巧

交换两个变量的值使用a, b = b, a而不借助中间变量；
循环时使用xrange而不是range；
使用局部变量，避免global关键字；
在耗时较多的循环中，把函数的调用改为内联的方式；
build-in函数通常较快，add(a, b)要优于a + b（受特定应用影响）¹；
if done is not None比if done != None更快；
while 1比while True更快；
使用级联比较x < y < z代替x < y and y < z。

闭包优化

工具优化

将关键Python代码部分重写成C扩展模块，或者选用在性能上更为优化的解释器等，这些统称为优化工具。Python有很多自带的优化工具，如PyPy、Cython、CFFI、Pyrex等。

PyPy

PyPy 表示“用 Python 实现的 Python”，但实际上它使用一个称为 RPython 的 Python 子集实现，能够将 Python 代码转成 C、.NET、Java 等语言和平台的代码。PyPy 使用的是基于 Trace 的 JIT 技术，大幅提升了 Python 的性能。和许多编译器和解释器不同，它不关心 Python 代码的词法分析和语法树。因为它是用 Python 语言写的，所以它直接利用 Python 语言的 Code Object。Code Object是 Python字节码的表示，也就是说 PyPy 直接分析 Python 代码所对应的字节码，这些字节码既不是以字符形式也不是以某种二进制格式保存在文件中，而在 Python 运行环境中。

OSX 中可通过 brew 安装 PyPy：

$ brew install pypy

安装了之后，还可用 pip_pypy 和 easy_install_pypy 进行包管理。将循环优化的代码存为 loop.py，然后使用 PyPy：

$ pypy loop.py

运行时间从 27.14 秒减少到了 0.93 秒。

Numba

Numba 通过 Python 直接实现的高性能函数，提高程序性能。利用少量的标注，基于数组、数学密集型的 Python 代码能被 JIT 编译为原生机器指令，获得与 C、C++ 和 Fortran 相近的性能。

安装 Numba，需要 llvm 的支持：

$ brew install llvm --with-rtti
$ LLVM_CONFIG=/usr/local/opt/llvm/bin/llvm-config pip install numba

安装 Numba 会默认安装依赖包 llvmlite（不再使用 llvmpy）。通过添加 @jit 标签，很容易极大提升代码性能：

from numba import jit
from numpy import arange
from datetime import datetime

@jit
def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

a = arange(9999999).reshape(3333333,3)
start = datetime.now()
print(sum2d(a))
stop = datetime.now()
print(stop-start)

Cython

Cython是一种编程语言，它使得给Python写C扩展就像Python自己编程一样容易。它的目标是成为Python语言的超集，具备高级层次、面向对象、函数式和动态的编程特性。在此基础上，其主要特性是将支持可选的静态类型声明作为语言的一部分。源代码被翻译为优化的C/C++代码，并被编译为Python的扩展模块。这使得程序运行不仅快，而且和外部C库集成紧密，同时保持了Python语言著称的高产特性。用它写出来的库都可以通过import来载入，性能比Python的快。Cython里可以载入Python 扩展（比如import math），也可以载入C的库的头文件（比如cdef extern from "math.h"），另外也可以用它来写 Python代码。将关键部分重写成C扩展模块。

可通过Python的包管理器安装Cython：

$ sudo pip install Cython

Cython代码与Python不同，必须先编译。

一、直接编译

编译一般经过两个阶段：

将pyx文件编译为.c文件；
将.c文件编译为动态链接库（OSX中为.dylib类型）文件。

测试代码文件sum.pyx内容为：

def sum(int a, int b): 
    print a + b

首先，编译为.c文件：

$ cython sum.pyx

然后，编译为动态链接库：

$ clang -shared -O3 -Wall -lpython2.7 -o sum.dylib sum.c \
        -I/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Headers

二、使用distutils编译

首先，建立一个setup.py脚本：

from distutils.core import setup 
from distutils.extension import Extension 
from Cython.Distutils import build_ext 

ext_modules = [Extension("sum", ["sum.pyx"])] 
setup(
    name = 'sum app', 
    cmdclass = {'build_ext': build_ext}, 
    ext_modules = ext_modules 
)

然后编译：

$ python setup.py build_ext --inplace

最后，导入使用：

import sum
sum.sum(1, 3)

三、性能对比

ctest.pyx文件内容为：

from time import time 

def test(int n): 
    cdef int a = 0
    cdef int i
    for i in xrange(n):
        a += i 
    return a 
t = time() 
test(10000000) 
print "total run time:"
print time() - t

采用如下方法导入，其运行时间为1.8835067749e-05秒。

import pyximport; pyximport.install() 
import ctest

ctest.py文件内容为：

from time import time 

def test(n):
    a = 0
    for i in xrange(n):
        a += i 
    return a 
t = time() 
test(10000000) 
print "total run time:"
print time() - t

采用如下方法，其运行时耗为0.47秒。

$ python ctest.py

PyParallel

PyParallel 目前是 Python 3.3.5 的 fork，专为多核、快速固态硬盘、NUMA 体系结构和 10Gb+ 以太网络优化：

消除了全局解释器锁（GIL, Global Interpreter Lock）；
适用于多核的异步 IO；
多核和 IO 带宽线性伸缩；
NumPy、datrie、pyodbc 和 IPython 等有兼容版本；
超低的时延，高并发，最大的吞吐量。

PyParallel 目前还是实验性的 alpha 版本，只提供了 Windows 7/8/10 64-bit 的下载版本。

Pyston

Pyston 是一个 Dropbox 推出的新的基于 JIT 的 Python 2.7 的实现。Pyston 解析 Python 代码并转换到 LLVM 的 intermediate representation (IR)，然后 IR 通过 LLVM 优化器处理后在 LLVM JIT 引擎上执行，其结果是机器码的执行，但目前不支持 Mac OSX 系统（Pyston does not currently work on Mac OSX, and it is not clear when it will.）。

Nuitka

Nuitka 是 Python 编译器。它是 Python 解释器的无缝替代品或扩展，能够编译 CPython 支持的所有部分。然后以非常和谐的方式执行未编译代码和编译代码。它可以自由使用所有 Python 模块和扩展模块。它将 Python 转换为 C 级程序，然后与 CPython 相同的方式使用 libpython 。

参考资料

当a = b = 1时，numpy.add(a, b)就比a + b慢得多。 ↩

打赏作者

2016-10-24 » NNML（03）：BP 学习
2016-10-16 » NNML（02）：感知器学习
2016-10-09 » 鲁棒及自适应控制（2）：模型
2016-10-09 » NNML（01）：引言
2016-09-24 » 家用监控设备用于电网的可行性分析
2016-09-19 » 无人机电缆隧道巡检可行性调研报告
2016-09-18 » 鲁棒及自适应控制（1）：概论
2016-09-13 » Gradient-Based Learning Applied to Document Recognition

上一篇：Tracking-Learning-Detection 下一篇：计算机视觉中的多视几何：序言（by Olivier Faugeras）

目录