The Python Quants

Technical News from the Python Financial Analytics Front

Dr. Yves J. Hilpisch

The Python Quants GmbH

team@tpq.io | http://tpq.io

Quick Overview

In this talk, I cover the following libraries:

  • TsTables — High Performance Time Series Management
  • bcolz — High Performance Columnar Data Store
  • ibis — Efficient and Performant Handling of SQL Sources
  • blaze — Data Blending and Analysis

TsTables – High Performance Times Series Management

TsTables is a Python library by Andy Fiedler built on top of the popular PyTables HDF5 database library. It is meant to handle large amounts of high frequency time series data in append once, retrieve many times scenarios (cf. Gihub page). The focus lies on retrieving chunks of data from large data sets as quickly as possible.

In [1]:
import os
import numpy as np
import pandas as pd
import tables as tb
import tstables as tstb
import random
from time import time
from datetime import datetime
import seaborn as sns; sns.set()
%matplotlib inline

Sample Time Series Data

Let us generate a decent amount of sample data points.

In [2]:
no = 30000000
co = 3
dt = 1. / (12 * 30 * 24 * 60)
vol = 0.2

We generate one second intervals of data.

In [3]:
dr = pd.date_range('2017-1-1', periods=no, freq='1s')
In [4]:
dr
Out[4]:
DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 00:00:01',
               '2017-01-01 00:00:02', '2017-01-01 00:00:03',
               '2017-01-01 00:00:04', '2017-01-01 00:00:05',
               '2017-01-01 00:00:06', '2017-01-01 00:00:07',
               '2017-01-01 00:00:08', '2017-01-01 00:00:09',
               ...
               '2017-12-14 05:19:50', '2017-12-14 05:19:51',
               '2017-12-14 05:19:52', '2017-12-14 05:19:53',
               '2017-12-14 05:19:54', '2017-12-14 05:19:55',
               '2017-12-14 05:19:56', '2017-12-14 05:19:57',
               '2017-12-14 05:19:58', '2017-12-14 05:19:59'],
              dtype='datetime64[ns]', length=30000000, freq='S')

In memory generation is quite quick.

In [5]:
%%time
da = 100 * np.exp(np.cumsum(-0.5 * vol ** 2 * dt +
        vol * np.sqrt(dt) * np.random.standard_normal((no, co)), axis=0))
da[0] = 100
CPU times: user 5.76 s, sys: 854 ms, total: 6.62 s
Wall time: 6.61 s
In [6]:
df = pd.DataFrame(da, index=dr, columns=['ts1', 'ts2', 'ts3'])
In [7]:
df.count()
Out[7]:
ts1    30000000
ts2    30000000
ts3    30000000
dtype: int64

The starting values of the three time series.

In [8]:
df.head()
Out[8]:
ts1 ts2 ts3
2017-01-01 00:00:00 100.000000 100.000000 100.000000
2017-01-01 00:00:01 99.972947 100.017864 99.962609
2017-01-01 00:00:02 99.990383 99.991254 99.951958
2017-01-01 00:00:03 99.957912 100.027486 99.976107
2017-01-01 00:00:04 99.953263 100.006583 100.010659

And a plot of the time series data (every 100000th point).

In [9]:
df[::100000].plot(figsize=(10, 6));

Storage and Retrieval with TsTables

To store the time series data in a PyTables table we first define the table structure.

In [10]:
class TS(tb.IsDescription):
    timestamp = tb.Int64Col(pos=0)
    ts1 = tb.Float64Col(pos=1)
    ts2 = tb.Float64Col(pos=1)
    ts3 = tb.Float64Col(pos=1)

Second, open a database file and create the table object.

In [11]:
path = '/notebooks/fpq/yves/data/'
os.environ['path'] = path
In [12]:
h5 = tb.open_file(path + 'ts.h5','w')

TsTables adds a new function create_ts to PyTables.

In [13]:
ts = h5.create_ts('/','TS', TS)

Third, we append the time series data to the table object.

In [14]:
%time ts.append(df)
CPU times: user 2.49 s, sys: 2.19 s, total: 4.68 s
Wall time: 46.7 s
In [15]:
!ls -n $path/ts*
-rw-r--r-- 1 3613 8 935432352 Apr 27 12:26 /notebooks/fpq/yves/data//ts.h5

The approach of TsTables is to apply a highly structured storage hierarchy.

In [16]:
a = str(h5)
print(a[:508])
/notebooks/fpq/yves/data/ts.h5 (File) ''
Last modif.: 'Wed Apr 27 12:26:02 2016'
Object Tree: 
/ (RootGroup) ''
/TS (Group/Timeseries) ''
/TS/y2016 (Group) ''
/TS/y2017 (Group) ''
/TS/y2017/m01 (Group) ''
/TS/y2017/m02 (Group) ''
/TS/y2017/m03 (Group) ''
/TS/y2017/m04 (Group) ''
/TS/y2017/m05 (Group) ''
/TS/y2017/m06 (Group) ''
/TS/y2017/m07 (Group) ''
/TS/y2017/m08 (Group) ''
/TS/y2017/m09 (Group) ''
/TS/y2017/m10 (Group) ''
/TS/y2017/m11 (Group) ''
/TS/y2017/m12 (Group) ''
/TS/y2017/m12/d01 (Group) ''

The strength of TsTables lies in retrieving chunks of time series data defined by a start date and an end date (which obviously is a typical case in finance, e.g. in backtesting strategies or risk management).

In [17]:
read_start_dt = datetime(2017, 2, 1, 0, 0)
read_end_dt = datetime(2017, 2, 2, 0, 0)

TsTables tries to make such an operation as fast as possible.

In [18]:
%time rows = ts.read_range(read_start_dt, read_end_dt)
CPU times: user 11 ms, sys: 4 ms, total: 15 ms
Wall time: 262 ms

Let us try it with random intervals.

In [19]:
t0 = time()
its = 100
for _ in range(its):
    day = random.randint(1, 27)
    read_start_dt = datetime(2017, 2, day, 0, 0)
    read_end_dt = datetime(2017, 2, day + 1, 0, 0)
    rows = ts.read_range(read_start_dt, read_end_dt)
t1 = time()

The results are convincing.

In [20]:
print("time for %d random accesses    %5.3f seconds" %(its, t1 - t0))
print("average time for random access  %5.3f seconds" %((t1 - t0) / its))
time for 100 random accesses    0.870 seconds
average time for random access  0.009 seconds

Conveniently, the returned object is a pandas DataFrame.

In [21]:
rows.count()
Out[21]:
ts1    86401
ts2    86401
ts3    86401
dtype: int64
In [22]:
rows.head()
Out[22]:
ts1 ts2 ts3
2017-02-08 00:00:00 41.714617 112.266922 63.477136
2017-02-08 00:00:01 41.721620 112.260084 63.494282
2017-02-08 00:00:02 41.723495 112.327356 63.500059
2017-02-08 00:00:03 41.734147 112.326363 63.540484
2017-02-08 00:00:04 41.708918 112.336727 63.570569

A look at a data sub-set.

In [23]:
rows[::500].plot(figsize=(10, 6));
In [24]:
h5.close()
In [25]:
!rm $path/ts.h5

bcolz – High Performance Columnar Data Store

bcolz is a columnar data store for fast data storage and retrieval with built-in high performance compression. It supports both in-memory and out-of-memory storage and operations. Cf. http://bcolz.blosc.org/.

In [26]:
import bcolz

ctable Example

The first example is based on the ctable class for data in table format. The example data set is 1 GB in size.

In [27]:
N = 100000 * 1000
print(N)
100000000

In-Memory Storage

We generate first an in-memory object using high compression. Since we work with integers, good compression ratios are to be expected. It takes about 44 sec to generate the ctable object from a generator via the fromiter method.

In [28]:
%%time
ct = bcolz.fromiter(((i, i ** 2) for i in range(N)),
                    dtype="i4, i8",
                    count=N,
                    cparams=bcolz.cparams(clevel=9))
CPU times: user 46.4 s, sys: 128 ms, total: 46.5 s
Wall time: 45 s

The in-memory size is about 150 MB only, which translates in to a compression ratio of 7+.

In [29]:
ct
Out[29]:
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')])
  nbytes: 1.12 GB; cbytes: 151.84 MB; ratio: 7.54
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009)
 (99999998, 9999999600000004) (99999999, 9999999800000001)]

You can now implement fast numerical operations on this data object (note that the output is a carray object).

In [30]:
%time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.55 s, sys: 442 ms, total: 4.99 s
Wall time: 1.17 s
Out[30]:
carray((100000000,), float64)
  nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[  0.00000000e+00   2.00000000e+00   6.00000000e+00 ...,   1.37491943e+09
   1.57491943e+09   1.77491942e+09]

Disk-Based Storage

The same tasks can be implemented with disk-based storage. To this end, only specify the rootdir parameter. With about 1 minute the generation takes a bit longer on disk. everything else (especially the object handling) remaining the same however.

In [31]:
%%time
ct = bcolz.fromiter(((i, i ** 2) for i in range(N)),
                    dtype="i4, i8",
                    count=N, rootdir=path + 'ct',
                    cparams=bcolz.cparams(clevel=9))
CPU times: user 45.9 s, sys: 289 ms, total: 46.2 s
Wall time: 49.3 s

Everything else (especially the object handling) remains almost the same however.

In [32]:
ct
Out[32]:
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')])
  nbytes: 1.12 GB; cbytes: 151.84 MB; ratio: 7.54
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := '/notebooks/fpq/yves/data/ct'
[(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009)
 (99999998, 9999999600000004) (99999999, 9999999800000001)]

The numerical operations work in the same fashion and hardly take longer due to native multi threading and optimized caching.

In [33]:
%time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.58 s, sys: 514 ms, total: 5.09 s
Wall time: 1.19 s
Out[33]:
carray((100000000,), float64)
  nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[  0.00000000e+00   2.00000000e+00   6.00000000e+00 ...,   1.37491943e+09
   1.57491943e+09   1.77491942e+09]

Let us finally verify system disk usage.

In [34]:
!ls $path/ct
__attrs__  f0  f1  __rootdirs__
In [35]:
!du -hs $path/ct
# system disk usage
156M	/notebooks/fpq/yves/data//ct
In [36]:
!rm -r $path/ct

carray Example

This example is about mid data which does not fit (in general) into memory (without compression).

In [37]:
import numpy as np

We generte as basis a NumPy ndarray object of size 32 MB.

In [38]:
n = 2000
a = np.arange(n * n).reshape(n, n) 
a.nbytes
Out[38]:
32000000

In-Memory Storage

Let us first again work in-memory. Our carray object contains 4,000 versions of the ndarray object. The in-memory generation of the object takes about 25 sec.

In [39]:
%%time
it = 4000
ca = bcolz.carray(a, cparams=bcolz.cparams(clevel=9))
for i in range(it):
    ca.append(a)
CPU times: user 27.3 s, sys: 317 ms, total: 27.6 s
Wall time: 27.9 s

The carray object stores 120 GB worth of data in less than 1 GB of memory, for a compression ratio of more than 130.

In [40]:
ca
Out[40]:
carray((8002000, 2000), int64)
  nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[      0       1       2 ...,    1997    1998    1999]
 [   2000    2001    2002 ...,    3997    3998    3999]
 [   4000    4001    4002 ...,    5997    5998    5999]
 ..., 
 [3994000 3994001 3994002 ..., 3995997 3995998 3995999]
 [3996000 3996001 3996002 ..., 3997997 3997998 3997999]
 [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]

Let us implement the evaluation of a numerical expression on this data set. The syntax and handling are the same as with NumPy ndarray objects.

In [41]:
%time ca[:5000] ** 2 + np.sqrt(ca[10000:15000])
CPU times: user 140 ms, sys: 97 ms, total: 237 ms
Wall time: 236 ms
Out[41]:
array([[  0.00000000e+00,   2.00000000e+00,   5.41421356e+00, ...,
          3.98805369e+06,   3.99204870e+06,   3.99604571e+06],
       [  4.00004472e+06,   4.00404573e+06,   4.00804874e+06, ...,
          1.59760722e+07,   1.59840672e+07,   1.59920642e+07],
       [  1.60000632e+07,   1.60080643e+07,   1.60160673e+07, ...,
          3.59640864e+07,   3.59760814e+07,   3.59880785e+07],
       ..., 
       [  3.97603600e+12,   3.97603999e+12,   3.97604398e+12, ...,
          3.98400403e+12,   3.98400802e+12,   3.98401201e+12],
       [  3.98401600e+12,   3.98401999e+12,   3.98402399e+12, ...,
          3.99199201e+12,   3.99199601e+12,   3.99200001e+12],
       [  3.99200400e+12,   3.99200800e+12,   3.99201199e+12, ...,
          3.99998800e+12,   3.99999200e+12,   3.99999600e+12]])

Another approach is to use the eval function of bcolz.

In [42]:
x = ca[:10000]  # 10,000 rows as sub-set
In [43]:
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(clevel=9))
  # output carray object compressed
CPU times: user 746 ms, sys: 55 ms, total: 801 ms
Wall time: 216 ms
Out[43]:
carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 38.70 MB; ratio: 3.94
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Disk-Based Storage

Disk-based storage of multiple versions of the array object. We write the object 4,000 times to disk in a single carray object. It takes only about 1 min to compress and store 120 GB worth of data on disk.

In [44]:
%%time
it = 4000
ca = bcolz.carray(a, rootdir=path + 'ca',
                 cparams=bcolz.cparams(clevel=9))
for i in range(it):
    ca.append(a)
CPU times: user 33 s, sys: 4.62 s, total: 37.6 s
Wall time: 1min 9s

The compression ratio in this case is again 130+.

In [45]:
ca
Out[45]:
carray((8002000, 2000), int64)
  nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := '/notebooks/fpq/yves/data/ca'
[[      0       1       2 ...,    1997    1998    1999]
 [   2000    2001    2002 ...,    3997    3998    3999]
 [   4000    4001    4002 ...,    5997    5998    5999]
 ..., 
 [3994000 3994001 3994002 ..., 3995997 3995998 3995999]
 [3996000 3996001 3996002 ..., 3997997 3997998 3997999]
 [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]

Simple numerical operations are easy to implement.

In [46]:
%time np.sum(ca[:1000] + ca[4000:5000])
CPU times: user 53 ms, sys: 4 ms, total: 57 ms
Wall time: 56.6 ms
Out[46]:
3999998000000

Let us try the previous, mathematically more demanding operation – again with a sub-set of the data.

In [47]:
x = ca[:10000]  # 10,000 rows as sub-set

First, with an in-memory carray results object.

In [48]:
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9))
CPU times: user 732 ms, sys: 17 ms, total: 749 ms
Wall time: 205 ms
Out[48]:
carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 38.70 MB; ratio: 3.94
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Second, with an on-disk results object. The time difference is not that huge.

In [49]:
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9), rootdir=path + 'out')
CPU times: user 848 ms, sys: 62 ms, total: 910 ms
Wall time: 397 ms
Out[49]:
carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 38.70 MB; ratio: 3.94
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := '/notebooks/fpq/yves/data/out'
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Finally, we verify system disk usage.

In [50]:
# system disk usage data object
!du -hs $path/ca
985M	/notebooks/fpq/yves/data//ca
In [51]:
# system disk usage results object
!du -hs $path/out
39M	/notebooks/fpq/yves/data//out
In [52]:
!rm -r $path/out
!rm -r $path/ca

ibis — Efficient and Performant Handling of SQL Sources

ibis is (one of) the new projects by Wes McKinney. The content that follows is from the tutorial as found under http://blog.ibis-project.org/sqlite-crunchbase-quickstart/.

Importing & Database Connection

First, let us establish a connection to a SQLite3 database.

In [53]:
import ibis
ibis.options.interactive = True
In [54]:
con = ibis.sqlite.connect('crunchbase.db')

Basic Operations and Lookups

Certain operations and lookups are easily accomplished without any SQL (1).

In [55]:
con.list_tables()
Out[55]:
['acquisitions', 'companies', 'investments', 'rounds']

Certain operations and lookups are easily accomplished without any SQL (2).

In [56]:
rounds = con.table('rounds')
rounds.info()
Table rows: 87161

Column                   Type    Non-null #
------                   ----    ----------
company_permalink        string  87161     
company_name             string  87161     
company_category_list    string  83464     
company_market           string  80523     
company_country_code     string  81291     
company_state_code       string  59296     
company_region           string  81016     
company_city             string  80208     
funding_round_permalink  string  87161     
funding_round_type       string  87161     
funding_round_code       string  23648     
funded_at                string  87161     
funded_month             string  87147     
funded_quarter           string  87147     
funded_year              int64   87147     
raised_amount_usd        double  73406     

Certain operations and lookups are easily accomplished without any SQL (3).

In [57]:
rounds.funding_round_type.value_counts()
Out[57]:
       funding_round_type  count
0                   angel   4602
1        convertible_note    838
2          debt_financing   5869
3     equity_crowdfunding   2401
4                   grant   1523
5   non_equity_assistance     69
6           post_ipo_debt     92
7         post_ipo_equity    425
8          private_equity   1864
9    product_crowdfunding    241
10       secondary_market     61
11                   seed  22053
12            undisclosed   4128
13                venture  42995

Instead of writing SQL, you create Pythonic expressions with ibis.

In [58]:
acquisitions = con.table('acquisitions')
expr = (acquisitions.price_amount
        .isnull()
        .name('has_price')
        .value_counts())
expr
Out[58]:
   has_price  count
0          0   3816
1          1  51424

Executing expressions gives back DataFrame objects by default for table like results.

In [59]:
df = expr.execute()
df
Out[59]:
has_price count
0 0 3816
1 1 51424
In [60]:
type(expr)
Out[60]:
ibis.expr.types.TableExpr
In [61]:
type(df)
Out[61]:
pandas.core.frame.DataFrame

Another case is a single value result.

In [62]:
companies = con.table('companies')
expr = companies.funding_total_usd.mean()
type(expr)
Out[62]:
ibis.expr.types.DoubleScalar
In [63]:
expr.execute()
Out[63]:
16198405.416388474

Funding Metrics

Let us have a look at some funding metrics. The time stamping doesn't seem to be "too accurate".

In [64]:
funded_at = rounds.funded_at.cast('timestamp')
funded_at.year().value_counts()
Out[64]:
    unnamed  count
0         1      3
1         7      1
2        11      1
3        20      2
4        29      1
5        30      1
6       201      1
7       203      1
8       214      1
9      1921      1
10     1960      3
11     1973      1
12     1974      2
13     1979      1
14     1982      3
15     1983      1
16     1984      4
17     1985      6
18     1986      4
19     1987      6
20     1988      2
21     1989      4
22     1990     14
23     1991     10
24     1992     12
25     1993     15
26     1994     15
27     1995     21
28     1996     32
29     1997     30
30     1998     62
31     1999    178
32     2000    267
33     2001    200
34     2002    207
35     2003    271
36     2004    434
37     2005   1762
38     2006   2663
39     2007   3562
40     2008   4257
41     2009   6553
42     2010   8692
43     2011  10550
44     2012  12648
45     2013  16194
46     2014  17670
47     2015    790
48     2104      1
49     2105      1

Another look at different funding stages.

In [65]:
rounds.funding_round_code.value_counts()
Out[65]:
  funding_round_code  count
0               None  63513
1                  A  11382
2                  B   6548
3                  C   3329
4                  D   1530
5                  E    608
6                  F    201
7                  G     45
8                  H      5

Next, a somewhat more complex analysis ...

In [66]:
year = funded_at.year().name('year')
type(year)
Out[66]:
ibis.expr.types.Int32Array
In [67]:
year
Out[67]:
0       2015
1       2014
2       2012
3       2010
4       2010
5       2012
6       2011
7       2014
8       2014
9       2013
10      2007
11      2008
12      2014
13      2014
14      2013
15      2013
16      2010
17      2011
18      2014
19      2013
20      2014
21      2010
22      2011
23      2009
24      2011
25      2011
26      2009
27      2011
28      2010
29      2011
        ... 
9970    2010
9971    2011
9972    2010
9973    2012
9974    2011
9975    2010
9976    2007
9977    2006
9978    2004
9979    2011
9980    2013
9981    2006
9982    2014
9983    2013
9984    2014
9985    2009
9986    2011
9987    2014
9988    2006
9989    2012
9990    2014
9991    2012
9992    2012
9993    2013
9994    2014
9995    2014
9996    2009
9997    2012
9998    2013
9999    2011
Name: tmp, dtype: int64

... selecting venture fundings between 2000 and 2015 and grouping them by year and funding rounds. The syntax/API is similar to pandas.

In [68]:
expr = (rounds[(rounds.funding_round_type == 'venture') &
               year.between(2000, 2015) &
               rounds.funding_round_code.notnull()]
        .group_by([year, 'funding_round_code'])
        .size())

A selection of the results.

In [69]:
results = expr.execute()
type(results)
Out[69]:
pandas.core.frame.DataFrame
In [70]:
results[:15]
Out[70]:
year funding_round_code count
0 2000 A 79
1 2000 B 32
2 2000 C 10
3 2000 D 8
4 2000 E 2
5 2001 A 50
6 2001 B 30
7 2001 C 17
8 2001 D 4
9 2001 E 1
10 2002 A 35
11 2002 B 39
12 2002 C 25
13 2002 D 5
14 2002 E 2

Pivot tables are quickly generated out of such results objects.

In [71]:
pivoted = (results.set_index(['year', 'funding_round_code'])
           .unstack('funding_round_code')
           .fillna(0))
pivoted
Out[71]:
count
funding_round_code A B C D E F G H
year
2000 79 32 10 8 2 0 0 0
2001 50 30 17 4 1 0 0 0
2002 35 39 25 5 2 2 0 0
2003 68 45 14 13 1 0 1 0
2004 146 76 32 15 3 1 0 0
2005 513 334 176 67 26 6 0 0
2006 717 465 226 91 35 7 1 0
2007 956 583 281 110 49 7 1 0
2008 979 653 308 120 54 17 1 0
2009 753 531 290 147 55 28 0 0
2010 1013 598 369 149 52 18 2 0
2011 1250 700 334 175 60 18 5 0
2012 1242 610 345 184 69 16 7 0
2013 1606 796 377 185 81 38 6 0
2014 1757 952 471 223 108 36 18 5
2015 88 71 34 28 8 5 3 0

Now an analysis with regard to funding sum buckets.

In [72]:
funding_buckets = [0, 1000000, 10000000, 50000000, 100000000, 
                   500000000, 1000000000]
In [73]:
bucket = (companies
          .funding_total_usd
          .bucket(funding_buckets, include_over=True))
bucket.value_counts()
Out[73]:
   unnamed  count
0      NaN  12055
1        0  15965
2        1  15754
3        2   7839
4        3   1532
5        4   1022
6        5     88
7        6     37

Adding bucket names makes the results output more meaningful.

In [74]:
bucket_names = ['0 to 1m', '1m to 10m', '10m to 50m', 
                '50m to 100m', '100m to 500m',
                '500m to 1b', 'Over 1b']
In [75]:
counts = bucket.name('bucket').value_counts()
labeled = counts.bucket.label(bucket_names)
with_names = counts.mutate(bucket_name=labeled)
with_names
Out[75]:
   bucket  count   bucket_name
0     NaN  12055          None
1       0  15965       0 to 1m
2       1  15754     1m to 10m
3       2   7839    10m to 50m
4       3   1532   50m to 100m
5       4   1022  100m to 500m
6       5     88    500m to 1b
7       6     37       Over 1b

Let us add another results column, about the total funding received for the bucket.

In [76]:
metrics = (companies.group_by(bucket.name('bucket'))
           .aggregate(count=companies.count(),
                      total_funding=companies.funding_total_usd.sum())
           .mutate(bucket_name=lambda x: x.bucket.label(bucket_names)))
metrics
Out[76]:
   bucket  count  total_funding   bucket_name
0     NaN  12055            NaN          None
1       0  15965   4.505177e+09       0 to 1m
2       1  15754   5.712283e+10     1m to 10m
3       2   7839   1.724166e+11    10m to 50m
4       3   1532   1.054132e+11   50m to 100m
5       4   1022   1.826600e+11  100m to 500m
6       5     88   5.804196e+10    500m to 1b
7       6     37   1.040123e+11       Over 1b

Finally, a complex query using the buckets and differentiating across the stati of the company.

In [77]:
joined = (companies.mutate(bucket=bucket,
                           status=companies.status.fillna('Unknown'))
          [(companies.founded_at > '2010-01-01') |
           companies.founded_at.isnull()]
          .group_by(['bucket', 'status'])
          .size()
          .mutate(bucket_name=lambda x: (x.bucket.label(bucket_names)
                                         .fillna('Unknown'))))

table = joined.execute()

table.set_index(['status', 'bucket', 'bucket_name'])['count'].unstack('status')
Out[77]:
status Unknown acquired closed operating
bucket bucket_name
NaN Unknown 3208 217 251 5731
0 0 to 1m 353 227 632 11195
1 1m to 10m 197 530 369 8166
2 10m to 50m 62 276 124 2572
3 50m to 100m 6 47 20 360
4 100m to 500m 16 14 11 255
5 500m to 1b NaN 2 1 23
6 Over 1b NaN 1 NaN 9

Generated SQL Queries

A major benefit of working with ibis is the automatic generation of SQL queries out of Pythonic code/expressions (in the background).

In [78]:
print(ibis.impala.compile(joined))
SELECT *,
  isnull(CASE `bucket`
    WHEN 0 THEN '0 to 1m'
    WHEN 1 THEN '1m to 10m'
    WHEN 2 THEN '10m to 50m'
    WHEN 3 THEN '50m to 100m'
    WHEN 4 THEN '100m to 500m'
    WHEN 5 THEN '500m to 1b'
    WHEN 6 THEN 'Over 1b'
    ELSE NULL
  END, 'Unknown') AS `bucket_name`
FROM (
  SELECT `bucket`, `status`, count(*) AS `count`
  FROM (
    SELECT `permalink`, `name`, `homepage_url`, `category_list`, `market`,
           `funding_total_usd`, isnull(`status`, 'Unknown') AS `status`,
           `country_code`, `state_code`, `region`, `city`, `funding_rounds`,
           `founded_at`, `founded_month`, `founded_quarter`, `founded_year`,
           `first_funding_at`, `last_funding_at`,
      CASE
        WHEN (`funding_total_usd` >= 0) AND (`funding_total_usd` < 1000000) THEN 0
        WHEN (`funding_total_usd` >= 1000000) AND (`funding_total_usd` < 10000000) THEN 1
        WHEN (`funding_total_usd` >= 10000000) AND (`funding_total_usd` < 50000000) THEN 2
        WHEN (`funding_total_usd` >= 50000000) AND (`funding_total_usd` < 100000000) THEN 3
        WHEN (`funding_total_usd` >= 100000000) AND (`funding_total_usd` < 500000000) THEN 4
        WHEN (`funding_total_usd` >= 500000000) AND (`funding_total_usd` <= 1000000000) THEN 5
        WHEN `funding_total_usd` > 1000000000 THEN 6
        ELSE NULL
      END AS `bucket`
    FROM companies
    WHERE (`founded_at` > '2010-01-01') OR `founded_at` IS NULL
  ) t1
  GROUP BY 1, 2
) t0

blaze – Data Blending and Analysis

blaze allows Python users a familiar interface to query data living in diverse data storage systems. Cf. http://blaze.pydata.org/.

In [79]:
import blaze as bz

Simple Example

The first example constructs a blaze.Data object from native Python objects.

In [80]:
t = bz.Data([('Henry', 'boy', 8),
              ('Lilli', 'girl', 14)],
            fields=['name', 'gender', 'age'])
In [81]:
t
Out[81]:
name gender age
0 Henry boy 8
1 Lilli girl 14
In [82]:
t[t.age > 10]
Out[82]:
name gender age
0 Lilli girl 14

Data from NumPy Array

Let us read data from an in-memory NumPy ndarray object.

In [83]:
import numpy as np
In [84]:
a = np.random.standard_normal((1000000, 5))
  # 1mn data rows, 5 columns
In [85]:
df = bz.DataFrame(a, columns=['f0', 'f1', 'f2', 'f3', 'f4'])
  # blaze DataFrame constructor

A look at the data structure.

In [86]:
df.head()
Out[86]:
f0 f1 f2 f3 f4
0 2.056930 0.029329 -1.001722 1.004306 -0.937600
1 -1.452573 -0.934812 0.729341 -0.325649 -2.133633
2 -0.089087 0.932304 0.543675 -1.040536 -0.685716
3 -0.867179 0.606548 0.211712 -0.010828 0.014817
4 1.679594 -2.270351 1.643111 0.531190 0.602561

Data itself is stored as NumPy ndarray object.

In [87]:
df.values
Out[87]:
array([[ 2.05693019,  0.02932883, -1.00172196,  1.00430592, -0.93760034],
       [-1.45257295, -0.93481184,  0.72934074, -0.3256488 , -2.13363315],
       [-0.08908673,  0.93230404,  0.54367499, -1.04053639, -0.6857165 ],
       ..., 
       [ 2.18432667, -0.29418322,  0.36483589,  0.27630689, -0.58188977],
       [-1.31239497,  1.1626392 , -0.80855171, -1.04623562, -0.60156394],
       [-1.69589157,  0.40294101, -0.14898031,  0.94228704, -0.51654518]])

Data from CSV File

We generate first a CSV file using the random data from before.

In [88]:
%time df.to_csv(path + 'data.csv', index=False)
CPU times: user 7.73 s, sys: 91 ms, total: 7.82 s
Wall time: 9.46 s

Let us read the data with blaze. Actually, we only generate a view.

In [89]:
%time csv = bz.CSV(path + 'data.csv')
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 41.5 µs
In [90]:
%time t1 = bz.Data(csv)
CPU times: user 6 ms, sys: 0 ns, total: 6 ms
Wall time: 75.8 ms

Now, we can work with the data. Note, however, that iterating, slicing, etc. are not (yet) implemented.

In [91]:
%time t1.count()
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 476 µs
Out[91]:
1000000

The backend is a CSV object. And a look at the first 10 rows.

In [92]:
t1.data
Out[92]:
<odo.backends.csv.CSV at 0x7f784e4c6208>
In [93]:
t1
Out[93]:
f0 f1 f2 f3 f4
0 2.056930 0.029329 -1.001722 1.004306 -0.937600
1 -1.452573 -0.934812 0.729341 -0.325649 -2.133633
2 -0.089087 0.932304 0.543675 -1.040536 -0.685716
3 -0.867179 0.606548 0.211712 -0.010828 0.014817
4 1.679594 -2.270351 1.643111 0.531190 0.602561
5 -1.164137 0.963810 0.277807 -0.365502 0.731875
6 -0.587053 -1.063658 -1.765022 -0.193745 -1.193952
7 -0.122269 0.405408 0.161799 -1.787346 -1.891628
8 0.292853 1.062099 1.130397 -0.481681 0.495915
9 -0.735962 -0.570641 -0.535854 0.029475 -0.612822
10 -0.028088 -1.269165 0.077696 0.139571 -0.914147

Data from SQL

We now generate a SQLite3 table with the dummy data from before.

In [94]:
import sqlite3 as sq3
In [95]:
con = sq3.connect(path + 'data.sql')
try:
    con.execute('DROP TABLE numbers')
    # delete in case it exists
except:
    pass

We write the data into an appropriate table.

In [96]:
con.execute(
    'CREATE TABLE numbers (f0 real, f1 real, f2 real, f3 real, f4 real)'
    )
Out[96]:
<sqlite3.Cursor at 0x7f784e4fb9d0>
In [97]:
%time con.executemany('INSERT INTO numbers VALUES (?, ?, ?, ?, ?)', a)
CPU times: user 7.23 s, sys: 93 ms, total: 7.32 s
Wall time: 8.16 s
Out[97]:
<sqlite3.Cursor at 0x7f784e4fb960>
In [98]:
con.commit()
In [99]:
con.close()

Now reading the data with blaze (i.e. just generating a view).

In [100]:
%time t2 = bz.Data('sqlite:///%sdata.sql::numbers' % path)
CPU times: user 7 ms, sys: 0 ns, total: 7 ms
Wall time: 7.03 ms

The SQL backend and first 10 rows again.

In [101]:
t2
Out[101]:
f0 f1 f2 f3 f4
0 2.056930 0.029329 -1.001722 1.004306 -0.937600
1 -1.452573 -0.934812 0.729341 -0.325649 -2.133633
2 -0.089087 0.932304 0.543675 -1.040536 -0.685716
3 -0.867179 0.606548 0.211712 -0.010828 0.014817
4 1.679594 -2.270351 1.643111 0.531190 0.602561
5 -1.164137 0.963810 0.277807 -0.365502 0.731875
6 -0.587053 -1.063658 -1.765022 -0.193745 -1.193952
7 -0.122269 0.405408 0.161799 -1.787346 -1.891628
8 0.292853 1.062099 1.130397 -0.481681 0.495915
9 -0.735962 -0.570641 -0.535854 0.029475 -0.612822
10 -0.028088 -1.269165 0.077696 0.139571 -0.914147

Working with the blaze Objects

blaze provides an abstraction logic for computations/queries.

In [102]:
# generic table description -- independent of the target data structure
ts = bz.TableSymbol('ts',
        '{f0: float64, f1: float64, f2: float64, f3: float64, f4: float64}')
# generic expression -- independent of the target data structure
expr = ts[ts['f0'] + ts['f3'] > 2.5]['f1']

The blaze compiler specializes the generic objects to different data structures.

In [103]:
%time np.array(bz.compute(expr, a))  # NumPy ndarray object
CPU times: user 40 ms, sys: 4 ms, total: 44 ms
Wall time: 607 ms
Out[103]:
array([ 0.02932883, -1.23164119,  0.67050742, ...,  1.6657984 ,
       -1.19846161, -1.30715482])
In [104]:
%time np.array(bz.compute(expr, df))  # DataFrame object
CPU times: user 19 ms, sys: 0 ns, total: 19 ms
Wall time: 9.78 ms
Out[104]:
array([ 0.02932883, -1.23164119,  0.67050742, ...,  1.6657984 ,
       -1.19846161, -1.30715482])
In [105]:
%time np.array(bz.compute(expr, csv))  # CSV file representation
CPU times: user 565 ms, sys: 30 ms, total: 595 ms
Wall time: 603 ms
Out[105]:
array([ 0.02932883, -1.23164119,  0.67050742, ...,  1.6657984 ,
       -1.19846161, -1.30715482])

In similar fashion, blaze allows unified expression evaluations for different backends (I).

In [106]:
%time t1[t1['f0'] + t1['f3'] > 2.5]['f1'].head()
  # table representation 1
  # from CSV
CPU times: user 2 ms, sys: 0 ns, total: 2 ms
Wall time: 1.92 ms
Out[106]:
f1
0 0.029329
18 -1.231641
36 0.670507
39 0.717779
68 -0.224904
111 -1.886033
144 -0.673477
158 -0.350397
161 -1.144460
174 -0.295445

In similar fashion, blaze allows unified expression evaluations for different backends (II).

In [107]:
%time t2[t2['f0'] + t2['f3'] > 2.5]['f1'].head()
  # table representation 2
  # from SQL database
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.57 ms
Out[107]:
f1
0 0.029329
1 -1.231641
2 0.670507
3 0.717779
4 -0.224904
5 -1.886033
6 -0.673477
7 -0.350397
8 -1.144460
9 -0.295445

Typical aggregational operations work as well.

In [108]:
%time t1.f0.sum()
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 433 µs
Out[108]:
-1816.535944971288
In [109]:
%time t2.f3.max()
CPU times: user 1 ms, sys: 0 ns, total: 1 ms
Wall time: 625 µs
Out[109]:
4.538421437830834

Transforming Data Formats

If you work intensively with data sets, it might be beneficial to transform them once into highly performant binary data formats (eg bcolz, HDF5).

Using bcolz as Data Store

In [110]:
%time bz.into(path + 'data.bcolz', path + 'data.csv')
  # natively done by blaze
  # cparams=bcolz.cparams(9) could be added
  # no effect here due to random floats
CPU times: user 960 ms, sys: 99 ms, total: 1.06 s
Wall time: 1.06 s
Out[110]:
ctable((1000000,), [('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])
  nbytes: 38.15 MB; cbytes: 38.53 MB; ratio: 0.99
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
  rootdir := '/notebooks/fpq/yves/data/data.bcolz'
[ (2.0569301865053324, 0.029328830260023194, -1.0017219637000396, 1.0043059163971206, -0.937600336039536)
 (-1.4525729543596462, -0.9348118352288594, 0.729340739702494, -0.3256487991023368, -2.1336331523344527)
 (-0.08908673382989611, 0.9323040448749188, 0.5436749909190117, -1.0405363851357037, -0.6857164993229661)
 ...,
 (2.1843266726204096, -0.2941832224261781, 0.3648358949245896, 0.2763068869380623, -0.5818897659149579)
 (-1.3123949679575355, 1.1626392014644351, -0.8085517108946355, -1.0462356207393482, -0.6015639357751871)
 (-1.6958915689699, 0.40294100740191385, -0.1489803134025054, 0.9422870416487872, -0.5165451785994732)]

We can now connect to the bcolz disk-based ctable object.

In [111]:
import bcolz as bc
In [112]:
b = bc.ctable(rootdir=path + 'data.bcolz') 

Now, the power of bcolz for numerical computations can be played out.

In [113]:
%time nex = b.eval('sqrt(abs(f0)) + log(abs(f1))')
CPU times: user 139 ms, sys: 12 ms, total: 151 ms
Wall time: 50.6 ms
In [114]:
nex
Out[114]:
carray((1000000,), float64)
  nbytes: 7.63 MB; cbytes: 7.45 MB; ratio: 1.02
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[-2.09498409  1.13781733  0.22837772 ...,  0.25439428  1.29629068
  0.39329891]

Using HDF5

Similarly, we can use PyTables and HDF5 as an efficient binary store.

In [115]:
import pandas as pd
In [116]:
%%time
con = sq3.connect(path + 'data.sql')
pd.HDFStore(path + 'data.h5')['sql'] = pd.read_sql('SELECT * FROM numbers', con)
  # simultaneously reading whole SQL table and writing it to HDF5 store
con.close()
CPU times: user 1.41 s, sys: 142 ms, total: 1.55 s
Wall time: 1.66 s

Now, data can be efficiently retrieved.

In [117]:
%%time
%matplotlib inline
pd.HDFStore(path + 'data.h5')['sql'][::1000].cumsum().plot(figsize=(10, 6))
  # simultaneously reading data from HDF5 store and plotting it
CPU times: user 63 ms, sys: 12 ms, total: 75 ms
Wall time: 156 ms

Cleaning Up

In [118]:
!du -h $path
12K	/notebooks/fpq/yves/data/data.bcolz/f2/meta
9.6M	/notebooks/fpq/yves/data/data.bcolz/f2/data
9.6M	/notebooks/fpq/yves/data/data.bcolz/f2
12K	/notebooks/fpq/yves/data/data.bcolz/f4/meta
9.6M	/notebooks/fpq/yves/data/data.bcolz/f4/data
9.6M	/notebooks/fpq/yves/data/data.bcolz/f4
12K	/notebooks/fpq/yves/data/data.bcolz/f1/meta
9.6M	/notebooks/fpq/yves/data/data.bcolz/f1/data
9.6M	/notebooks/fpq/yves/data/data.bcolz/f1
12K	/notebooks/fpq/yves/data/data.bcolz/f0/meta
9.6M	/notebooks/fpq/yves/data/data.bcolz/f0/data
9.6M	/notebooks/fpq/yves/data/data.bcolz/f0
12K	/notebooks/fpq/yves/data/data.bcolz/f3/meta
9.6M	/notebooks/fpq/yves/data/data.bcolz/f3/data
9.6M	/notebooks/fpq/yves/data/data.bcolz/f3
48M	/notebooks/fpq/yves/data/data.bcolz
282M	/notebooks/fpq/yves/data/
In [119]:
!ls -n $path/dat*.*
-rw-r--r-- 1 3613 8 98156364 Apr 27 12:30 /notebooks/fpq/yves/data//data.csv
-rw-r--r-- 1 3613 8 48007208 Apr 27 12:30 /notebooks/fpq/yves/data//data.h5
-rw-r--r-- 1 3613 8 54446080 Apr 27 12:30 /notebooks/fpq/yves/data//data.sql

/notebooks/fpq/yves/data//data.bcolz:
total 28
-rw-r--r-- 1 3613 8    3 Apr 27 12:30 __attrs__
drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f0
drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f1
drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f2
drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f3
drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f4
-rw-r--r-- 1 3613 8   42 Apr 27 12:30 __rootdirs__
In [120]:
# cleaning up
!rm -r $path/dat*.*

Conclusion

High performance (hardware-bound) I/O operations and highly efficient data blending and analytics are among Python's key strengths.

The Python Quants

http://tpq.io | yves@tpq.io | @dyjh | hilpisch.com

Quant Platformquant-platform.com

Python for Financepython-for-finance.com

Derivatives Analytics with Pythonderivatives-analytics-with-python.com

Python for Finance Trainingstraining.tpq.io