The Python Quants

Technical News from the Python Financial Analytics Front¶

Dr. Yves J. Hilpisch

The Python Quants GmbH

Quick Overview¶

In this talk, I cover the following libraries:

TsTables — High Performance Time Series Management
bcolz — High Performance Columnar Data Store
ibis — Efficient and Performant Handling of SQL Sources
blaze — Data Blending and Analysis

TsTables – High Performance Times Series Management

TsTables is a Python library by Andy Fiedler built on top of the popular PyTables HDF5 database library. It is meant to handle large amounts of high frequency time series data in append once, retrieve many times scenarios (cf. Gihub page). The focus lies on retrieving chunks of data from large data sets as quickly as possible.

In [1]:

import os
import numpy as np
import pandas as pd
import tables as tb
import tstables as tstb
import random
from time import time
from datetime import datetime
import seaborn as sns; sns.set()
%matplotlib inline

Sample Time Series Data¶

Let us generate a decent amount of sample data points.

In [2]:

no = 30000000
co = 3
dt = 1. / (12 * 30 * 24 * 60)
vol = 0.2

We generate one second intervals of data.

In [3]:

dr = pd.date_range('2017-1-1', periods=no, freq='1s')

In [4]:

dr

Out[4]:

DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 00:00:01',
               '2017-01-01 00:00:02', '2017-01-01 00:00:03',
               '2017-01-01 00:00:04', '2017-01-01 00:00:05',
               '2017-01-01 00:00:06', '2017-01-01 00:00:07',
               '2017-01-01 00:00:08', '2017-01-01 00:00:09',
               ...
               '2017-12-14 05:19:50', '2017-12-14 05:19:51',
               '2017-12-14 05:19:52', '2017-12-14 05:19:53',
               '2017-12-14 05:19:54', '2017-12-14 05:19:55',
               '2017-12-14 05:19:56', '2017-12-14 05:19:57',
               '2017-12-14 05:19:58', '2017-12-14 05:19:59'],
              dtype='datetime64[ns]', length=30000000, freq='S')

In memory generation is quite quick.

In [5]:

%%time
da = 100 * np.exp(np.cumsum(-0.5 * vol ** 2 * dt +
        vol * np.sqrt(dt) * np.random.standard_normal((no, co)), axis=0))
da[0] = 100

CPU times: user 5.76 s, sys: 854 ms, total: 6.62 s
Wall time: 6.61 s

In [6]:

df = pd.DataFrame(da, index=dr, columns=['ts1', 'ts2', 'ts3'])

In [7]:

df.count()

Out[7]:

ts1    30000000
ts2    30000000
ts3    30000000
dtype: int64

The starting values of the three time series.

In [8]:

df.head()

Out[8]:

	ts1	ts2	ts3
2017-01-01 00:00:00	100.000000	100.000000	100.000000
2017-01-01 00:00:01	99.972947	100.017864	99.962609
2017-01-01 00:00:02	99.990383	99.991254	99.951958
2017-01-01 00:00:03	99.957912	100.027486	99.976107
2017-01-01 00:00:04	99.953263	100.006583	100.010659

And a plot of the time series data (every 100000th point).

In [9]:

df[::100000].plot(figsize=(10, 6));

Storage and Retrieval with TsTables¶

To store the time series data in a PyTables table we first define the table structure.

In [10]:

class TS(tb.IsDescription):
    timestamp = tb.Int64Col(pos=0)
    ts1 = tb.Float64Col(pos=1)
    ts2 = tb.Float64Col(pos=1)
    ts3 = tb.Float64Col(pos=1)

Second, open a database file and create the table object.

In [11]:

path = '/notebooks/fpq/yves/data/'
os.environ['path'] = path

In [12]:

h5 = tb.open_file(path + 'ts.h5','w')

TsTables adds a new function create_ts to PyTables.

In [13]:

ts = h5.create_ts('/','TS', TS)

Third, we append the time series data to the table object.

In [14]:

%time ts.append(df)

CPU times: user 2.49 s, sys: 2.19 s, total: 4.68 s
Wall time: 46.7 s

In [15]:

!ls -n $path/ts*

-rw-r--r-- 1 3613 8 935432352 Apr 27 12:26 /notebooks/fpq/yves/data//ts.h5

The approach of TsTables is to apply a highly structured storage hierarchy.

In [16]:

a = str(h5)
print(a[:508])

/notebooks/fpq/yves/data/ts.h5 (File) ''
Last modif.: 'Wed Apr 27 12:26:02 2016'
Object Tree: 
/ (RootGroup) ''
/TS (Group/Timeseries) ''
/TS/y2016 (Group) ''
/TS/y2017 (Group) ''
/TS/y2017/m01 (Group) ''
/TS/y2017/m02 (Group) ''
/TS/y2017/m03 (Group) ''
/TS/y2017/m04 (Group) ''
/TS/y2017/m05 (Group) ''
/TS/y2017/m06 (Group) ''
/TS/y2017/m07 (Group) ''
/TS/y2017/m08 (Group) ''
/TS/y2017/m09 (Group) ''
/TS/y2017/m10 (Group) ''
/TS/y2017/m11 (Group) ''
/TS/y2017/m12 (Group) ''
/TS/y2017/m12/d01 (Group) ''

The strength of TsTables lies in retrieving chunks of time series data defined by a start date and an end date (which obviously is a typical case in finance, e.g. in backtesting strategies or risk management).

In [17]:

read_start_dt = datetime(2017, 2, 1, 0, 0)
read_end_dt = datetime(2017, 2, 2, 0, 0)

TsTables tries to make such an operation as fast as possible.

In [18]:

%time rows = ts.read_range(read_start_dt, read_end_dt)

CPU times: user 11 ms, sys: 4 ms, total: 15 ms
Wall time: 262 ms

Let us try it with random intervals.

In [19]:

t0 = time()
its = 100
for _ in range(its):
    day = random.randint(1, 27)
    read_start_dt = datetime(2017, 2, day, 0, 0)
    read_end_dt = datetime(2017, 2, day + 1, 0, 0)
    rows = ts.read_range(read_start_dt, read_end_dt)
t1 = time()

The results are convincing.

In [20]:

print("time for %d random accesses    %5.3f seconds" %(its, t1 - t0))
print("average time for random access  %5.3f seconds" %((t1 - t0) / its))

time for 100 random accesses    0.870 seconds
average time for random access  0.009 seconds

Conveniently, the returned object is a pandas DataFrame.

In [21]:

rows.count()

Out[21]:

ts1    86401
ts2    86401
ts3    86401
dtype: int64

In [22]:

rows.head()

Out[22]:

	ts1	ts2	ts3
2017-02-08 00:00:00	41.714617	112.266922	63.477136
2017-02-08 00:00:01	41.721620	112.260084	63.494282
2017-02-08 00:00:02	41.723495	112.327356	63.500059
2017-02-08 00:00:03	41.734147	112.326363	63.540484
2017-02-08 00:00:04	41.708918	112.336727	63.570569

A look at a data sub-set.

In [23]:

rows[::500].plot(figsize=(10, 6));

In [24]:

h5.close()

In [25]:

!rm $path/ts.h5

bcolz – High Performance Columnar Data Store

bcolz is a columnar data store for fast data storage and retrieval with built-in high performance compression. It supports both in-memory and out-of-memory storage and operations. Cf. http://bcolz.blosc.org/.

In [26]:

import bcolz

ctable Example¶

The first example is based on the ctable class for data in table format. The example data set is 1 GB in size.

In [27]:

N = 100000 * 1000
print(N)

100000000

In-Memory Storage¶

We generate first an in-memory object using high compression. Since we work with integers, good compression ratios are to be expected. It takes about 44 sec to generate the ctable object from a generator via the fromiter method.

In [28]:

%%time
ct = bcolz.fromiter(((i, i ** 2) for i in range(N)),
                    dtype="i4, i8",
                    count=N,
                    cparams=bcolz.cparams(clevel=9))

CPU times: user 46.4 s, sys: 128 ms, total: 46.5 s
Wall time: 45 s

The in-memory size is about 150 MB only, which translates in to a compression ratio of 7+.

In [29]:

ct

Out[29]:

ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')])
  nbytes: 1.12 GB; cbytes: 151.84 MB; ratio: 7.54
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009)
 (99999998, 9999999600000004) (99999999, 9999999800000001)]

You can now implement fast numerical operations on this data object (note that the output is a carray object).

In [30]:

%time ct.eval('f0 ** 2 + sqrt(f1)')

CPU times: user 4.55 s, sys: 442 ms, total: 4.99 s
Wall time: 1.17 s

Out[30]:

carray((100000000,), float64)
  nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[  0.00000000e+00   2.00000000e+00   6.00000000e+00 ...,   1.37491943e+09
   1.57491943e+09   1.77491942e+09]

Disk-Based Storage¶

The same tasks can be implemented with disk-based storage. To this end, only specify the rootdir parameter. With about 1 minute the generation takes a bit longer on disk. everything else (especially the object handling) remaining the same however.

In [31]:

%%time
ct = bcolz.fromiter(((i, i ** 2) for i in range(N)),
                    dtype="i4, i8",
                    count=N, rootdir=path + 'ct',
                    cparams=bcolz.cparams(clevel=9))

CPU times: user 45.9 s, sys: 289 ms, total: 46.2 s
Wall time: 49.3 s

Everything else (especially the object handling) remains almost the same however.

In [32]:

ct

Out[32]:

ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')])
  nbytes: 1.12 GB; cbytes: 151.84 MB; ratio: 7.54
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := '/notebooks/fpq/yves/data/ct'
[(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009)
 (99999998, 9999999600000004) (99999999, 9999999800000001)]

The numerical operations work in the same fashion and hardly take longer due to native multi threading and optimized caching.

In [33]:

%time ct.eval('f0 ** 2 + sqrt(f1)')

CPU times: user 4.58 s, sys: 514 ms, total: 5.09 s
Wall time: 1.19 s

Out[33]:

carray((100000000,), float64)
  nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[  0.00000000e+00   2.00000000e+00   6.00000000e+00 ...,   1.37491943e+09
   1.57491943e+09   1.77491942e+09]

Let us finally verify system disk usage.

In [34]:

!ls $path/ct

__attrs__  f0  f1  __rootdirs__

In [35]:

!du -hs $path/ct
# system disk usage

156M	/notebooks/fpq/yves/data//ct

In [36]:

!rm -r $path/ct

carray Example¶

This example is about mid data which does not fit (in general) into memory (without compression).

In [37]:

import numpy as np

We generte as basis a NumPy ndarray object of size 32 MB.

In [38]:

n = 2000
a = np.arange(n * n).reshape(n, n) 
a.nbytes

Out[38]:

32000000

In-Memory Storage¶

Let us first again work in-memory. Our carray object contains 4,000 versions of the ndarray object. The in-memory generation of the object takes about 25 sec.

In [39]:

%%time
it = 4000
ca = bcolz.carray(a, cparams=bcolz.cparams(clevel=9))
for i in range(it):
    ca.append(a)

CPU times: user 27.3 s, sys: 317 ms, total: 27.6 s
Wall time: 27.9 s

The carray object stores 120 GB worth of data in less than 1 GB of memory, for a compression ratio of more than 130.

In [40]:

ca

Out[40]:

carray((8002000, 2000), int64)
  nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[      0       1       2 ...,    1997    1998    1999]
 [   2000    2001    2002 ...,    3997    3998    3999]
 [   4000    4001    4002 ...,    5997    5998    5999]
 ..., 
 [3994000 3994001 3994002 ..., 3995997 3995998 3995999]
 [3996000 3996001 3996002 ..., 3997997 3997998 3997999]
 [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]

Let us implement the evaluation of a numerical expression on this data set. The syntax and handling are the same as with NumPy ndarray objects.

In [41]:

%time ca[:5000] ** 2 + np.sqrt(ca[10000:15000])

CPU times: user 140 ms, sys: 97 ms, total: 237 ms
Wall time: 236 ms

Out[41]:

array([[  0.00000000e+00,   2.00000000e+00,   5.41421356e+00, ...,
          3.98805369e+06,   3.99204870e+06,   3.99604571e+06],
       [  4.00004472e+06,   4.00404573e+06,   4.00804874e+06, ...,
          1.59760722e+07,   1.59840672e+07,   1.59920642e+07],
       [  1.60000632e+07,   1.60080643e+07,   1.60160673e+07, ...,
          3.59640864e+07,   3.59760814e+07,   3.59880785e+07],
       ..., 
       [  3.97603600e+12,   3.97603999e+12,   3.97604398e+12, ...,
          3.98400403e+12,   3.98400802e+12,   3.98401201e+12],
       [  3.98401600e+12,   3.98401999e+12,   3.98402399e+12, ...,
          3.99199201e+12,   3.99199601e+12,   3.99200001e+12],
       [  3.99200400e+12,   3.99200800e+12,   3.99201199e+12, ...,
          3.99998800e+12,   3.99999200e+12,   3.99999600e+12]])

Another approach is to use the eval function of bcolz.

In [42]:

x = ca[:10000]  # 10,000 rows as sub-set

In [43]:

%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(clevel=9))
  # output carray object compressed

CPU times: user 746 ms, sys: 55 ms, total: 801 ms
Wall time: 216 ms

Out[43]:

carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 38.70 MB; ratio: 3.94
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Disk-Based Storage¶

Disk-based storage of multiple versions of the array object. We write the object 4,000 times to disk in a single carray object. It takes only about 1 min to compress and store 120 GB worth of data on disk.

In [44]:

%%time
it = 4000
ca = bcolz.carray(a, rootdir=path + 'ca',
                 cparams=bcolz.cparams(clevel=9))
for i in range(it):
    ca.append(a)

CPU times: user 33 s, sys: 4.62 s, total: 37.6 s
Wall time: 1min 9s

The compression ratio in this case is again 130+.

In [45]:

ca

Out[45]:

carray((8002000, 2000), int64)
  nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := '/notebooks/fpq/yves/data/ca'
[[      0       1       2 ...,    1997    1998    1999]
 [   2000    2001    2002 ...,    3997    3998    3999]
 [   4000    4001    4002 ...,    5997    5998    5999]
 ..., 
 [3994000 3994001 3994002 ..., 3995997 3995998 3995999]
 [3996000 3996001 3996002 ..., 3997997 3997998 3997999]
 [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]

Simple numerical operations are easy to implement.

In [46]:

%time np.sum(ca[:1000] + ca[4000:5000])

CPU times: user 53 ms, sys: 4 ms, total: 57 ms
Wall time: 56.6 ms

Out[46]:

3999998000000

Let us try the previous, mathematically more demanding operation – again with a sub-set of the data.

In [47]:

x = ca[:10000]  # 10,000 rows as sub-set

First, with an in-memory carray results object.

In [48]:

%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9))

CPU times: user 732 ms, sys: 17 ms, total: 749 ms
Wall time: 205 ms

Out[48]:

carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 38.70 MB; ratio: 3.94
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Second, with an on-disk results object. The time difference is not that huge.

In [49]:

%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9), rootdir=path + 'out')

CPU times: user 848 ms, sys: 62 ms, total: 910 ms
Wall time: 397 ms

Out[49]:

carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 38.70 MB; ratio: 3.94
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := '/notebooks/fpq/yves/data/out'
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Finally, we verify system disk usage.

In [50]:

# system disk usage data object
!du -hs $path/ca

985M	/notebooks/fpq/yves/data//ca

In [51]:

# system disk usage results object
!du -hs $path/out

39M	/notebooks/fpq/yves/data//out

In [52]:

!rm -r $path/out
!rm -r $path/ca

ibis — Efficient and Performant Handling of SQL Sources

ibis is (one of) the new projects by Wes McKinney. The content that follows is from the tutorial as found under http://blog.ibis-project.org/sqlite-crunchbase-quickstart/.

Importing & Database Connection¶

First, let us establish a connection to a SQLite3 database.

In [53]:

import ibis
ibis.options.interactive = True

In [54]:

con = ibis.sqlite.connect('crunchbase.db')

Basic Operations and Lookups¶

Certain operations and lookups are easily accomplished without any SQL (1).

In [55]:

con.list_tables()

Out[55]:

['acquisitions', 'companies', 'investments', 'rounds']

Certain operations and lookups are easily accomplished without any SQL (2).

In [56]:

rounds = con.table('rounds')
rounds.info()

Table rows: 87161

Column                   Type    Non-null #
------                   ----    ----------
company_permalink        string  87161     
company_name             string  87161     
company_category_list    string  83464     
company_market           string  80523     
company_country_code     string  81291     
company_state_code       string  59296     
company_region           string  81016     
company_city             string  80208     
funding_round_permalink  string  87161     
funding_round_type       string  87161     
funding_round_code       string  23648     
funded_at                string  87161     
funded_month             string  87147     
funded_quarter           string  87147     
funded_year              int64   87147     
raised_amount_usd        double  73406

Certain operations and lookups are easily accomplished without any SQL (3).

In [57]:

rounds.funding_round_type.value_counts()

Out[57]:

       funding_round_type  count
0                   angel   4602
1        convertible_note    838
2          debt_financing   5869
3     equity_crowdfunding   2401
4                   grant   1523
5   non_equity_assistance     69
6           post_ipo_debt     92
7         post_ipo_equity    425
8          private_equity   1864
9    product_crowdfunding    241
10       secondary_market     61
11                   seed  22053
12            undisclosed   4128
13                venture  42995

Instead of writing SQL, you create Pythonic expressions with ibis.

In [58]:

acquisitions = con.table('acquisitions')
expr = (acquisitions.price_amount
        .isnull()
        .name('has_price')
        .value_counts())
expr

Out[58]:

   has_price  count
0          0   3816
1          1  51424

Executing expressions gives back DataFrame objects by default for table like results.

In [59]:

df = expr.execute()
df

Out[59]:

	has_price	count
0	0	3816
1	1	51424

In [60]:

type(expr)

Out[60]:

ibis.expr.types.TableExpr

In [61]:

type(df)

Out[61]:

pandas.core.frame.DataFrame

Another case is a single value result.

In [62]:

companies = con.table('companies')
expr = companies.funding_total_usd.mean()
type(expr)

Out[62]:

ibis.expr.types.DoubleScalar

In [63]:

expr.execute()

Out[63]:

16198405.416388474

Funding Metrics¶

Let us have a look at some funding metrics. The time stamping doesn't seem to be "too accurate".

In [64]:

funded_at = rounds.funded_at.cast('timestamp')
funded_at.year().value_counts()

Out[64]:

    unnamed  count
0         1      3
1         7      1
2        11      1
3        20      2
4        29      1
5        30      1
6       201      1
7       203      1
8       214      1
9      1921      1
10     1960      3
11     1973      1
12     1974      2
13     1979      1
14     1982      3
15     1983      1
16     1984      4
17     1985      6
18     1986      4
19     1987      6
20     1988      2
21     1989      4
22     1990     14
23     1991     10
24     1992     12
25     1993     15
26     1994     15
27     1995     21
28     1996     32
29     1997     30
30     1998     62
31     1999    178
32     2000    267
33     2001    200
34     2002    207
35     2003    271
36     2004    434
37     2005   1762
38     2006   2663
39     2007   3562
40     2008   4257
41     2009   6553
42     2010   8692
43     2011  10550
44     2012  12648
45     2013  16194
46     2014  17670
47     2015    790
48     2104      1
49     2105      1

Another look at different funding stages.

In [65]:

rounds.funding_round_code.value_counts()

Out[65]:

  funding_round_code  count
0               None  63513
1                  A  11382
2                  B   6548
3                  C   3329
4                  D   1530
5                  E    608
6                  F    201
7                  G     45
8                  H      5

Next, a somewhat more complex analysis ...

In [66]:

year = funded_at.year().name('year')
type(year)

Out[66]:

ibis.expr.types.Int32Array

In [67]:

year

Out[67]:

0       2015
1       2014
2       2012
3       2010
4       2010
5       2012
6       2011
7       2014
8       2014
9       2013
10      2007
11      2008
12      2014
13      2014
14      2013
15      2013
16      2010
17      2011
18      2014
19      2013
20      2014
21      2010
22      2011
23      2009
24      2011
25      2011
26      2009
27      2011
28      2010
29      2011
        ... 
9970    2010
9971    2011
9972    2010
9973    2012
9974    2011
9975    2010
9976    2007
9977    2006
9978    2004
9979    2011
9980    2013
9981    2006
9982    2014
9983    2013
9984    2014
9985    2009
9986    2011
9987    2014
9988    2006
9989    2012
9990    2014
9991    2012
9992    2012
9993    2013
9994    2014
9995    2014
9996    2009
9997    2012
9998    2013
9999    2011
Name: tmp, dtype: int64

... selecting venture fundings between 2000 and 2015 and grouping them by year and funding rounds. The syntax/API is similar to pandas.

In [68]:

expr = (rounds[(rounds.funding_round_type == 'venture') &
               year.between(2000, 2015) &
               rounds.funding_round_code.notnull()]
        .group_by([year, 'funding_round_code'])
        .size())

A selection of the results.

In [69]:

results = expr.execute()
type(results)

Out[69]:

pandas.core.frame.DataFrame

In [70]:

results[:15]

Out[70]:

	year	funding_round_code	count
0	2000	A	79
1	2000	B	32
2	2000	C	10
3	2000	D	8
4	2000	E	2
5	2001	A	50
6	2001	B	30
7	2001	C	17
8	2001	D	4
9	2001	E	1
10	2002	A	35
11	2002	B	39
12	2002	C	25
13	2002	D	5
14	2002	E	2

Pivot tables are quickly generated out of such results objects.

In [71]:

pivoted = (results.set_index(['year', 'funding_round_code'])
           .unstack('funding_round_code')
           .fillna(0))
pivoted

Out[71]:

	count
funding_round_code	A	B	C	D	E	F	G	H
year
2000	79	32	10	8	2	0	0	0
2001	50	30	17	4	1	0	0	0
2002	35	39	25	5	2	2	0	0
2003	68	45	14	13	1	0	1	0
2004	146	76	32	15	3	1	0	0
2005	513	334	176	67	26	6	0	0
2006	717	465	226	91	35	7	1	0
2007	956	583	281	110	49	7	1	0
2008	979	653	308	120	54	17	1	0
2009	753	531	290	147	55	28	0	0
2010	1013	598	369	149	52	18	2	0
2011	1250	700	334	175	60	18	5	0
2012	1242	610	345	184	69	16	7	0
2013	1606	796	377	185	81	38	6	0
2014	1757	952	471	223	108	36	18	5
2015	88	71	34	28	8	5	3	0

Now an analysis with regard to funding sum buckets.

In [72]:

funding_buckets = [0, 1000000, 10000000, 50000000, 100000000, 
                   500000000, 1000000000]

In [73]:

bucket = (companies
          .funding_total_usd
          .bucket(funding_buckets, include_over=True))
bucket.value_counts()

Out[73]:

   unnamed  count
0      NaN  12055
1        0  15965
2        1  15754
3        2   7839
4        3   1532
5        4   1022
6        5     88
7        6     37

Adding bucket names makes the results output more meaningful.

In [74]:

bucket_names = ['0 to 1m', '1m to 10m', '10m to 50m', 
                '50m to 100m', '100m to 500m',
                '500m to 1b', 'Over 1b']

In [75]:

counts = bucket.name('bucket').value_counts()
labeled = counts.bucket.label(bucket_names)
with_names = counts.mutate(bucket_name=labeled)
with_names

Out[75]:

   bucket  count   bucket_name
0     NaN  12055          None
1       0  15965       0 to 1m
2       1  15754     1m to 10m
3       2   7839    10m to 50m
4       3   1532   50m to 100m
5       4   1022  100m to 500m
6       5     88    500m to 1b
7       6     37       Over 1b

Let us add another results column, about the total funding received for the bucket.

In [76]:

metrics = (companies.group_by(bucket.name('bucket'))
           .aggregate(count=companies.count(),
                      total_funding=companies.funding_total_usd.sum())
           .mutate(bucket_name=lambda x: x.bucket.label(bucket_names)))
metrics

Out[76]:

   bucket  count  total_funding   bucket_name
0     NaN  12055            NaN          None
1       0  15965   4.505177e+09       0 to 1m
2       1  15754   5.712283e+10     1m to 10m
3       2   7839   1.724166e+11    10m to 50m
4       3   1532   1.054132e+11   50m to 100m
5       4   1022   1.826600e+11  100m to 500m
6       5     88   5.804196e+10    500m to 1b
7       6     37   1.040123e+11       Over 1b

Finally, a complex query using the buckets and differentiating across the stati of the company.

In [77]:

joined = (companies.mutate(bucket=bucket,
                           status=companies.status.fillna('Unknown'))
          [(companies.founded_at > '2010-01-01') |
           companies.founded_at.isnull()]
          .group_by(['bucket', 'status'])
          .size()
          .mutate(bucket_name=lambda x: (x.bucket.label(bucket_names)
                                         .fillna('Unknown'))))

table = joined.execute()

table.set_index(['status', 'bucket', 'bucket_name'])['count'].unstack('status')

Out[77]:

	status	Unknown	acquired	closed	operating
bucket	bucket_name
NaN	Unknown	3208	217	251	5731
0	0 to 1m	353	227	632	11195
1	1m to 10m	197	530	369	8166
2	10m to 50m	62	276	124	2572
3	50m to 100m	6	47	20	360
4	100m to 500m	16	14	11	255
5	500m to 1b	NaN	2	1	23
6	Over 1b	NaN	1	NaN	9

Generated SQL Queries¶

A major benefit of working with ibis is the automatic generation of SQL queries out of Pythonic code/expressions (in the background).

In [78]:

print(ibis.impala.compile(joined))

SELECT *,
  isnull(CASE `bucket`
    WHEN 0 THEN '0 to 1m'
    WHEN 1 THEN '1m to 10m'
    WHEN 2 THEN '10m to 50m'
    WHEN 3 THEN '50m to 100m'
    WHEN 4 THEN '100m to 500m'
    WHEN 5 THEN '500m to 1b'
    WHEN 6 THEN 'Over 1b'
    ELSE NULL
  END, 'Unknown') AS `bucket_name`
FROM (
  SELECT `bucket`, `status`, count(*) AS `count`
  FROM (
    SELECT `permalink`, `name`, `homepage_url`, `category_list`, `market`,
           `funding_total_usd`, isnull(`status`, 'Unknown') AS `status`,
           `country_code`, `state_code`, `region`, `city`, `funding_rounds`,
           `founded_at`, `founded_month`, `founded_quarter`, `founded_year`,
           `first_funding_at`, `last_funding_at`,
      CASE
        WHEN (`funding_total_usd` >= 0) AND (`funding_total_usd` < 1000000) THEN 0
        WHEN (`funding_total_usd` >= 1000000) AND (`funding_total_usd` < 10000000) THEN 1
        WHEN (`funding_total_usd` >= 10000000) AND (`funding_total_usd` < 50000000) THEN 2
        WHEN (`funding_total_usd` >= 50000000) AND (`funding_total_usd` < 100000000) THEN 3
        WHEN (`funding_total_usd` >= 100000000) AND (`funding_total_usd` < 500000000) THEN 4
        WHEN (`funding_total_usd` >= 500000000) AND (`funding_total_usd` <= 1000000000) THEN 5
        WHEN `funding_total_usd` > 1000000000 THEN 6
        ELSE NULL
      END AS `bucket`
    FROM companies
    WHERE (`founded_at` > '2010-01-01') OR `founded_at` IS NULL
  ) t1
  GROUP BY 1, 2
) t0

blaze – Data Blending and Analysis

blaze allows Python users a familiar interface to query data living in diverse data storage systems. Cf. http://blaze.pydata.org/.

In [79]:

import blaze as bz

Simple Example¶

The first example constructs a blaze.Data object from native Python objects.

In [80]:

t = bz.Data([('Henry', 'boy', 8),
              ('Lilli', 'girl', 14)],
            fields=['name', 'gender', 'age'])

In [81]:

Out[81]:

	name	gender	age
0	Henry	boy	8
1	Lilli	girl	14

In [82]:

t[t.age > 10]

Out[82]:

	name	gender	age
0	Lilli	girl	14

Data from NumPy Array¶

Let us read data from an in-memory NumPy ndarray object.

In [83]:

import numpy as np

In [84]:

a = np.random.standard_normal((1000000, 5))
  # 1mn data rows, 5 columns

In [85]:

df = bz.DataFrame(a, columns=['f0', 'f1', 'f2', 'f3', 'f4'])
  # blaze DataFrame constructor

A look at the data structure.

In [86]:

df.head()

Out[86]:

	f0	f1	f2	f3	f4
0	2.056930	0.029329	-1.001722	1.004306	-0.937600
1	-1.452573	-0.934812	0.729341	-0.325649	-2.133633
2	-0.089087	0.932304	0.543675	-1.040536	-0.685716
3	-0.867179	0.606548	0.211712	-0.010828	0.014817
4	1.679594	-2.270351	1.643111	0.531190	0.602561

Data itself is stored as NumPy ndarray object.

In [87]:

df.values

Out[87]:

array([[ 2.05693019,  0.02932883, -1.00172196,  1.00430592, -0.93760034],
       [-1.45257295, -0.93481184,  0.72934074, -0.3256488 , -2.13363315],
       [-0.08908673,  0.93230404,  0.54367499, -1.04053639, -0.6857165 ],
       ..., 
       [ 2.18432667, -0.29418322,  0.36483589,  0.27630689, -0.58188977],
       [-1.31239497,  1.1626392 , -0.80855171, -1.04623562, -0.60156394],
       [-1.69589157,  0.40294101, -0.14898031,  0.94228704, -0.51654518]])

Data from CSV File¶

We generate first a CSV file using the random data from before.

In [88]:

%time df.to_csv(path + 'data.csv', index=False)

CPU times: user 7.73 s, sys: 91 ms, total: 7.82 s
Wall time: 9.46 s

Let us read the data with blaze. Actually, we only generate a view.

In [89]:

%time csv = bz.CSV(path + 'data.csv')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 41.5 µs

In [90]:

%time t1 = bz.Data(csv)

CPU times: user 6 ms, sys: 0 ns, total: 6 ms
Wall time: 75.8 ms

Now, we can work with the data. Note, however, that iterating, slicing, etc. are not (yet) implemented.

In [91]:

%time t1.count()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 476 µs

Out[91]:

1000000

The backend is a CSV object. And a look at the first 10 rows.

In [92]:

t1.data

Out[92]:

<odo.backends.csv.CSV at 0x7f784e4c6208>

In [93]:

t1

Out[93]:

	f0	f1	f2	f3	f4
0	2.056930	0.029329	-1.001722	1.004306	-0.937600
1	-1.452573	-0.934812	0.729341	-0.325649	-2.133633
2	-0.089087	0.932304	0.543675	-1.040536	-0.685716
3	-0.867179	0.606548	0.211712	-0.010828	0.014817
4	1.679594	-2.270351	1.643111	0.531190	0.602561
5	-1.164137	0.963810	0.277807	-0.365502	0.731875
6	-0.587053	-1.063658	-1.765022	-0.193745	-1.193952
7	-0.122269	0.405408	0.161799	-1.787346	-1.891628
8	0.292853	1.062099	1.130397	-0.481681	0.495915
9	-0.735962	-0.570641	-0.535854	0.029475	-0.612822
10	-0.028088	-1.269165	0.077696	0.139571	-0.914147

Data from SQL¶

We now generate a SQLite3 table with the dummy data from before.

In [94]:

import sqlite3 as sq3

In [95]:

con = sq3.connect(path + 'data.sql')
try:
    con.execute('DROP TABLE numbers')
    # delete in case it exists
except:
    pass

We write the data into an appropriate table.

In [96]:

con.execute(
    'CREATE TABLE numbers (f0 real, f1 real, f2 real, f3 real, f4 real)'
    )

Out[96]:

<sqlite3.Cursor at 0x7f784e4fb9d0>

In [97]:

%time con.executemany('INSERT INTO numbers VALUES (?, ?, ?, ?, ?)', a)

CPU times: user 7.23 s, sys: 93 ms, total: 7.32 s
Wall time: 8.16 s

Out[97]:

<sqlite3.Cursor at 0x7f784e4fb960>

In [98]:

con.commit()

In [99]:

con.close()

Now reading the data with blaze (i.e. just generating a view).

In [100]:

%time t2 = bz.Data('sqlite:///%sdata.sql::numbers' % path)

CPU times: user 7 ms, sys: 0 ns, total: 7 ms
Wall time: 7.03 ms

The SQL backend and first 10 rows again.

In [101]:

t2

Out[101]:

	f0	f1	f2	f3	f4
0	2.056930	0.029329	-1.001722	1.004306	-0.937600
1	-1.452573	-0.934812	0.729341	-0.325649	-2.133633
2	-0.089087	0.932304	0.543675	-1.040536	-0.685716
3	-0.867179	0.606548	0.211712	-0.010828	0.014817
4	1.679594	-2.270351	1.643111	0.531190	0.602561
5	-1.164137	0.963810	0.277807	-0.365502	0.731875
6	-0.587053	-1.063658	-1.765022	-0.193745	-1.193952
7	-0.122269	0.405408	0.161799	-1.787346	-1.891628
8	0.292853	1.062099	1.130397	-0.481681	0.495915
9	-0.735962	-0.570641	-0.535854	0.029475	-0.612822
10	-0.028088	-1.269165	0.077696	0.139571	-0.914147

Working with the blaze Objects¶

blaze provides an abstraction logic for computations/queries.

In [102]:

# generic table description -- independent of the target data structure
ts = bz.TableSymbol('ts',
        '{f0: float64, f1: float64, f2: float64, f3: float64, f4: float64}')
# generic expression -- independent of the target data structure
expr = ts[ts['f0'] + ts['f3'] > 2.5]['f1']

The blaze compiler specializes the generic objects to different data structures.

In [103]:

%time np.array(bz.compute(expr, a))  # NumPy ndarray object

CPU times: user 40 ms, sys: 4 ms, total: 44 ms
Wall time: 607 ms

Out[103]:

array([ 0.02932883, -1.23164119,  0.67050742, ...,  1.6657984 ,
       -1.19846161, -1.30715482])

In [104]:

%time np.array(bz.compute(expr, df))  # DataFrame object

CPU times: user 19 ms, sys: 0 ns, total: 19 ms
Wall time: 9.78 ms

Out[104]:

array([ 0.02932883, -1.23164119,  0.67050742, ...,  1.6657984 ,
       -1.19846161, -1.30715482])

In [105]:

%time np.array(bz.compute(expr, csv))  # CSV file representation

CPU times: user 565 ms, sys: 30 ms, total: 595 ms
Wall time: 603 ms

Out[105]:

array([ 0.02932883, -1.23164119,  0.67050742, ...,  1.6657984 ,
       -1.19846161, -1.30715482])

In similar fashion, blaze allows unified expression evaluations for different backends (I).

In [106]:

%time t1[t1['f0'] + t1['f3'] > 2.5]['f1'].head()
  # table representation 1
  # from CSV

CPU times: user 2 ms, sys: 0 ns, total: 2 ms
Wall time: 1.92 ms

Out[106]:

	f1
0	0.029329
18	-1.231641
36	0.670507
39	0.717779
68	-0.224904
111	-1.886033
144	-0.673477
158	-0.350397
161	-1.144460
174	-0.295445

In similar fashion, blaze allows unified expression evaluations for different backends (II).

In [107]:

%time t2[t2['f0'] + t2['f3'] > 2.5]['f1'].head()
  # table representation 2
  # from SQL database

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.57 ms

Out[107]:

	f1
0	0.029329
1	-1.231641
2	0.670507
3	0.717779
4	-0.224904
5	-1.886033
6	-0.673477
7	-0.350397
8	-1.144460
9	-0.295445

Typical aggregational operations work as well.

In [108]:

%time t1.f0.sum()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 433 µs

Out[108]:

-1816.535944971288

In [109]:

%time t2.f3.max()

CPU times: user 1 ms, sys: 0 ns, total: 1 ms
Wall time: 625 µs

Out[109]:

4.538421437830834

Transforming Data Formats¶

If you work intensively with data sets, it might be beneficial to transform them once into highly performant binary data formats (eg bcolz, HDF5).

Using bcolz as Data Store¶

In [110]:

%time bz.into(path + 'data.bcolz', path + 'data.csv')
  # natively done by blaze
  # cparams=bcolz.cparams(9) could be added
  # no effect here due to random floats

CPU times: user 960 ms, sys: 99 ms, total: 1.06 s
Wall time: 1.06 s

Out[110]:

ctable((1000000,), [('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])
  nbytes: 38.15 MB; cbytes: 38.53 MB; ratio: 0.99
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
  rootdir := '/notebooks/fpq/yves/data/data.bcolz'
[ (2.0569301865053324, 0.029328830260023194, -1.0017219637000396, 1.0043059163971206, -0.937600336039536)
 (-1.4525729543596462, -0.9348118352288594, 0.729340739702494, -0.3256487991023368, -2.1336331523344527)
 (-0.08908673382989611, 0.9323040448749188, 0.5436749909190117, -1.0405363851357037, -0.6857164993229661)
 ...,
 (2.1843266726204096, -0.2941832224261781, 0.3648358949245896, 0.2763068869380623, -0.5818897659149579)
 (-1.3123949679575355, 1.1626392014644351, -0.8085517108946355, -1.0462356207393482, -0.6015639357751871)
 (-1.6958915689699, 0.40294100740191385, -0.1489803134025054, 0.9422870416487872, -0.5165451785994732)]

We can now connect to the bcolz disk-based ctable object.

In [111]:

import bcolz as bc

In [112]:

b = bc.ctable(rootdir=path + 'data.bcolz')

Now, the power of bcolz for numerical computations can be played out.

In [113]:

%time nex = b.eval('sqrt(abs(f0)) + log(abs(f1))')

CPU times: user 139 ms, sys: 12 ms, total: 151 ms
Wall time: 50.6 ms

In [114]:

nex

Out[114]:

carray((1000000,), float64)
  nbytes: 7.63 MB; cbytes: 7.45 MB; ratio: 1.02
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[-2.09498409  1.13781733  0.22837772 ...,  0.25439428  1.29629068
  0.39329891]

Using HDF5¶

Similarly, we can use PyTables and HDF5 as an efficient binary store.

In [115]:

import pandas as pd

In [116]:

%%time
con = sq3.connect(path + 'data.sql')
pd.HDFStore(path + 'data.h5')['sql'] = pd.read_sql('SELECT * FROM numbers', con)
  # simultaneously reading whole SQL table and writing it to HDF5 store
con.close()

CPU times: user 1.41 s, sys: 142 ms, total: 1.55 s
Wall time: 1.66 s

Now, data can be efficiently retrieved.

In [117]:

%%time
%matplotlib inline
pd.HDFStore(path + 'data.h5')['sql'][::1000].cumsum().plot(figsize=(10, 6))
  # simultaneously reading data from HDF5 store and plotting it

CPU times: user 63 ms, sys: 12 ms, total: 75 ms
Wall time: 156 ms

Cleaning Up¶

In [118]:

!du -h $path

12K	/notebooks/fpq/yves/data/data.bcolz/f2/meta
9.6M	/notebooks/fpq/yves/data/data.bcolz/f2/data
9.6M	/notebooks/fpq/yves/data/data.bcolz/f2
12K	/notebooks/fpq/yves/data/data.bcolz/f4/meta
9.6M	/notebooks/fpq/yves/data/data.bcolz/f4/data
9.6M	/notebooks/fpq/yves/data/data.bcolz/f4
12K	/notebooks/fpq/yves/data/data.bcolz/f1/meta
9.6M	/notebooks/fpq/yves/data/data.bcolz/f1/data
9.6M	/notebooks/fpq/yves/data/data.bcolz/f1
12K	/notebooks/fpq/yves/data/data.bcolz/f0/meta
9.6M	/notebooks/fpq/yves/data/data.bcolz/f0/data
9.6M	/notebooks/fpq/yves/data/data.bcolz/f0
12K	/notebooks/fpq/yves/data/data.bcolz/f3/meta
9.6M	/notebooks/fpq/yves/data/data.bcolz/f3/data
9.6M	/notebooks/fpq/yves/data/data.bcolz/f3
48M	/notebooks/fpq/yves/data/data.bcolz
282M	/notebooks/fpq/yves/data/

In [119]:

!ls -n $path/dat*.*

-rw-r--r-- 1 3613 8 98156364 Apr 27 12:30 /notebooks/fpq/yves/data//data.csv
-rw-r--r-- 1 3613 8 48007208 Apr 27 12:30 /notebooks/fpq/yves/data//data.h5
-rw-r--r-- 1 3613 8 54446080 Apr 27 12:30 /notebooks/fpq/yves/data//data.sql

/notebooks/fpq/yves/data//data.bcolz:
total 28
-rw-r--r-- 1 3613 8    3 Apr 27 12:30 __attrs__
drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f0
drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f1
drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f2
drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f3
drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f4
-rw-r--r-- 1 3613 8   42 Apr 27 12:30 __rootdirs__

In [120]:

# cleaning up
!rm -r $path/dat*.*

Conclusion¶

High performance (hardware-bound) I/O operations and highly efficient data blending and analytics are among Python's key strengths.

The Python Quants

http://tpq.io | yves@tpq.io | @dyjh | hilpisch.com

Quant Platform — quant-platform.com

Python for Finance — python-for-finance.com

Derivatives Analytics with Python — derivatives-analytics-with-python.com

Python for Finance Trainings — training.tpq.io