In this talk, I cover the following libraries:
TsTables is a Python library by Andy Fiedler built on top of the popular PyTables
HDF5 database library. It is meant to handle large amounts of high frequency time series data in append once, retrieve many times scenarios (cf. Gihub page). The focus lies on retrieving chunks of data from large data sets as quickly as possible.
import os
import numpy as np
import pandas as pd
import tables as tb
import tstables as tstb
import random
from time import time
from datetime import datetime
import seaborn as sns; sns.set()
%matplotlib inline
Let us generate a decent amount of sample data points.
no = 30000000
co = 3
dt = 1. / (12 * 30 * 24 * 60)
vol = 0.2
We generate one second intervals of data.
dr = pd.date_range('2017-1-1', periods=no, freq='1s')
dr
DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 00:00:01', '2017-01-01 00:00:02', '2017-01-01 00:00:03', '2017-01-01 00:00:04', '2017-01-01 00:00:05', '2017-01-01 00:00:06', '2017-01-01 00:00:07', '2017-01-01 00:00:08', '2017-01-01 00:00:09', ... '2017-12-14 05:19:50', '2017-12-14 05:19:51', '2017-12-14 05:19:52', '2017-12-14 05:19:53', '2017-12-14 05:19:54', '2017-12-14 05:19:55', '2017-12-14 05:19:56', '2017-12-14 05:19:57', '2017-12-14 05:19:58', '2017-12-14 05:19:59'], dtype='datetime64[ns]', length=30000000, freq='S')
In memory generation is quite quick.
%%time
da = 100 * np.exp(np.cumsum(-0.5 * vol ** 2 * dt +
vol * np.sqrt(dt) * np.random.standard_normal((no, co)), axis=0))
da[0] = 100
CPU times: user 5.76 s, sys: 854 ms, total: 6.62 s Wall time: 6.61 s
df = pd.DataFrame(da, index=dr, columns=['ts1', 'ts2', 'ts3'])
df.count()
ts1 30000000 ts2 30000000 ts3 30000000 dtype: int64
The starting values of the three time series.
df.head()
ts1 | ts2 | ts3 | |
---|---|---|---|
2017-01-01 00:00:00 | 100.000000 | 100.000000 | 100.000000 |
2017-01-01 00:00:01 | 99.972947 | 100.017864 | 99.962609 |
2017-01-01 00:00:02 | 99.990383 | 99.991254 | 99.951958 |
2017-01-01 00:00:03 | 99.957912 | 100.027486 | 99.976107 |
2017-01-01 00:00:04 | 99.953263 | 100.006583 | 100.010659 |
And a plot of the time series data (every 100000th point).
df[::100000].plot(figsize=(10, 6));
To store the time series data in a PyTables table we first define the table structure.
class TS(tb.IsDescription):
timestamp = tb.Int64Col(pos=0)
ts1 = tb.Float64Col(pos=1)
ts2 = tb.Float64Col(pos=1)
ts3 = tb.Float64Col(pos=1)
Second, open a database file and create the table object.
path = '/notebooks/fpq/yves/data/'
os.environ['path'] = path
h5 = tb.open_file(path + 'ts.h5','w')
TsTables adds a new function create_ts
to PyTables.
ts = h5.create_ts('/','TS', TS)
Third, we append the time series data to the table object.
%time ts.append(df)
CPU times: user 2.49 s, sys: 2.19 s, total: 4.68 s Wall time: 46.7 s
!ls -n $path/ts*
-rw-r--r-- 1 3613 8 935432352 Apr 27 12:26 /notebooks/fpq/yves/data//ts.h5
The approach of TsTables
is to apply a highly structured storage hierarchy.
a = str(h5)
print(a[:508])
/notebooks/fpq/yves/data/ts.h5 (File) '' Last modif.: 'Wed Apr 27 12:26:02 2016' Object Tree: / (RootGroup) '' /TS (Group/Timeseries) '' /TS/y2016 (Group) '' /TS/y2017 (Group) '' /TS/y2017/m01 (Group) '' /TS/y2017/m02 (Group) '' /TS/y2017/m03 (Group) '' /TS/y2017/m04 (Group) '' /TS/y2017/m05 (Group) '' /TS/y2017/m06 (Group) '' /TS/y2017/m07 (Group) '' /TS/y2017/m08 (Group) '' /TS/y2017/m09 (Group) '' /TS/y2017/m10 (Group) '' /TS/y2017/m11 (Group) '' /TS/y2017/m12 (Group) '' /TS/y2017/m12/d01 (Group) ''
The strength of TsTables lies in retrieving chunks of time series data defined by a start date and an end date (which obviously is a typical case in finance, e.g. in backtesting strategies or risk management).
read_start_dt = datetime(2017, 2, 1, 0, 0)
read_end_dt = datetime(2017, 2, 2, 0, 0)
TsTables tries to make such an operation as fast as possible.
%time rows = ts.read_range(read_start_dt, read_end_dt)
CPU times: user 11 ms, sys: 4 ms, total: 15 ms Wall time: 262 ms
Let us try it with random intervals.
t0 = time()
its = 100
for _ in range(its):
day = random.randint(1, 27)
read_start_dt = datetime(2017, 2, day, 0, 0)
read_end_dt = datetime(2017, 2, day + 1, 0, 0)
rows = ts.read_range(read_start_dt, read_end_dt)
t1 = time()
The results are convincing.
print("time for %d random accesses %5.3f seconds" %(its, t1 - t0))
print("average time for random access %5.3f seconds" %((t1 - t0) / its))
time for 100 random accesses 0.870 seconds average time for random access 0.009 seconds
Conveniently, the returned object is a pandas DataFrame.
rows.count()
ts1 86401 ts2 86401 ts3 86401 dtype: int64
rows.head()
ts1 | ts2 | ts3 | |
---|---|---|---|
2017-02-08 00:00:00 | 41.714617 | 112.266922 | 63.477136 |
2017-02-08 00:00:01 | 41.721620 | 112.260084 | 63.494282 |
2017-02-08 00:00:02 | 41.723495 | 112.327356 | 63.500059 |
2017-02-08 00:00:03 | 41.734147 | 112.326363 | 63.540484 |
2017-02-08 00:00:04 | 41.708918 | 112.336727 | 63.570569 |
A look at a data sub-set.
rows[::500].plot(figsize=(10, 6));
h5.close()
!rm $path/ts.h5
bcolz
is a columnar data store for fast data storage and retrieval with built-in high performance compression. It supports both in-memory and out-of-memory storage and operations. Cf. http://bcolz.blosc.org/.
import bcolz
The first example is based on the ctable
class for data in table format. The example data set is 1 GB in size.
N = 100000 * 1000
print(N)
100000000
We generate first an in-memory object using high compression. Since we work with integers, good compression ratios are to be expected. It takes about 44 sec to generate the ctable
object from a generator via the fromiter
method.
%%time
ct = bcolz.fromiter(((i, i ** 2) for i in range(N)),
dtype="i4, i8",
count=N,
cparams=bcolz.cparams(clevel=9))
CPU times: user 46.4 s, sys: 128 ms, total: 46.5 s Wall time: 45 s
The in-memory size is about 150 MB only, which translates in to a compression ratio of 7+.
ct
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')]) nbytes: 1.12 GB; cbytes: 151.84 MB; ratio: 7.54 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') [(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009) (99999998, 9999999600000004) (99999999, 9999999800000001)]
You can now implement fast numerical operations on this data object (note that the output is a carray
object).
%time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.55 s, sys: 442 ms, total: 4.99 s Wall time: 1.17 s
carray((100000000,), float64) nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20 cparams := cparams(clevel=5, shuffle=True, cname='blosclz') [ 0.00000000e+00 2.00000000e+00 6.00000000e+00 ..., 1.37491943e+09 1.57491943e+09 1.77491942e+09]
The same tasks can be implemented with disk-based storage. To this end, only specify the rootdir
parameter. With about 1 minute the generation takes a bit longer on disk. everything else (especially the object handling) remaining the same however.
%%time
ct = bcolz.fromiter(((i, i ** 2) for i in range(N)),
dtype="i4, i8",
count=N, rootdir=path + 'ct',
cparams=bcolz.cparams(clevel=9))
CPU times: user 45.9 s, sys: 289 ms, total: 46.2 s Wall time: 49.3 s
Everything else (especially the object handling) remains almost the same however.
ct
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')]) nbytes: 1.12 GB; cbytes: 151.84 MB; ratio: 7.54 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') rootdir := '/notebooks/fpq/yves/data/ct' [(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009) (99999998, 9999999600000004) (99999999, 9999999800000001)]
The numerical operations work in the same fashion and hardly take longer due to native multi threading and optimized caching.
%time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.58 s, sys: 514 ms, total: 5.09 s Wall time: 1.19 s
carray((100000000,), float64) nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20 cparams := cparams(clevel=5, shuffle=True, cname='blosclz') [ 0.00000000e+00 2.00000000e+00 6.00000000e+00 ..., 1.37491943e+09 1.57491943e+09 1.77491942e+09]
Let us finally verify system disk usage.
!ls $path/ct
__attrs__ f0 f1 __rootdirs__
!du -hs $path/ct
# system disk usage
156M /notebooks/fpq/yves/data//ct
!rm -r $path/ct
This example is about mid data which does not fit (in general) into memory (without compression).
import numpy as np
We generte as basis a NumPy ndarray
object of size 32 MB.
n = 2000
a = np.arange(n * n).reshape(n, n)
a.nbytes
32000000
Let us first again work in-memory. Our carray
object contains 4,000 versions of the ndarray
object. The in-memory generation of the object takes about 25 sec.
%%time
it = 4000
ca = bcolz.carray(a, cparams=bcolz.cparams(clevel=9))
for i in range(it):
ca.append(a)
CPU times: user 27.3 s, sys: 317 ms, total: 27.6 s Wall time: 27.9 s
The carray
object stores 120 GB worth of data in less than 1 GB of memory, for a compression ratio of more than 130.
ca
carray((8002000, 2000), int64) nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') [[ 0 1 2 ..., 1997 1998 1999] [ 2000 2001 2002 ..., 3997 3998 3999] [ 4000 4001 4002 ..., 5997 5998 5999] ..., [3994000 3994001 3994002 ..., 3995997 3995998 3995999] [3996000 3996001 3996002 ..., 3997997 3997998 3997999] [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]
Let us implement the evaluation of a numerical expression on this data set. The syntax and handling are the same as with NumPy ndarray
objects.
%time ca[:5000] ** 2 + np.sqrt(ca[10000:15000])
CPU times: user 140 ms, sys: 97 ms, total: 237 ms Wall time: 236 ms
array([[ 0.00000000e+00, 2.00000000e+00, 5.41421356e+00, ..., 3.98805369e+06, 3.99204870e+06, 3.99604571e+06], [ 4.00004472e+06, 4.00404573e+06, 4.00804874e+06, ..., 1.59760722e+07, 1.59840672e+07, 1.59920642e+07], [ 1.60000632e+07, 1.60080643e+07, 1.60160673e+07, ..., 3.59640864e+07, 3.59760814e+07, 3.59880785e+07], ..., [ 3.97603600e+12, 3.97603999e+12, 3.97604398e+12, ..., 3.98400403e+12, 3.98400802e+12, 3.98401201e+12], [ 3.98401600e+12, 3.98401999e+12, 3.98402399e+12, ..., 3.99199201e+12, 3.99199601e+12, 3.99200001e+12], [ 3.99200400e+12, 3.99200800e+12, 3.99201199e+12, ..., 3.99998800e+12, 3.99999200e+12, 3.99999600e+12]])
Another approach is to use the eval
function of bcolz.
x = ca[:10000] # 10,000 rows as sub-set
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(clevel=9))
# output carray object compressed
CPU times: user 746 ms, sys: 55 ms, total: 801 ms Wall time: 216 ms
carray((10000, 2000), float64) nbytes: 152.59 MB; cbytes: 38.70 MB; ratio: 3.94 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') [[ 0.00000000e+00 2.00000000e+00 5.41421356e+00 ..., 3.98805369e+06 3.99204870e+06 3.99604571e+06] [ 4.00004472e+06 4.00404573e+06 4.00804874e+06 ..., 1.59760722e+07 1.59840672e+07 1.59920642e+07] [ 1.60000632e+07 1.60080643e+07 1.60160673e+07 ..., 3.59640864e+07 3.59760814e+07 3.59880785e+07] ..., [ 1.59520360e+13 1.59520440e+13 1.59520520e+13 ..., 1.59679920e+13 1.59680000e+13 1.59680080e+13] [ 1.59680160e+13 1.59680240e+13 1.59680320e+13 ..., 1.59839800e+13 1.59839880e+13 1.59839960e+13] [ 1.59840040e+13 1.59840120e+13 1.59840200e+13 ..., 1.59999760e+13 1.59999840e+13 1.59999920e+13]]
Disk-based storage of multiple versions of the array object. We write the object 4,000 times to disk in a single carray
object. It takes only about 1 min to compress and store 120 GB worth of data on disk.
%%time
it = 4000
ca = bcolz.carray(a, rootdir=path + 'ca',
cparams=bcolz.cparams(clevel=9))
for i in range(it):
ca.append(a)
CPU times: user 33 s, sys: 4.62 s, total: 37.6 s Wall time: 1min 9s
The compression ratio in this case is again 130+.
ca
carray((8002000, 2000), int64) nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') rootdir := '/notebooks/fpq/yves/data/ca' [[ 0 1 2 ..., 1997 1998 1999] [ 2000 2001 2002 ..., 3997 3998 3999] [ 4000 4001 4002 ..., 5997 5998 5999] ..., [3994000 3994001 3994002 ..., 3995997 3995998 3995999] [3996000 3996001 3996002 ..., 3997997 3997998 3997999] [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]
Simple numerical operations are easy to implement.
%time np.sum(ca[:1000] + ca[4000:5000])
CPU times: user 53 ms, sys: 4 ms, total: 57 ms Wall time: 56.6 ms
3999998000000
Let us try the previous, mathematically more demanding operation – again with a sub-set of the data.
x = ca[:10000] # 10,000 rows as sub-set
First, with an in-memory carray
results object.
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9))
CPU times: user 732 ms, sys: 17 ms, total: 749 ms Wall time: 205 ms
carray((10000, 2000), float64) nbytes: 152.59 MB; cbytes: 38.70 MB; ratio: 3.94 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') [[ 0.00000000e+00 2.00000000e+00 5.41421356e+00 ..., 3.98805369e+06 3.99204870e+06 3.99604571e+06] [ 4.00004472e+06 4.00404573e+06 4.00804874e+06 ..., 1.59760722e+07 1.59840672e+07 1.59920642e+07] [ 1.60000632e+07 1.60080643e+07 1.60160673e+07 ..., 3.59640864e+07 3.59760814e+07 3.59880785e+07] ..., [ 1.59520360e+13 1.59520440e+13 1.59520520e+13 ..., 1.59679920e+13 1.59680000e+13 1.59680080e+13] [ 1.59680160e+13 1.59680240e+13 1.59680320e+13 ..., 1.59839800e+13 1.59839880e+13 1.59839960e+13] [ 1.59840040e+13 1.59840120e+13 1.59840200e+13 ..., 1.59999760e+13 1.59999840e+13 1.59999920e+13]]
Second, with an on-disk results object. The time difference is not that huge.
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9), rootdir=path + 'out')
CPU times: user 848 ms, sys: 62 ms, total: 910 ms Wall time: 397 ms
carray((10000, 2000), float64) nbytes: 152.59 MB; cbytes: 38.70 MB; ratio: 3.94 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') rootdir := '/notebooks/fpq/yves/data/out' [[ 0.00000000e+00 2.00000000e+00 5.41421356e+00 ..., 3.98805369e+06 3.99204870e+06 3.99604571e+06] [ 4.00004472e+06 4.00404573e+06 4.00804874e+06 ..., 1.59760722e+07 1.59840672e+07 1.59920642e+07] [ 1.60000632e+07 1.60080643e+07 1.60160673e+07 ..., 3.59640864e+07 3.59760814e+07 3.59880785e+07] ..., [ 1.59520360e+13 1.59520440e+13 1.59520520e+13 ..., 1.59679920e+13 1.59680000e+13 1.59680080e+13] [ 1.59680160e+13 1.59680240e+13 1.59680320e+13 ..., 1.59839800e+13 1.59839880e+13 1.59839960e+13] [ 1.59840040e+13 1.59840120e+13 1.59840200e+13 ..., 1.59999760e+13 1.59999840e+13 1.59999920e+13]]
Finally, we verify system disk usage.
# system disk usage data object
!du -hs $path/ca
985M /notebooks/fpq/yves/data//ca
# system disk usage results object
!du -hs $path/out
39M /notebooks/fpq/yves/data//out
!rm -r $path/out
!rm -r $path/ca
ibis is (one of) the new projects by Wes McKinney. The content that follows is from the tutorial as found under http://blog.ibis-project.org/sqlite-crunchbase-quickstart/.
First, let us establish a connection to a SQLite3 database.
import ibis
ibis.options.interactive = True
con = ibis.sqlite.connect('crunchbase.db')
Certain operations and lookups are easily accomplished without any SQL (1).
con.list_tables()
['acquisitions', 'companies', 'investments', 'rounds']
Certain operations and lookups are easily accomplished without any SQL (2).
rounds = con.table('rounds')
rounds.info()
Table rows: 87161 Column Type Non-null # ------ ---- ---------- company_permalink string 87161 company_name string 87161 company_category_list string 83464 company_market string 80523 company_country_code string 81291 company_state_code string 59296 company_region string 81016 company_city string 80208 funding_round_permalink string 87161 funding_round_type string 87161 funding_round_code string 23648 funded_at string 87161 funded_month string 87147 funded_quarter string 87147 funded_year int64 87147 raised_amount_usd double 73406
Certain operations and lookups are easily accomplished without any SQL (3).
rounds.funding_round_type.value_counts()
funding_round_type count 0 angel 4602 1 convertible_note 838 2 debt_financing 5869 3 equity_crowdfunding 2401 4 grant 1523 5 non_equity_assistance 69 6 post_ipo_debt 92 7 post_ipo_equity 425 8 private_equity 1864 9 product_crowdfunding 241 10 secondary_market 61 11 seed 22053 12 undisclosed 4128 13 venture 42995
Instead of writing SQL, you create Pythonic expressions with ibis.
acquisitions = con.table('acquisitions')
expr = (acquisitions.price_amount
.isnull()
.name('has_price')
.value_counts())
expr
has_price count 0 0 3816 1 1 51424
Executing expressions gives back DataFrame
objects by default for table like results.
df = expr.execute()
df
has_price | count | |
---|---|---|
0 | 0 | 3816 |
1 | 1 | 51424 |
type(expr)
ibis.expr.types.TableExpr
type(df)
pandas.core.frame.DataFrame
Another case is a single value result.
companies = con.table('companies')
expr = companies.funding_total_usd.mean()
type(expr)
ibis.expr.types.DoubleScalar
expr.execute()
16198405.416388474
Let us have a look at some funding metrics. The time stamping doesn't seem to be "too accurate".
funded_at = rounds.funded_at.cast('timestamp')
funded_at.year().value_counts()
unnamed count 0 1 3 1 7 1 2 11 1 3 20 2 4 29 1 5 30 1 6 201 1 7 203 1 8 214 1 9 1921 1 10 1960 3 11 1973 1 12 1974 2 13 1979 1 14 1982 3 15 1983 1 16 1984 4 17 1985 6 18 1986 4 19 1987 6 20 1988 2 21 1989 4 22 1990 14 23 1991 10 24 1992 12 25 1993 15 26 1994 15 27 1995 21 28 1996 32 29 1997 30 30 1998 62 31 1999 178 32 2000 267 33 2001 200 34 2002 207 35 2003 271 36 2004 434 37 2005 1762 38 2006 2663 39 2007 3562 40 2008 4257 41 2009 6553 42 2010 8692 43 2011 10550 44 2012 12648 45 2013 16194 46 2014 17670 47 2015 790 48 2104 1 49 2105 1
Another look at different funding stages.
rounds.funding_round_code.value_counts()
funding_round_code count 0 None 63513 1 A 11382 2 B 6548 3 C 3329 4 D 1530 5 E 608 6 F 201 7 G 45 8 H 5
Next, a somewhat more complex analysis ...
year = funded_at.year().name('year')
type(year)
ibis.expr.types.Int32Array
year
0 2015 1 2014 2 2012 3 2010 4 2010 5 2012 6 2011 7 2014 8 2014 9 2013 10 2007 11 2008 12 2014 13 2014 14 2013 15 2013 16 2010 17 2011 18 2014 19 2013 20 2014 21 2010 22 2011 23 2009 24 2011 25 2011 26 2009 27 2011 28 2010 29 2011 ... 9970 2010 9971 2011 9972 2010 9973 2012 9974 2011 9975 2010 9976 2007 9977 2006 9978 2004 9979 2011 9980 2013 9981 2006 9982 2014 9983 2013 9984 2014 9985 2009 9986 2011 9987 2014 9988 2006 9989 2012 9990 2014 9991 2012 9992 2012 9993 2013 9994 2014 9995 2014 9996 2009 9997 2012 9998 2013 9999 2011 Name: tmp, dtype: int64
... selecting venture fundings between 2000 and 2015 and grouping them by year and funding rounds. The syntax/API is similar to pandas.
expr = (rounds[(rounds.funding_round_type == 'venture') &
year.between(2000, 2015) &
rounds.funding_round_code.notnull()]
.group_by([year, 'funding_round_code'])
.size())
A selection of the results.
results = expr.execute()
type(results)
pandas.core.frame.DataFrame
results[:15]
year | funding_round_code | count | |
---|---|---|---|
0 | 2000 | A | 79 |
1 | 2000 | B | 32 |
2 | 2000 | C | 10 |
3 | 2000 | D | 8 |
4 | 2000 | E | 2 |
5 | 2001 | A | 50 |
6 | 2001 | B | 30 |
7 | 2001 | C | 17 |
8 | 2001 | D | 4 |
9 | 2001 | E | 1 |
10 | 2002 | A | 35 |
11 | 2002 | B | 39 |
12 | 2002 | C | 25 |
13 | 2002 | D | 5 |
14 | 2002 | E | 2 |
Pivot tables are quickly generated out of such results objects.
pivoted = (results.set_index(['year', 'funding_round_code'])
.unstack('funding_round_code')
.fillna(0))
pivoted
count | ||||||||
---|---|---|---|---|---|---|---|---|
funding_round_code | A | B | C | D | E | F | G | H |
year | ||||||||
2000 | 79 | 32 | 10 | 8 | 2 | 0 | 0 | 0 |
2001 | 50 | 30 | 17 | 4 | 1 | 0 | 0 | 0 |
2002 | 35 | 39 | 25 | 5 | 2 | 2 | 0 | 0 |
2003 | 68 | 45 | 14 | 13 | 1 | 0 | 1 | 0 |
2004 | 146 | 76 | 32 | 15 | 3 | 1 | 0 | 0 |
2005 | 513 | 334 | 176 | 67 | 26 | 6 | 0 | 0 |
2006 | 717 | 465 | 226 | 91 | 35 | 7 | 1 | 0 |
2007 | 956 | 583 | 281 | 110 | 49 | 7 | 1 | 0 |
2008 | 979 | 653 | 308 | 120 | 54 | 17 | 1 | 0 |
2009 | 753 | 531 | 290 | 147 | 55 | 28 | 0 | 0 |
2010 | 1013 | 598 | 369 | 149 | 52 | 18 | 2 | 0 |
2011 | 1250 | 700 | 334 | 175 | 60 | 18 | 5 | 0 |
2012 | 1242 | 610 | 345 | 184 | 69 | 16 | 7 | 0 |
2013 | 1606 | 796 | 377 | 185 | 81 | 38 | 6 | 0 |
2014 | 1757 | 952 | 471 | 223 | 108 | 36 | 18 | 5 |
2015 | 88 | 71 | 34 | 28 | 8 | 5 | 3 | 0 |
Now an analysis with regard to funding sum buckets.
funding_buckets = [0, 1000000, 10000000, 50000000, 100000000,
500000000, 1000000000]
bucket = (companies
.funding_total_usd
.bucket(funding_buckets, include_over=True))
bucket.value_counts()
unnamed count 0 NaN 12055 1 0 15965 2 1 15754 3 2 7839 4 3 1532 5 4 1022 6 5 88 7 6 37
Adding bucket names makes the results output more meaningful.
bucket_names = ['0 to 1m', '1m to 10m', '10m to 50m',
'50m to 100m', '100m to 500m',
'500m to 1b', 'Over 1b']
counts = bucket.name('bucket').value_counts()
labeled = counts.bucket.label(bucket_names)
with_names = counts.mutate(bucket_name=labeled)
with_names
bucket count bucket_name 0 NaN 12055 None 1 0 15965 0 to 1m 2 1 15754 1m to 10m 3 2 7839 10m to 50m 4 3 1532 50m to 100m 5 4 1022 100m to 500m 6 5 88 500m to 1b 7 6 37 Over 1b
Let us add another results column, about the total funding received for the bucket.
metrics = (companies.group_by(bucket.name('bucket'))
.aggregate(count=companies.count(),
total_funding=companies.funding_total_usd.sum())
.mutate(bucket_name=lambda x: x.bucket.label(bucket_names)))
metrics
bucket count total_funding bucket_name 0 NaN 12055 NaN None 1 0 15965 4.505177e+09 0 to 1m 2 1 15754 5.712283e+10 1m to 10m 3 2 7839 1.724166e+11 10m to 50m 4 3 1532 1.054132e+11 50m to 100m 5 4 1022 1.826600e+11 100m to 500m 6 5 88 5.804196e+10 500m to 1b 7 6 37 1.040123e+11 Over 1b
Finally, a complex query using the buckets and differentiating across the stati of the company.
joined = (companies.mutate(bucket=bucket,
status=companies.status.fillna('Unknown'))
[(companies.founded_at > '2010-01-01') |
companies.founded_at.isnull()]
.group_by(['bucket', 'status'])
.size()
.mutate(bucket_name=lambda x: (x.bucket.label(bucket_names)
.fillna('Unknown'))))
table = joined.execute()
table.set_index(['status', 'bucket', 'bucket_name'])['count'].unstack('status')
status | Unknown | acquired | closed | operating | |
---|---|---|---|---|---|
bucket | bucket_name | ||||
NaN | Unknown | 3208 | 217 | 251 | 5731 |
0 | 0 to 1m | 353 | 227 | 632 | 11195 |
1 | 1m to 10m | 197 | 530 | 369 | 8166 |
2 | 10m to 50m | 62 | 276 | 124 | 2572 |
3 | 50m to 100m | 6 | 47 | 20 | 360 |
4 | 100m to 500m | 16 | 14 | 11 | 255 |
5 | 500m to 1b | NaN | 2 | 1 | 23 |
6 | Over 1b | NaN | 1 | NaN | 9 |
A major benefit of working with ibis is the automatic generation of SQL queries out of Pythonic code/expressions (in the background).
print(ibis.impala.compile(joined))
SELECT *, isnull(CASE `bucket` WHEN 0 THEN '0 to 1m' WHEN 1 THEN '1m to 10m' WHEN 2 THEN '10m to 50m' WHEN 3 THEN '50m to 100m' WHEN 4 THEN '100m to 500m' WHEN 5 THEN '500m to 1b' WHEN 6 THEN 'Over 1b' ELSE NULL END, 'Unknown') AS `bucket_name` FROM ( SELECT `bucket`, `status`, count(*) AS `count` FROM ( SELECT `permalink`, `name`, `homepage_url`, `category_list`, `market`, `funding_total_usd`, isnull(`status`, 'Unknown') AS `status`, `country_code`, `state_code`, `region`, `city`, `funding_rounds`, `founded_at`, `founded_month`, `founded_quarter`, `founded_year`, `first_funding_at`, `last_funding_at`, CASE WHEN (`funding_total_usd` >= 0) AND (`funding_total_usd` < 1000000) THEN 0 WHEN (`funding_total_usd` >= 1000000) AND (`funding_total_usd` < 10000000) THEN 1 WHEN (`funding_total_usd` >= 10000000) AND (`funding_total_usd` < 50000000) THEN 2 WHEN (`funding_total_usd` >= 50000000) AND (`funding_total_usd` < 100000000) THEN 3 WHEN (`funding_total_usd` >= 100000000) AND (`funding_total_usd` < 500000000) THEN 4 WHEN (`funding_total_usd` >= 500000000) AND (`funding_total_usd` <= 1000000000) THEN 5 WHEN `funding_total_usd` > 1000000000 THEN 6 ELSE NULL END AS `bucket` FROM companies WHERE (`founded_at` > '2010-01-01') OR `founded_at` IS NULL ) t1 GROUP BY 1, 2 ) t0
blaze
allows Python users a familiar interface to query data living in diverse data storage systems.
Cf. http://blaze.pydata.org/.
import blaze as bz
The first example constructs a blaze.Data
object from native Python objects.
t = bz.Data([('Henry', 'boy', 8),
('Lilli', 'girl', 14)],
fields=['name', 'gender', 'age'])
t
name | gender | age | |
---|---|---|---|
0 | Henry | boy | 8 |
1 | Lilli | girl | 14 |
t[t.age > 10]
name | gender | age | |
---|---|---|---|
0 | Lilli | girl | 14 |
Let us read data from an in-memory NumPy ndarray
object.
import numpy as np
a = np.random.standard_normal((1000000, 5))
# 1mn data rows, 5 columns
df = bz.DataFrame(a, columns=['f0', 'f1', 'f2', 'f3', 'f4'])
# blaze DataFrame constructor
A look at the data structure.
df.head()
f0 | f1 | f2 | f3 | f4 | |
---|---|---|---|---|---|
0 | 2.056930 | 0.029329 | -1.001722 | 1.004306 | -0.937600 |
1 | -1.452573 | -0.934812 | 0.729341 | -0.325649 | -2.133633 |
2 | -0.089087 | 0.932304 | 0.543675 | -1.040536 | -0.685716 |
3 | -0.867179 | 0.606548 | 0.211712 | -0.010828 | 0.014817 |
4 | 1.679594 | -2.270351 | 1.643111 | 0.531190 | 0.602561 |
Data itself is stored as NumPy ndarray
object.
df.values
array([[ 2.05693019, 0.02932883, -1.00172196, 1.00430592, -0.93760034], [-1.45257295, -0.93481184, 0.72934074, -0.3256488 , -2.13363315], [-0.08908673, 0.93230404, 0.54367499, -1.04053639, -0.6857165 ], ..., [ 2.18432667, -0.29418322, 0.36483589, 0.27630689, -0.58188977], [-1.31239497, 1.1626392 , -0.80855171, -1.04623562, -0.60156394], [-1.69589157, 0.40294101, -0.14898031, 0.94228704, -0.51654518]])
We generate first a CSV file using the random data from before.
%time df.to_csv(path + 'data.csv', index=False)
CPU times: user 7.73 s, sys: 91 ms, total: 7.82 s Wall time: 9.46 s
Let us read the data with blaze
. Actually, we only generate a view.
%time csv = bz.CSV(path + 'data.csv')
CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 41.5 µs
%time t1 = bz.Data(csv)
CPU times: user 6 ms, sys: 0 ns, total: 6 ms Wall time: 75.8 ms
Now, we can work with the data. Note, however, that iterating, slicing, etc. are not (yet) implemented.
%time t1.count()
CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 476 µs
The backend is a CSV object. And a look at the first 10 rows.
t1.data
<odo.backends.csv.CSV at 0x7f784e4c6208>
t1
f0 | f1 | f2 | f3 | f4 | |
---|---|---|---|---|---|
0 | 2.056930 | 0.029329 | -1.001722 | 1.004306 | -0.937600 |
1 | -1.452573 | -0.934812 | 0.729341 | -0.325649 | -2.133633 |
2 | -0.089087 | 0.932304 | 0.543675 | -1.040536 | -0.685716 |
3 | -0.867179 | 0.606548 | 0.211712 | -0.010828 | 0.014817 |
4 | 1.679594 | -2.270351 | 1.643111 | 0.531190 | 0.602561 |
5 | -1.164137 | 0.963810 | 0.277807 | -0.365502 | 0.731875 |
6 | -0.587053 | -1.063658 | -1.765022 | -0.193745 | -1.193952 |
7 | -0.122269 | 0.405408 | 0.161799 | -1.787346 | -1.891628 |
8 | 0.292853 | 1.062099 | 1.130397 | -0.481681 | 0.495915 |
9 | -0.735962 | -0.570641 | -0.535854 | 0.029475 | -0.612822 |
10 | -0.028088 | -1.269165 | 0.077696 | 0.139571 | -0.914147 |
We now generate a SQLite3 table with the dummy data from before.
import sqlite3 as sq3
con = sq3.connect(path + 'data.sql')
try:
con.execute('DROP TABLE numbers')
# delete in case it exists
except:
pass
We write the data into an appropriate table.
con.execute(
'CREATE TABLE numbers (f0 real, f1 real, f2 real, f3 real, f4 real)'
)
<sqlite3.Cursor at 0x7f784e4fb9d0>
%time con.executemany('INSERT INTO numbers VALUES (?, ?, ?, ?, ?)', a)
CPU times: user 7.23 s, sys: 93 ms, total: 7.32 s Wall time: 8.16 s
<sqlite3.Cursor at 0x7f784e4fb960>
con.commit()
con.close()
Now reading the data with blaze
(i.e. just generating a view).
%time t2 = bz.Data('sqlite:///%sdata.sql::numbers' % path)
CPU times: user 7 ms, sys: 0 ns, total: 7 ms Wall time: 7.03 ms
The SQL backend and first 10 rows again.
t2
f0 | f1 | f2 | f3 | f4 | |
---|---|---|---|---|---|
0 | 2.056930 | 0.029329 | -1.001722 | 1.004306 | -0.937600 |
1 | -1.452573 | -0.934812 | 0.729341 | -0.325649 | -2.133633 |
2 | -0.089087 | 0.932304 | 0.543675 | -1.040536 | -0.685716 |
3 | -0.867179 | 0.606548 | 0.211712 | -0.010828 | 0.014817 |
4 | 1.679594 | -2.270351 | 1.643111 | 0.531190 | 0.602561 |
5 | -1.164137 | 0.963810 | 0.277807 | -0.365502 | 0.731875 |
6 | -0.587053 | -1.063658 | -1.765022 | -0.193745 | -1.193952 |
7 | -0.122269 | 0.405408 | 0.161799 | -1.787346 | -1.891628 |
8 | 0.292853 | 1.062099 | 1.130397 | -0.481681 | 0.495915 |
9 | -0.735962 | -0.570641 | -0.535854 | 0.029475 | -0.612822 |
10 | -0.028088 | -1.269165 | 0.077696 | 0.139571 | -0.914147 |
blaze
provides an abstraction logic for computations/queries.
# generic table description -- independent of the target data structure
ts = bz.TableSymbol('ts',
'{f0: float64, f1: float64, f2: float64, f3: float64, f4: float64}')
# generic expression -- independent of the target data structure
expr = ts[ts['f0'] + ts['f3'] > 2.5]['f1']
The blaze
compiler specializes the generic objects to different data structures.
%time np.array(bz.compute(expr, a)) # NumPy ndarray object
CPU times: user 40 ms, sys: 4 ms, total: 44 ms Wall time: 607 ms
array([ 0.02932883, -1.23164119, 0.67050742, ..., 1.6657984 , -1.19846161, -1.30715482])
%time np.array(bz.compute(expr, df)) # DataFrame object
CPU times: user 19 ms, sys: 0 ns, total: 19 ms Wall time: 9.78 ms
array([ 0.02932883, -1.23164119, 0.67050742, ..., 1.6657984 , -1.19846161, -1.30715482])
%time np.array(bz.compute(expr, csv)) # CSV file representation
CPU times: user 565 ms, sys: 30 ms, total: 595 ms Wall time: 603 ms
array([ 0.02932883, -1.23164119, 0.67050742, ..., 1.6657984 , -1.19846161, -1.30715482])
In similar fashion, blaze
allows unified expression evaluations for different backends (I).
%time t1[t1['f0'] + t1['f3'] > 2.5]['f1'].head()
# table representation 1
# from CSV
CPU times: user 2 ms, sys: 0 ns, total: 2 ms Wall time: 1.92 ms
f1 | |
---|---|
0 | 0.029329 |
18 | -1.231641 |
36 | 0.670507 |
39 | 0.717779 |
68 | -0.224904 |
111 | -1.886033 |
144 | -0.673477 |
158 | -0.350397 |
161 | -1.144460 |
174 | -0.295445 |
In similar fashion, blaze
allows unified expression evaluations for different backends (II).
%time t2[t2['f0'] + t2['f3'] > 2.5]['f1'].head()
# table representation 2
# from SQL database
CPU times: user 4 ms, sys: 0 ns, total: 4 ms Wall time: 3.57 ms
f1 | |
---|---|
0 | 0.029329 |
1 | -1.231641 |
2 | 0.670507 |
3 | 0.717779 |
4 | -0.224904 |
5 | -1.886033 |
6 | -0.673477 |
7 | -0.350397 |
8 | -1.144460 |
9 | -0.295445 |
Typical aggregational operations work as well.
%time t1.f0.sum()
CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 433 µs
%time t2.f3.max()
CPU times: user 1 ms, sys: 0 ns, total: 1 ms Wall time: 625 µs
If you work intensively with data sets, it might be beneficial to transform them once into highly performant binary data formats (eg bcolz, HDF5
).
%time bz.into(path + 'data.bcolz', path + 'data.csv')
# natively done by blaze
# cparams=bcolz.cparams(9) could be added
# no effect here due to random floats
CPU times: user 960 ms, sys: 99 ms, total: 1.06 s Wall time: 1.06 s
ctable((1000000,), [('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')]) nbytes: 38.15 MB; cbytes: 38.53 MB; ratio: 0.99 cparams := cparams(clevel=5, shuffle=True, cname='blosclz') rootdir := '/notebooks/fpq/yves/data/data.bcolz' [ (2.0569301865053324, 0.029328830260023194, -1.0017219637000396, 1.0043059163971206, -0.937600336039536) (-1.4525729543596462, -0.9348118352288594, 0.729340739702494, -0.3256487991023368, -2.1336331523344527) (-0.08908673382989611, 0.9323040448749188, 0.5436749909190117, -1.0405363851357037, -0.6857164993229661) ..., (2.1843266726204096, -0.2941832224261781, 0.3648358949245896, 0.2763068869380623, -0.5818897659149579) (-1.3123949679575355, 1.1626392014644351, -0.8085517108946355, -1.0462356207393482, -0.6015639357751871) (-1.6958915689699, 0.40294100740191385, -0.1489803134025054, 0.9422870416487872, -0.5165451785994732)]
We can now connect to the bcolz
disk-based ctable
object.
import bcolz as bc
b = bc.ctable(rootdir=path + 'data.bcolz')
Now, the power of bcolz
for numerical computations can be played out.
%time nex = b.eval('sqrt(abs(f0)) + log(abs(f1))')
CPU times: user 139 ms, sys: 12 ms, total: 151 ms Wall time: 50.6 ms
nex
carray((1000000,), float64) nbytes: 7.63 MB; cbytes: 7.45 MB; ratio: 1.02 cparams := cparams(clevel=5, shuffle=True, cname='blosclz') [-2.09498409 1.13781733 0.22837772 ..., 0.25439428 1.29629068 0.39329891]
Similarly, we can use PyTables
and HDF5
as an efficient binary store.
import pandas as pd
%%time
con = sq3.connect(path + 'data.sql')
pd.HDFStore(path + 'data.h5')['sql'] = pd.read_sql('SELECT * FROM numbers', con)
# simultaneously reading whole SQL table and writing it to HDF5 store
con.close()
CPU times: user 1.41 s, sys: 142 ms, total: 1.55 s Wall time: 1.66 s
Now, data can be efficiently retrieved.
%%time
%matplotlib inline
pd.HDFStore(path + 'data.h5')['sql'][::1000].cumsum().plot(figsize=(10, 6))
# simultaneously reading data from HDF5 store and plotting it
CPU times: user 63 ms, sys: 12 ms, total: 75 ms Wall time: 156 ms
!du -h $path
12K /notebooks/fpq/yves/data/data.bcolz/f2/meta 9.6M /notebooks/fpq/yves/data/data.bcolz/f2/data 9.6M /notebooks/fpq/yves/data/data.bcolz/f2 12K /notebooks/fpq/yves/data/data.bcolz/f4/meta 9.6M /notebooks/fpq/yves/data/data.bcolz/f4/data 9.6M /notebooks/fpq/yves/data/data.bcolz/f4 12K /notebooks/fpq/yves/data/data.bcolz/f1/meta 9.6M /notebooks/fpq/yves/data/data.bcolz/f1/data 9.6M /notebooks/fpq/yves/data/data.bcolz/f1 12K /notebooks/fpq/yves/data/data.bcolz/f0/meta 9.6M /notebooks/fpq/yves/data/data.bcolz/f0/data 9.6M /notebooks/fpq/yves/data/data.bcolz/f0 12K /notebooks/fpq/yves/data/data.bcolz/f3/meta 9.6M /notebooks/fpq/yves/data/data.bcolz/f3/data 9.6M /notebooks/fpq/yves/data/data.bcolz/f3 48M /notebooks/fpq/yves/data/data.bcolz 282M /notebooks/fpq/yves/data/
!ls -n $path/dat*.*
-rw-r--r-- 1 3613 8 98156364 Apr 27 12:30 /notebooks/fpq/yves/data//data.csv -rw-r--r-- 1 3613 8 48007208 Apr 27 12:30 /notebooks/fpq/yves/data//data.h5 -rw-r--r-- 1 3613 8 54446080 Apr 27 12:30 /notebooks/fpq/yves/data//data.sql /notebooks/fpq/yves/data//data.bcolz: total 28 -rw-r--r-- 1 3613 8 3 Apr 27 12:30 __attrs__ drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f0 drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f1 drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f2 drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f3 drwxr-sr-x 4 3613 8 4096 Apr 27 12:30 f4 -rw-r--r-- 1 3613 8 42 Apr 27 12:30 __rootdirs__
# cleaning up
!rm -r $path/dat*.*
High performance (hardware-bound) I/O operations and highly efficient data blending and analytics are among Python's key strengths.
http://tpq.io | yves@tpq.io | @dyjh | hilpisch.com
Quant Platform — quant-platform.com
Python for Finance — python-for-finance.com
Derivatives Analytics with Python — derivatives-analytics-with-python.com
Python for Finance Trainings — training.tpq.io