Dask array¶

Permet de manipuler des tableau multi-dimensionnels.

Ce code créé une vecteur de 100 000 éléments et change sa forme (reshape) vers une matrice de 200 lignes et 500 colonnes.

In [50]:

Copied!





import numpy as np
import dask.array as da

data = np.arange(100_000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a
import numpy as np
import dask.array as da

data = np.arange(100_000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a

Out[50]:

	Array	Chunk
Bytes	781.25 kiB	78.12 kiB
Shape	(200, 500)	(100, 100)
Dask graph	10 chunks in 1 graph layer
Data type	int64 numpy.ndarray

On voit que dask a subdivisé en interne la matrice en 10 morceaux sur 2 lignes et 5 colonnes, donnant ainsi une matrice de chunks (ou de morceaux) de taille 2x5

In [51]:

Copied!

a.chunks
a.chunks

Out[51]:

((100, 100), (100, 100, 100, 100, 100))

In [52]:

Copied!

a.visualize()
a.visualize()

Out[52]:

No description has been provided for this image

Le premier morceau (0, 0) contient les 100 premières lignes et les 100 première colonnes.

In [53]:

Copied!

a.blocks[0, 0]
a.blocks[0, 0]

Out[53]:

	Array	Chunk
Bytes	78.12 kiB	78.12 kiB
Shape	(100, 100)	(100, 100)
Dask graph	1 chunks in 2 graph layers
Data type	int64 numpy.ndarray

In [54]:

Copied!

a.blocks[1, 3]
a.blocks[1, 3]

Out[54]:

	Array	Chunk
Bytes	78.12 kiB	78.12 kiB
Shape	(100, 100)	(100, 100)
Dask graph	1 chunks in 2 graph layers
Data type	int64 numpy.ndarray

Indexation et accès aux données¶

API similaire au slicing Python, Numpy et DataFrame de Pandas.

In [55]:

Copied!

a[:50, 200]
a[:50, 200]

Out[55]:

	Array	Chunk
Bytes	400 B	400 B
Shape	(50,)	(50,)
Dask graph	1 chunks in 2 graph layers
Data type	int64 numpy.ndarray

In [56]:

Copied!

a[-1, -1]
a[-1, -1]

Out[56]:

	Array	Chunk
Bytes	8 B	8 B
Shape	()	()
Dask graph	1 chunks in 2 graph layers
Data type	int64 numpy.ndarray

In [57]:

Copied!

a[-1, -1].compute()
a[-1, -1].compute()

Out[57]:

np.int64(99999)

Méthodes et API¶

Similiare à Numpy et DataFrame de Pandas. Toujours ne pas oublier de faire un compute() pour obtenir le résultat final. Le chaînage de méthode est possible.

In [58]:

Copied!

m = a.mean()
m
m = a.mean()
m

Out[58]:

	Array	Chunk
Bytes	8 B	8 B
Shape	()	()
Dask graph	1 chunks in 5 graph layers
Data type	float64 numpy.ndarray

In [59]:

Copied!

m.compute()
m.compute()

Out[59]:

np.float64(49999.5)

In [60]:

Copied!

m.visualize()
m.visualize()

Out[60]:

In [61]:

Copied!

s = np.sin(a)
s
s = np.sin(a)
s

Out[61]:

	Array	Chunk
Bytes	781.25 kiB	78.12 kiB
Shape	(200, 500)	(100, 100)
Dask graph	10 chunks in 2 graph layers
Data type	float64 numpy.ndarray

In [62]:

Copied!

s.compute()
s.compute()

Out[62]:

array([[ 0.        ,  0.84147098,  0.90929743, ...,  0.58781939,
         0.99834363,  0.49099533],
       [-0.46777181, -0.9964717 , -0.60902011, ..., -0.89796748,
        -0.85547315, -0.02646075],
       [ 0.82687954,  0.9199906 ,  0.16726654, ...,  0.99951642,
         0.51387502, -0.4442207 ],
       ...,
       [-0.99720859, -0.47596473,  0.48287891, ..., -0.76284376,
         0.13191447,  0.90539115],
       [ 0.84645538,  0.00929244, -0.83641393, ...,  0.37178568,
        -0.5802765 , -0.99883514],
       [-0.49906936,  0.45953849,  0.99564877, ...,  0.10563876,
         0.89383946,  0.86024828]], shape=(200, 500))

In [63]:

Copied!

s.visualize()
s.visualize()

Out[63]:

In [64]:

Copied!

a.T.compute()
a.T.compute()

Out[64]:

array([[    0,   500,  1000, ..., 98500, 99000, 99500],
       [    1,   501,  1001, ..., 98501, 99001, 99501],
       [    2,   502,  1002, ..., 98502, 99002, 99502],
       ...,
       [  497,   997,  1497, ..., 98997, 99497, 99997],
       [  498,   998,  1498, ..., 98998, 99498, 99998],
       [  499,   999,  1499, ..., 98999, 99499, 99999]], shape=(500, 200))

In [65]:

Copied!

a.compute()
a.compute()

Out[65]:

array([[    0,     1,     2, ...,   497,   498,   499],
       [  500,   501,   502, ...,   997,   998,   999],
       [ 1000,  1001,  1002, ...,  1497,  1498,  1499],
       ...,
       [98500, 98501, 98502, ..., 98997, 98998, 98999],
       [99000, 99001, 99002, ..., 99497, 99498, 99499],
       [99500, 99501, 99502, ..., 99997, 99998, 99999]], shape=(200, 500))

Axis: corrspond à l'axe qui bouge. Les autres axes sont figés.

Axis 0 : il s'agit de la première dimension (lignes). Le calcul (ou l'axe qui bouge) est sur la ligne. Donc sum(axis=0) calcule la somme de la colonne. En effet, axis = 0 signfie que la somme est calculée en avançant ligne par ligne et en fixant la colonne.
Axis 1 : il s'agit de la deuxième dimension (colonnes). Le calcul est sur sur la colonne. Donc si on fait un sum(axis=1), ça donne la somme de la ligne.

Le code suivant donne le max de chaque ligne (comme axis=1 donc le max est calculé sur les colonnes de chaque ligne; càd la ligne est figée et on bouge la colonne pour calculer le max), inverse l'ordre du résultat et ajoute 10 à chaque valeur. Le résultat est donc un vecteur (ou une liste) de 200 éléments.

In [66]:

Copied!

b = a.max(axis=1)[::-1] + 10
b
b = a.max(axis=1)[::-1] + 10
b

Out[66]:

	Array	Chunk
Bytes	1.56 kiB	800 B
Shape	(200,)	(100,)
Dask graph	2 chunks in 6 graph layers
Data type	int64 numpy.ndarray

In [67]:

Copied!

b.compute()
b.compute()

Out[67]:

array([100009,  99509,  99009,  98509,  98009,  97509,  97009,  96509,
        96009,  95509,  95009,  94509,  94009,  93509,  93009,  92509,
        92009,  91509,  91009,  90509,  90009,  89509,  89009,  88509,
        88009,  87509,  87009,  86509,  86009,  85509,  85009,  84509,
        84009,  83509,  83009,  82509,  82009,  81509,  81009,  80509,
        80009,  79509,  79009,  78509,  78009,  77509,  77009,  76509,
        76009,  75509,  75009,  74509,  74009,  73509,  73009,  72509,
        72009,  71509,  71009,  70509,  70009,  69509,  69009,  68509,
        68009,  67509,  67009,  66509,  66009,  65509,  65009,  64509,
        64009,  63509,  63009,  62509,  62009,  61509,  61009,  60509,
        60009,  59509,  59009,  58509,  58009,  57509,  57009,  56509,
        56009,  55509,  55009,  54509,  54009,  53509,  53009,  52509,
        52009,  51509,  51009,  50509,  50009,  49509,  49009,  48509,
        48009,  47509,  47009,  46509,  46009,  45509,  45009,  44509,
        44009,  43509,  43009,  42509,  42009,  41509,  41009,  40509,
        40009,  39509,  39009,  38509,  38009,  37509,  37009,  36509,
        36009,  35509,  35009,  34509,  34009,  33509,  33009,  32509,
        32009,  31509,  31009,  30509,  30009,  29509,  29009,  28509,
        28009,  27509,  27009,  26509,  26009,  25509,  25009,  24509,
        24009,  23509,  23009,  22509,  22009,  21509,  21009,  20509,
        20009,  19509,  19009,  18509,  18009,  17509,  17009,  16509,
        16009,  15509,  15009,  14509,  14009,  13509,  13009,  12509,
        12009,  11509,  11009,  10509,  10009,   9509,   9009,   8509,
         8009,   7509,   7009,   6509,   6009,   5509,   5009,   4509,
         4009,   3509,   3009,   2509,   2009,   1509,   1009,    509])

In [68]:

Copied!

b.visualize()
b.visualize()

Out[68]:

Ceci permet de comprendre le calcul fait plus haut.

In [69]:

Copied!

a.max(axis=1).compute()
a.max(axis=1).compute()

Out[69]:

array([  499,   999,  1499,  1999,  2499,  2999,  3499,  3999,  4499,
        4999,  5499,  5999,  6499,  6999,  7499,  7999,  8499,  8999,
        9499,  9999, 10499, 10999, 11499, 11999, 12499, 12999, 13499,
       13999, 14499, 14999, 15499, 15999, 16499, 16999, 17499, 17999,
       18499, 18999, 19499, 19999, 20499, 20999, 21499, 21999, 22499,
       22999, 23499, 23999, 24499, 24999, 25499, 25999, 26499, 26999,
       27499, 27999, 28499, 28999, 29499, 29999, 30499, 30999, 31499,
       31999, 32499, 32999, 33499, 33999, 34499, 34999, 35499, 35999,
       36499, 36999, 37499, 37999, 38499, 38999, 39499, 39999, 40499,
       40999, 41499, 41999, 42499, 42999, 43499, 43999, 44499, 44999,
       45499, 45999, 46499, 46999, 47499, 47999, 48499, 48999, 49499,
       49999, 50499, 50999, 51499, 51999, 52499, 52999, 53499, 53999,
       54499, 54999, 55499, 55999, 56499, 56999, 57499, 57999, 58499,
       58999, 59499, 59999, 60499, 60999, 61499, 61999, 62499, 62999,
       63499, 63999, 64499, 64999, 65499, 65999, 66499, 66999, 67499,
       67999, 68499, 68999, 69499, 69999, 70499, 70999, 71499, 71999,
       72499, 72999, 73499, 73999, 74499, 74999, 75499, 75999, 76499,
       76999, 77499, 77999, 78499, 78999, 79499, 79999, 80499, 80999,
       81499, 81999, 82499, 82999, 83499, 83999, 84499, 84999, 85499,
       85999, 86499, 86999, 87499, 87999, 88499, 88999, 89499, 89999,
       90499, 90999, 91499, 91999, 92499, 92999, 93499, 93999, 94499,
       94999, 95499, 95999, 96499, 96999, 97499, 97999, 98499, 98999,
       99499, 99999])

Petit apparté sur les slices¶

In [70]:

Copied!

l = list(range(1, 20))
l[::]
l = list(range(1, 20))
l[::]

Out[70]:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [71]:

Copied!

# l[indice de début: indice de fin (optionnel = dernier indice selon le sens):pas (optionnel = 1)]
l[:]
# l[indice de début: indice de fin (optionnel = dernier indice selon le sens):pas (optionnel = 1)]
l[:]

Out[71]:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [72]:

Copied!

l[::-1]
l[::-1]

Out[72]:

[19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

In [73]:

Copied!

l[3:1:-1]
l[3:1:-1]

Out[73]:

[4, 3]

In [74]:

Copied!

l[15:2:-3]
l[15:2:-3]

Out[74]:

[16, 13, 10, 7, 4]