Darr API Documentation#

Two types of numeric data structures are supported:

arrays
raggedarrays

Darr is a Python library for storing numeric data arrays in a format that is as open and simple as possible. It also provides easy memory-mapped access to such disk-based data using numpy indexing.

Darr objects can be created from array-like objects, such as numpy arrays and lists, using the asarray function. Alternatively, darr arrays can be created from scratch by the create_array function. Existing Darr data on disk can be accessed through the Array constructor. To remove a Darr array from disk, use delete_array.

Arrays #

Accessing arrays #

class darr.Array(path, accessmode='r')#

Instantiate a Darr array from disk.

A darr array corresponds to a directory containing 1) a binary file with the raw numeric array values, 2) a text file (json format) describing the numeric type, array shape, and other format information, 3) a README text file documenting the data format, including code to read the array data in other languages.

Parameters:

path (str or pathlib.Path) – Path to disk-based array directory.
accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data. r means read-only, r+ means read-write. w does not exist. To create new darr arrays, potentially overwriting an other one, use the asarray or create_array functions.

property accessmode#: Data access mode of metadata, {‘r’, ‘r+’}.

append(array)#

Add array-like objects to darr to the end of the dataset.

Data will be appended along the first axis. The shape of the data and the darr must be compliant. When appending data repeatedly it is more efficient to use iterappend.

Parameters:: array (array-like object) – This can be a numpy array, a sequence that can be converted into a numpy array.
Return type:: None

Examples

>>> import darr as da
>>> d = da.create_array('test.da', shape=(4,2), overwrite=True)
>>> d.append([[1,2],[3,4],[5,6]])
>>> print(d)
[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 1.  2.]
 [ 3.  4.]
 [ 5.  6.]]

archive(filepath=None, compressiontype='xz', overwrite=False)#

Archive array data into a single compressed file.

Parameters:

filepath (str) – Name of the archive. In None, it will be derived from the data’s path name.
compressiontype (str) – One of ‘xz’, ‘gz’, or ‘bz2’, corresponding to the gzip, bz2 and lzma compression algorithms supported by the Python standard library.
overwrite ((True, False), optional) – Overwrites existing archive if it exists. Default is False.

Returns:

The path of the created archive

Return type:

pathlib.Path

Notes

See the tarfile library for more info on archiving formats

copy(path, dtype=None, chunklen=None, accessmode='r', overwrite=False)#

Copy darr to a different path, potentially changing its dtype.

The copying is performed in chunks to avoid RAM memory overflow for very large darr arrays.

Parameters:

path (str or pathlib.Path) –
dtype (<dtype, None>) – Numpy data type of the copy. Default is None, which corresponds to the dtype of the darr to be copied.
chunklen (<int, None>) – The length of chunks (along first axis) that are written during creation. If None, it is chosen so that chunks are 10 Mb in total size.
accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data of the returned Darr object. r means read-only, r+ means read-write.
overwrite ((True, False), optional) – Overwrites existing darr data if it exists. Note that a darr path is a directory. If that directory contains additional files, these will not be removed and an OSError is raised. Default is False.

Returns:

copy of the darr array

Return type:

Array

property datadir#: Data directory object with many useful methods, such as writing information to text or json files, archiving all data, calculating checksums etc.

property dtype#: Numpy data type of the array values.

property itemsize#: The size in bytes of each item in the array.

iterappend(arrayiterable)#

Iteratively append data from a data iterable.

The iterable has to yield chunks of data that are array-like objects compliant with Darr arrays.

Parameters:: arrayiterable (an iterable that yield array-like objects) –
Return type:: None

Examples

>>> import darr as da
>>> d = da.create_array('test.da', shape=(3,2), overwrite=True)
>>> def ga():
        yield [[1,2],[3,4]]
        yield [[5,6],[7,8],[9,10]]
>>> d.iterappend(ga())
>>> print(d)
[[  0.   0.]
 [  0.   0.]
 [  0.   0.]
 [  1.   2.]
 [  3.   4.]
 [  5.   6.]
 [  7.   8.]
 [  9.  10.]]

iterchunks(chunklen, stepsize=None, startindex=None, endindex=None, include_remainder=True, accessmode=None)#

Iterate over array array yielding chunks of a given length and with a given stepsize.

This method keeps the underlying data file open during iteration, and is therefore relatively fast.

Parameters:

chunklen (int) – Size of chunk for across the first axis. Note that the last chunk may be smaller than chunklen, depending on the size of the first axis.
stepsize (<int, None>) – Size of the shift per iteration across the first axis. Default is None, which means that stepsize equals chunklen.
include_remainder (<True, False>) – Determines whether remainder (< chunklen) should be included.
startindex (<int, None>) – Start index value. Default is None, which means to start at the beginning.
endindex (<int, None>) – End index value. Default is None, which means to end at the end.
include_remainder – Determines if the remainder at the end of the array, if it exist, should be yielded or not. The remainder is smaller than chunklen. Default is True.
accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data. r means read-only, r+ means read-write.

Returns:

a generator that produces numpy array chunks.

Return type:

generator

Examples

>>> import darr as da
>>> fillfunc = lambda i: i # fill with index number
>>> d1 = da.create_array('test1.da', shape=(12,), fillfunc=fillfunc)
>>> print(d1)
[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.]
>>> d2 = darr.asarray('test2.da', d.iterchunks(chunklen=2, stepsize=3))
>>> print(d2)
[  0.   1.   3.   4.   6.   7.   9.  10.]

iterindices(chunklen, stepsize=None, startindex=None, endindex=None, include_remainder=True)#

Generate indices of chunks of a given length and with a given stepsize.

This method keeps the underlying data file open during iteration, and is therefore relatively fast.

Parameters:

chunklen (int) – Size of chunk for across the first axis. Note that the last chunk may be smaller than chunklen, depending on the size of the first axis.
stepsize (<int, None>) – Size of the shift per iteration across the first axis. Default is None, which means that stepsize equals chunklen.
include_remainder (<True, False>) – Determines whether remainder (< chunklen) should be included.
startindex (<int, None>) – Start index value. Default is None, which means to start at the beginning.
endindex (<int, None>) – End index value. Default is None, which means to end at the end.
include_remainder – Determines if the remainder at the end of the array, if it exist, should be yielded or not. The remainder is smaller than chunklen. Default is True.
accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data. r means read-only, r+ means read-write.

Returns:

a generator that produces numpy array chunks.

Return type:

generator

Examples

>>> import darr as da
>>> d = da.create_array('test.da', shape=(12,), accesmode= 'r+')
>>> for start, end in enumerate(d.iterindices(chunklen=2, stepsize=3)):
        d[:] = 1
>>> print(d)
[ 1.  1.  0.  1.  1.  0.  1.  1.  0.  1.  1.  0.]

property mb#: Array size in megabytes, excluding metadata.

property metadata#: Dictionary-like interface to metadata.

property nbytes#: Array size in bytes, excluding metadata.

property ndim#: Number of dimensions

open_array(accessmode=None)#

Open the array for efficient multiple read or write operations.

Although read and write operations can be performed conveniently using indexing notation on the Darr object, this can be relatively slow when performing multiple access operations after each other. To read data, the disk file needs to be opened, data copied into memory, and after which the file is closed. In such cases, it is much faster to first open the disk-based data.

Parameters:: accessmode ({'r', 'r+'}, default 'r') – File access mode of the disk array data. r means read-only, r+ means read-write.
Yields:: None

Examples

>>> import darr as da
>>> d = da.create_array('test.da', shape=(1000,3), overwrite=True)
>>> with d.open_array(accessmode='r+'):
        s1 = d[:10,1:].sum()
        s2 = d[20:25,:2].sum()
        d[500:] = 3.33

property path#: File system path to array data

readcode(language, abspath=False, basepath=None)#

Generate code to read the array in a different language.

Note that this does not include reading the metadata, which is just based on a text file in JSON format.

Parameter#

language: str: One of the languages that are supported. Choose from: ‘darr’, ‘idl’, ‘julia_ver0’, ‘julia_ver1’, ‘mathematica’, ‘matlab’, ‘maple’, ‘numpy’, ‘numpymemmap’, ‘R’.
abspath: bool: Should the paths to the data files be absolute or not? Default: True.
basepath: str or pathlib.Path or None: Path relative to which the binary array data file should be provided. Default: None.

Example

>>> import darr
>>> a = darr.asarray('test.darr', [[1,2,3],[4,5,6]])
>>> print(a.readcode('matlab'))
fileid = fopen('arrayvalues.bin');
a = fread(fileid, [3, 2], '*int32', 'ieee-le');
fclose(fileid);

property readcodelanguages#: Tuple of the languages that the readcode method can produce reading code for. Code in these languages is also included in the README.txt file that is stored as part of the array .

property shape#: Tuple with sizes of each axis of the data array.

property size#: Total number of values in the data array.

Creating arrays #

darr.asarray(path, array, dtype=None, accessmode='r', metadata=None, chunklen=None, overwrite=False)#

Save an array or array generator as a Darr array to file system path.

Data is always written in ‘C’ order to disk, independent of the order of array.

Parameters:

path (str or pathlib.Path) – File system path to which the array will be saved. Note that this will be a directory containing multiple files.
array (array-like object or generator yielding array-like objects) – This can be a numpy array, a sequence that can be converted into a numpy array, or a generator that yields such objects. The latter will be concatenated along the first dimension.
dtype (numpy dtype, optional) – Is inferred from the data if None. If dtype is provided the data will be cast to dtype. Default is None.
accessmode ({r, r+}, optional) – File access mode of the darr that is returned. r means read-only, r+ means read-write. In the latter case, data can be changed. Default r.
metadata ({None, dict}) – Dictionary with metadata to be saved in a separate JSON file. Default is None. If so, and the array has a ‘metadata’ attribute, Darr will try to use it as metadata of the output array.
chunklen (<int, None>) – The length of chunks (along first axis) that are read and written during the process. If None and the array is a numpy array or darr, it is chosen so that chunks are 10 Mb in total size. If None and array is a generator or sequence, chunklen will be 1.
overwrite ((True, False), optional) – Overwrites existing darr data if it exists. Note that a darr path is a directory. If that directory contains additional files, these will not be removed and an OSError is raised. Default is False.

Returns:

A Darr array instance.

Return type:

Array

Deleting arrays #

darr.delete_array(da)#

Delete Darr array data from disk.

Parameters:: da (Array or str or pathlib.Path) – The darr object to be deleted or file system path to it.

Truncating arrays #

darr.truncate_array(a, index)#

Truncate darr data.

Parameters:

a (array or str or pathlib.Path) – The darr object to be truncated or file system path to it.
index (int) – The index along the first axis at which the darr should be truncated. Negative indices can be used but the resulting length of the truncated darr should be larger than 0 and smaller than the current length.

Examples

>>> import darr as da
>>> fillfunc = lambda i: i
>>> a = da.create_array('testarray.da', shape=(5,2), fillfunc=fillfunc)
>>> a
darr([[ 0.,  0.],
           [ 1.,  1.],
           [ 2.,  2.],
           [ 3.,  3.],
           [ 4.,  4.]]) (r+)
>>> da.truncate_array(a, 3)
>>> a
darr([[ 0.,  0.],
           [ 1.,  1.],
           [ 2.,  2.]]) (r+)
>>> da.truncate_array(a, -1)
>>> a
darr([[ 0.,  0.],
           [ 1.,  1.]]) (r+)

Ragged Arrays #

Accessing ragged arrays #

class darr.RaggedArray(path, accessmode='r')#

Instantiate a Darr ragged array from disk.

A ragged array is a sequence of subarrays that may be multidimesional, with the restriction that their dimensional shape is the same except for their first axis. In the simplest case it is a sequence of variable-length one-dimensional subarrays, e.g.:

[[1,2],
 [3,4,5],
 [6],
 [7,8,9,10]]

, but the subarrays can also be multidimensional, e.g.:

[[[1,2],[3,4]],
 [[5,6],[7,8],[9,10]],
 [[11,12]],
 [[13,14],[15,16]]]

In the latter case they are they are two-dimensional, and the length of their second axis is fixed: length 2. The atom shape of the array is said to be (2,). If subarrays were four-dimensional their atom shape could be, e.g. (2,7,5), and their dimensionality (N,2,7,5), where N is an integer. In Darr N can also be zero.

Ragged arrays are often used for time series data that has been collected in multiple episodes of varying duration, although other use cases exist.

On disk, a Darr ragged array corresponds to a directory containing 1) a Darr array called ‘values’ in which all subarrays have been concatenated along their first dimension. 2) a Darr array called ‘indices’, which is two-dimensional an hold the indices to obtain the subarrays from the values array. 3) a text file (JSON format) describing the numeric type, array size and length, and other format information, and 4) a README text file documenting the data format, including code to read the array data in other programming languages.

A RaggedArray can be indexed with an integer to get the subarrays as NumPy arrays.

Parameters:

path (str or pathlib.Path) – Path to disk-based array directory.
accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data. r means read-only, r+ means read-write. w does not exist. To create new darr arrays, potentially overwriting an other one, use the asarray or create_array functions.

Examples

>>> import darr
>>> ra = darr.asraggedarray('test.darr', [[1,2],[1,2,3,4],[5,6,7]], dtype='int32')
>>> ra
RaggedArray (3 subarrays with atom shape (), r+)
>>> del ra
>>> ra = darr.RaggedArray('test.darr')
>>> ra[1] # will return a NumPy array
array([1, 2, 3, 4], dtype=int32)

property accessmode#: Data access mode of metadata, {‘r’, ‘r+’}.

append(array)#

Append array-like objects to the ragged array.

The shape of the data and the darr must be compliant. The length of its first axis may vary, but if the are more axes, these should have the same lengths as all other subarrays (which is the atom of the raged array). When appending data repeatedly it is more efficient to use iterappend.

Parameters:: array (array-like object) – This can be a numpy array, a sequence that can be converted into a numpy array.
Return type:: None

archive(filepath=None, compressiontype='xz', overwrite=False)#

Archive ragged array data into a single compressed file.

Parameters:

filepath (str) – Name of the archive. In None, it will be derived from the data’s path name.
compressiontype (str) – One of ‘xz’, ‘gz’, or ‘bz2’, corresponding to the gzip, bz2 and lzma compression algorithms supported by the Python standard library.
overwrite ((True, False), optional) – Overwrites existing archive if it exists. Default is False.

Returns:

The path of the created archive

Return type:

pathlib.Path

Notes

See the tarfile library for more info on archiving formats

property atom#: Dimensions of the non-variable axes of the arrays.

copy(path, dtype=None, accessmode='r', overwrite=False)#

Copy darr to a different path, potentially changing its dtype.

The copying is performed in chunks to avoid RAM memory overflow for very large darr arrays.

Parameters:

path (str or pathlib.Path) –
dtype (<dtype, None>) – Numpy data type of the copy. Default is None, which corresponds to the dtype of the darr to be copied.
accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data of the returned Darr object. r means read-only, r+ means read-write.
overwrite ((True, False), optional) – Overwrites existing darr data if it exists. Note that a darr path is a directory. If that directory contains additional files, these will not be removed and an OSError is raised. Default is False.

Returns:

copy of the darr array

Return type:

Array

property datadir#: Data directory object with many useful methods, such as writing information to text or json files, archiving all data, calculating checksums etc.

property dtype#: Numpy data type of the array values.

iter_arrays(startindex=0, endindex=None, stepsize=1, accessmode=None)#

Iterate over ragged array yielding subarrays.

startindex: <int, None>: Start index value. Default is None, which means to start at the beginning.
endindex: <int, None>: End index value. Default is None, which means to end at the end.
stepsize: <int, None>: Size of the shift per iteration across the first axis. Default is None, which means that stepsize equals chunklen.

iterappend(arrayiterable)#

Iteratively append data from a data iterable.

The iterable has to yield array-like objects compliant with darr. The length of first dimension of these objects may be different, but the length of other dimensions, if any, has to be the same.

Parameters:: arrayiterable (an iterable that yield array-like objects) –
Return type:: None

property mb#: Storage size in megabytes of the ragged array.

property metadata#: Dictionary of meta data.

property narrays#: number of subarrays in the RaggedArray.

property path#: File system path to array data

readcode(language, abspath=False, basepath=None)#

Generate code to read the array in a different language.

Note that this does not include reading the metadata, which is just based on a text file in JSON format.

Parameters:

language (str) – One of the languages that are supported. Choose from: ‘matlab’, ‘numpymemmap’, ‘R’.
abspath (bool) – Should the paths to the data files be absolute or not? Default: True.
basepath (str or pathlib.Path or None) – Path relative to which the binary array data file should be provided. Default: None.

Example

>>> import darr
>>> a = darr.asraggedarray('test.darr', [[1],[2,3],[4,5,6],[7,8,9,10]], overwrite=True)
>>> print(a.readcode('matlab'))
fileid = fopen('indices/arrayvalues.bin');
i = fread(fileid, [2, 4], '*int64', 'ieee-le');
fclose(fileid);
fileid = fopen('values/arrayvalues.bin');
v = fread(fileid, 10, '*int32', 'ieee-le');
fclose(fileid);
% example to read third subarray
startindex = i(1,3) + 1;  % matlab starts counting from 1
endindex = i(2,3);  % matlab has inclusive end index
a = v(startindex:endindex);

property readcodelanguages#: Tuple of the languages that the readcode method can produce reading code for. Code in these languages is also included in the README.txt file that is stored as part of the array .

property size#: Total number of values in the ragged array.

Creating ragged arrays #

darr.asraggedarray(path, arrayiterable, dtype=None, metadata=None, accessmode='r+', indextype='int64', overwrite=False)#

Creates an empty RaggedArray.

Parameters:

path (str or pathlib.Path) – Path to disk-based array directory.
arrayiterable (iterator yielding array-like objects) – This can be a numpy array, a sequence that can be converted into a numpy array, or an iterator that yields such objects. The latter will be concatenated along the first dimension.
dtype (dtype, optional) – The type of the darr. Default is ‘float64’
metadata ({None, dict}) – Dictionary with metadata to be saved in a separate JSON file. Default None
accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data. r means read-only, r+ means read-write. w does not exist. To create new darr arrays, potentially overwriting an other one, use the asarray or create_array functions.
indextype (dtype, optional) – The dtype of the index array underlying the disk-based ragged array format. Defaults to ‘int64’. But other possibilities are int8’,’uint8’, ‘int16’, ‘uint16’, ‘int32’, ‘uint32’, ‘int64’. This determines the maximum length of the ragged array, but also how compatible the array is with other languages.
overwrite (<True, False>, optional) – Overwrites existing darr data if it exists. Note that a darr paths is a directory. If that directory contains additional files, these will not be removed and an OSError is raised. Default is False.

Returns:

A Darr RaggedArray instance.

Return type:

RaggedArray

darr.create_raggedarray(path, atom=(), dtype='float64', metadata=None, accessmode='r+', indextype='int64', overwrite=False)#

Creates an empty RaggedArray.

Parameters:

path (str or pathlib.Path) – Path to disk-based array directory.
atom (tuple) – shape of the subarray dimensions, except the first dimension. Default is (), meaning that the ragged array consists of one-dimensional subarrays. If subarrays, e.g., have a second and third dimension of length 4 and 7, the atom would be (4,7).
dtype (dtype, optional) – The type of the darr. Default is ‘float64’
metadata ({None, dict}) – Dictionary with metadata to be saved in a separate JSON file. Default None
accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data. r means read-only, r+ means read-write. w does not exist. To create new darr arrays, potentially overwriting an other one, use the asarray or create_array functions.
indextype (dtype, optional) – The dtype of the index array underlying the disk-based ragged array format. Defaults to ‘int64’. But other possibilities are int8’,’uint8’, ‘int16’, ‘uint16’, ‘int32’, ‘uint32’, ‘int64’. This determines the maximum length of the ragged array, but also how compatible the array is with other languages.
overwrite (<True, False>, optional) – Overwrites existing darr data if it exists. Note that a darr paths is a directory. If that directory contains additional files, these will not be removed and an OSError is raised. Default is False.

Returns:

A Darr RaggedArray instance.

Return type:

RaggedArray

Deleting ragged arrays #

darr.delete_raggedarray(ra)#

Delete Darr ragged array data from disk.

Parameters:: ra (RaggedArray or path to RaggedArray to be deleted.) –

Truncating ragged arrays #

darr.truncate_raggedarray(ra, index)#

Truncate darr ragged array.

Parameters:

ra (array or str or pathlib.Path) – The darr object to be truncated or file system path to it.
index (int) – The index along the first axis at which the darr ragged array should be truncated. Negative indices can be used but the resulting length of the truncated darr should be 0 or larger and smaller than the current length.

Previous topic

Darr API Documentation#

Parameter#