module

dframeio.parquet

</>

Access parquet datasets using pyarrow.

Classes

ParquetBackend — Backend to read and write parquet datasets</>

class

`dframeio.parquet.ParquetBackend(base_path`, `partitions=None`, `rows_per_file=0)`

</>

Bases

dframeio.abstract.AbstractDataFrameReader dframeio.abstract.AbstractDataFrameWriter

Backend to read and write parquet datasets

Parameters

base_path (str) — Base path for the parquet files. Only files in this folder or subfolders can be read from or written to.
partitions (iterable of str, optional) — (For writing only) Columns to use for partitioning. If given, the write functions split the data into a parquet dataset. Subfolders with the following naming schema are created when writing: column_name=value.
Per default data is written as a single file.
Cannot be combined with rows_per_file.
rows_per_file (int, optional) — (For writing only) If a positive integer value is given this specifies the desired number of rows per file. The data is then written to multiple files.
Per default data is written as a single file.
Cannot be combined with partitions.

Raises

TypeError — If any of the input arguments has a diffent type as documented
ValueError — If any of the input arguments are outside of the documented value ranges or if conflicting arguments are given.

Methods

read_to_dict(source, columns, row_filter, limit, sample, drop_duplicates) (dict(str: )) — Read a parquet dataset from disk into a dictionary of columns</>
read_to_pandas(source, columns, row_filter, limit, sample, drop_duplicates) (DataFrame) — Read a parquet dataset from disk into a pandas DataFrame</>
write_append(target, dataframe) — Write data in append-mode</>
write_replace(target, dataframe) — Write data with full replacement of an existing dataset</>

method

`read_to_pandas(source`, `columns=None`, `row_filter=None`, `limit=-1`, `sample=-1`, `drop_duplicates=False)`

</>

Read a parquet dataset from disk into a pandas DataFrame

Parameters

source (str) — The path of the file or folder with a parquet dataset to read
columns (list of str, optional) — List of column names to limit the reading to
row_filter (str, optional) — Filter expression for selecting rows
limit (int, optional) — Maximum number of rows to return (limit to first n rows)
sample (int, optional) — Size of a random sample to return
drop_duplicates (bool, optional) — Whether to drop duplicate rows from the final selection

Returns (DataFrame)

A pandas DataFrame with the requested data.

Raises

ValueError — If path specified with source is outside of the base path

The logic of the filtering arguments is as documented for AbstractDataFrameReader.read_to_pandas().

method

`read_to_dict(source`, `columns=None`, `row_filter=None`, `limit=-1`, `sample=-1`, `drop_duplicates=False)`

</>

Read a parquet dataset from disk into a dictionary of columns

Parameters

source (str) — The path of the file or folder with a parquet dataset to read
columns (list of str, optional) — List of column names to limit the reading to
row_filter (str, optional) — Filter expression for selecting rows
limit (int, optional) — Maximum number of rows to return (limit to first n rows)
sample (int, optional) — Size of a random sample to return
drop_duplicates (bool, optional) — (Not supported!) Whether to drop duplicate rows

Returns (dict(str: ))

A dictionary with column names as key and a list with column values as values

Raises

NotImplementedError — When drop_duplicates is specified

The logic of the filtering arguments is as documented for AbstractDataFrameReader.read_to_pandas().

method

`write_replace(target`, `dataframe)`

</>

Write data with full replacement of an existing dataset

Parameters

target (str) — The path of the file or folder to write to. The path may be absolute or relative to the base_path given in the __init__() function.
dataframe (Union(dataframe, dict(str: ))) — The data to write as pandas.DataFrame or as a Python dictionary in the format column_name: [column_data]

Raises

TypeError — When the dataframe is neither an pandas.DataFrame nor a dictionary
ValueError — If the dataframe does not contain the columns to partition by as specified in the __init__() function.

method

`write_append(target`, `dataframe)`

</>

Write data in append-mode

dframeio.parquet

dframeio.parquet.ParquetBackend(base_path, partitions=None, rows_per_file=0)

read_to_pandas(source, columns=None, row_filter=None, limit=-1, sample=-1, drop_duplicates=False)

read_to_dict(source, columns=None, row_filter=None, limit=-1, sample=-1, drop_duplicates=False)

write_replace(target, dataframe)

write_append(target, dataframe)

`dframeio.parquet.ParquetBackend(base_path`, `partitions=None`, `rows_per_file=0)`

`read_to_pandas(source`, `columns=None`, `row_filter=None`, `limit=-1`, `sample=-1`, `drop_duplicates=False)`

`read_to_dict(source`, `columns=None`, `row_filter=None`, `limit=-1`, `sample=-1`, `drop_duplicates=False)`

`write_replace(target`, `dataframe)`

`write_append(target`, `dataframe)`