module
dframeio.parquet
Access parquet datasets using pyarrow.
Classes
ParquetBackend— Backend to read and write parquet datasets</>
class
dframeio.parquet.ParquetBackend(base_path, partitions=None, rows_per_file=0)
Backend to read and write parquet datasets
Parameters
base_path(str) — Base path for the parquet files. Only files in this folder or subfolders can be read from or written to.partitions(iterable of str, optional) — (For writing only) Columns to use for partitioning. If given, the write functions split the data into a parquet dataset. Subfolders with the following naming schema are created when writing:column_name=value.
Per default data is written as a single file.
Cannot be combined with rows_per_file.rows_per_file(int, optional) — (For writing only) If a positive integer value is given this specifies the desired number of rows per file. The data is then written to multiple files.
Per default data is written as a single file.
Cannot be combined with partitions.
Raises
TypeError— If any of the input arguments has a diffent type as documentedValueError— If any of the input arguments are outside of the documented value ranges or if conflicting arguments are given.
Methods
read_to_dict(source,columns,row_filter,limit,sample,drop_duplicates)(dict(str: )) — Read a parquet dataset from disk into a dictionary of columns</>read_to_pandas(source,columns,row_filter,limit,sample,drop_duplicates)(DataFrame) — Read a parquet dataset from disk into a pandas DataFrame</>write_append(target,dataframe)— Write data in append-mode</>write_replace(target,dataframe)— Write data with full replacement of an existing dataset</>
method
read_to_pandas(source, columns=None, row_filter=None, limit=-1, sample=-1, drop_duplicates=False)
Read a parquet dataset from disk into a pandas DataFrame
Parameters
source(str) — The path of the file or folder with a parquet dataset to readcolumns(list of str, optional) — List of column names to limit the reading torow_filter(str, optional) — Filter expression for selecting rowslimit(int, optional) — Maximum number of rows to return (limit to first n rows)sample(int, optional) — Size of a random sample to returndrop_duplicates(bool, optional) — Whether to drop duplicate rows from the final selection
Returns (DataFrame)
A pandas DataFrame with the requested data.
Raises
ValueError— If path specified withsourceis outside of the base path
The logic of the filtering arguments is as documented for
AbstractDataFrameReader.read_to_pandas().
method
read_to_dict(source, columns=None, row_filter=None, limit=-1, sample=-1, drop_duplicates=False)
Read a parquet dataset from disk into a dictionary of columns
Parameters
source(str) — The path of the file or folder with a parquet dataset to readcolumns(list of str, optional) — List of column names to limit the reading torow_filter(str, optional) — Filter expression for selecting rowslimit(int, optional) — Maximum number of rows to return (limit to first n rows)sample(int, optional) — Size of a random sample to returndrop_duplicates(bool, optional) — (Not supported!) Whether to drop duplicate rows
Returns (dict(str: ))
A dictionary with column names as key and a list with column values as values
Raises
NotImplementedError— When drop_duplicates is specified
The logic of the filtering arguments is as documented for
AbstractDataFrameReader.read_to_pandas().
method
write_replace(target, dataframe)
Write data with full replacement of an existing dataset
Parameters
target(str) — The path of the file or folder to write to. The path may be absolute or relative to the base_path given in the__init__()function.dataframe(Union(dataframe, dict(str: ))) — The data to write as pandas.DataFrame or as a Python dictionary in the formatcolumn_name: [column_data]
Raises
TypeError— When the dataframe is neither an pandas.DataFrame nor a dictionaryValueError— If the dataframe does not contain the columns to partition by as specified in the__init__()function.
method
write_append(target, dataframe)
Write data in append-mode