module
dframeio.parquet
Access parquet datasets using pyarrow.
Classes
ParquetBackend
— Backend to read and write parquet datasets</>
class
dframeio.parquet.
ParquetBackend
(
base_path
, partitions=None
, rows_per_file=0
)
Backend to read and write parquet datasets
Parameters
base_path
(str) — Base path for the parquet files. Only files in this folder or subfolders can be read from or written to.partitions
(iterable of str, optional) — (For writing only) Columns to use for partitioning. If given, the write functions split the data into a parquet dataset. Subfolders with the following naming schema are created when writing:column_name=value
.
Per default data is written as a single file.
Cannot be combined with rows_per_file.rows_per_file
(int, optional) — (For writing only) If a positive integer value is given this specifies the desired number of rows per file. The data is then written to multiple files.
Per default data is written as a single file.
Cannot be combined with partitions.
Raises
TypeError
— If any of the input arguments has a diffent type as documentedValueError
— If any of the input arguments are outside of the documented value ranges or if conflicting arguments are given.
Methods
read_to_dict
(
source
,columns
,row_filter
,limit
,sample
,drop_duplicates
)
(dict(str: )) — Read a parquet dataset from disk into a dictionary of columns</>read_to_pandas
(
source
,columns
,row_filter
,limit
,sample
,drop_duplicates
)
(DataFrame) — Read a parquet dataset from disk into a pandas DataFrame</>write_append
(
target
,dataframe
)
— Write data in append-mode</>write_replace
(
target
,dataframe
)
— Write data with full replacement of an existing dataset</>
method
read_to_pandas
(
source
, columns=None
, row_filter=None
, limit=-1
, sample=-1
, drop_duplicates=False
)
Read a parquet dataset from disk into a pandas DataFrame
Parameters
source
(str) — The path of the file or folder with a parquet dataset to readcolumns
(list of str, optional) — List of column names to limit the reading torow_filter
(str, optional) — Filter expression for selecting rowslimit
(int, optional) — Maximum number of rows to return (limit to first n rows)sample
(int, optional) — Size of a random sample to returndrop_duplicates
(bool, optional) — Whether to drop duplicate rows from the final selection
Returns (DataFrame)
A pandas DataFrame with the requested data.
Raises
ValueError
— If path specified withsource
is outside of the base path
The logic of the filtering arguments is as documented for
AbstractDataFrameReader.read_to_pandas()
.
method
read_to_dict
(
source
, columns=None
, row_filter=None
, limit=-1
, sample=-1
, drop_duplicates=False
)
Read a parquet dataset from disk into a dictionary of columns
Parameters
source
(str) — The path of the file or folder with a parquet dataset to readcolumns
(list of str, optional) — List of column names to limit the reading torow_filter
(str, optional) — Filter expression for selecting rowslimit
(int, optional) — Maximum number of rows to return (limit to first n rows)sample
(int, optional) — Size of a random sample to returndrop_duplicates
(bool, optional) — (Not supported!) Whether to drop duplicate rows
Returns (dict(str: ))
A dictionary with column names as key and a list with column values as values
Raises
NotImplementedError
— When drop_duplicates is specified
The logic of the filtering arguments is as documented for
AbstractDataFrameReader.read_to_pandas()
.
method
write_replace
(
target
, dataframe
)
Write data with full replacement of an existing dataset
Parameters
target
(str) — The path of the file or folder to write to. The path may be absolute or relative to the base_path given in the__init__()
function.dataframe
(Union(dataframe, dict(str: ))) — The data to write as pandas.DataFrame or as a Python dictionary in the formatcolumn_name: [column_data]
Raises
TypeError
— When the dataframe is neither an pandas.DataFrame nor a dictionaryValueError
— If the dataframe does not contain the columns to partition by as specified in the__init__()
function.
method
write_append
(
target
, dataframe
)
Write data in append-mode