

Access parquet datasets using pyarrow.


dframeio.parquet.ParquetBackend(base_path, partitions=None, rows_per_file=0)

Backend to read and write parquet datasets

  • base_path (str) Base path for the parquet files. Only files in this folder or subfolders can be read from or written to.
  • partitions (iterable of str, optional) (For writing only) Columns to use for partitioning. If given, the write functions split the data into a parquet dataset. Subfolders with the following naming schema are created when writing: column_name=value.
    Per default data is written as a single file.
    Cannot be combined with rows_per_file.
  • rows_per_file (int, optional) (For writing only) If a positive integer value is given this specifies the desired number of rows per file. The data is then written to multiple files.
    Per default data is written as a single file.
    Cannot be combined with partitions.
  • TypeError If any of the input arguments has a diffent type as documented
  • ValueError If any of the input arguments are outside of the documented value ranges or if conflicting arguments are given.
  • read_to_dict(source, columns, row_filter, limit, sample, drop_duplicates) (dict(str: )) Read a parquet dataset from disk into a dictionary of columns</>
  • read_to_pandas(source, columns, row_filter, limit, sample, drop_duplicates) (DataFrame) Read a parquet dataset from disk into a pandas DataFrame</>
  • write_append(target, dataframe) Write data in append-mode</>
  • write_replace(target, dataframe) Write data with full replacement of an existing dataset</>

read_to_pandas(source, columns=None, row_filter=None, limit=-1, sample=-1, drop_duplicates=False)

Read a parquet dataset from disk into a pandas DataFrame

  • source (str) The path of the file or folder with a parquet dataset to read
  • columns (list of str, optional) List of column names to limit the reading to
  • row_filter (str, optional) Filter expression for selecting rows
  • limit (int, optional) Maximum number of rows to return (limit to first n rows)
  • sample (int, optional) Size of a random sample to return
  • drop_duplicates (bool, optional) Whether to drop duplicate rows from the final selection
Returns (DataFrame)

A pandas DataFrame with the requested data.

  • ValueError If path specified with source is outside of the base path

The logic of the filtering arguments is as documented for AbstractDataFrameReader.read_to_pandas().


read_to_dict(source, columns=None, row_filter=None, limit=-1, sample=-1, drop_duplicates=False)

Read a parquet dataset from disk into a dictionary of columns

  • source (str) The path of the file or folder with a parquet dataset to read
  • columns (list of str, optional) List of column names to limit the reading to
  • row_filter (str, optional) Filter expression for selecting rows
  • limit (int, optional) Maximum number of rows to return (limit to first n rows)
  • sample (int, optional) Size of a random sample to return
  • drop_duplicates (bool, optional) (Not supported!) Whether to drop duplicate rows
Returns (dict(str: ))

A dictionary with column names as key and a list with column values as values

  • NotImplementedError When drop_duplicates is specified

The logic of the filtering arguments is as documented for AbstractDataFrameReader.read_to_pandas().


write_replace(target, dataframe)

Write data with full replacement of an existing dataset

  • target (str) The path of the file or folder to write to. The path may be absolute or relative to the base_path given in the __init__() function.
  • dataframe (Union(dataframe, dict(str: ))) The data to write as pandas.DataFrame or as a Python dictionary in the format column_name: [column_data]
  • TypeError When the dataframe is neither an pandas.DataFrame nor a dictionary
  • ValueError If the dataframe does not contain the columns to partition by as specified in the __init__() function.

write_append(target, dataframe)

Write data in append-mode