Towards Realistic Error Models for Tabular Data

The notebooks in this repository were used to execute the experimental evaluation of our paper "Towards Realistic Error Models for Tabular Data". Specifically,

the notebook dataset_generation.ipynb contains the procedure we followed to generate datasets corresponding to the error scenarios we describe in our paper.
The notebook dataset_analysis.ipynb contains our analysis of the HOSP dataset.
The notebook plots.ipynb contains the procedure we use to generate the figures in our publication. It reads experiment's results from the error_paper/measurements/ directory -- check the notebook's code for details.

Installation

We use poetry to manage dependencies. Simply run poetry install to install all dependencies.

Experiments

In our experiments, we examine data cleaning and downstream machine learning task impact using tab_err.

In the first part of the data cleaning experiments, we generate various erroneous versions of the HOSP dataset and clean them with HoloClean (benchmarks/hosp-impact).
We then proceed to generate various erroneous versions of datasets bridges, beers, restaurant and cars and correct them with algorithms baran&raha, holoclean and renuver (benchmarks/cleaning-impact).
In the downstream machine learning task impact, we look at how ML models behave given data with various errors (benchmarks/ml_downstream_experiments).

Check the documentation in benchmarks/README.md for instructions on how to replicate our measurements.

Profiling

We also looked at the memory and runtime of tab_err using various error models and dataset sizes. See the directory benchmarks/profiling for examples.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
benchmarks		benchmarks
data		data
error_paper		error_paper
export_data		export_data
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Towards Realistic Error Models for Tabular Data

Installation

Experiments

Profiling

About

Uh oh!

Uh oh!

Contributors 3

Languages

calgo-lab/error-paper

Folders and files

Latest commit

History

Repository files navigation

Towards Realistic Error Models for Tabular Data

Installation

Experiments

Profiling

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 3

Languages