Skip to Main Content

Today's hours:

See all library hours »

  • Ask a Librarian
  • FAQ

Clean Messy Data with OpenRefine

OpenRefine Basics

"A power tool for working with messy data"

The original creator David Huynh said Refine is:

  • more powerful than a spreadsheet
  • more interactive and visual than scripting
  • more provisional / exploratory / experimental / playful than a database

Explore - navigate and evaluate quality with visualizations and filters that help dig deeply into the data so you can get to know it better

Clean - efficiently discover and fix inconsistency with faceting, clustering, cell transforms, GREL (Google Refine Expression Language) expressions

Transform - easily change formats, subset, or reshape with split/join multi valued cells, split columns, transpose columns/rows

Enrich - extend and enhance data by combining files, merging projects, fetching URLs, reconciliation with online databases

Automate - record and preserve your processing routine for transparency, then automate reuse by exporting operation history in JSON

Illustration from the Openscapes blog by Julia Lowndes and Allison Horst.

 

Open Refine Features

OpenRefine

  • works on large”ish” datasets in the low 6 figure row range
  • does not change or attempt to interpret your data when uploading
  • does not modify your original dataset
  • saves your work as you go, and allows changes to be easily reversed
  • saves operational history can be exported and applied to future datasets, or shared with collaborators.

OpenRefine is NOT

  • a spreadsheet for data entry
  • a statistical analysis tool
  • cloud based even though it runs in your browser, and uses the language of upload and download