Cleaning the data with OpenRefine

By Tony Hirst from the Open Knowledge Foundation

Whenever you work with a real dataset, it's not uncommon to find that the data doesn't quite work with the tool you're trying to load it into, or it isn't structured in a way that you can conveniently use.

A good example of this is data that describes financial information, where values of an "amount" column come in a form that makes sense to us, but that confuses spreadsheets or chart plotting tools. For example, amounts may have a currency label (such as $5000 or âŹ2000) or contain separators that mark the value out as a piece of text rather than a number, such as in the case of 10,000.12 rather than 10000.12, or 1.500.250,12 rather than 1500250,12 (or maybe even 1500250.12 depending on the numbering convention that is being used). Whilst it is often possible to use a text editor or spreadsheet to search for and replace any unwanted characters within a dataset, there are several dangers associated with taking such an approach. For example, a global search and replace may remove legitimate characters that appear elsewhere in the data set. In addition, without a record of changes made to the original data set, difficulties may be encountered when it comes to demonstrating the provenance or correctness of a cleaned data set. A far better approach is to use a tool such as OpenRefine (available from OpenRefine.org) or the OKF Labs Data Explorer. Here's an example using OpenRefine:

What this expression says is - replace any character (value.replace()) in the set (/[]/) that contains the characters âŹ$ÂŁ, with nothing (that is, an empty string: ''). One advantage of using tools such as these is that they can generate "change logs" that itemise the changes that have been applied to an original dataset and thus allow you to verify or check any changes that have been made to a data set whilst cleaning it. OpenRefine also allows you to undo any incorrect changes you have made to a dataset whilst you are working out how to clean it! When is a number not a number? Even if you have removed any unwanted punctuation from a number, it might still be characterised as a text string (that is, as a sequence of text characters), rather than a numerical quantity. Sorting numbers as text strings rather than as actual numbers can have many unfortunate side effects, as the sort may in part be ordered on string length to give an order that whilst "correct" is not necessarily very meaningful. For example, here's the result of sorting a set of numbers represented as character strings, with the numbers appearing in the order we might expect if they were included in a dictionary.

In such a case, it may be necessary to declare the type of the column as a number type such numeric, integer or <em<float< em=""> (that is, floating point number - typically, a real number with a decimal point in it) before sorting on it in order to get a numerical ordering.

Similar issues around unexpected orderings can arise when trying to sort columns containing date values if the column is not recognised as representing things that should be interpreted as dates. Out of Bounds We all know how easy it is to make a typographical error when writing a piece of text, but errors also abound when dealing with numbers. One important part of cleaning a dataset is to check that any numbers you are dealing with fall within expected limits, even if you can't check that every number is actually correct. If you have a column of numbers in a dataset, it often makes sense to have a quick look at the range and distribution of the numbers to see if there are any obvious outliers. For example, suppose I am expecting a set of numbers to have values that fall within the range 0 to 1000, but the numeric facet view in OpenRefine shows me numbers outside that range?

By using the sliders, we can filter the data rows to see which ones contain data elements in the specified column that fall outside the expected range. When things that should be the same aren't... Another problem commonly associated with datasets arises from things that are supposed to be the same being written in different ways, and not necessarily just as the result of a typographical error. For example, here is a partial view over a supposedly unique list of company names that appear in a particular column of a particular dataset. Example Company Ltd Example Company Limited Example Company Ltd. Example Company limited Example Company Ltd EXAMPLE COMPANY LIMITED Another Company Ltd In this case, we have several fictional company names, the majority of which look as if they should be the same, but differ slightly in capitalisation or full spelling. Two of the names also appear to be identical (Example Company Ltd) - in this case, there is a blank space at the end of one of the names that makes it different to the other in terms of exactly matching text strings. To cope with such strings, we may be able to run a tool that removes and unwanted white space characters, or that "normalises" the capitalisation according to a particular convention (such as capitalising the first letter of each word, or just the first letter in a sentence). In more extreme cases, we may want to convert everything to either lower case or upper case characters.

For strings that almost match we can use a family of techniques often referred to as partial string matching. These techniques tend to look for strings that differ only in a small number of characters, or that require only a small number of edits to change one string into the other. Having detected partially matching strings, we can then decide whether we want to "normalise" them - that is, rewrite some or all of the partially matched strings so that they are the same.

Sometimes we may want to run string clustering repeatedly, or we may choose to rewrite differently detected clusters as the same thing. As well as similarity measures based on the characters used in each string, we might also be able to cluster items on a "soundalike" basis:

Being able to tidy data in this way can be a really powerful technique, although care needs to be taken not to merge items that are actually distinct. Getting Your Data into Shape Another form of data preparation relates to the reshaping of a dataset to get it into a form that you might be able to display using charting tools such as Datawrapper, or the charting tools in Data Explorer. For example, you may have data in a form where the values in a particular column might instead be used to cluster together data from several rows into a single row with additional columns:

Conversely, you may have data contained within a particular column that might allow you to map one row onto many rows for one reason or another:

In other cases, you may simply have "gaps" in your data that need filling in. Again, data cleansing tools can help you fix this sort of problem in a straightforward way.

More complex reshaping operations are also possible, but that should give you a taste...

Putting All the Pieces Together As we have seen, cleaning a data set may involve multiple different steps, from removing unwanted characters to matching text strings that differ from each other as far as most applications are concerned. Data cleansing may also involve some other data preparation related activities, such as reshaping a data set to get it into a format that allows you to work with it using readily available charting tools that expect data to be presented in a particular way.

Photo in this article was published by JD Hancock on FLICKR under a CC BY 2.0 License

Date

10/21/2013 - 15:25

Topic

Transparency

Accountability

Country

Serbia

Croatia

Montenegro

Bosnia and Herzegovina

Kosovo

Resource