DataWrangler(@ VGSOM)

DataWrangler @VGSOM
View more presentations from Divya Hamirwasia.

ABSTRACT

With an increase in the amount of work done in the field of data analytics, the amount of time spent by the analysts to get the data in the desired format has also gone up. Though there has been continuous improvements in the data analytic tools used today, still analysts spent a lot of time in activities like reformatting data values and formats, integrating data from multiple sources, missing values etc. It is for this purpose that a tool like ‘DATA WRANGLER’ is used. this is an interactive data transformation tool. It allows direct manipulation of visual data and also provides automatic suggestions for relevant transformations. Analysts can continuously monitor the applicable operations and then preview their effect. It has been seen that the use of Wrangler reduces the specification time significantly.

INTRODUCTION

For data analysis to begin, the data needs to be molded into the required appropriate format before the analysts can actually work on it. This is a time consuming task and it is estimated that in a typical data warehousing projects, data cleaning takes up about 80% of the total time and cost incurred by the project. Before starting with any analysis, the analysts have to restructure the data so as to :

Make it palatable with statistical packages, databases and other tools
Address issues like misspelling and missing data
Unresolved duplicacy and outliers etc.
perform transformation

As such this is a very tedious task and involves the writing of long and complex programming scripts in languages like python etc or manual editing using tools like Microsoft Excel.

Wrangler is a system for interactive data transformation. It helps the analysts with transformations and minimizes repetition. Data uses a combination of:

mixed-initiative user interface
Declarative transformation language.

With the help of data wrangler, users can build to a sequence of simple transforms to specify any transformation. When the user selects any data, applicable transformations are suggested by the tool based on the current context of interaction. Data wrangler uses a modeling technique to enumerate and rate the possible transformations. This model combines user's inputs with diversity, frequency and specification difficulty of applicable transform types. For ease of use, Wrangler provides short natural language descriptions of the transforms and also provides the visual previews of the transform results. This helps analysts to assess the viable transforms quickly. Wrangler's interactive history viewer records and shows the step of transforms applied on the data set so as to facilitate reuse and provide documentation of data provenance. Wrangler scripts can be run in a web browser using JavaScript or translated into MapReduce or Python code.

WRANGLER TRANSFORMATION LANGUAGE

Wrangler uses an underlying declarative data transformation language. This language consists of 8 classes of transformations:

1) MAP: This transform maps one input data row to zero, one or multiple output rows.

· One to zero: Delete transforms is used to remove the input row.

· One to One: this includes transforms like extracting, cutting, splitting values into multiple columns, reformatting, simple arithmetic and value updates.

· One to many: this transform includes operations for splitting data into multiple rows, unnesting arrays and sets.

2) Look ups and Joins: This incorporates data from external tables. Wrangler includes lookup tables to support mapping, such as mapping zip codes to state names for aggregation across states. Wrangler supports two types of joins: equi-joins and approximate joins. These help in lookups and for correcting typos for known data types.

3) Reshape: This manipulates data structure and schema. Wrangler provides two reshaping operators:

· fold : it collapses multiple columns to two or more columns containing key-value sets

· Unfold: it creates new column headers from data values.

4) Positional: This includes the fill and lag operators:

· Fill :generate values based on neighboring values in a row or column

· Lag: shifts the values of a column up or down by a specified number of rows.

5) Sorting

6) Aggregation

7) Key Generation

8) Schema Transforms:

It is used to set column names, specify column data types, and assign semantic roles.Wrangler supports all standard data types like string, numbers, integers etc. and higher semantic roles like geographic location, currencies, classification codes etc. this helps in data validation as well as transformation.

DATA SET USED

The data which we are using is the house crime data from the U.S. Bureau of Justice Statistics. This data is available as a sample data in the wrangler program. The data were downloaded as a CSV (comma-separated values) file, but are not immediately usable by other tools because the data contains empty lines, U.S. states are organized in disjoint matrices and the state names are embedded in other text. We would demonstrate how data wrangler can be used to interactively transform this difficult to use data into an easily usable format.

THE WRANGLER INTERFACE

BASIC INTERACTIONS

Wrangler supports six basic interactions within the data table. Users can:

select rows
select columns
click bars in the data quality meter
select text within a cell
edit data values within the table
assign column names, data types or semantic roles.

AUTOMATED TRANSFORMATION SUGGESTION

Wrangler suggests a set of transforms to the analysts while they interact with the data. These suggestions can be based on the relevance according to the type, frequency and diversity of the previous transforms. In many cases these suggestions can go upto hundreds which makes it difficult and cumbersome. Hence instead of listing all the possible suggestions, users can prune and reorder the space in 3 ways :

Ø users can provide more examples so as to remove ambiguity in the input to the inference engine.

Ø user can select a particular operator from the transform menu and therefore filter the space of transforms.

Ø users can edit a transform by altering the parameter to a desired state

NATURAL LANGUAGE DESCRIPTIONS

Wrangler gives concise and short natural language descriptions of the transform types and parameters. These are generated in such a manner that they are readable by non-experts as well. For example, we translate regular expressions into natural language via pattern substitution (e.g., (\d+) to ‘number’). This translation can make some descriptions less concise but increases readability. Translation is only performed with regular expressions generated by the Wrangler inference engine. This helps in better understanding as well as removing confusions if any. These descriptions are editable in nature.

VISUAL TRANSFORMATION PREVIEWS

Wrangler provides visual previews to enable analysts to quickly see and evaluate the effect of a transform. For almost all transforms, Wrangler shows these previews in source data, and not as a separate visualization i.e, side-by-side before and after views and not separately. Wrangler maps transforms to at least one of five preview classes:

Ø selection

Ø deletion

Ø update

Ø column

Ø table.

In defining these mappings, wrangler conveys a transform’s effect with minimum displacement of the original data. This stability allows users to continue interacting with the original data, e.g., to provide new selection examples.

TRANSFORMATION HISTORIES AND EXPORT

Wrangler adds the descriptions of transforms as and when they are applied, to an interactive transformation history viewer. This enables to user to view and edit individual transform descriptions. They can selectively enable or disable any prior transforms. If any changes take place, the data table is updated by running the edited script.

Analysts can export both generated scripts and transformed data. They can later run saved or exported scripts on new data sources, modifying the script as needed. Script export is a useful option for handling large data sets: first transform a sample of your data in the Wrangler interface, and then run the resulting script on the full data set. Wrangler currently supports output scripts in two languages: Python and JavaScript.

THE WRANGLER INFERENCE ENGINE

The wrangler inference engine generates a ranked list of suggested transforms. The input of the wrangler inference engine consists of

(a) user interactions

(b) current working transform

(d) corpus of historical usage statistics

Following are the three phases in which transform suggestion work:

(a) inferring transform parameters from user interactions

(b) generating candidate transforms from inferred parameters

USAGE CORPUS AND TRANSFORMATION EQUIVALENCE

Wrangler's inference engine uses corpus of usage statistics to generate the transform and then to rank them. The initial corpus is built by wrangling the collected data sets. This corpus gets updated as more and more analysts use Wrangler. The corpus basically consists of the frequency counts of the transform descriptors and initiating interactions.

It is not possible to find an exact match in the corpus. To overcome the problem and to get the useful transform frequency, a relaxed matching routine is defined. In a relaxed matching routine two transforms are considered equivalent in the corpus if they have

(a) identical transform type

(b) equivalent parameters

The four basic type of parameters are: row, column, text selection and enumerables. The two row selections are treated equivalent if they both contain filtering conditions or match all rows in a table. The column selections are treated equivalent if they both refer to columns with the same data type or semantic role. Text selections are treated equivalent if both of them are index based selections or contain regular expressions. Enumerable parameters are treated equivalent if and only if they exactly match.

INFERRING PARAMETER SETS FROM USER INTERACTION

Wrangler attempts to infer three types of transform parameters in response of the user interaction: row, column or text selections. We enumerate possible parameter values for each type resulting in a collection of inferred parameter sets. The values of these parameters are independent of the other parameters.

Ø Case 1: Text Selection: In case of text selection we infer regular expressions based solely on the selected text, a process otherwise independent of which rows or columns are selected.

Ø Case 2: Row Selections: In case of row selections we infer the expressions based on row indices and predicate matching.

Ø Case 3: Column Selection: In the case of column selections we simply return the columns that users have interacted with.

Emitted text selections are either simple index ranges or inferred regular expressions. Regular expressions are generated by tokenizing the text within a cell and extracting both the selected text and any surrounding text within a 5 token window. The tokens are then annotated with one or more labels of the form number, word, uppercase word, lowercase word, or whitespace. Next we enumerate label sequences that match the text before, within, and after the selection range; sequences can contain either an annotation label or the exact token text. Finally we remove all possible combinations of before, within, and after sequences that match all current text selection to translate matching label sequences into regular expressions.

GENERATING SUGGESTED TRANSFORM

Wrangler generates a list of transform suggestions after inferring parameter sets. For each parameter set we loop over each transform type in the language, emitting the types that can accept all parameters in the set. Wrangler instantiates each emitted transform with parameters from the parameter set. During this process complex criteria such as row predicates or regular expressions are not inferred but index-based row selections and column inputs are inferred. The suggestion sets are then filtered to remove “degenerate” transforms that would have no effect on the data.

RANKING SUGGESTED TRANSFORM

Wrangler rank-orders transform suggestions according to five criteria. The first three criteria rank transforms by their type; the remaining two criteria rank transforms within types. Ensuring that transforms of the same type are adjacent helps users compare varying parameterizations more easily.

Ø Criteria 1: We consider explicit interactions: if a user chooses a transform from the menu or selects a current working transform, we assign higher rank to transforms of that type.

Ø Criteria 2: We consider specification difficulty. We have observed that row and text selection predicates are harder to specify than other parameters. We thus label row and text selections as hard and all others as easy. We then sort transform types according to the count of hard parameters they can accept.

Ø Criteria 3: We rank transform types based on their corpus frequency, conditioned on their initiating user interaction (e.g., text or column selection). In the case of text selection, we also consider the length of the selected text. If a user selects three or fewer characters, split transforms are ranked above extract transforms; the opposite is true for longer selections.

Ø Criteria 4: We then sort transforms within type. We first sort transforms by frequency of equivalent transforms in the corpus. Second, we sort transforms in ascending order using a simple measure of transform complexity. Our goal is to preferentially rank simpler transforms because users can evaluate their descriptions more quickly. We define transform complexity as the sum of complexity scores for each parameter. The complexity of a row selection predicate is the number of clauses it contains. The complexity of a regular expression is defined to be the number of tokens (described previously) in its description. All other parameters are given complexity scores of zero.

Ø Criteria 5: Finally, we attempt to surface diverse transform types in the final suggestion list. We filter the transforms so that no type accounts for more than 1/3 of the suggestions, unless the transform type matches the working transform or the filter results in fewer suggestions than can appear in the interface.

TUTORIAL

GETTING STARTED

Data wrangler is a browser based tool. Click on the following link and click ‘Try it now’: http ://vis.stanford.edu/wrangler /

DATA ENTRY

This is the input screen of the data wrangler. The analyst pastes his/her data into the place provided and then clicks the wrangler button. The data can be in the following formats: Csv,tsv and manual entry. Here we have used the data from the example data provided the wrangler tool. The data shows the crime reports in different cities of US.

TRANSFORMS

Cut	Remove selected text from cells in specified columns.
Delete	Remove rows that match given indices or predicates.
Drop	Remove specified columns from the table.
Edit	Edit the text in each cell of the specified columns.
Extract	Copy text from cells in a column into a new column.
Fill	Fill empty cells using values from adjacent cells.
Fold	Reshape a table into columns of key-value sets; selected rows map to keys, selected columns to values.
Merge	Concatenate multiple columns into a single column.
Promote	Promote row values to be the column names.
Split	Split a column into multiple columns by delimiters.
Translate	Shift the position of cell values by a given offset.
Transpose	Transpose the rows and columns of the table.
Unfold	Reshape a table by mapping key-value sets to a collection of new columns, one per unique key

The tool now shows a table containing data to be transformed along with the transformed history as well as a space for suggestions which the wrangler would provide.

DELETE TRANSFORM: Next we select the empty rows in the tables. In response wrangler suggests the suggested transforms in the natural language description.

When we mouse over the descriptions, wrangler previews the transforms effect on the table. The highlight indicates which rows will be deleted.

When we execute the transform, wrangler adds the transform to the transform history.

EXTRACT TRANSFORM: we use wrangler’s text selection to extract state names from the first column which represents year. We select the text Alaska in row 6. Wrangler guesses we are selecting text between positions 18 and 24. It highlights the matching text in each column and previews the derived column.

we can provide more examples to help wrangler generalize our selection. We update selection by selecting another state. We execute the highlighted suggestion and wrangler adds it to the history.

this is the final derived result which we get after executing the transform.

RENAMING THE COLUMN: we then rename the derived column as ‘states’. This column is parsley populated and the missing values are indicated by the grey bar at the top of the column determining the quality of data.

FILL TRANSFORM: when we click the grey bar, wrangler suggests transforms for the missing values. We choose the desired transform and click enter.

the resultant table is:

DELETE TRANFORM: we now remove the rows containing the word reported. We select the word reported. Wrangler suggests transforms like extract, cut, split etc but no delete.

we can reorder the transform ranks by clicking the delete option in the header menu and then execute the command. After this transform, data comes in a relational format.

FOLD TRANSFORM: we now create a cross tabulation of crime rate. we use unfold operation to restructure the data. they are like pivot tables in itself. we select the year and the property crime rate column. we preview the result of the suggested transform and execute it.

the resultant table looks like this. This data can be directly transported to the excel for further analysis.

OUTPUT

DATA

The output of the data wrangler is a script which can be exported and used on the data set . the output table can as it is be exported in the form of tab separated values in an excel table.

SCRIPT

The transformation script can also be exported in the form of either python or java script.

DataWrangler(@ VGSOM)

Saturday, 7 April 2012

DataWrangler@VGSOM