What is Data Wrangling?
Data wrangling is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. While in many cases data transformation involves both code and human intervention, ideally it is fully automated using a repeatable script.
A data wrangler is a person who performs these transformation operations. The process includes gathering the data (for example, downloading a file from the web, scraping a web page, querying an API or a database), assessing its quality, cleaning it (fix or remove them) and finally storing it in a way that make subsequent analysis easy.
Data Wrangling Methods
Data wrangling can be broadly categorized into two main types:
- Structural changes — changing the format of your data to make it easier to work with (for example, converting an Excel file into a CSV file or JSON file) or changing how the data is stored (for example, storing tabular data in a relational database rather than an Excel file).
- Semantic changes — changing the meaning of your data (for example, categorizing strings representing colors into a set of color ranges).
Data wrangling can also include a number of other steps such as identifying issues in your dataset and fixing them, either manually or automatically. This often includes filling in missing values or removing duplicates. Often these issues are identified by looking at basic summaries and plots of your data.
The Data Wrangling Process
Data wrangling is not a single task, but a series of tasks that require human intervention. The process of data wrangling consists of:
- Gathering data. This step involves identifying what data is needed and obtaining that data. Data can be gathered from many sources, including APIs, HTML web pages, and flat files such as .csv or .tsv files.
- Assessing data. Once the data has been gathered, assess whether the gathered data is sufficient for the purposes at hand. For example, if you want to analyze tweets about a topic, but there are only 10 tweets with that topic in your collection of 200 million tweets, the analysis may not be sufficiently robust or accurate.
- Cleaning data. Cleaning refers to detecting and removing errors or inconsistencies from the raw collected (gathered) data so that it can be further analyzed without bias or interference caused by these errors. For example, many times when collecting information from various sources, some entries have incorrect values or lack some attributes (such as age).
Some examples of activities that data wranglers would engage in include:
- Finding gaps, empty cells, or values that don't fit the format of the table or column/row
- Removing irrelevant or duplicated data
- Combining several data sources into one set for analysis