23 Apr Data analysis – 5 steps to prepare data for visualization
Where does a person whose specialty is data analysis spend most of his or her time? Surely, anyone who is not an analyst will answer that their job is to analyze data and present its results. In an ideal world, this would certainly be the reality, and reporting would be an easy and pleasant activity. But what, then, is the most challenging thing for every data analyst? With one answer: data preparation.
Any data reporter would certainly like to receive clean data sets, with standardized column names. Also troublesome for analysts are unusual and outliers, unfilled (Null) values in every place. After dealing with these problems, all that remains is to generate more visualizations. Afterward – interpret their results, and surprise superiors with deeply hidden insights, unprecedented discoveries, and valuable recommendations. Then our company achieves growth at a level far exceeding market growth. After some time, it becomes a world leader, and the analyst – a living legend. A beautiful vision, but the world looks different. Why is it so hard to find good data that can be trusted? The problem very often occurs at the preliminary stage of data analysis.
The Challenges of a Data Analyst
The biggest challenge for any specialist whose main activity is data analysis is to handle, transform and prepare the data set for analysis. Unfortunately, data describing events in the company will always require proper management – control of correct completion, handling of unusual events, implementation of new information structures. The scale of this task often forces the creation of a separate position, responsible for data engineering – their preparation, planning the reporting structure, and technological architecture behind the whole process.
However, regardless of the scale of our enterprise, the structure of the analytical department, and the quality of data, each data analysis should start with preparing the set for analysis and its visualization. This is where Exploratory Data Analysis (EDA) comes in. It is a process carried out at the beginning of an analyst’s work. It allows him to get familiar with the existing structure and handle potential problems. At this stage, the analyst acquires information about the collection itself, begins to understand it, notices unusual values and errors in the collection that require additional action. What does the process of Exploratory Data Analysis consist of?
Explore 5 simple steps on how to properly prepare your data for effective analysis and visualization of your findings.
Data analysis and its business purpose – understand your dataset
The key element for data analysis is to check and understand what we are dealing with. At this stage we should answer the questions:
- Where is the data coming from: who is filling it in, when, how?
- What exactly does each column/variable/table mean (e.g., is the “date an employee was added” the date they were hired, the date they were added to the system, the date they started working)?
- What values can be found in each slice of the set (e.g., does the column with the employee’s hire date consist of only valid dates or maybe numbers and sentences as well)?
- What do I ultimately want to achieve (is this set appropriate for the data analysis to be successful)?
It may seem that the answers to these questions are obvious and this step is just a formality. Unfortunately, it is often a very labor-intensive step. In various places in the dataset, there may appear values that we do not understand, so it is necessary to explain why such values occur and what they mean. At this stage, we may learn, for example, that the variable corresponding to the transaction value, in addition to numbers, also contains expressions such as “canceled”, “unknown”, “installments”. It may also turn out that a team member delegated to the United States, filled in contract dates in MM/DD/YYYY format, while the rest of the team – DD/MM/YYYY.
Consequently, this stage of the analysis involves both a preliminary analysis of the collection in an analytical tool (Business Intelligence tools like Tableau, integrated development environments like PyCharm, etc.), as well as a thorough review of the collection’s metadata and a comparison of our findings with the team to clarify all doubts.
After carrying out this stage correctly, we should receive information about necessary transformations on the data in the collection. Its performance is necessary for further analysis.
Optimize your dataset
In the second step of the process, we should focus on the technical optimization of the dataset. This means answering two questions for each column/variable/table under study:
- Is the object (e.g., table, column, variable) useful for this analysis (if not, should it be removed)?
- Does the object have the appropriate data type (e.g., are the numbers definitely integers/floats and not strings)?
This step, while simple to apply, is extremely important. In the case of medium and large data sets, it can speed up the work on them even several times. When the data analysis is performed on a flat table containing several thousand records and several dozen columns, this step can reduce its size significantly. This translates into analysis time. Generating a simple chart will take one instead of a dozen or so seconds, and as a consequence executing a full analytical script will take a few instead of a dozen or so minutes. These are not large numbers. However, it is important to remember that building and testing a script involves running it multiple times…
In this step, we perform a transformation of the available data. This is a simple task that can be done in most analytical tools. Keep in mind the data types supported by our analytical environment and always aim to minimize the memory required.
Avoid the most “memory-intensive” types, such as string (words, strings of characters). Instead, aim for the most memory-efficient ones, such as boolean (true/false).
Manage data gaps appropriately
The next step in the data preparation process is the proper management of missing data in the database. In every production database, we will encounter a situation where a value on a given object is empty. A classic mistake in such a case is assigning zero to this value. However, the lack of data in selected cells does not indicate zero value there and there is no basis for such action. Moreover, what if there should be a date or product description in the column, but not a number? In this case, even more so, we cannot replace the missing with zero.
In the first step, we need to identify the type of deficiencies that occur by their genesis. In this way we will distinguish:
- human errors (a given value should have been filled in, but the person responsible for it did not do it);
- real gaps (due to the specifics of the record, indeed the value should be empty, e.g. the value of the variable “last purchased product” should be empty for a customer who has not yet made a purchase on our platform).
For real deficiencies, how they are handled depends on our analysis. Most often, however, records with gaps are not changed and are treated as a separate category of records when presenting results. It is correct, however, some analysis is not possible on such a dataset.
Deficiencies occurring due to human error require an agreement to handle them within the team/management staff. It is possible to:
- defining the value that will replace the gaps (e.g. we impute in advance the date 01/01/2000 for each gap in the field “date of employment” or the average order value for the gap in the field “order value”; alternatively, data gaps can be replaced by imputation according to the distributions of the variables);
- filling in the gaps with appropriate values by the team members responsible (which may not be possible if the scale of the gaps is significant or they relate to events that are far removed in time);
- leaving gaps (if their occurrence does not significantly affect the analysis);
- removing defective records from the analysis (which entails the loss of some information, but sometimes may be necessary if the deficiencies may significantly distort reality).
Determine the distributions of the variables and identify outliers
Once the data has been properly cleaned according to the above principles, it is time to better understand its structure. We now know, for example, that the values corresponding to invoice payment delays are integers corresponding to the number of days late. But how do these values break down in terms of frequency/density of occurrence? Are customers most often late by a maximum of one week? In our case, are these values evenly distributed throughout the year? Are there any outliers, i.e. customers are generally one week late at most, but one contractor is 600 days late?
A glance at the data
As part of this step, a density plot or histogram should be constructed for each object under study (below is an example of an analysis of discounts offered at a particular store). Such visualizations will help us understand how our values are distributed and where we can expect outliers to occur.
Using the above graph as an example, we can see that the scale of price reduction is not symmetrically distributed. Small (up to 20%) price discounts dominate. At the same time, significant discounts, in the range of 70-80%, are already in third place in terms of frequency. This may indicate the company’s strategy towards discounts, according to which the discount should be small or significant. This is due to the fact that discounts in the middle of the pile do not attract attention while they cost money.
After such analysis, further statistics for each subject should be checked. Namely mean, median, standard deviation, values for the first and third quartile. To interpret them easily, you can use a box plot (chart below) that shows some of these statistics.
From the above graph we can read:
- median (the border of the two gray rectangles in the middle),
- the first and third quartiles (the lower and upper outer limits of the two gray rectangles),
- the interquartile range (the length of the two grey rectangles),
- first and third quartile with respectively subtracted and added value of the quarterly spread multiplied by 1.5 (the outermost horizontal lines, the so-called “whiskers”; in the case of this graph, the lower limit stops at zero because the sales value cannot be negative),
- the results of individual salespeople (dots).
From the sample chart, we can see that in terms of sales value, two salespeople stand out from the rest. They are Richard Martinez and Robert Wilson. At the same time, their results are not improbable, so there is no reason to suspect an error in the data.
Conclusions from a cursory analysis
Although typical analytical work already appears at this stage, graphs and statistics are generated, this element should not be considered as full data analysis. This stage is only to introduce the researcher to the characteristics of the data set. Also – allow him to make possible transformations even before the actual analysis.
At the same time, at this stage, we should decide on the approach to outliers, for example, for the mentioned case of invoice payment delay equal to 600 days. In the literature, one can find an approach that suggests the application of the “three-sigma” rule. This term means rejecting all observations that are more than three standard deviations away from the mean. Note, however, that this approach is justified only in the case of a normal probability distribution, where indeed 99.7% of the observations are within the range: mean +/- three standard deviations.
In reality, we can rarely confirm the presence of this distribution in our data. A classic error when applying this rule is to exclude customers with purchase values outside the range indicated above. Such a high value distorts the analysis, but at the same time, it may indicate that the given buyer is a business customer, not a private one. Consequently, it should not be excluded, but analyzed separately, in a subgroup of records corresponding to businesses.
Accordingly, the best approach to managing outliers is to approach each variable individually and decide according to its distribution, statistics, and informational value. As with missing data, we can change that value to another, remove it from the set, or leave it unchanged.
Check relationships and correlations between variables
The final step of a proper Exploratory Data Analysis is to check the relationships and correlations that exist between the variables under study. This step involves checking the co-occurrence of given categories and values of individual variables in our data set.
The primary activity of this step is the determination of correlation measures. In the most classical case, this measure will be the Pearson’s linear correlation coefficient, allowing us to indicate whether, for example, if the value of one variable increases, the values of the other variable increase or decrease. Alternatively, e.g. in the case of a variable with outliers, the Spearman rank correlation coefficient can be used, which better reflects the relationships in such a case.
Correlation and co-occurrence analysis allows us to check the validity and consistency of our data with intuition. If the analysis shows that higher-order values occur for individual customers and not for companies, this may indicate errors in the set that were not detected at any of the previous levels of analysis. At the same time, this is the stage when we can already indicate the first dependencies. In this way, we can smoothly proceed to the proper analysis of the already clean set of data.
Proper execution of the above steps significantly shortens the analytical work on the set.
Thanks to the successive stages, the researcher can thoroughly understand the data, i.e. the material on which he works. This will save many mistakes and misunderstandings in subsequent stages, and the data analysis will be carried out effectively.
We also encourage you to follow us on LinkedIn.
Article by Karol Michalak.