Project - Data Visualisation using D3.JS

05/2024

Table of Contents

The Project

This project was relatively short in scope having me create 3 visualisations that answer some kind of question. Me and my team mate chose to answer whether or not life expectency was affected by alcohol consumption.

Obviosuly, no, that wasn't the case. There are many more factors that contribute to a countries medical success. This of course meant that we found no correlation to a country's life expectency and alcohol consumption.

The deployed website can be found here .

This page is my team's visual design process book (submission for the final assignment) in HTML format.

Data Processing

Due to the scope of the project, it was decided that creating a SQL database to process the data would be beneficial, and help us transform the data into any shope or form much quicker. Not only that, but while data was being inserted into the data into the database, the data could be cleaned and validated before it ever touched the database.

Software

After researching which DBMS (database management system) should be used for processing the data, it was decided that MariaDB should be used as it is free and open source. MySQL had some issues with licensing in the past leading to some confusion about whether or not MySQL was free as in freedom (libre). Hence, the new MariaDB team forked the MySQL source code and aims to be a drop-in replacement for it. MariaDB is actually included in the base Debian 12 bullseye repository, unlike MySQL which has to be installed through a third-party repository because of said licensing issues.

Python was used to due to its simplicity. However, upon completing the project, another language should have been considered due to Python's poor imlementation of loops, lists and other core features required to make a simple data processing program. Python is to "simple" for its own good.

Data Collection

Firstly, the team downloaded all the datasets needed to come to a decision. Bellow are the sources for each of the .csv files used:

Sadly, the requirements of the task demanded that only the data from the OECD are to be used. This means that there is a limited amount of data to work with. However, they do have huge amounts of aggregate data from all countries included .

Analysing the data

Once the data has been gathered, it was thoroughly analysed the data and stripped any un-needed columns away. This was done quickly using Excel. As a result, parsing and processing the data in Python would be made easier and potentially faster.

Some of the things noted were that some countries did not record data for alcohol life expectency, or even both for a certain time period. The reason for this could be financial or politically motivated, though it entirely depends on the country. The decision was made to simply implement some kind of data extrapolation as it could be done in code, and a warning could be included to allow the reader to to come to a decision about what to make of the data.

Here are the things that had to be done in Excel:

  1. Use Excel to determine which columns should be included in the final SQL database
  2. Strip metadata from each record ( STRUCTURE , STRUCTURE_ID , etc.) as they aren't needed in the database
Columns dropped in alcohol_consumption.csv
  1. FREQ - Redundant, always measured anually
  2. MEASURE , UNIT_MEASURE , Unit of measurement - Redundant, always "alcohol consumption litres per person"
  3. AGE , Age - Always 15 years or over
  4. METHODOLOGY , Measurement method - Always not applicable
  5. Time period - Redundant, TIME_PERIOD was used instead
  6. UNIT_MULT , Unit multiplier - Always set to 0 units
Columns dropped in life_expectency.csv
  1. FREQ - Always measured annually
  2. Time period - Redundant, TIME_PERIOD was used instead
  3. UNIT_MULT , Unit multiplier Always set to 0 units
  4. METHODOLOGY , Measurement methodolyg - Always not applicable
  5. SOCIO_ECON_STATUS , Socio-economic status , DEATH_CAUSE ,
  6. Cause of death , CALC_METHODOLOGY , Calculation methodology , GESTATION_THRESHOLD , Gestation period threshold , HEALTH_STATUS , Health status , DISEASE , Disease , CANCER_SITE , Cancer site - Always not applicable / empty

Now that all unneccessary columns have been purged, we had to properly understand what the data we were collecting means, and how we can use it to extract meaning out of it. Thankfully, both datasets contain the column Unit of measurement which helped out a lot. We found that we simply cannot combine each variable into the same chart and axis as they are different measurements (alcohol consumption in litres per person, and life expectancy). Though, they did have one thing in common: year. This meant we could at least compare both value's trends overtime.

Processing the data in Python

The Python script does a few things:

  1. Establishes a connection to the MariaDB database on the Debian virtual machine, using the python-mariadb-connector Python package

    Python - Connect to MariaDB database

    try:
         connection = mariadb.connect(
             user="foo",
             host="bar",
             port=3306,
         )
         Except mariadb.Error as e:
             Print("Mariadb error: " + e)
             Sys.exit()
    
         cur = connection.cursor()
  2. Create all tables in the database using the mariadb python package

    The entity relationship diagram of the database generated by JetBrains DataGrip The entity relationship diagram of the database generated by JetBrains DataGrip
  3. Open the alcohol_consumption.csv dataset, loop through all lines, and repeat for other datasets

    Python - Connect to MariaDB database

    ref_area_codes = {}
    obs_status_codes = {}
    
    def time_preiod_mariadb_format(time) -> datetime.datetime:
        split = time.split('-')
        return datetime.datetime(int(split[0]), int(split[1]), 1)
    
    def insert_ref_area_if_not_exists(db_cursor: mariadb.Cursor, ref_area_code, ref_area_name):
        if ref_area_code not in ref_area_codes:
            db_cursor.execute("insert into reference_area(code, name) values(?, ?)", (ref_area_code, ref_area_name)) 
            ref_area_codes[ref_area_code] = ref_area_name
    
    def insert_obs_status_code(db_cursor: mariadb.Cursor, obs_status_id, obs_status):
        if obs_status_id not in obs_status_codes:
            db_cursor.execute("insert into obs_status(id, name) values(?, ?)", (obs_status_id, obs_status))
            obs_status_codes[obs_status_id] = obs_status

    with open(dataset_alcohol_consumption, "r") as f:
        lines = f.readlines()
        line_c = len(lines)
        index = 0
    
        print(f"\nParsing {str(line_c)} records in ALCOHOL CONSUMPTION dataset")
        for row in lines:
            if index == 0:
                index += 1
                continue
    
            if row.strip() == '':
                continue
    
            records = row.split(',')
            ref_area_code = records[0]
            ref_area_name = records[1]
            time_period = records[2]
            obs_value = records[3]
            obs_status_id = records[4]
            obs_status_name = records[5]
    
            insert_ref_area_if_not_exists(cur, ref_area_code, ref_area_name)
            insert_obs_status_code(cur, obs_status_id, obs_status_name)
            cur.execute("insert into alcohol_consumption(ref_area_code,
            obs_status_id, time_period, obs_value) values(?, ?, ?, ?)", (ref_area_code,
            obs_status_id, time_period, str(obs_value)))
            index+=1

Querying the Database

Now that everything has been inserted into the database, we can execute simple SQL queries to get any kind of data we want. We will discuss the visualisations in the next section, but table 5 shows an example query that was used to create the stacked area chart:

SQL - Example Queries

DROP VIEW IF EXISTS times;
CREATE VIEW times AS
SELECT alcohol_consumption.time_period FROM alcohol_consumption
UNION SELECT life_expectency.time_period FROM life_expectency

SELECT reference_area.name, times.time_period, life_expectency.obs_value as life_expectency
FROM reference_area
CROSS JOIN times
LEFT JOIN alcohol_consumption ON reference_area.code = alcohol_consumption.ref_area_code AND times.time_period = alcohol_consumption.time_period
LEFT JOIN life_expectency ON reference_area.code = life_expectency.ref_area_code AND times.time_period = life_expectency.time_period
ORDER BY reference_area.code, times.time_period;

With this one simple trick, the data can be manipulated to get any kind of output very easily, and be generally fast (Power-BI users will hate this)! You can see how in this query, the times view was CROSS JOIN ed into the alcohol_consumption table. The times view is combining all years that are present in each dataset into a temporary table that could then be used. Then, the developer can CROSS JOIN it in the reference_area table and return a dataset containing each country, alcohol consumption and life expectancy for each year.

Because the query performs a left join the life_expecancy and alcohol_consumption table, we will get some null data. Again, as mentioned before, The D3 website will extrapolate the data in order to fill in the gaps.

Once the SQL query was executed, it could then be exported into a .csv file to then be turned into a visualisation in D3.

name code time_period alcohol_consuption life_expectency
Argentina ARG 2010 8.2
Argentina ARG 2011 7.7
Argentina ARG 2012 8
Argentina ARG 2013 8.3
Argentina ARG 2014 7.9
Argentina ARG 2015 8.5 76.8
Argentina ARG 2016 8.4 76.8
Argentina ARG 2017 8.3 76.3
Argentina ARG 2018 8 76.8
Argentina ARG 2019 8 77
Argentina ARG 2020 75.3
Argentina ARG 2021 75.9
Argentina ARG 2022 75.4

Extrapolation

The data had to be extrapolated because it would make the chart borderline unreadable. Also, upon analysing the data, with our initial line chart. Life expectancy and alcohol consumption in most countries show a mostly linear, downward trend. This means that by using the extrapolation function as shown in Table 7, we won’t be too far off. Although it is worth noting that other political events during those times may affect the validity of the data, a tooltip should be added to notify the user that this data point is estimated. The formula provided by the stack exchange post by TecHunter *(which was sourced from a post from classroom.synonym.com ) to find the slope of the dataset. Below is the formula used to calculate the slope ($$$y$$$ increases as $$$x$$$ increases by one year), which we use to fill in the gaps:

$$$\text{slope} = \frac{n \sum (xy) - \sum x \sum y}{n \sum x^2 - \left(\sum x\right)^2}$$$

le_ac_chart.js - source

/** 
 * Fills null data by calculating trendline of existing data, and using extrapolated data to fill it in
 * @param {Array} data
 * @param {string} The index of the y axis to calculate the extrapolated values for
 * @param {string} An optional string that when set to something, will acces data[i]["isEstimated"] and flip to true if it was
 *  modified during the function
 */
function fillMissingData(data, yAxisParameter, isEstimatedParameter) { // Formula from: https://math.stackexchange.com/a/204021

    // KEY TERIMNOLOGY:
    // [null, null, *6*, 8, 3, **9**, null, null ]
    //    - LEFT MOST:  *6*
    //    - RIGHT MOST: **9**

    // 1. We want to find the left most value, and loop for all other null values until reach index 0
    // 2. Find right most value, loop until reach index (size of array)
    // 3. Calculate `y = mx` (no `+c`, we only want gradient)
    //    1. Calculate slope of equation `m` 
    //    2. Calculate offset `c`

    let leftMostIndex = -1;
    let rightMostIndex = -1;
    for (let i = 0; i < data.length; i++) {
        if (leftMostIndex == -1) {
            // We have not found the left most index, meaning we should loop until we find it
            if (data[i][yAxisParameter] != null) 
                leftMostIndex = i
        } else {
            // We have found it, so we'll need to find the right most index
            // Loop until the next number is either a stack overflow, or null
            if (i == data.length - 1)
                rightMostIndex = data.length
            else if (data[i + 1][yAxisParameter] == null)
                rightMostIndex = i

            if (rightMostIndex != -1) break // We found one. Nicer than having 'break' in each 'if elseif' statement imo
        }
    }

    // THIS COULD BE BAD
    // rightMostIndex + 1 seems like something that should not be done 
    // (Could cause a stack overflow exception)
    const dataSubset = data.slice(leftMostIndex, rightMostIndex + 1)
    const dataRise = calculateSlopeOfDataset(dataSubset, (d) => d[yAxisParameter])

    // This loop starts the left most value, and loops unitl the index = 0. Since we want to calculate the extrapolated value using
    // the linear equation we created, we can keep a store of the current X and Y, then for each index, subtract rise and run and set
    // that data point's life expectency to it

    if (leftMostIndex != 0) {
        let d = data[leftMostIndex]
        let currentY = d[yAxisParameter]
        for (let i = leftMostIndex - 1; i <= 0; i--) {
            // We are going backward, so subtract
            currentY -= dataRise
            data[i][isEstimatedParameter] = true
            data[i][yAxisParameter] = currentY
        }
    }


    if (rightMostIndex < data.length) {
        d = data[rightMostIndex]
        currentY = d[yAxisParameter]
        for (let i = rightMostIndex + 1; i < data.length; i++) {
            currentY += dataRise
            data[i][isEstimatedParameter] = true
            data[i][yAxisParameter] = currentY
        }
    }
}

/**
 * Calculates the slope of Y in a dataset. Formula from - https://math.stackexchange.com/a/204021
 * @param {Array} data The sliced dataset, containing NO null/NaN values
 * @param {Function} A function that takes each data point as an argument, to return which value to use. Null to just use that value
 */
function calculateSlopeOfDataset(data, func) {
    let sumOfX = 0
    let sumOfXSquared = 0
    let sumOfY = 0
    let sumOfXMultY = 0 // summtion (x * y)
    let n = data.length
    for (let i = 0; i < n; i++) {
        const d = data[i]
        const x = d.timePeriod.getFullYear()
        const y = func(d) ?? d

        sumOfX += x
        sumOfXSquared += x * x
        sumOfY += y
        sumOfXMultY += x * y
    }

    return (n * sumOfXMultY - sumOfX * sumOfY) / (n * sumOfXSquared - (sumOfX * sumOfX))
}

Visualisation Design

For the effective communication of the message that was to be conveyed, appropriate visualisations were needed. So, visualisations were proposed, and 3 were selected that best encoded the data that that was collected. After there had a few datasets,

Edward Tufte's Design Princaples

Tufte’s design princaples are a list of design decisions that the designer must make when creating their visualisation. In this case, his decisions were put into consideration. Bellow are some of the princaples that were applied. The reader is advised to familiarise themselves with them before reading on.

Scrapped Ideas

Radar Chart

The radar chart mock-up The radar chart mock-up

This visualisation intends to make displaying multiple data points at once.

While sketching up different visualisations, one of the first charts we thought of was a radar chart. This chart makes it easy to compare each variable against each other country, and by design, also makes it easy to find causality between variables. This chart does not show any other documentation that would be included such as the country name, drop-down for each country, etc... but if it were to be implemented, we would include them. We use hue to distinguish each country from each other on the radio chart, allowing users to select multiple countries and clearly see which one is which. By design, this chart integrates multiple variables (life expectancy, cancer, crime, unemployment, and alcohol consumption) together for the audience to form an opinion.

It was decided that we shouldn’t use it due to it’s complexity to program. It could be done, but we felt as if it would take too much time to do under the strict deadline.

Line Chart

The scrapped line chart mock-up The scrapped line chart mock-up

The interactive line chart above illustrates the trends in alcohol consumption over the various years. At the top of the visualisation, the user can decide the start and the end date that the visualisation covers. Along the x-axis, years are plotted, and the y- axis has the amount of alcohol consumed during those years. When hovered over, the exact amount of alcohol consumed and the corresponding year. This would show the trend in countries and the ability to compare different countries. When clicked, it would show a pie chart that compares the ratio of male drinkers to female drinkers in the country in that given year.

To elaborate, this allows the user to find any causations between each country. It is multivariant as the user can select which countries they want to select and has adequate context for the data points to be compared against. This will be one of the charts we consider for our final website.

Selected Visualisations

First Visualisation - Choropleth map of average alcohol consumption

The goal of this visualisation is to give the audience an overall view of alcohol consumption on a map.

Initial sketch of bar chart [Version 1 and 2]

Version 1 (sketch) of alcohol consumption visualisation Version 1 (sketch) of alcohol consumption visualisation

Our vision was to start simple and build upon the ideas we start with. The sketch shows the two visualisations created. This section will talk about version 1, on the left page. As you can see, we started off with a simple bar chart, known for its good data/ink ratio, and can easily be modified to include axis lines, tooltips, etc thanks to D3. However, no matter what we try, there are simply too many countries to to include. Even if we were to add checkboxes as shown in Version 2, it would be a nightmare to scroll through if all you need is alcohol consumption. So, we thought of some other ideas for representing this data. Some of the ideas here would eventually be incorporated in other charts.

Pros Cons
  • When few countries are selected, it is easy to compare alcohol consumption between countries
  • Easy to program in D3, meaning less development time
  • More countries being compared, the harder it becomes to read
  • Countries checkbox may be hard to implement

In conclusion, while this chart is very simple, it does not accomplish the goal it sets out to accomplish of allowing the audience to see every country’s alcohol consumption.

Choropleth [Version 3] Sketch

Version 4 (sketch) of alcohol consumption on a map Version 4 (sketch) of alcohol consumption on a map

Our second idea was to create a choropleth map. Its simple design makes interacting with the map and finding the data points you need easy, intuitive, and engaging. Upon that, it also allows users to compare against different countries at once. However, to show multiple variables, it must be split in 2, but that won’t be necessary.

Due to its simplicity, it easily accomplishes the goal as discussed above, so we decided to use this and make some modifications during the development phase, discussed in the next section.

Final product of Choropleth

We landed on the visualisation containing a choropleth of global alcohol consumption. After implementing it, we have made some changes to how we designed the chart, so we will talk more about it here.

This final visualisation is a choropleth map that depicts the average alcohol consumption per person across the world. The map uses a color gradient/hue shift from light beige to dark red and represents the level of alcohol consumption ranging from the least(1 liter per person) to the most(14 liters per person). Countries with no available data are marked as light gray (HTML white smoke), as per one of Tufte’s design princaples.

The map allows users to hover over different countries in the map to show the exact value of alcohol consumption per person enhancing user interaction and providing more accurate data on demand to avoid any significant lie factors.

Going back to the color gradient, it is a good way of representing quantitative data at a glance and comparison is easy and clear. A choropleth map is perfect for this type of data because it provides a clear geographical distribution of alcohol consumption rates across different countries. This also allows the reader to see patterns within an area. For example, we can say that alcohol consumption in Russia is higher than that of China and the alcohol consumption of different European countries is close to one another.

These are factors that Tufte recomends visualisations incorperate. A visualisation should allow the reader to sift through the data and form opinions based on what is shown.

Second Visualisation - Bar chart for top 10 alcohol consumers

We were trying to analyze the data to find any correlation between data points. We did not find much just by pointing and clicking at different countries. So, we decided to make a quick visualisation in Datagrip to show us the max, min, and median alcohol consumers, to which we found Russia as the maximum alcohol consumer, Turkey to be the least, and New Zealand to be the median. We used those countries to see if there was even a little bit of correlation. Below is the table generated from the SQL query showing the countries with the least alcohol consumption:

SQL - Finding top 10 alcohol consumers

try:
SELECT
    reference_area.name AS 'ref_area_name',
    reference_area.code AS 'ref_area_code',
    alcohol_consumption.obs_value AS 'average_alcohol_consumption',
    life_expectency.obs_value AS 'average_life_expectancy'
FROM
    reference_area
CROSS JOIN
    times
LEFT JOIN
    alcohol_consumption ON reference_area.code = alcohol_consumption.ref_area_code AND
    times.time_period = alcohol_consumption.time_period
LEFT JOIN
    life_expectency ON reference_area.code = life_expectency.ref_area_code AND
    times.time_period = life_expectency.time_period
WHERE
    times.time_period = 2019
ORDER BY
    alcohol_consumption.obs_value DESC
LIMIT
    10



ID Ref Area Code Time Period Alcohol Consumption (liters per person) Life Expectancy (years)
1 CZE 2019 11.9 79.3
2 AUT 2019 11.6 82
3 LVA 2019 11.6 75.7
4 FRA 2019 11.4 83
5 LTU 2019 11.1 76.5
6 LUX 2019 11 82.7
7 RUS 2019 10.8 73.2
8 HUN 2019 10.7 76.5
9 DEU 2019 10.6 81.3
10 EST 2019 10.4 79

The visualisation focuses on the top 10 countries in terms of alcohol consumption in 2019 and compares their alcohol consumption rates and the difference in life expectancies across the countries. This visualisation was chosen to represent two different variables across the ten countries highlighting how the amount of alcohol consumed influences the life expectancy of the population. Each bar represents a country, coded by its country abbreviation, and displays the average liters of alcohol consumed per person. Additionally, the life expectancy for each country is overlaid on the y-axis using a gradient to represent the life expectancy of the different countries ranging from 72 with the lowest saturation of blue to 85 with the highest saturation of blue. Like in the first visualization, data on demand is provided when the different bars are hovered over. The full name of the country, exact alcohol consumption value, and exact life expectancy values are displayed to similarly avoid any significant lie factors.

A bar chart was chosen as it is the most effective for comparing quantitative values. In this case, it facilitates easy comparison of alcohol consumption rates. Also, the bars are colored in a spectrum that represents the life expectancy of the top ten countries in terms of alcohol consumption. The use of two different variables also facilitates easy comparison to see a correlation between the two. Additionally, ordering of the bars according to alcohol consumption or life expectancy is a feature of the visualisation to help see any patterns in the data to allow the user to interact with the chart more thoroughly to make conclusions more rapidly, without having more ink than data, as this is something that Tufte is concern about with visualisations.

Third Visualisation - Stacked Area Chart

Iteration Designs

1 st Iteration - Dual Column Chart

This design takes two variables, alcohol consumption per person, and life expectancy and combines them side by side in a dual-column chart. This means the reader can search through the list of countries to find what they are looking for, and they can see both. The design has more way too much data (in terms of data/ink ratio), which can become overwhelming. This was also discussed in 1st visualisation’s first iteration. The solution for that would be to incorporate a scroll bar with a bunch of countries which would work much better here, but there is something that won’t work about this. There is simply no room to compare them over time. So we had to scrap this one.

Sadly, it has more way too much data (in terms of data/ink ratio), which can become overwhelming. This was also discussed in 1st visualisation’s first iteration. The solution for that would be to incorporate a scroll bar with a bunch of countries which would work much better here, but there is something that won’t work about this.

There is simply no room to compare them over time. So we had to scrap this one.

2 nd Iteration - Scatter Plot

The scatter plot has a very similar problem as the dual column chart, it does introduce a much simpler UI instead of the convoluted dual bar chart solution. This one instead has a dropdown where you can select througheach year, with the $$$x$$$ and $$$y$$$ axis being life expectancy and alcohol consumption respectively.

This chart is simply bad though, each variable has different measurements, and would make no sense to be put on a different axis. The fact is, is that life expectancy and lcohol consumption are dependent on the year, meaning it cannot be put on a scatterplot by definition.

Depending on how many countries there are, this chart could have a bad data/ink ratio and makes it hard to incorporate any kind of color, thus making it hard to abide by Tufte’s design princaples. This does depend on how the circles/data points are designed, but no matter what happens, this simply won’t be good enough. Tufte is not impressed.

Iteration - Stacked Area Chart

Thes stacked area chart is the design we stuck with. It combines both life expectancy and alcohol consumption together, along a common $$$x$$$ axis of time. It allows users to select from a range of countries to filter which ones they’d like to look at. Upon that, it allows the user to view the trend of a country’s life expectancy and alcohol consumption to see if there is any trend.

Final Design

The interactive time series chart here tracks alcohol consumption (liters per person) and life expectancy over a period of time (2010 – 2021) in the selected country. This graph combines life expectancy and alcohol consumption together as a stacked area chart. Changes in life expectancy and alcohol consumption can be compared with ease using the third visualisation helping to find a correlation between alcohol consumption and life expectancy, thus allowing the reader to find any causations between the data points (if any).

Like the other visualisations, this graph also has a hover feature that shows the exact amount of alcohol consumed and the life expectancy each year of the selected country. As discussed in the data processing section, it was found that some countries don’t have recorded data in some years, so any missing values were extrapolated to make the chart look better and provide estimates for empty values that could otherwise be problematic for the overall design. To ensure this chart is up to Tufte’s standards the lie factor is minimised, a legend was incorporated to show which data points are extrapolated.

Upon that, it allows the user to choose which data to show (life expectancy and alcohol consumption on sidebar).

Conclusion

Throughout the course of the project, our team embarked on an extensive iterative design process to determine the most effective methods of visual representation for the data that was curated on alcohol consumption and life expectancy. The journey through various visualisation designs included exploring an array of different visualisation types, each contributing to our understanding of the best visualisation practices in data visualisation.

After careful and multiple iterations, we finalized a set of visualisations that clearly articulated the relationship between the data. Each selected visualisation was tailoredto highlight specific aspects of data, ensuring clarity and ease of understanding for the audience. The tools not only facilitated understanding of the trends and patterns but also enhanced the interactive experience, allowing users to engage deeply.

Our analysis focused on comparing life expectancy with alcohol consumption across various regions and over time. Contrary to expectations, our observations revealed that there is no significant correlation between alcohol consumption and life expectancy. This conclusion underscores the complexity of public health data and suggests that multiple factors influence life expectancy, with alcohol consumption not playing a pivotal role as hypothesized. The insights are valuable for researchers, policymakers, and public health professionals, providing a nuanced understanding of the factors that impact life expectancy.