Masterset statistical library (mirror) https://git.rol.so/colttaine/masterset
Go to file
colttaine eab499c97b Initial commit 2023-02-28 13:38:59 +11:00
data Initial commit 2023-02-28 13:38:59 +11:00
examples Initial commit 2023-02-28 13:38:59 +11:00
src Initial commit 2023-02-28 13:38:59 +11:00
.gitconfig Initial commit 2023-02-28 13:38:59 +11:00
.gitignore Initial commit 2023-02-28 13:38:59 +11:00
LICENSE Initial commit 2023-02-28 13:38:59 +11:00
README.md Initial commit 2023-02-28 13:38:59 +11:00

README.md

Masterset

Masterset is a library of statistics collected from a variety of online sources.

The data has all been standardised to a common format with a common keyname schema to make easy comparisons possible.

Masterset includes R scripts to easily load the data sets and merge them into a common dataframe (called masterset) as well as a metadata dataframe (called masterset_meta).

Gross Domestic Product (GDP) verses National Average Intelligence Quotent (IQ)

Why JSON?

JSON is perhaps not the obvious choice for statistical data, however it solves a number of problems that exist with other similar data formats.

The first and obvious reason to choose JSON, is that it is a human readable plain-text format. This (in my opinion) is always preferable over proprietary formats like Microsoft Excel or bianry database formats like Sqlite which unnecissarily require additional software to view and edit.

The second and more notible problem is the storage of "metadata". This perhaps isn't an issue in data for your own self contained lab study, however since the data in masterset is agregated from a variety of difference online sources, it is necessary to store metadata about where the statistics came from, who compiled those statistics, and when. Storing metadata like this is not possible (or at least not easy) with other plain-text data formats like CSV. With JSON the metadata can easily be stored in the very same file as the data itself.

The Data

The data is organised into several different master "scopes", and then categorised by directory within those scopes. The current scopes are:

global - Country level statistics (for comparing countries with one another in the epoch of current year).

historical - Country level statistics which have changed over time (for analysing how a country has changed over time, or potentially compairing one country's change to another).

regional/united-states - United States county level statistics (for comparing US counties with one another.)

regional/united-kingdom - United Kingdom constituency level statistics (for comparing UK constituencies with one another.)

lifespan - Demographic information which changes over an individual person's life time (for compairing different life outcomes by race, gender, profession etc).

Be mindful that these "scopes" all have slighly different ways that the data is loaded and merged in the R masterset dataframe. Whilst the MasterSet R script is generally "smart enough" to detect the scope of the data it is importing, you may need to slightly adjust your own custom R scripts depending on the data set. Also be mindful that you cannot really mix and match data from different scopes. Eg, it doesn't make sense to try and mergelifespan data about cancer rates over an individual's life with global GDP data across different countries.

Each JSON datafile has a metadata header with information about what the statistical data is, where it was sourced from, who compiled the data, and when. After the metadata information, the actual data is stored in a 2 dimentional list (ie, basically just CSV data wrapped in JSON array tags).

Using Masterset

Clone the git repo into your project directory

git clone https://git.rol.so/colttaine/masterset.git

There are several example scripts included for you to reference, but the main thing is to make sure in your own scripts, that you change the setwd() at the begining of the script to point to the masterset directory.

License

This project is licensed under a GNU GPL3+NIGGER license. For more information refer to the included license file.