Tools

Data Languages

Python

The R Project for Statistical Computing or R Studio

Anaconda (namely if using python to install most libraries)

Data Cleansing and Extraction

Unstructured data

Other

OpenRefine

import.io

Tabula (extract tables from pdf)

RegExr

Data Visualisation

DIVE

3D.js and 3D Plus

Processing

CoffeeScript

AngularJS

pyjs

Brython

Tableau

RAWgraphs

Datawrapper

Libraries:

(Integration with Python)

Matplotlib

Plotly

Geoplotlib

Seaborn

Ggplot

Bokeh

pygal

(Integration with R)

GGplot2

Plotly

Some Methods (examples of libraries) for data imputation

List of R Packages

MICE (Multivariate Imputation by Chained Equations, Random Forest, CART etc).

Methods:

  • PMM (Predictive Mean Matching) – For numeric variables

  • logreg(Logistic Regression) – For Binary Variables( with 2 levels)

  • polyreg(Bayesian polytomous regression) – For Factor Variables (>= 2 levels)

  • Proportional odds model (ordered, >= 2 levels)

Amelia and Amelia II multiple imputation (generate imputed data sets) to deal with missing values.It is enabled with bootstrap based EMB algorithm which makes it faster and robust to impute many variables including cross sectional, time series data etc.

It makes the following assumptions:

  • All variables in a data set have Multivariate Normal Distribution (MVN).
  • It uses means and covariances to summarize data.
  • Missing data is random in nature (Missing at Random)

missForest (Random Forest, non parametric imputation)

Hmisc (linear regression, logistic regression & cox regression)

Mi (Multiple imputation with diagnostics)

Missing values (Rubin, Donald B. “Inference and Missing Data.” Biometrika, vol. 63, no. 3, 1976, pp. 581–592.JSTOR, JSTOR, www.jstor.org/stable/2335739)

Reproducible Research Tools

Notebooks and repositories

R

Markdown

Python (and R):

Jupyter

Repos

DKAN Open Data Platform

GitHub