Data Science & Statistical Analysis Methodology

I tend to be an extremely methodical student. I take notes, make outlines, and study by teaching myself the topics I’m learning. I also like to create step-by-step methodologies to solve problems. For example, for calculus classes I would create step-by-step outlines of how to solve derivatives, differential equations, etc. Likewise, here is a data-statistical analysis workflow I created for each time I start a new project. Of course, steps depend on the data at hand, but this helps organize topics into data munging, visualization, analysis, and even theory.

##Data Science Methodology & Workflow

###Preparation * Formulate a problem and think of solutions and applications * Create a source control repository to save all work

###Gathering Data * Any data that we can possibly acquire and that could be relevant to the problem * Web Scraping * API * Inport files

###Cleaning data * Check for NAs or NaaN * Check for missing data * Consider irrelevant indicators * Understand values for quantitative data and categories for qualitative data

###Exploratory analysis * Plot the data with graphs and charts; consider the distribution (it will affect model you pick) * Apply basic statistical correlation between variables, if applicable

###Visualization * Create visualizations to show, explain, and more deeply understand your data; use tools in R, Tableau or d3 * Use visualizations to determine which predictors are relevant and consider using these as features for your model

###Statistical – Predictive modeling * Begin formulating and implementing a statistical model for the data. Do I want to predict? Do I want to find out if there is a relationship between variables? Do I want to find clusters in data? * Look at a scatterplot matrix of your data * Choose features and create new features from the variables you have - log or quadratic versions, new categorical variables from quantitative ones * Tip: Use the Box-Cox test to normalize data (only for some data); recall that transforming data alters interpretation of coefficients * Choose a model, but understand the underlying model assumptions, importance of features, over fitting, and other shortcomings of the models * Recall that correlation does not imply causation and that all models are wrong, but some are useful
* Evaluate yoru model. Look at R^2, adjusted R^2, plots of fitted versus residuals, and other evaluation metrics

##I would love to hear about your data scientist workflow!

Data Science & Statistical Analysis Methodology

March 03, 2015

Announcing cityfinder

Some Spooky Reflections

What I learned from two weeks with Django