Data Science

technical

The practice of extracting knowledge and insight from structured and unstructured data through statistics, machine learning, and analytical reasoning.

Max Level

250

Attribute Contributions

Intelligence 50% Wisdom 25% Creativity 15% Dexterity 10%

Prerequisites

Programming Lv 15

Overview

Data science is the interdisciplinary practice of extracting knowledge and actionable insights from data through a combination of statistical analysis, machine learning, programming, and domain understanding. A data scientist typically works through a complete analytical cycle: framing a business question as an analytical problem, acquiring and cleaning data, exploring and visualizing it to understand structure and relationships, building and validating statistical or machine learning models, communicating findings to stakeholders, and deploying models into production systems where they inform decisions at scale.

The field sits at the intersection of statistics, computer science, and domain expertise — the intersection that Drew Conway's famous Venn diagram popularized. Statistical knowledge provides the mathematical tools for inference and modeling; programming skill enables working with large datasets and implementing algorithms; domain expertise determines what questions are worth asking and what findings are actually meaningful in context. Strength in only one or two of these areas without the others produces analyses that are technically sophisticated but practically useless, or practically relevant but statistically invalid.

Getting Started

Python is the dominant language for data science, with a mature ecosystem of libraries for every stage of the analytical workflow. NumPy and Pandas for data manipulation, Matplotlib and Seaborn for visualization, and Scikit-learn for machine learning provide the foundational toolkit. Jupyter notebooks are the standard interactive development environment for exploration and analysis.

Exploratory data analysis (EDA) — the initial examination of a dataset to understand its structure, identify patterns, detect anomalies, and check assumptions — is the first substantive step in any data project and often reveals the questions most worth answering before any modeling begins. Learning to systematically explore a new dataset — summarizing distributions, checking for nulls and outliers, examining relationships between variables through visualization and correlation — before building any models prevents the waste of sophisticated modeling applied to poorly understood data.

Statistical foundations are the mathematical backbone of data science. Understanding probability distributions, hypothesis testing, confidence intervals, and linear regression provides the conceptual framework within which all more complex methods are understood. Machine learning methods are most reliably applied by practitioners who understand their statistical foundations — why regularization prevents overfitting, what bias-variance tradeoff means in practice, why cross-validation produces more reliable performance estimates than a single train-test split.

Common Pitfalls

Data leakage — accidentally including information in training data that would not be available at prediction time — produces models that perform excellently in evaluation but fail in production. This error is particularly insidious because it produces high evaluation scores that appear to validate the model. Rigorous train-test splitting, including careful handling of time-series data and group-based splits, prevents leakage.

Overfitting — building models that fit training data well but generalize poorly to new data — is the most fundamental failure mode in machine learning. The discipline of holding out test data and using cross-validation to estimate generalization performance, rather than evaluating on the same data used to train, prevents overfit models from appearing better than they are.

Confusing correlation with causation in observational data analysis produces incorrect conclusions that lead to bad decisions. Most data science uses observational data, where correlations may reflect confounding rather than causal relationships. Understanding when causal inference is warranted — and when only predictive, correlational claims can be made — is the statistical literacy that distinguishes rigorous from misleading analysis.

Milestones

Completing a full data science project — problem framing, EDA, feature engineering, model training and evaluation, and communication of results — on a real dataset marks foundational competency. Deploying a trained model as a production API that serves predictions to an application marks the engineering-to-production milestone. Winning or placing in a Kaggle competition requiring original feature engineering and model selection marks competitive modeling skill.

Advanced data scientists contribute to the methodology of their domain, develop novel analytical approaches to business problems, and lead multi-person analytical projects.

Where to Specialize

Machine learning engineering focuses on production model deployment, serving infrastructure, and model monitoring. NLP applies data science to text data including classification, generation, and information extraction. Time series analysis focuses on temporal data modeling for forecasting and anomaly detection. Causal inference applies experimental and quasi-experimental methods to establish causal claims from observational data. A/B testing and experimentation designs and analyzes randomized experiments to measure the causal impact of product changes.

Tips for Success

  • Do thorough EDA before any modeling — understanding your data reveals the real questions and prevents building models on misunderstood or dirty data.
  • Learn the statistics behind models, not just the APIs — understanding regularization, bias-variance tradeoff, and cross-validation makes model selection defensible.
  • Guard against data leakage obsessively — accidentally including future information in training data is the most common source of unreliable evaluation metrics.
  • Use cross-validation rather than a single train-test split to estimate generalization performance — a single split can mislead through chance.
  • Communicate uncertainty honestly — point estimates without confidence intervals or error bars overstate the precision of analytical findings.
  • Frame problems carefully before building — the most technically sophisticated model cannot answer a question that was not clearly defined before the analysis began.
  • Learn the domain you are working in — data science without domain expertise produces technically sound analyses that answer unimportant questions.

Practice Quests

Suggested activities for building your Data Science skill at different intensities.

Daily Quests

EDA Practice 1.00 hr

Perform exploratory data analysis on one dataset — summarizing distributions, checking for nulls and outliers, and visualizing key relationships — without building any models.

Feature Engineering Session 0.50 hrs

Create five new features from a raw dataset — transformations, aggregations, or interaction terms — and evaluate their potential predictive value through visualization.

Statistics Review 0.50 hrs

Work through five statistics problems — hypothesis testing, confidence intervals, or probability distributions — connecting each to a specific data science application.

Weekly Quests

Complete Analysis Project 6.00 hrs

Complete a full analysis cycle — question framing, EDA, modeling, evaluation, and write-up — on one Kaggle dataset or business problem within the week.

Model Comparison Study 4.00 hrs

Train five different models on the same dataset using cross-validation, compare their performance systematically, and explain why the best-performing model wins.

Monthly Quests

Domain Deep Study 15.00 hrs

Study the data science methods specific to one domain — time series, NLP, or recommendation systems — reading one text and implementing a complete example.

Kaggle Competition 20.00 hrs

Enter one Kaggle competition, making at least five model submissions with documented experiments and achieving a score in the top half of participants.

Notable Practitioners

Hadley Wickham

New Zealand statistician and chief scientist at Posit who created ggplot2, dplyr, and the tidyverse ecosystem that transformed how data scientists work with R.

Wes McKinney

American software developer who created the Pandas library while at AQR Capital Management, providing the foundational data manipulation tool for Python data science.

Andrew Ng

Chinese-American computer scientist and former head of Google Brain and Baidu AI whose Coursera machine learning courses introduced millions to data science and AI.

Nate Silver

American statistician and founder of FiveThirtyEight whose data-driven political and sports forecasting made statistical reasoning visible and accessible to a broad public.

Learning Resources

Website Kaggle — Data Science Competitions
YouTube StatQuest with Josh Starmer on YouTube
Website Wikipedia: Data Science
Website fast.ai — Practical Deep Learning

Ready to start tracking Data Science?

Start Tracking Data Science