Data Science
technicalThe practice of extracting knowledge and insight from structured and unstructured data through statistics, machine learning, and analytical reasoning.
Max Level
250
Attribute Contributions
Prerequisites
Overview
Data science is the interdisciplinary practice of extracting knowledge and actionable insights from data through a combination of statistical analysis, machine learning, programming, and domain understanding. A data scientist typically works through a complete analytical cycle: framing a business question as an analytical problem, acquiring and cleaning data, exploring and visualizing it to understand structure and relationships, building and validating statistical or machine learning models, communicating findings to stakeholders, and deploying models into production systems where they inform decisions at scale.
The field sits at the intersection of statistics, computer science, and domain expertise — the intersection that Drew Conway's famous Venn diagram popularized. Statistical knowledge provides the mathematical tools for inference and modeling; programming skill enables working with large datasets and implementing algorithms; domain expertise determines what questions are worth asking and what findings are actually meaningful in context. Strength in only one or two of these areas without the others produces analyses that are technically sophisticated but practically useless, or practically relevant but statistically invalid.
Getting Started
Python is the dominant language for data science, with a mature ecosystem of libraries for every stage of the analytical workflow. NumPy and Pandas for data manipulation, Matplotlib and Seaborn for visualization, and Scikit-learn for machine learning provide the foundational toolkit. Jupyter notebooks are the standard interactive development environment for exploration and analysis.
Exploratory data analysis (EDA) — the initial examination of a dataset to understand its structure, identify patterns, detect anomalies, and check assumptions — is the first substantive step in any data project and often reveals the questions most worth answering before any modeling begins. Learning to systematically explore a new dataset — summarizing distributions, checking for nulls and outliers, examining relationships between variables through visualization and correlation — before building any models prevents the waste of sophisticated modeling applied to poorly understood data.
Statistical foundations are the mathematical backbone of data science. Understanding probability distributions, hypothesis testing, confidence intervals, and linear regression provides the conceptual framework within which all more complex methods are understood. Machine learning methods are most reliably applied by practitioners who understand their statistical foundations — why regularization prevents overfitting, what bias-variance tradeoff means in practice, why cross-validation produces more reliable performance estimates than a single train-test split.
Common Pitfalls
Data leakage — accidentally including information in training data that would not be available at prediction time — produces models that perform excellently in evaluation but fail in production. This error is particularly insidious because it produces high evaluation scores that appear to validate the model. Rigorous train-test splitting, including careful handling of time-series data and group-based splits, prevents leakage.
Overfitting — building models that fit training data well but generalize poorly to new data — is the most fundamental failure mode in machine learning. The discipline of holding out test data and using cross-validation to estimate generalization performance, rather than evaluating on the same data used to train, prevents overfit models from appearing better than they are.
Confusing correlation with causation in observational data analysis produces incorrect conclusions that lead to bad decisions. Most data science uses observational data, where correlations may reflect confounding rather than causal relationships. Understanding when causal inference is warranted — and when only predictive, correlational claims can be made — is the statistical literacy that distinguishes rigorous from misleading analysis.
Milestones
Completing a full data science project — problem framing, EDA, feature engineering, model training and evaluation, and communication of results — on a real dataset marks foundational competency. Deploying a trained model as a production API that serves predictions to an application marks the engineering-to-production milestone. Winning or placing in a Kaggle competition requiring original feature engineering and model selection marks competitive modeling skill.
Advanced data scientists contribute to the methodology of their domain, develop novel analytical approaches to business problems, and lead multi-person analytical projects.
Where to Specialize
Machine learning engineering focuses on production model deployment, serving infrastructure, and model monitoring. NLP applies data science to text data including classification, generation, and information extraction. Time series analysis focuses on temporal data modeling for forecasting and anomaly detection. Causal inference applies experimental and quasi-experimental methods to establish causal claims from observational data. A/B testing and experimentation designs and analyzes randomized experiments to measure the causal impact of product changes.
Tips for Success
- Do thorough EDA before any modeling — understanding your data reveals the real questions and prevents building models on misunderstood or dirty data.
- Learn the statistics behind models, not just the APIs — understanding regularization, bias-variance tradeoff, and cross-validation makes model selection defensible.
- Guard against data leakage obsessively — accidentally including future information in training data is the most common source of unreliable evaluation metrics.
- Use cross-validation rather than a single train-test split to estimate generalization performance — a single split can mislead through chance.
- Communicate uncertainty honestly — point estimates without confidence intervals or error bars overstate the precision of analytical findings.
- Frame problems carefully before building — the most technically sophisticated model cannot answer a question that was not clearly defined before the analysis began.
- Learn the domain you are working in — data science without domain expertise produces technically sound analyses that answer unimportant questions.
Practice Quests
Suggested activities for building your Data Science skill at different intensities.
Daily Quests
Perform exploratory data analysis on one dataset — summarizing distributions, checking for nulls and outliers, and visualizing key relationships — without building any models.
Create five new features from a raw dataset — transformations, aggregations, or interaction terms — and evaluate their potential predictive value through visualization.
Work through five statistics problems — hypothesis testing, confidence intervals, or probability distributions — connecting each to a specific data science application.
Weekly Quests
Complete a full analysis cycle — question framing, EDA, modeling, evaluation, and write-up — on one Kaggle dataset or business problem within the week.
Train five different models on the same dataset using cross-validation, compare their performance systematically, and explain why the best-performing model wins.
Monthly Quests
Study the data science methods specific to one domain — time series, NLP, or recommendation systems — reading one text and implementing a complete example.
Enter one Kaggle competition, making at least five model submissions with documented experiments and achieving a score in the top half of participants.
Notable Practitioners
New Zealand statistician and chief scientist at Posit who created ggplot2, dplyr, and the tidyverse ecosystem that transformed how data scientists work with R.
American software developer who created the Pandas library while at AQR Capital Management, providing the foundational data manipulation tool for Python data science.
Chinese-American computer scientist and former head of Google Brain and Baidu AI whose Coursera machine learning courses introduced millions to data science and AI.
American statistician and founder of FiveThirtyEight whose data-driven political and sports forecasting made statistical reasoning visible and accessible to a broad public.
Learning Resources
Ready to start tracking Data Science?
Start Tracking Data Science