Chapter 2: Foundations of Data Science

發布於 2026-03-02 06:04

# Chapter 2: Foundations of Data Science Data science for social good is built on a shared mathematical and computational vocabulary. This chapter lays the groundwork so that the reader can comfortably read a statistical paper, code a model, and interpret results—all while keeping the **impact** of the analysis in sight. ## 2.1 Statistical Fundamentals | Concept | Definition | Why It Matters for Social Impact | Example | |---------|------------|-----------------------------------|---------| | Population | The entire set of entities we wish to study (e.g., all households in a city). | Defines the scope of a policy recommendation. | Estimating the average household income in Lagos, Nigeria. | | Sample | A subset of the population drawn for analysis. | Enables inference when the population is too large to survey in full. | A random sample of 1,200 households from Lagos. | | Parameter | A numerical characteristic of a population (e.g., φ = mean income). | Target of estimation. | Population median household income. | | Statistic | A numerical characteristic of a sample (e.g., σ = sample mean). | Basis for inference. | Sample median household income. | ### Key Takeaway - **Inference** bridges the gap between limited data and broad policy decisions. Remember: *the goal is to estimate parameters that reflect real‑world populations*. ## 2.2 Probability Concepts Probability is the engine that powers statistical inference. Below are the pillars that we use most often in social‑impact projects. ### 2.2.1 Random Variables - **Discrete**: Counts or categories (e.g., number of days a student attends school). - **Continuous**: Measurements on a continuum (e.g., test scores). ### 2.2.2 Distributions | Distribution | Common Use | Example | |---------------|------------|---------| | Binomial | Success/failure experiments | Probability a child passes an exam (success = pass). | | Poisson | Rare events per unit | Number of traffic accidents in a week in a borough. | | Normal | Many real‑world metrics | Heights of adults in a city. | ### 2.2.3 Expectation & Variance - **Expectation (E[X])**: The average value you would expect over many repetitions. - **Variance (Var[X])**: How spread out the values are. ### 2.2.4 Central Limit Theorem (CLT) > For a large enough sample size, the distribution of the sample mean approaches a normal distribution, regardless of the population distribution. **Why CLT matters**: It justifies using t‑tests and confidence intervals even when we don't know the population distribution. ## 2.3 Basics of Machine Learning Machine learning (ML) is the toolbox that turns raw numbers into actionable insights. Here we discuss the high‑level concepts that underpin most models. ### 2.3.1 Supervised vs. Unsupervised Learning | Type | Goal | Typical Algorithms | |-------|------|--------------------| | Supervised | Predict a target variable given features | Linear regression, logistic regression, decision trees | | Unsupervised | Discover structure without explicit labels | k‑means clustering, PCA | ### 2.3.2 Loss Functions & Optimization - **Loss Function** quantifies error (e.g., Mean Squared Error for regression). - **Optimization** finds parameter values that minimize loss (e.g., gradient descent). ### 2.3.3 Model Evaluation | Metric | When to Use | Interpretation | |--------|-------------|----------------| | R² | Regression | Proportion of variance explained (0–1). | | Accuracy | Classification | % of correct predictions. | | AUC‑ROC | Binary classification | Trade‑off between true & false positives. | | Silhouette | Clustering | How similar an object is to its own cluster vs. others. | ### 2.3.4 Overfitting & Regularization - **Overfitting**: Model learns noise instead of signal. - **Regularization**: Penalizes complexity (L1, L2 penalties). ### 2.3.5 Feature Engineering - **Domain knowledge** is the most powerful engine. - **Transformations** (log, Box‑Cox) can stabilize variance and improve model performance. ## 2.4 Key Terminology for Social Impact Projects - **Bias**: Systematic error that leads to unfair outcomes. - **Variance**: Sensitivity to fluctuations in the training data. - **Fairness**: The absence of discrimination across protected attributes. - **Explainability**: The degree to which a model's decisions can be understood by humans. - **Scalability**: Ability to handle larger datasets or more complex models. ## 2.5 Case Study: Predicting School Dropout Risk We illustrate the concepts above with a practical example. Suppose we have a dataset of 5,000 students from an urban school district, including: | Feature | Type | Example | |---------|------|---------| | Age | Numeric | 14 | | Gender | Categorical | Male | | AttendanceRate | Numeric | 0.92 | | ParentalEducation | Categorical | High School | | SocioeconomicStatus | Numeric | 3 (on a 1–5 scale) | | PriorTestScore | Numeric | 78 | | Dropout | Binary | 0 (No), 1 (Yes) | ### 2.5.1 Data Overview python import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv("school_dropout.csv") print(df.describe()) ### 2.5.2 Preprocessing Steps 1. Encode categorical variables using one‑hot encoding. 2. Impute missing `PriorTestScore` with the mean. 3. Scale numeric features with `StandardScaler`. 4. Split into train/test (80/20). python from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline numeric_features = ['Age', 'AttendanceRate', 'SocioeconomicStatus', 'PriorTestScore'] numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())]) categorical_features = ['Gender', 'ParentalEducation'] categorical_transformer = OneHotEncoder(handle_unknown='ignore') preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)]) X = df.drop('Dropout', axis=1) y = df['Dropout'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) ### 2.5.3 Model Building We’ll use a **logistic regression** model with L2 regularization to predict dropout risk. python from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, accuracy_score clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression(max_iter=1000))]) clf.fit(X_train, y_train) pred_proba = clf.predict_proba(X_test)[:, 1] roc_auc = roc_auc_score(y_test, pred_proba) print(f"AUC‑ROC: {roc_auc:.3f}") **Result Interpretation**: - AUC‑ROC of 0.78 indicates decent discriminatory power. - Feature coefficients (after inverse transformation) reveal that lower attendance and lower prior test scores are strong predictors. ### 2.5.4 Ethical Reflection - **Bias Check**: Verify that predictions do not disproportionately flag girls or lower‑SES students. - **Explainability**: Use SHAP values to communicate which factors drive a particular student’s risk. - **Intervention**: Pair predictions with actionable outreach (e.g., tutoring or counseling). ## 2.6 Practical Checklist for Foundations | Step | Action | Tool | Tips | |------|--------|------|------| | 1 | Define population & sampling frame | Survey design | Use stratified sampling to capture sub‑groups. | | 2 | Establish key metrics | KPI table | Align metrics with policy goals. | | 3 | Compute descriptive stats | Pandas, NumPy | Visualize distributions to spot anomalies. | | 4 | Formulate hypothesis | Statistical tests | Use 95% CI unless domain demands otherwise. | | 5 | Choose model type | scikit‑learn, TensorFlow | Start simple; increase complexity only if needed. | | 6 | Evaluate fairness | Fairlearn, AIF360 | Test across protected attributes. | | 7 | Document assumptions | Jupyter Notebook | Include version control for reproducibility. | ## 2.7 Takeaway The foundation of data science is **conceptual clarity**—understanding how probability, statistics, and machine learning intertwine—and **ethical mindfulness**—ensuring that every analytical choice serves the people at the heart of your project. Master these fundamentals, and you will be ready to dive into the hands‑on techniques of the subsequent chapters. --- *End of Chapter 2.*

Chapter 1: The Power of Purpose-Driven Data

Chapter 3: Data Collection – From Chaos to Structured Insight