Data Science Fundamentals - Statistics, Machine Learning, Python Complete Guide

February 22, 2026

Data Science Fundamentals - Complete Mastery

Introduction: The Data Revolution

Data science is the intersection of statistics, programming, and domain expertise. Companies generate 2.5 quintillion bytes of data daily. Understanding data science is essential for:

Making data-driven decisions
Building predictive models
Automating decision-making
Understanding customer behavior
Optimizing business processes
Creating competitive advantages

This guide covers core concepts with practical implementations.

1. Statistics Fundamentals

Descriptive Statistics

Measures of Central Tendency:

Mean (Average):
μ = (x₁ + x₂ + ... + xₙ) / n
Example: [2, 4, 6, 8] → mean = 20/4 = 5
Use: General average, affected by outliers

Median (Middle Value):
50th percentile - half values above, half below
Example: [2, 4, 6, 8] → median = (4 + 6) / 2 = 5
Use: When outliers present (robust)

Mode (Most Frequent):
Value appearing most often
Example: [1, 2, 2, 3, 3, 3] → mode = 3
Use: Categorical data

Choosing the right measure:
- Symmetric distribution: mean ≈ median
- Skewed right: mean > median
- Skewed left: mean < median

import numpy as np
import pandas as pd

data = [2, 4, 6, 8, 100]  # 100 is outlier

print(f"Mean: {np.mean(data)}")        # 24 (affected by outlier)
print(f"Median: {np.median(data)}")    # 6 (robust to outlier)
print(f"Std Dev: {np.std(data)}")      # 41.8 (shows spread)

# Measures of Spread
print(f"Range: {max(data) - min(data)}")           # 98
print(f"Variance: {np.var(data)}")                 # 1748 (spread²)
print(f"Std Deviation: {np.std(data)}")            # 41.8 (spread)
print(f"IQR: Q3 - Q1 = {np.percentile(data, 75) - np.percentile(data, 25)}")

Probability Distributions

Normal Distribution (Bell Curve):
- Mean = median = mode
- 68% within 1 std dev
- 95% within 2 std devs
- 99.7% within 3 std devs

Real examples:
- Human heights
- Test scores
- Measurement errors
- Many natural phenomena

Probability Density Function (PDF):
f(x) = 1/(σ√(2π)) * e^(-(x-μ)²/(2σ²))

μ = mean, σ = standard deviation

import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate normal distribution
data = np.random.normal(loc=100, scale=15, size=10000)

plt.hist(data, bins=50, density=True, alpha=0.7, label='Data')

# Plot theoretical distribution
x = np.linspace(data.min(), data.max(), 100)
plt.plot(x, norm.pdf(x, loc=100, scale=15), 'r-', label='Normal Distribution')
plt.legend()
plt.show()

# Probability calculations
# What's P(X < 115)?
prob = norm.cdf(115, loc=100, scale=15)  # 0.8413 (84.13%)

Hypothesis Testing

Concept: Is observed difference real or random chance?

Process:
1. Null hypothesis (H₀): No difference
2. Alternative hypothesis (H₁): Difference exists
3. Collect data
4. Calculate p-value
5. Compare to significance level (α = 0.05)

If p-value < α: Reject H₀ (difference is significant)
If p-value ≥ α: Fail to reject H₀ (no evidence of difference)

Example: Testing if coin is fair
H₀: p = 0.5 (fair coin)
H₁: p ≠ 0.5 (unfair coin)
Flip 100 times → 65 heads
p-value = 0.0018 < 0.05 → Reject H₀ (coin is biased)

from scipy import stats

# T-test: Compare two groups
group1 = [85, 88, 92, 78, 95]  # Control group scores
group2 = [92, 95, 98, 91, 94]  # Treatment group scores

t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Significant difference between groups")
else:
    print("No significant difference")

# Chi-square test: Categorical data
from sklearn.datasets import load_iris
from scipy.stats import chi2_contingency

# Example: Test independence of flower species and sepal length
observed = [[20, 30], [25, 25]]  # Contingency table
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"Chi-square: {chi2:.3f}, p-value: {p_value:.4f}")

2. Probability and Bayesian Thinking

Basic Probability

P(A) = favorable outcomes / total outcomes

Probability Rules:
- P(A) ∈ [0, 1] - probability between 0 and 1
- P(A) + P(not A) = 1
- P(A and B) = P(A) × P(B) - if independent
- P(A or B) = P(A) + P(B) - P(A and B)

Bayes' Theorem

P(A|B) = P(B|A) × P(A) / P(B)

Interpretation:
P(A|B) = Posterior (what we want to know)
P(B|A) = Likelihood (how likely is evidence given hypothesis)
P(A) = Prior (belief before seeing evidence)
P(B) = Evidence (normalizing factor)

Real Example: Medical Test
Disease exists: A
Test positive: B

P(A|B) = P(disease | positive test) = ?

Given:
- P(A) = 0.01 (1% of population has disease)
- P(B|A) = 0.99 (test 99% accurate if disease present)
- P(B|not A) = 0.02 (test 2% false positive rate)

P(B) = P(B|A)×P(A) + P(B|not A)×P(not A)
     = 0.99×0.01 + 0.02×0.99 = 0.0297

P(A|B) = 0.99×0.01 / 0.0297 = 0.333 (only 33.3% sure!)

Insight: Low disease prevalence + imperfect test = surprising result

# Bayes' theorem implementation
def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):
    """Calculate P(A|B) using Bayes' theorem"""
    p_not_a = 1 - p_a
    p_b = (p_b_given_a * p_a) + (p_b_given_not_a * p_not_a)
    p_a_given_b = (p_b_given_a * p_a) / p_b
    return p_a_given_b

# Medical test example
p_disease = 0.01
p_positive_if_disease = 0.99
p_positive_if_no_disease = 0.02

result = bayes_theorem(p_disease, p_positive_if_disease, p_positive_if_no_disease)
print(f"P(Disease | Positive Test) = {result:.4f}")  # 0.3333

3. Machine Learning Fundamentals

Supervised Learning

Regression: Predict continuous values

Goal: Find relationship between features and continuous target

Classic Problem: Predict house price
Features: square feet, bedrooms, location
Target: price

Example Output:
Input: 2000 sq ft, 3 bedrooms → Output: $450,000

Algorithms:
- Linear Regression: y = mx + b
- Polynomial Regression: y = ax² + bx + c
- Ridge/Lasso Regression: regularized linear

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Sample data: hours studied → test score
X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([50, 60, 70, 75, 85, 90])

model = LinearRegression()
model.fit(X, y)

# Make predictions
predictions = model.predict(X)
print(f"Coefficient: {model.coef_[0]:.2f}")    # ~7.86
print(f"Intercept: {model.intercept_:.2f}")    # ~42

# Evaluate
r2 = r2_score(y, predictions)
rmse = np.sqrt(mean_squared_error(y, predictions))
print(f"R² Score: {r2:.3f}")    # 0-1, higher better
print(f"RMSE: {rmse:.2f}")      # lower better

# Predict new value
new_hours = np.array([[7]])
predicted_score = model.predict(new_hours)
print(f"Study 7 hours → Expected score: {predicted_score[0]:.0f}")

Classification: Predict categories

Goal: Predict which category data belongs to

Classic Problem: Email classification
Features: text, sender, time
Target: spam or not spam

Algorithms:
- Logistic Regression: Binary classification
- Decision Trees: Non-linear, interpretable
- Random Forest: Ensemble, robust
- SVM: High-dimensional spaces
- Neural Networks: Complex patterns

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train random forest
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")  # 1.0 (perfect on iris)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{cm}")

Unsupervised Learning

Clustering: Group similar data

Goal: Find natural groupings without labels

Classic Problem: Customer segmentation
Data: purchase history, demographics
Output: Groups of similar customers

K-Means Algorithm:
1. Choose k (number of clusters)
2. Initialize k random centers
3. Assign points to nearest center
4. Update centers
5. Repeat until converged

Choosing k:
- Elbow method: plot inertia vs k
- Silhouette score: measure cluster quality
- Domain knowledge: what makes sense?

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Sample customer data: spending, visits
X = np.array([
    [100, 10],   # high spender, frequent visitor
    [150, 12],
    [50, 3],     # low spender, rare visitor
    [45, 2],
    [500, 50],   # very high spender
    [480, 48]
])

# Standardize features
X_scaled = StandardScaler().fit_transform(X)

# Find optimal k using elbow method
inertias = []
for k in range(1, 6):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.plot(range(1, 6), inertias, 'bo-')
plt.xlabel('k (number of clusters)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

# Use k=2 (elbow point)
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
print(f"Clusters: {clusters}")  # [0, 0, 1, 1, 0, 0]

Dimensionality Reduction

Problem: Too many features → overfitting, slowness

Solution: Reduce to most important features

PCA (Principal Component Analysis):
1. Find directions with most variance
2. Project data onto these directions
3. Keep top k components

Result: Fewer dimensions, most information preserved

from sklearn.decomposition import PCA

# High-dimensional data
X = np.random.randn(100, 50)  # 100 samples, 50 features

# Reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

print(f"Original shape: {X.shape}")
print(f"Reduced shape: {X_reduced.shape}")
print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Total variance retained: {sum(pca.explained_variance_ratio_):.2%}")

4. Working with Data in Python

Pandas Basics

import pandas as pd
import numpy as np

# Create DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 75000]
}
df = pd.DataFrame(data)

# Basic operations
print(df.head())           # First rows
print(df.info())           # Data types, nulls
print(df.describe())       # Statistical summary
print(df['Salary'].mean()) # Column mean

# Filtering
high_earners = df[df['Salary'] > 55000]
print(high_earners)

# Grouping
grouped = df.groupby('Age')['Salary'].mean()
print(grouped)

# Merging
df2 = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Department': ['IT', 'HR']
})
merged = df.merge(df2, on='Name')
print(merged)

# Handling missing data
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8]
})
print(df_missing.fillna(0))        # Replace NaN with 0
print(df_missing.dropna())         # Remove rows with NaN
print(df_missing.interpolate())    # Linear interpolation

NumPy for Numerical Computing

import numpy as np

# Array creation
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])

# Mathematical operations
print(np.mean(arr))            # Average
print(np.median(arr))          # Median
print(np.std(arr))             # Standard deviation
print(np.sum(arr))             # Sum

# Linear algebra
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

print(np.dot(a, b))            # Matrix multiplication
print(np.linalg.inv(a))        # Matrix inverse
eigenvalues, eigenvectors = np.linalg.eig(a)

5. Data Visualization

Matplotlib

import matplotlib.pyplot as plt

# Line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y, label='sin(x)')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Sine Function')
plt.legend()
plt.show()

# Histogram
data = np.random.normal(100, 15, 1000)
plt.hist(data, bins=30, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Normal Distribution')
plt.show()

# Scatter plot
x = np.random.randn(100)
y = 2 * x + np.random.randn(100)
plt.scatter(x, y, alpha=0.6)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
plt.show()

Seaborn (Statistical Visualization)

import seaborn as sns

# Correlation heatmap
data = pd.DataFrame(np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()

# Distribution plot
sns.histplot(data['A'], kde=True)
plt.show()

# Box plot (quartiles, outliers)
sns.boxplot(data=data)
plt.show()

# Pairplot (all correlations)
df = pd.DataFrame({
    'X1': np.random.randn(50),
    'X2': np.random.randn(50),
    'X3': np.random.randn(50)
})
sns.pairplot(df)
plt.show()

6. Model Evaluation

Classification Metrics

Confusion Matrix:
              Predicted Positive | Predicted Negative
Actual Positive:  TP              | FN
Actual Negative:  FP              | TN

TP (True Positive): Correctly predicted positive
FP (False Positive): Incorrectly predicted positive (Type I error)
TN (True Negative): Correctly predicted negative
FN (False Negative): Incorrectly predicted negative (Type II error)

Metrics:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)     - of predicted positive, how many correct?
Recall = TP / (TP + FN)         - of actual positive, how many found?
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

from sklearn.metrics import confusion_matrix, classification_report

y_true = [0, 1, 1, 1, 0, 1, 0, 1]
y_pred = [0, 1, 1, 0, 0, 1, 0, 0]

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[2 0]
#  [2 4]]

# Classification report
print(classification_report(y_true, y_pred))

Regression Metrics

Mean Absolute Error (MAE):
Average absolute difference between predicted and actual

Root Mean Squared Error (RMSE):
Square root of average squared differences
Penalizes large errors more than MAE

R² Score (Coefficient of Determination):
Proportion of variance explained
0 = model useless
1 = perfect predictions

from sklearn.metrics import mean_absolute_error, r2_score

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0, 2, 8]

mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

print(f"MAE: {mae}")   # 0.5
print(f"R²: {r2}")     # 0.95

7. Real-World Example: Predicting House Prices

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# 1. Load data
df = pd.read_csv('house_prices.csv')

# 2. Explore
print(df.head())
print(df.info())
print(df.describe())

# 3. Handle missing data
df = df.dropna()

# 4. Feature engineering
df['price_per_sqft'] = df['price'] / df['square_feet']
df['age'] = 2024 - df['year_built']

# 5. Prepare data
X = df.drop(columns=['price'])
y = df['price']

# 6. Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 7. Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 8. Train
model = RandomForestRegressor(n_estimators=100, max_depth=10)
model.fit(X_train_scaled, y_train)

# 9. Evaluate
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R² Score: {r2:.3f}")
print(f"RMSE: ${rmse:,.0f}")

# 10. Feature importance
importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)
print(importance)

Key Takeaways

Statistics First - Probability, distributions, hypothesis testing foundation
Exploratory Data Analysis (EDA) - Understand data before modeling
Supervised Learning - Regression for continuous, classification for categories
Unsupervised Learning - Clustering for discovery
Feature Engineering - Create meaningful features from raw data
Model Evaluation - Always measure on test set, use appropriate metrics
Avoid Overfitting - Use train/test split, regularization, cross-validation
Real Data is Messy - Handle missing values, outliers, imbalanced classes
Iterate - Try different models, tune hyperparameters
Domain Knowledge - ML + business understanding = valuable insights