Data Science Basics - Statistics and ML You Can Actually Use

By Chetan Sharma | February 22, 2024

Data Science Basics - Because Numbers Tell Stories

Introduction: Why I Started Learning This

I was frustrated. My app collected tons of user data but I had no idea what to do with it. Users came and went, some features worked, others didn't. I was basically guessing.

I started learning data science not because I wanted to be a "data scientist." I just wanted to understand: Why do some users stick around? Which features actually matter? What's my user really doing?

Turned out: statistics and basic machine learning made these questions answerable. Not magic. Just ways to find patterns in data and make decisions based on actual numbers instead of guesses.

This guide covers the fundamentals. Statistics that matter, probability thinking, and enough machine learning to actually use it on your own data. Nothing fancy.

1. Statistics Fundamentals

Descriptive Statistics

Measures of Central Tendency:

Mean (Average):
μ = (x₁ + x₂ + ... + xₙ) / n
Example: [2, 4, 6, 8] → mean = 20/4 = 5
Use: General average, affected by outliers

Median (Middle Value):
50th percentile - half values above, half below
Example: [2, 4, 6, 8] → median = (4 + 6) / 2 = 5
Use: When outliers present (robust)

Mode (Most Frequent):
Value appearing most often
Example: [1, 2, 2, 3, 3, 3] → mode = 3
Use: Categorical data

Choosing the right measure:
- Symmetric distribution: mean ≈ median
- Skewed right: mean > median
- Skewed left: mean < median

import numpy as np
import pandas as pd

data = [2, 4, 6, 8, 100]  # 100 is outlier

print(f"Mean: {np.mean(data)}")        # 24 (affected by outlier)
print(f"Median: {np.median(data)}")    # 6 (robust to outlier)
print(f"Std Dev: {np.std(data)}")      # 41.8 (shows spread)

# Measures of Spread
print(f"Range: {max(data) - min(data)}")           # 98
print(f"Variance: {np.var(data)}")                 # 1748 (spread²)
print(f"Std Deviation: {np.std(data)}")            # 41.8 (spread)
print(f"IQR: Q3 - Q1 = {np.percentile(data, 75) - np.percentile(data, 25)}")

Probability Distributions

Normal Distribution (Bell Curve):
- Mean = median = mode
- 68% within 1 std dev
- 95% within 2 std devs
- 99.7% within 3 std devs

Real examples:
- Human heights
- Test scores
- Measurement errors
- Many natural phenomena

Probability Density Function (PDF):
f(x) = 1/(σ√(2π)) * e^(-(x-μ)²/(2σ²))

μ = mean, σ = standard deviation

import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate normal distribution
data = np.random.normal(loc=100, scale=15, size=10000)

plt.hist(data, bins=50, density=True, alpha=0.7, label='Data')

# Plot theoretical distribution
x = np.linspace(data.min(), data.max(), 100)
plt.plot(x, norm.pdf(x, loc=100, scale=15), 'r-', label='Normal Distribution')
plt.legend()
plt.show()

# Probability calculations
# What's P(X < 115)?
prob = norm.cdf(115, loc=100, scale=15)  # 0.8413 (84.13%)

Hypothesis Testing

Concept: Is observed difference real or random chance?

Process:
1. Null hypothesis (H₀): No difference
2. Alternative hypothesis (H₁): Difference exists
3. Collect data
4. Calculate p-value
5. Compare to significance level (α = 0.05)

If p-value < α: Reject H₀ (difference is significant)
If p-value ≥ α: Fail to reject H₀ (no evidence of difference)

Example: Testing if coin is fair
H₀: p = 0.5 (fair coin)
H₁: p ≠ 0.5 (unfair coin)
Flip 100 times → 65 heads
p-value = 0.0018 < 0.05 → Reject H₀ (coin is biased)

from scipy import stats

# T-test: Compare two groups
group1 = [85, 88, 92, 78, 95]  # Control group scores
group2 = [92, 95, 98, 91, 94]  # Treatment group scores

t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Significant difference between groups")
else:
    print("No significant difference")

# Chi-square test: Categorical data
from sklearn.datasets import load_iris
from scipy.stats import chi2_contingency

# Example: Test independence of flower species and sepal length
observed = [[20, 30], [25, 25]]  # Contingency table
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"Chi-square: {chi2:.3f}, p-value: {p_value:.4f}")

2. Probability and Bayesian Thinking

Basic Probability

P(A) = favorable outcomes / total outcomes

Probability Rules:
- P(A) ∈ [0, 1] - probability between 0 and 1
- P(A) + P(not A) = 1
- P(A and B) = P(A) × P(B) - if independent
- P(A or B) = P(A) + P(B) - P(A and B)

Bayes' Theorem

P(A|B) = P(B|A) × P(A) / P(B)

Interpretation:
P(A|B) = Posterior (what we want to know)
P(B|A) = Likelihood (how likely is evidence given hypothesis)
P(A) = Prior (belief before seeing evidence)
P(B) = Evidence (normalizing factor)

Real Example: Medical Test
Disease exists: A
Test positive: B

P(A|B) = P(disease | positive test) = ?

Given:
- P(A) = 0.01 (1% of population has disease)
- P(B|A) = 0.99 (test 99% accurate if disease present)
- P(B|not A) = 0.02 (test 2% false positive rate)

P(B) = P(B|A)×P(A) + P(B|not A)×P(not A)
     = 0.99×0.01 + 0.02×0.99 = 0.0297

P(A|B) = 0.99×0.01 / 0.0297 = 0.333 (only 33.3% sure!)

Insight: Low disease prevalence + imperfect test = surprising result

# Bayes' theorem implementation
def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):
    """Calculate P(A|B) using Bayes' theorem"""
    p_not_a = 1 - p_a
    p_b = (p_b_given_a * p_a) + (p_b_given_not_a * p_not_a)
    p_a_given_b = (p_b_given_a * p_a) / p_b
    return p_a_given_b

# Medical test example
p_disease = 0.01
p_positive_if_disease = 0.99
p_positive_if_no_disease = 0.02

result = bayes_theorem(p_disease, p_positive_if_disease, p_positive_if_no_disease)
print(f"P(Disease | Positive Test) = {result:.4f}")  # 0.3333

3. Machine Learning Fundamentals

Supervised Learning

Regression: Predict continuous values

Goal: Find relationship between features and continuous target

Classic Problem: Predict house price
Features: square feet, bedrooms, location
Target: price

Example Output:
Input: 2000 sq ft, 3 bedrooms → Output: $450,000

Algorithms:
- Linear Regression: y = mx + b
- Polynomial Regression: y = ax² + bx + c
- Ridge/Lasso Regression: regularized linear

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Sample data: hours studied → test score
X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([50, 60, 70, 75, 85, 90])

model = LinearRegression()
model.fit(X, y)

# Make predictions
predictions = model.predict(X)
print(f"Coefficient: {model.coef_[0]:.2f}")    # ~7.86
print(f"Intercept: {model.intercept_:.2f}")    # ~42

# Evaluate
r2 = r2_score(y, predictions)
rmse = np.sqrt(mean_squared_error(y, predictions))
print(f"R² Score: {r2:.3f}")    # 0-1, higher better
print(f"RMSE: {rmse:.2f}")      # lower better

# Predict new value
new_hours = np.array([[7]])
predicted_score = model.predict(new_hours)
print(f"Study 7 hours → Expected score: {predicted_score[0]:.0f}")

Classification: Predict categories

Goal: Predict which category data belongs to

Classic Problem: Email classification
Features: text, sender, time
Target: spam or not spam

Algorithms:
- Logistic Regression: Binary classification
- Decision Trees: Non-linear, interpretable
- Random Forest: Ensemble, robust
- SVM: High-dimensional spaces
- Neural Networks: Complex patterns

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train random forest
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")  # 1.0 (perfect on iris)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{cm}")

Unsupervised Learning

Clustering: Group similar data

Goal: Find natural groupings without labels

Classic Problem: Customer segmentation
Data: purchase history, demographics
Output: Groups of similar customers

K-Means Algorithm:
1. Choose k (number of clusters)
2. Initialize k random centers
3. Assign points to nearest center
4. Update centers
5. Repeat until converged

Choosing k:
- Elbow method: plot inertia vs k
- Silhouette score: measure cluster quality
- Domain knowledge: what makes sense?

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Sample customer data: spending, visits
X = np.array([
    [100, 10],   # high spender, frequent visitor
    [150, 12],
    [50, 3],     # low spender, rare visitor
    [45, 2],
    [500, 50],   # very high spender
    [480, 48]
])

# Standardize features
X_scaled = StandardScaler().fit_transform(X)

# Find optimal k using elbow method
inertias = []
for k in range(1, 6):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.plot(range(1, 6), inertias, 'bo-')
plt.xlabel('k (number of clusters)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

# Use k=2 (elbow point)
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
print(f"Clusters: {clusters}")  # [0, 0, 1, 1, 0, 0]

Dimensionality Reduction

Problem: Too many features → overfitting, slowness

Solution: Reduce to most important features

PCA (Principal Component Analysis):
1. Find directions with most variance
2. Project data onto these directions
3. Keep top k components

Result: Fewer dimensions, most information preserved

from sklearn.decomposition import PCA

# High-dimensional data
X = np.random.randn(100, 50)  # 100 samples, 50 features

# Reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

print(f"Original shape: {X.shape}")
print(f"Reduced shape: {X_reduced.shape}")
print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Total variance retained: {sum(pca.explained_variance_ratio_):.2%}")

4. Working with Data in Python

Pandas Basics

import pandas as pd
import numpy as np

# Create DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 75000]
}
df = pd.DataFrame(data)

# Basic operations
print(df.head())           # First rows
print(df.info())           # Data types, nulls
print(df.describe())       # Statistical summary
print(df['Salary'].mean()) # Column mean

# Filtering
high_earners = df[df['Salary'] > 55000]
print(high_earners)

# Grouping
grouped = df.groupby('Age')['Salary'].mean()
print(grouped)

# Merging
df2 = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Department': ['IT', 'HR']
})
merged = df.merge(df2, on='Name')
print(merged)

# Handling missing data
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8]
})
print(df_missing.fillna(0))        # Replace NaN with 0
print(df_missing.dropna())         # Remove rows with NaN
print(df_missing.interpolate())    # Linear interpolation

NumPy for Numerical Computing

import numpy as np

# Array creation
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])

# Mathematical operations
print(np.mean(arr))            # Average
print(np.median(arr))          # Median
print(np.std(arr))             # Standard deviation
print(np.sum(arr))             # Sum

# Linear algebra
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

print(np.dot(a, b))            # Matrix multiplication
print(np.linalg.inv(a))        # Matrix inverse
eigenvalues, eigenvectors = np.linalg.eig(a)

5. Data Visualization

Matplotlib

import matplotlib.pyplot as plt

# Line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y, label='sin(x)')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Sine Function')
plt.legend()
plt.show()

# Histogram
data = np.random.normal(100, 15, 1000)
plt.hist(data, bins=30, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Normal Distribution')
plt.show()

# Scatter plot
x = np.random.randn(100)
y = 2 * x + np.random.randn(100)
plt.scatter(x, y, alpha=0.6)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
plt.show()

Seaborn (Statistical Visualization)

import seaborn as sns

# Correlation heatmap
data = pd.DataFrame(np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()

# Distribution plot
sns.histplot(data['A'], kde=True)
plt.show()

# Box plot (quartiles, outliers)
sns.boxplot(data=data)
plt.show()

# Pairplot (all correlations)
df = pd.DataFrame({
    'X1': np.random.randn(50),
    'X2': np.random.randn(50),
    'X3': np.random.randn(50)
})
sns.pairplot(df)
plt.show()

6. Model Evaluation

Classification Metrics

Confusion Matrix:
              Predicted Positive | Predicted Negative
Actual Positive:  TP              | FN
Actual Negative:  FP              | TN

TP (True Positive): Correctly predicted positive
FP (False Positive): Incorrectly predicted positive (Type I error)
TN (True Negative): Correctly predicted negative
FN (False Negative): Incorrectly predicted negative (Type II error)

Metrics:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)     - of predicted positive, how many correct?
Recall = TP / (TP + FN)         - of actual positive, how many found?
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

from sklearn.metrics import confusion_matrix, classification_report

y_true = [0, 1, 1, 1, 0, 1, 0, 1]
y_pred = [0, 1, 1, 0, 0, 1, 0, 0]

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[2 0]
#  [2 4]]

# Classification report
print(classification_report(y_true, y_pred))

Regression Metrics

Mean Absolute Error (MAE):
Average absolute difference between predicted and actual

Root Mean Squared Error (RMSE):
Square root of average squared differences
Penalizes large errors more than MAE

R² Score (Coefficient of Determination):
Proportion of variance explained
0 = model useless
1 = perfect predictions

from sklearn.metrics import mean_absolute_error, r2_score

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0, 2, 8]

mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

print(f"MAE: {mae}")   # 0.5
print(f"R²: {r2}")     # 0.95

7. Real-World Example: Predicting House Prices

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# 1. Load data
df = pd.read_csv('house_prices.csv')

# 2. Explore
print(df.head())
print(df.info())
print(df.describe())

# 3. Handle missing data
df = df.dropna()

# 4. Feature engineering
df['price_per_sqft'] = df['price'] / df['square_feet']
df['age'] = 2024 - df['year_built']

# 5. Prepare data
X = df.drop(columns=['price'])
y = df['price']

# 6. Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 7. Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 8. Train
model = RandomForestRegressor(n_estimators=100, max_depth=10)
model.fit(X_train_scaled, y_train)

# 9. Evaluate
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R² Score: {r2:.3f}")
print(f"RMSE: ${rmse:,.0f}")

# 10. Feature importance
importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)
print(importance)

Key Takeaways

Statistics First - Probability, distributions, hypothesis testing foundation
Exploratory Data Analysis (EDA) - Understand data before modeling
Supervised Learning - Regression for continuous, classification for categories
Unsupervised Learning - Clustering for discovery
Feature Engineering - Create meaningful features from raw data
Model Evaluation - Always measure on test set, use appropriate metrics
Avoid Overfitting - Use train/test split, regularization, cross-validation
Real Data is Messy - Handle missing values, outliers, imbalanced classes
Iterate - Try different models, tune hyperparameters
Domain Knowledge - ML + business understanding = valuable insights