Data Science Basics - Because Numbers Tell Stories
Introduction: Why I Started Learning This
I was frustrated. My app collected tons of user data but I had no idea what to do with it. Users came and went, some features worked, others didn't. I was basically guessing.
I started learning data science not because I wanted to be a "data scientist." I just wanted to understand: Why do some users stick around? Which features actually matter? What's my user really doing?
Turned out: statistics and basic machine learning made these questions answerable. Not magic. Just ways to find patterns in data and make decisions based on actual numbers instead of guesses.
This guide covers the fundamentals. Statistics that matter, probability thinking, and enough machine learning to actually use it on your own data. Nothing fancy.
1. Statistics Fundamentals
Descriptive Statistics
Measures of Central Tendency:
Mean (Average):
μ = (x₁ + x₂ + ... + xₙ) / n
Example: [2, 4, 6, 8] → mean = 20/4 = 5
Use: General average, affected by outliers
Median (Middle Value):
50th percentile - half values above, half below
Example: [2, 4, 6, 8] → median = (4 + 6) / 2 = 5
Use: When outliers present (robust)
Mode (Most Frequent):
Value appearing most often
Example: [1, 2, 2, 3, 3, 3] → mode = 3
Use: Categorical data
Choosing the right measure:
- Symmetric distribution: mean ≈ median
- Skewed right: mean > median
- Skewed left: mean < medianimport numpy as np
import pandas as pd
data = [2, 4, 6, 8, 100] # 100 is outlier
print(f"Mean: {np.mean(data)}") # 24 (affected by outlier)
print(f"Median: {np.median(data)}") # 6 (robust to outlier)
print(f"Std Dev: {np.std(data)}") # 41.8 (shows spread)
# Measures of Spread
print(f"Range: {max(data) - min(data)}") # 98
print(f"Variance: {np.var(data)}") # 1748 (spread²)
print(f"Std Deviation: {np.std(data)}") # 41.8 (spread)
print(f"IQR: Q3 - Q1 = {np.percentile(data, 75) - np.percentile(data, 25)}")Probability Distributions
Normal Distribution (Bell Curve):
- Mean = median = mode
- 68% within 1 std dev
- 95% within 2 std devs
- 99.7% within 3 std devs
Real examples:
- Human heights
- Test scores
- Measurement errors
- Many natural phenomena
Probability Density Function (PDF):
f(x) = 1/(σ√(2π)) * e^(-(x-μ)²/(2σ²))
μ = mean, σ = standard deviationimport matplotlib.pyplot as plt
from scipy.stats import norm
# Generate normal distribution
data = np.random.normal(loc=100, scale=15, size=10000)
plt.hist(data, bins=50, density=True, alpha=0.7, label='Data')
# Plot theoretical distribution
x = np.linspace(data.min(), data.max(), 100)
plt.plot(x, norm.pdf(x, loc=100, scale=15), 'r-', label='Normal Distribution')
plt.legend()
plt.show()
# Probability calculations
# What's P(X < 115)?
prob = norm.cdf(115, loc=100, scale=15) # 0.8413 (84.13%)Hypothesis Testing
Concept: Is observed difference real or random chance?
Process:
1. Null hypothesis (H₀): No difference
2. Alternative hypothesis (H₁): Difference exists
3. Collect data
4. Calculate p-value
5. Compare to significance level (α = 0.05)
If p-value < α: Reject H₀ (difference is significant)
If p-value ≥ α: Fail to reject H₀ (no evidence of difference)
Example: Testing if coin is fair
H₀: p = 0.5 (fair coin)
H₁: p ≠ 0.5 (unfair coin)
Flip 100 times → 65 heads
p-value = 0.0018 < 0.05 → Reject H₀ (coin is biased)from scipy import stats
# T-test: Compare two groups
group1 = [85, 88, 92, 78, 95] # Control group scores
group2 = [92, 95, 98, 91, 94] # Treatment group scores
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Significant difference between groups")
else:
print("No significant difference")
# Chi-square test: Categorical data
from sklearn.datasets import load_iris
from scipy.stats import chi2_contingency
# Example: Test independence of flower species and sepal length
observed = [[20, 30], [25, 25]] # Contingency table
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"Chi-square: {chi2:.3f}, p-value: {p_value:.4f}")2. Probability and Bayesian Thinking
Basic Probability
P(A) = favorable outcomes / total outcomes
Probability Rules:
- P(A) ∈ [0, 1] - probability between 0 and 1
- P(A) + P(not A) = 1
- P(A and B) = P(A) × P(B) - if independent
- P(A or B) = P(A) + P(B) - P(A and B)Bayes' Theorem
P(A|B) = P(B|A) × P(A) / P(B)
Interpretation:
P(A|B) = Posterior (what we want to know)
P(B|A) = Likelihood (how likely is evidence given hypothesis)
P(A) = Prior (belief before seeing evidence)
P(B) = Evidence (normalizing factor)
Real Example: Medical Test
Disease exists: A
Test positive: B
P(A|B) = P(disease | positive test) = ?
Given:
- P(A) = 0.01 (1% of population has disease)
- P(B|A) = 0.99 (test 99% accurate if disease present)
- P(B|not A) = 0.02 (test 2% false positive rate)
P(B) = P(B|A)×P(A) + P(B|not A)×P(not A)
= 0.99×0.01 + 0.02×0.99 = 0.0297
P(A|B) = 0.99×0.01 / 0.0297 = 0.333 (only 33.3% sure!)
Insight: Low disease prevalence + imperfect test = surprising result# Bayes' theorem implementation
def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):
"""Calculate P(A|B) using Bayes' theorem"""
p_not_a = 1 - p_a
p_b = (p_b_given_a * p_a) + (p_b_given_not_a * p_not_a)
p_a_given_b = (p_b_given_a * p_a) / p_b
return p_a_given_b
# Medical test example
p_disease = 0.01
p_positive_if_disease = 0.99
p_positive_if_no_disease = 0.02
result = bayes_theorem(p_disease, p_positive_if_disease, p_positive_if_no_disease)
print(f"P(Disease | Positive Test) = {result:.4f}") # 0.33333. Machine Learning Fundamentals
Supervised Learning
Regression: Predict continuous values
Goal: Find relationship between features and continuous target
Classic Problem: Predict house price
Features: square feet, bedrooms, location
Target: price
Example Output:
Input: 2000 sq ft, 3 bedrooms → Output: $450,000
Algorithms:
- Linear Regression: y = mx + b
- Polynomial Regression: y = ax² + bx + c
- Ridge/Lasso Regression: regularized linearfrom sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
# Sample data: hours studied → test score
X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([50, 60, 70, 75, 85, 90])
model = LinearRegression()
model.fit(X, y)
# Make predictions
predictions = model.predict(X)
print(f"Coefficient: {model.coef_[0]:.2f}") # ~7.86
print(f"Intercept: {model.intercept_:.2f}") # ~42
# Evaluate
r2 = r2_score(y, predictions)
rmse = np.sqrt(mean_squared_error(y, predictions))
print(f"R² Score: {r2:.3f}") # 0-1, higher better
print(f"RMSE: {rmse:.2f}") # lower better
# Predict new value
new_hours = np.array([[7]])
predicted_score = model.predict(new_hours)
print(f"Study 7 hours → Expected score: {predicted_score[0]:.0f}")Classification: Predict categories
Goal: Predict which category data belongs to
Classic Problem: Email classification
Features: text, sender, time
Target: spam or not spam
Algorithms:
- Logistic Regression: Binary classification
- Decision Trees: Non-linear, interpretable
- Random Forest: Ensemble, robust
- SVM: High-dimensional spaces
- Neural Networks: Complex patternsfrom sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
# Load data
iris = load_iris()
X = iris.data
y = iris.target
# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train random forest
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}") # 1.0 (perfect on iris)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{cm}")Unsupervised Learning
Clustering: Group similar data
Goal: Find natural groupings without labels
Classic Problem: Customer segmentation
Data: purchase history, demographics
Output: Groups of similar customers
K-Means Algorithm:
1. Choose k (number of clusters)
2. Initialize k random centers
3. Assign points to nearest center
4. Update centers
5. Repeat until converged
Choosing k:
- Elbow method: plot inertia vs k
- Silhouette score: measure cluster quality
- Domain knowledge: what makes sense?from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Sample customer data: spending, visits
X = np.array([
[100, 10], # high spender, frequent visitor
[150, 12],
[50, 3], # low spender, rare visitor
[45, 2],
[500, 50], # very high spender
[480, 48]
])
# Standardize features
X_scaled = StandardScaler().fit_transform(X)
# Find optimal k using elbow method
inertias = []
for k in range(1, 6):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
plt.plot(range(1, 6), inertias, 'bo-')
plt.xlabel('k (number of clusters)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# Use k=2 (elbow point)
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
print(f"Clusters: {clusters}") # [0, 0, 1, 1, 0, 0]Dimensionality Reduction
Problem: Too many features → overfitting, slowness
Solution: Reduce to most important features
PCA (Principal Component Analysis):
1. Find directions with most variance
2. Project data onto these directions
3. Keep top k components
Result: Fewer dimensions, most information preservedfrom sklearn.decomposition import PCA
# High-dimensional data
X = np.random.randn(100, 50) # 100 samples, 50 features
# Reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Original shape: {X.shape}")
print(f"Reduced shape: {X_reduced.shape}")
print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Total variance retained: {sum(pca.explained_variance_ratio_):.2%}")4. Working with Data in Python
Pandas Basics
import pandas as pd
import numpy as np
# Create DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 75000]
}
df = pd.DataFrame(data)
# Basic operations
print(df.head()) # First rows
print(df.info()) # Data types, nulls
print(df.describe()) # Statistical summary
print(df['Salary'].mean()) # Column mean
# Filtering
high_earners = df[df['Salary'] > 55000]
print(high_earners)
# Grouping
grouped = df.groupby('Age')['Salary'].mean()
print(grouped)
# Merging
df2 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Department': ['IT', 'HR']
})
merged = df.merge(df2, on='Name')
print(merged)
# Handling missing data
df_missing = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})
print(df_missing.fillna(0)) # Replace NaN with 0
print(df_missing.dropna()) # Remove rows with NaN
print(df_missing.interpolate()) # Linear interpolationNumPy for Numerical Computing
import numpy as np
# Array creation
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])
# Mathematical operations
print(np.mean(arr)) # Average
print(np.median(arr)) # Median
print(np.std(arr)) # Standard deviation
print(np.sum(arr)) # Sum
# Linear algebra
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
print(np.dot(a, b)) # Matrix multiplication
print(np.linalg.inv(a)) # Matrix inverse
eigenvalues, eigenvectors = np.linalg.eig(a)5. Data Visualization
Matplotlib
import matplotlib.pyplot as plt
# Line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y, label='sin(x)')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Sine Function')
plt.legend()
plt.show()
# Histogram
data = np.random.normal(100, 15, 1000)
plt.hist(data, bins=30, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Normal Distribution')
plt.show()
# Scatter plot
x = np.random.randn(100)
y = 2 * x + np.random.randn(100)
plt.scatter(x, y, alpha=0.6)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
plt.show()Seaborn (Statistical Visualization)
import seaborn as sns
# Correlation heatmap
data = pd.DataFrame(np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()
# Distribution plot
sns.histplot(data['A'], kde=True)
plt.show()
# Box plot (quartiles, outliers)
sns.boxplot(data=data)
plt.show()
# Pairplot (all correlations)
df = pd.DataFrame({
'X1': np.random.randn(50),
'X2': np.random.randn(50),
'X3': np.random.randn(50)
})
sns.pairplot(df)
plt.show()6. Model Evaluation
Classification Metrics
Confusion Matrix:
Predicted Positive | Predicted Negative
Actual Positive: TP | FN
Actual Negative: FP | TN
TP (True Positive): Correctly predicted positive
FP (False Positive): Incorrectly predicted positive (Type I error)
TN (True Negative): Correctly predicted negative
FN (False Negative): Incorrectly predicted negative (Type II error)
Metrics:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP) - of predicted positive, how many correct?
Recall = TP / (TP + FN) - of actual positive, how many found?
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)from sklearn.metrics import confusion_matrix, classification_report
y_true = [0, 1, 1, 1, 0, 1, 0, 1]
y_pred = [0, 1, 1, 0, 0, 1, 0, 0]
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[2 0]
# [2 4]]
# Classification report
print(classification_report(y_true, y_pred))Regression Metrics
Mean Absolute Error (MAE):
Average absolute difference between predicted and actual
Root Mean Squared Error (RMSE):
Square root of average squared differences
Penalizes large errors more than MAE
R² Score (Coefficient of Determination):
Proportion of variance explained
0 = model useless
1 = perfect predictionsfrom sklearn.metrics import mean_absolute_error, r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0, 2, 8]
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"MAE: {mae}") # 0.5
print(f"R²: {r2}") # 0.957. Real-World Example: Predicting House Prices
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
# 1. Load data
df = pd.read_csv('house_prices.csv')
# 2. Explore
print(df.head())
print(df.info())
print(df.describe())
# 3. Handle missing data
df = df.dropna()
# 4. Feature engineering
df['price_per_sqft'] = df['price'] / df['square_feet']
df['age'] = 2024 - df['year_built']
# 5. Prepare data
X = df.drop(columns=['price'])
y = df['price']
# 6. Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 7. Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 8. Train
model = RandomForestRegressor(n_estimators=100, max_depth=10)
model.fit(X_train_scaled, y_train)
# 9. Evaluate
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"R² Score: {r2:.3f}")
print(f"RMSE: ${rmse:,.0f}")
# 10. Feature importance
importance = pd.DataFrame({
'Feature': X.columns,
'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)
print(importance)Key Takeaways
- Statistics First - Probability, distributions, hypothesis testing foundation
- Exploratory Data Analysis (EDA) - Understand data before modeling
- Supervised Learning - Regression for continuous, classification for categories
- Unsupervised Learning - Clustering for discovery
- Feature Engineering - Create meaningful features from raw data
- Model Evaluation - Always measure on test set, use appropriate metrics
- Avoid Overfitting - Use train/test split, regularization, cross-validation
- Real Data is Messy - Handle missing values, outliers, imbalanced classes
- Iterate - Try different models, tune hyperparameters
- Domain Knowledge - ML + business understanding = valuable insights