Data Science Fundamentals - Complete Mastery
Introduction: The Data Revolution
Data science is the intersection of statistics, programming, and domain expertise. Companies generate 2.5 quintillion bytes of data daily. Understanding data science is essential for:
- Making data-driven decisions
- Building predictive models
- Automating decision-making
- Understanding customer behavior
- Optimizing business processes
- Creating competitive advantages
This guide covers core concepts with practical implementations.
1. Statistics Fundamentals
Descriptive Statistics
Measures of Central Tendency:
Mean (Average):
μ = (x₁ + x₂ + ... + xₙ) / n
Example: [2, 4, 6, 8] → mean = 20/4 = 5
Use: General average, affected by outliers
Median (Middle Value):
50th percentile - half values above, half below
Example: [2, 4, 6, 8] → median = (4 + 6) / 2 = 5
Use: When outliers present (robust)
Mode (Most Frequent):
Value appearing most often
Example: [1, 2, 2, 3, 3, 3] → mode = 3
Use: Categorical data
Choosing the right measure:
- Symmetric distribution: mean ≈ median
- Skewed right: mean > median
- Skewed left: mean < medianimport numpy as np
import pandas as pd
data = [2, 4, 6, 8, 100] # 100 is outlier
print(f"Mean: {np.mean(data)}") # 24 (affected by outlier)
print(f"Median: {np.median(data)}") # 6 (robust to outlier)
print(f"Std Dev: {np.std(data)}") # 41.8 (shows spread)
# Measures of Spread
print(f"Range: {max(data) - min(data)}") # 98
print(f"Variance: {np.var(data)}") # 1748 (spread²)
print(f"Std Deviation: {np.std(data)}") # 41.8 (spread)
print(f"IQR: Q3 - Q1 = {np.percentile(data, 75) - np.percentile(data, 25)}")Probability Distributions
Normal Distribution (Bell Curve):
- Mean = median = mode
- 68% within 1 std dev
- 95% within 2 std devs
- 99.7% within 3 std devs
Real examples:
- Human heights
- Test scores
- Measurement errors
- Many natural phenomena
Probability Density Function (PDF):
f(x) = 1/(σ√(2π)) * e^(-(x-μ)²/(2σ²))
μ = mean, σ = standard deviationimport matplotlib.pyplot as plt
from scipy.stats import norm
# Generate normal distribution
data = np.random.normal(loc=100, scale=15, size=10000)
plt.hist(data, bins=50, density=True, alpha=0.7, label='Data')
# Plot theoretical distribution
x = np.linspace(data.min(), data.max(), 100)
plt.plot(x, norm.pdf(x, loc=100, scale=15), 'r-', label='Normal Distribution')
plt.legend()
plt.show()
# Probability calculations
# What's P(X < 115)?
prob = norm.cdf(115, loc=100, scale=15) # 0.8413 (84.13%)Hypothesis Testing
Concept: Is observed difference real or random chance?
Process:
1. Null hypothesis (H₀): No difference
2. Alternative hypothesis (H₁): Difference exists
3. Collect data
4. Calculate p-value
5. Compare to significance level (α = 0.05)
If p-value < α: Reject H₀ (difference is significant)
If p-value ≥ α: Fail to reject H₀ (no evidence of difference)
Example: Testing if coin is fair
H₀: p = 0.5 (fair coin)
H₁: p ≠ 0.5 (unfair coin)
Flip 100 times → 65 heads
p-value = 0.0018 < 0.05 → Reject H₀ (coin is biased)from scipy import stats
# T-test: Compare two groups
group1 = [85, 88, 92, 78, 95] # Control group scores
group2 = [92, 95, 98, 91, 94] # Treatment group scores
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Significant difference between groups")
else:
print("No significant difference")
# Chi-square test: Categorical data
from sklearn.datasets import load_iris
from scipy.stats import chi2_contingency
# Example: Test independence of flower species and sepal length
observed = [[20, 30], [25, 25]] # Contingency table
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"Chi-square: {chi2:.3f}, p-value: {p_value:.4f}")2. Probability and Bayesian Thinking
Basic Probability
P(A) = favorable outcomes / total outcomes
Probability Rules:
- P(A) ∈ [0, 1] - probability between 0 and 1
- P(A) + P(not A) = 1
- P(A and B) = P(A) × P(B) - if independent
- P(A or B) = P(A) + P(B) - P(A and B)Bayes' Theorem
P(A|B) = P(B|A) × P(A) / P(B)
Interpretation:
P(A|B) = Posterior (what we want to know)
P(B|A) = Likelihood (how likely is evidence given hypothesis)
P(A) = Prior (belief before seeing evidence)
P(B) = Evidence (normalizing factor)
Real Example: Medical Test
Disease exists: A
Test positive: B
P(A|B) = P(disease | positive test) = ?
Given:
- P(A) = 0.01 (1% of population has disease)
- P(B|A) = 0.99 (test 99% accurate if disease present)
- P(B|not A) = 0.02 (test 2% false positive rate)
P(B) = P(B|A)×P(A) + P(B|not A)×P(not A)
= 0.99×0.01 + 0.02×0.99 = 0.0297
P(A|B) = 0.99×0.01 / 0.0297 = 0.333 (only 33.3% sure!)
Insight: Low disease prevalence + imperfect test = surprising result# Bayes' theorem implementation
def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):
"""Calculate P(A|B) using Bayes' theorem"""
p_not_a = 1 - p_a
p_b = (p_b_given_a * p_a) + (p_b_given_not_a * p_not_a)
p_a_given_b = (p_b_given_a * p_a) / p_b
return p_a_given_b
# Medical test example
p_disease = 0.01
p_positive_if_disease = 0.99
p_positive_if_no_disease = 0.02
result = bayes_theorem(p_disease, p_positive_if_disease, p_positive_if_no_disease)
print(f"P(Disease | Positive Test) = {result:.4f}") # 0.33333. Machine Learning Fundamentals
Supervised Learning
Regression: Predict continuous values
Goal: Find relationship between features and continuous target
Classic Problem: Predict house price
Features: square feet, bedrooms, location
Target: price
Example Output:
Input: 2000 sq ft, 3 bedrooms → Output: $450,000
Algorithms:
- Linear Regression: y = mx + b
- Polynomial Regression: y = ax² + bx + c
- Ridge/Lasso Regression: regularized linearfrom sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
# Sample data: hours studied → test score
X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([50, 60, 70, 75, 85, 90])
model = LinearRegression()
model.fit(X, y)
# Make predictions
predictions = model.predict(X)
print(f"Coefficient: {model.coef_[0]:.2f}") # ~7.86
print(f"Intercept: {model.intercept_:.2f}") # ~42
# Evaluate
r2 = r2_score(y, predictions)
rmse = np.sqrt(mean_squared_error(y, predictions))
print(f"R² Score: {r2:.3f}") # 0-1, higher better
print(f"RMSE: {rmse:.2f}") # lower better
# Predict new value
new_hours = np.array([[7]])
predicted_score = model.predict(new_hours)
print(f"Study 7 hours → Expected score: {predicted_score[0]:.0f}")Classification: Predict categories
Goal: Predict which category data belongs to
Classic Problem: Email classification
Features: text, sender, time
Target: spam or not spam
Algorithms:
- Logistic Regression: Binary classification
- Decision Trees: Non-linear, interpretable
- Random Forest: Ensemble, robust
- SVM: High-dimensional spaces
- Neural Networks: Complex patternsfrom sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
# Load data
iris = load_iris()
X = iris.data
y = iris.target
# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train random forest
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}") # 1.0 (perfect on iris)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{cm}")Unsupervised Learning
Clustering: Group similar data
Goal: Find natural groupings without labels
Classic Problem: Customer segmentation
Data: purchase history, demographics
Output: Groups of similar customers
K-Means Algorithm:
1. Choose k (number of clusters)
2. Initialize k random centers
3. Assign points to nearest center
4. Update centers
5. Repeat until converged
Choosing k:
- Elbow method: plot inertia vs k
- Silhouette score: measure cluster quality
- Domain knowledge: what makes sense?from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Sample customer data: spending, visits
X = np.array([
[100, 10], # high spender, frequent visitor
[150, 12],
[50, 3], # low spender, rare visitor
[45, 2],
[500, 50], # very high spender
[480, 48]
])
# Standardize features
X_scaled = StandardScaler().fit_transform(X)
# Find optimal k using elbow method
inertias = []
for k in range(1, 6):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
plt.plot(range(1, 6), inertias, 'bo-')
plt.xlabel('k (number of clusters)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# Use k=2 (elbow point)
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
print(f"Clusters: {clusters}") # [0, 0, 1, 1, 0, 0]Dimensionality Reduction
Problem: Too many features → overfitting, slowness
Solution: Reduce to most important features
PCA (Principal Component Analysis):
1. Find directions with most variance
2. Project data onto these directions
3. Keep top k components
Result: Fewer dimensions, most information preservedfrom sklearn.decomposition import PCA
# High-dimensional data
X = np.random.randn(100, 50) # 100 samples, 50 features
# Reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Original shape: {X.shape}")
print(f"Reduced shape: {X_reduced.shape}")
print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Total variance retained: {sum(pca.explained_variance_ratio_):.2%}")4. Working with Data in Python
Pandas Basics
import pandas as pd
import numpy as np
# Create DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 75000]
}
df = pd.DataFrame(data)
# Basic operations
print(df.head()) # First rows
print(df.info()) # Data types, nulls
print(df.describe()) # Statistical summary
print(df['Salary'].mean()) # Column mean
# Filtering
high_earners = df[df['Salary'] > 55000]
print(high_earners)
# Grouping
grouped = df.groupby('Age')['Salary'].mean()
print(grouped)
# Merging
df2 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Department': ['IT', 'HR']
})
merged = df.merge(df2, on='Name')
print(merged)
# Handling missing data
df_missing = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})
print(df_missing.fillna(0)) # Replace NaN with 0
print(df_missing.dropna()) # Remove rows with NaN
print(df_missing.interpolate()) # Linear interpolationNumPy for Numerical Computing
import numpy as np
# Array creation
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])
# Mathematical operations
print(np.mean(arr)) # Average
print(np.median(arr)) # Median
print(np.std(arr)) # Standard deviation
print(np.sum(arr)) # Sum
# Linear algebra
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
print(np.dot(a, b)) # Matrix multiplication
print(np.linalg.inv(a)) # Matrix inverse
eigenvalues, eigenvectors = np.linalg.eig(a)5. Data Visualization
Matplotlib
import matplotlib.pyplot as plt
# Line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y, label='sin(x)')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Sine Function')
plt.legend()
plt.show()
# Histogram
data = np.random.normal(100, 15, 1000)
plt.hist(data, bins=30, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Normal Distribution')
plt.show()
# Scatter plot
x = np.random.randn(100)
y = 2 * x + np.random.randn(100)
plt.scatter(x, y, alpha=0.6)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
plt.show()Seaborn (Statistical Visualization)
import seaborn as sns
# Correlation heatmap
data = pd.DataFrame(np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()
# Distribution plot
sns.histplot(data['A'], kde=True)
plt.show()
# Box plot (quartiles, outliers)
sns.boxplot(data=data)
plt.show()
# Pairplot (all correlations)
df = pd.DataFrame({
'X1': np.random.randn(50),
'X2': np.random.randn(50),
'X3': np.random.randn(50)
})
sns.pairplot(df)
plt.show()6. Model Evaluation
Classification Metrics
Confusion Matrix:
Predicted Positive | Predicted Negative
Actual Positive: TP | FN
Actual Negative: FP | TN
TP (True Positive): Correctly predicted positive
FP (False Positive): Incorrectly predicted positive (Type I error)
TN (True Negative): Correctly predicted negative
FN (False Negative): Incorrectly predicted negative (Type II error)
Metrics:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP) - of predicted positive, how many correct?
Recall = TP / (TP + FN) - of actual positive, how many found?
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)from sklearn.metrics import confusion_matrix, classification_report
y_true = [0, 1, 1, 1, 0, 1, 0, 1]
y_pred = [0, 1, 1, 0, 0, 1, 0, 0]
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[2 0]
# [2 4]]
# Classification report
print(classification_report(y_true, y_pred))Regression Metrics
Mean Absolute Error (MAE):
Average absolute difference between predicted and actual
Root Mean Squared Error (RMSE):
Square root of average squared differences
Penalizes large errors more than MAE
R² Score (Coefficient of Determination):
Proportion of variance explained
0 = model useless
1 = perfect predictionsfrom sklearn.metrics import mean_absolute_error, r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0, 2, 8]
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"MAE: {mae}") # 0.5
print(f"R²: {r2}") # 0.957. Real-World Example: Predicting House Prices
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
# 1. Load data
df = pd.read_csv('house_prices.csv')
# 2. Explore
print(df.head())
print(df.info())
print(df.describe())
# 3. Handle missing data
df = df.dropna()
# 4. Feature engineering
df['price_per_sqft'] = df['price'] / df['square_feet']
df['age'] = 2024 - df['year_built']
# 5. Prepare data
X = df.drop(columns=['price'])
y = df['price']
# 6. Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 7. Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 8. Train
model = RandomForestRegressor(n_estimators=100, max_depth=10)
model.fit(X_train_scaled, y_train)
# 9. Evaluate
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"R² Score: {r2:.3f}")
print(f"RMSE: ${rmse:,.0f}")
# 10. Feature importance
importance = pd.DataFrame({
'Feature': X.columns,
'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)
print(importance)Key Takeaways
- Statistics First - Probability, distributions, hypothesis testing foundation
- Exploratory Data Analysis (EDA) - Understand data before modeling
- Supervised Learning - Regression for continuous, classification for categories
- Unsupervised Learning - Clustering for discovery
- Feature Engineering - Create meaningful features from raw data
- Model Evaluation - Always measure on test set, use appropriate metrics
- Avoid Overfitting - Use train/test split, regularization, cross-validation
- Real Data is Messy - Handle missing values, outliers, imbalanced classes
- Iterate - Try different models, tune hyperparameters
- Domain Knowledge - ML + business understanding = valuable insights