Classification Model
Project Overview
Content-Aware Image Manipulation System
This project implements a machine learning classifier that automatically categorizes Piazza forum posts using natural language processing and Bayesian classification techniques. The system trains on historical labeled data to learn statistical patterns about word distributions across different categories, then applies this learned knowledge to predict labels for new, unseen posts. The classifier can categorize posts either by topic (e.g., "euchre", "calculator", "exam") or by author type ("instructor" vs "student"), demonstrating practical applications of supervised machine learning in text classification.
Technical Architecture
The system follows a classic machine learning pipeline architecture with three primary stages: training, model representation, and prediction. During training, the classifier ingests CSV-formatted Piazza posts and builds a statistical model by computing various probability distributions. The core data structures leverage C++ Standard Library containers—particularly std::map and std::set—to efficiently store and retrieve word frequencies, label counts, and conditional probabilities. The architecture employs a bag-of-words model that abstracts posts into sets of unique words, discarding word order and frequency information to simplify the classification problem while maintaining reasonable accuracy.
Core Algorithmic Components
Multivariate Bernoulli Naive Bayes Classification: The classifier implements a simplified version of Naive Bayes that treats each word as a binary feature (present or absent). It computes log-probability scores rather than raw probabilities to avoid numerical underflow issues common when multiplying many small probabilities.
Training Algorithm:  During training, the system maintains several statistical counts: total posts, vocabulary size, per-label post counts, per-word document frequency, and per-label-word co-occurrence counts. These statistics form the foundation for all probability calculations.
Prediction Algorithm: For each test post, the classifier computes a log-probability score for every possible label by summing the log-prior probability of that label with the log-likelihoods of each unique word in the post given that label. The label with the highest aggregate score becomes the prediction. The algorithm includes smoothing techniques to handle unseen words gracefully—using fallback probability estimates when words appear in the test data but weren't observed with a particular label during training.
Probability Estimation with Laplace Smoothing:  The system implements three-tiered probability estimation: standard conditional probability for seen word-label pairs, marginal probability for words seen overall but not with a specific label, and uniform smoothing for completely novel words.
Performance & Design Considerations
The implementation prioritizes efficiency through careful data structure selection and algorithmic design choices. Using std::map provides O(log n) lookup times for word and label statistics with direct access methods ([] and find()) rather than inefficient iteration, while std::set enables duplicate elimination during word extraction. The single-pass processing architecture analyzes each post exactly once without storing all data in memory, maintaining a small memory footprint. The bag-of-words model keeps feature extraction computationally lightweight by ignoring word ordering and frequency, while logarithmic probability calculations avoid expensive exponentiation and prevent floating-point underflow. Critical performance optimizations include passing strings and containers by reference rather than value, using reference-based range loops, and avoiding redundant file I/O operations. These design decisions enable the classifier to process the largest datasets (3000+ posts) in under a minute while achieving approximately 74% accuracy on cross-term project classification and 87% accuracy on instructor/student identification.