
Evaluating handcrafted versus deep features for fine-grained food classification
Deep convolutional descriptors slice through frosting-level nuances; handcrafted stats simply can’t keep up.
Compared handcrafted descriptors and CNN-derived features for classifying 15 cake categories (1,800 images). Low-level statistics (color histogram, edge direction, co-occurrence) fed an MLP but plateaued at 31% accuracy, while PVMLNet feature maps (layer −5) coupled with an MLP achieved 90% test accuracy. Transfer learning by fine-tuning PVMLNet reached 80%, highlighting the importance of deep representations.
The dataset of 15 cake types (chocolate, tiramisu, cheesecake, etc.) is split 100/20 per class for train/test. Handcrafted descriptors—color histograms, edge direction histograms, grey-level co-occurrence matrices—are concatenated and normalised (mean-var, min-max, max-abs) before feeding an MLP. Despite tuning, performance stagnates around 31% due to intra-class variability. Switch to PVMLNet: intermediate activations from layers −1 to −7 are compared, with flattened layer −5 delivering 90% accuracy and converging in <100 epochs. Transfer learning replaces PVMLNet’s final layer with the trained MLP head, but full fine-tuning settles at 80%, still below the feature-extraction approach. Error analysis via confusion matrices flags persistent confusions (e.g., chocolate-mousse vs ice-cream cake) and guides future data augmentation ideas.