A Simple Guide to PCA Image Classification with Scikit-learn: Green vs Red Apples

Engineering Core

ISB Vietnam's skilled software engineers deliver high-quality applications, leveraging their extensive experience in developing financial tools, business management systems, medical technology, and mobile/web platforms.

In PCA image classification tasks, the input data is often high-dimensional, containing a large number of features. For example, a 64×64 RGB image consists of 3 color channels, resulting in 64 × 64 × 3 = 12,288 numerical pixel values. When we feed all 12,288 raw pixel values directly into traditional machine learning models, several issues typically arise:

- - Slow processing speed
  - Difficulty learning the important features

High sensitivity to noise
Increased risk of overfitting

Principal Component Analysis (PCA) is a dimensionality-reduction technique that preserves the most important information in the data while reducing noise. To illustrate how PCA works in practice, this article uses a clear and intuitive example: distinguishing between green apples and red apples using Scikit-learn - a simple task that effectively demonstrates the benefits of PCA in image classification.

1. What is Principal Component Analysis (PCA)?

Principal Component Analysis (PCA) is a statistical method that helps to:

Identify the directions where the data varies the most (the principal components)
Transform the data into a new coordinate system
Retain the most important components
Reduce the dimensionality while preserving the main structure of the information

To understand this concept clearly, we first need to explain what a “dimension” is in image data.

Understanding “Dimension” in Image Data

In machine learning, each feature in the data is considered one dimension.

A data table with 3 columns corresponds to 3 dimensions.
A vector with 10 values represents 10 dimensions.

When applied to images:

A 64×64 RGB image contains 3 color channels, resulting in 12,288 numerical pixel values.
Each pixel represents a light–dark value, making it a feature.
The entire image can be represented as a 12,288-dimensional vector.

We can say that a “dimension” is the number of independent pieces of information the model must handle. Because an image has thousands of pixels, it also has thousands of dimensions, meaning the model must process a large amount of highly varied information.

How Does PCA Work?

PCA has two main purposes:

First, it identifies the directions in an image where the data varies the most — these are known as the principal components.
Second, by keeping only these high-variance components and discarding low-variance information (often noise), the method effectively reduces the dimensionality of the data.

Applying PCA to the green–red apple example:

High-variance regions such as the apple’s contour, stem area, and strong color differences between green and red are preserved.
Low-variance regions such as uniform background, shadows, or noise on the table surface are filtered out.

As a result, the original 12,288-dimensional RGB image can be compressed to around 100 dimensions, while still keeping the core information needed for the model to classify the image.

2. Why Is PCA Important in Image Classification?

2.1. Image is High-dimensional

In our example, a 64×64 RGB image corresponds to 12,288 numerical pixel values, which means it has 12,288 dimensions. In traditional machine learning models, this leads to several problems:

Slow processing speed
Difficult optimization
Difficulty separating cats from dogs

⇒ PCA reduces the number of dimensions to around 100, helping the model learn faster and with greater stability.

2.2. Many Pixels Are Irrelevant to Classification

For example:

Wall background
Uniform dark regions
Grass
Details that are far from the main object

⇒ These regions do not help distinguish between green apples and red apples, so PCA automatically reduces their influence.

2.3. PCA Highlights the Important Features

Thanks to PCA, the model focuses on:

- Overall apple shape

- Strong color differences between green and red regions

Surface texture patterns
Brightness and shading variations on the peel
Stem area and contour lines
Local color gradients unique to each apple type

These are the features that truly help the model classify the image.

2.4. Improve the Performance of Traditional Models

Models such as Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest cannot efficiently handle images with 12,288 dimensions.

PCA transforms the images into a new feature space, allowing these models to:

Separate the classes more easily: PCA creates a new feature space where green and red apple images become “farther apart” in a mathematical sense (distance/variance). When the two classes are farther apart, the classifier can draw the decision boundary much more easily.
Generalize better: By removing noise and redundant information, the model focuses only on meaningful features.
Achieve higher accuracy: Since the input becomes cleaner and more compact, the classifier makes more reliable predictions.

3. Using Scikit-learn to Illustrate PCA in Image Classification

3.1. Preparing the environment and libraries

py -m pip install numpy pillow scikit-learn

3.2. Data normalization

This step is part of the preprocessing stage, where raw images are normalized and converted into a numerical format suitable for PCA and machine learning models.

# dataset_loader.py
import os
import numpy as np # Handle matric, flatten
from PIL import Image # Open images, convert to RGB, resize
def load_images(folder, label):
    """
    1. Load images from a given folder
    2. Convert them to RGB (keep color information),
    3. Resize them to 64×64
    4. Normalize to [0, 1] and flatten to 1D vectors (64×64×3 = 12,288 dimensions),
    5. Return both the processed images and their corresponding labels (0 for green_apple, 1 for red_apple).
    Parameters
    ----------
    folder : str
        Path to the folder containing the images.
    label : int
        The numerical label assigned to all images in this folder
        (0 for green_apple, 1 for red_apple).
    Returns
    -------
    images : numpy.ndarray
        Array of shape (N, 12288), where N is the number of images.
        Each row represents one flattened RGB image.
    labels : numpy.ndarray
        Array of shape (N,), containing the label for each image.
    """
   
    images, labels = [], []
    # Iterate through all files in the folder
    for filename in os.listdir(folder):
        if filename.startswith("."):
            continue
        if not filename.lower().endswith((".jpg", ".jpeg", ".png", ".bmp")):
            continue
        path = os.path.join(folder, filename)
        # Open the image, convert to RGB, and resize to 64×64
        img = Image.open(path).convert("RGB").resize((64, 64))
        # Convert to float32 NumPy array in [0, 1] and flatten
        arr = np.array(img, dtype=np.float32) / 255.0
        images.append(arr.flatten())
        # Append the label for this image
        labels.append(label)
    # Return as NumPy arrays for compatibility with ML models
    return np.array(images), np.array(labels)

3.3. Dimensionality Reduction with PCA

# pca_reduction.py
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from dataset_loader import load_images

def run_pca():
    """
    Load green_apple and red_apple images, combine them into a single RGB dataset,
    apply PCA to reduce dimensionality from 12,288 to configurable components,
    and return the transformed data along with their labels and the PCA model.

    Returns
    -------
    X_pca : numpy.ndarray
        PCA-transformed feature matrix.
    y : numpy.ndarray
        Label array for all images (0 = green_apple, 1 = red_apple).
    pca : sklearn.decomposition.PCA
        The trained PCA transformer.
    scaler : sklearn.preprocessing.StandardScaler
        Fitted scaler from training step.
    feature_scaler : sklearn.preprocessing.StandardScaler
        Scaler applied after PCA to keep features well conditioned.
    """

    # Load preprocessed green_apple images (label = 0)
    X_green_apples, y_green_apples = load_images("dataset/green_apples", label=0) # Images of green_apple
    # Load preprocessed red_apple images (label = 1)
    X_red_apples, y_red_apples = load_images("dataset/red_apples", label=1) # Images of red_apple

    # Combine green_apple + red_apple feature matrices into one dataset
    X = np.vstack([X_green_apples, X_red_apples])
    y = np.hstack([y_green_apples, y_red_apples])

    # Scale before PCA
    scaler = StandardScaler(with_mean=True, with_std=False)
    X_scaled = scaler.fit_transform(X)
    X_scaled = X_scaled.astype(np.float64, copy=False)

    # Create PCA with configurable number of components
    pca = PCA(
        n_components=100,
        random_state=42,
        svd_solver="full",
        whiten=False,
    )

    # Fit PCA on X and transform it
    X_pca = pca.fit_transform(X_scaled)
           

    # Scale PCA features to keep them well conditioned for linear models
    feature_scaler = StandardScaler()
    X_pca_scaled = feature_scaler.fit_transform(X_pca)

    print("Original shape:", X.shape)
    print("After PCA:", X_pca.shape)

    return X_pca_scaled, y, pca, scaler, feature_scaler

At this point, the dimensionality reduction step is complete. The original 12,288-dimensional RGB images have been compressed into just 100 dimensions. X_pca now represents the transformed dataset — more compact, cleaner, and easier for the model to learn from.

3.4. Training a Classifier on PCA Features (Logistic Regression)

# classification.py
import numpy as np
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold, train_test_split
from pca_reduction import run_pca
def train_model():
    """
    Train a logistic regression model on PCA reduced features.
    Returns
    -------
    model : sklearn.linear_model.LogisticRegressionCV
        Trained classifier.
    pca : sklearn.decomposition.PCA
        Fitted PCA transformer.
    scaler : sklearn.preprocessing.StandardScaler
        Fitted scaler.
    feature_scaler : sklearn.preprocessing.StandardScaler
        Scaler applied after PCA so inference matches training.
    metrics : dict
        Dictionary containing evaluation metadata.
    """
    X_pca, y, pca, scaler, feature_scaler = run_pca()
    # Split the PCA features and labels into train (80%) and test sets (20%)
    X_train, X_test, y_train, y_test = train_test_split(
        X_pca, y, test_size=0.2, stratify=y, random_state=42
    )
    # Cross-validated logistic regression improves generalization accuracy
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    model = LogisticRegressionCV(
        Cs=np.logspace(-2, 2, 9),
        cv=cv,
        max_iter=2000,
        scoring="accuracy",
        n_jobs=1,
        refit=True,
        solver="liblinear",
    )
   
    # Train the model on the PCA-transformed training data
    model.fit(X_train, y_train)
    # Predict labels for the test set
    y_pred = model.predict(X_test)
    # Compute classification accuracy
    acc = accuracy_score(y_test, y_pred)
    metrics = {
        "test_accuracy": acc,
        "train_size": len(X_train),
        "test_size": len(X_test),
        "best_C": float(model.C_[0]),
    }
    print(
        f"Test accuracy: {acc:.4f} "
        f"(train={len(X_train)} samples, test={len(X_test)} samples, best C={model.C_[0]:.4f})"
    )
    return model, pca, scaler, feature_scaler, metrics

3.5. Predict a new image

# predict.py
import os
import numpy as np
from PIL import Image
from classification import train_model
LABEL_TO_NAME = {0: "Green", 1: "Red"}
def prepare_image(path):
    img = Image.open(path).convert("RGB").resize((64, 64))
    arr = np.array(img, dtype=np.float32) / 255.0
    return arr.flatten().reshape(1, -1)
def predict_image(path, pca, scaler, feature_scaler, model):
    img_vec = prepare_image(path)
    # Apply the same scaling used during training
    img_scaled = scaler.transform(img_vec).astype(np.float64, copy=False)
    # Apply PCA transform
    img_pca = pca.transform(img_scaled)
    # Scale PCA features to keep them well conditioned for linear models
    img_pca_scaled = feature_scaler.transform(img_pca)
    # Predict label
    pred = model.predict(img_pca_scaled)[0]
    return LABEL_TO_NAME.get(pred, "Unknown")
def predict_folder(folder_path, pca, scaler, feature_scaler, model):
    print(f"Testing folder: {folder_path}")
    y_true, y_pred = [], []
    for file in os.listdir(folder_path):
        if file.startswith("."):
            continue
        if not file.lower().endswith((".jpg", ".jpeg", ".png", ".bmp")):
            continue
        path = os.path.join(folder_path, file)
        label_name = predict_image(path, pca, scaler, feature_scaler, model)
        print(f"{file} ➝ {label_name}")
def main():
    model, pca, scaler, feature_scaler, metrics = train_model()
    print(f"Training metrics: {metrics}")
    predict_folder(
        "dataset/test_apples",
        pca,
        scaler,
        feature_scaler,
        model
    )
if __name__ == "__main__":
    main()

After running the model on several test images, we obtain the predictions shown below:

green_01.jpg ➝ Green 
green_02.jpg ➝ Green 
green_03.jpg ➝ Green 
green_04.jpg ➝ Green 
green_05.jpg ➝ Red 
red_01.jpg ➝ Green 
red_02.jpg ➝ Red 
red_03.jpg ➝ Green 
red_04.jpg ➝ Red 
red_05.jpg ➝ Red

The filenames were assigned according to the true color of each apple (“green” for green apples and “red” for red apples). As shown, most predictions match the ground truth, while a few images were misclassified—likely due to lighting, shadows, or color overlap between the two varieties. This demonstrates both the effectiveness of PCA & Logistic Regression and the natural challenges of real-world image data.

4. Conclusion

Through the example of classifying green apples and red apples, we can clearly see that Principal Component Analysis:

Significantly reduces the dimensionality of RGB images
Effectively removes noise and irrelevant information
Highlights the most important visual features such as color and shape
Accelerates model training
Improves the accuracy and stability of traditional machine learning models

PCA remains one of the foundational techniques in image processing and classical machine learning, especially valuable when resources are limited or when the pipeline does not rely on deep learning.

References

1. Principal Component Analysis (PCA): https://www.geeksforgeeks.org/data-analysis/principal-component-analysis-pca/
2. scikit-learn API Reference: Principal component analysis (PCA): https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
3. scikit-learn API Reference: Logistic Regression CV: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html
4. Featured Image: https://www.pexels.com/photo/robot-pointing-on-a-wall-8386440/
5. Example Image: https://www.pexels.com/photo/two-red-and-green-apple-fruits-on-brown-surface-135130/

Written by

Engineering Core

COMPANY PROFILE

Please check out our Company Profile.

Download

COMPANY PORTFOLIO

Explore my work!

Download

ASK ISB Vietnam ABOUT DEVELOPMENT

Let's talk about your project!

No.	Data Type	What’s Included (Examples)
1	Personally Identifiable Info (PII)	Full names, addresses, phone numbers, ID/Passport numbers, personal emails.
2	Financial & Banking Data	Credit card numbers, bank accounts, transaction history, payment credentials.
3	Trade Secrets & Business Info	Proprietary algorithms, strategic plans, upcoming product specifications.
4	Biometric Data	Fingerprints, facial recognition, retina scans, voiceprints.
5	Medical & Health Records	Patient histories, prescriptions, diagnostic details, health insurance info.
6	Private Ideas & IP	Unpublished research, confidential brainstorming, creative intellectual property.

Method	How it works	Best used for
Substitution	Replaces real data with similar but fake values (e.g., replace a real name with a name from a random list).	Names, Credit card numbers.
Randomization	Replaces sensitive data with totally random values that have no connection to the original.	Addresses, PII.
Shuffling	Mixes the values within the same column. The data is real, but it now belongs to the wrong records.	Maintaining statistical relationships.
Encryption	Uses algorithms to turn data into an unreadable format. Only people with a "key" can read it.	High-level security (but can slow down analysis).
Hashing	Converts data into a fixed-length string of random characters. It cannot be reversed.	Passwords or data verification.
Tokenization	Replaces data with a "token" (reference value). The real data is stored in a separate, secure vault.	Sensitive production data & Compliance.
Nulling (Blanking)	Replaces data with a "null" value or a blank space. It simply removes the information.	Removing data while keeping the format.

A Simple Guide to PCA Image Classification with Scikit-learn: Green vs Red Apples

1. What is Principal Component Analysis (PCA)?

Understanding “Dimension” in Image Data

How Does PCA Work?

2. Why Is PCA Important in Image Classification?

2.1. Image is High-dimensional

2.2. Many Pixels Are Irrelevant to Classification

2.3. PCA Highlights the Important Features

2.4. Improve the Performance of Traditional Models

3. Using Scikit-learn to Illustrate PCA in Image Classification

3.1. Preparing the environment and libraries

3.2. Data normalization

3.3. Dimensionality Reduction with PCA

3.4. Training a Classifier on PCA Features (Logistic Regression)

3.5. Predict a new image

4. Conclusion

References

COMPANY PROFILE

COMPANY PORTFOLIO

ASK ISB Vietnam ABOUT DEVELOPMENT

I. Introduction

Domain 1 Overview: What You Need to Know

Task Statement 1.1: Define the Benefits of the AWS Cloud

Global Infrastructure Benefits

High Availability

Elasticity

Agility

Key Takeaways

What's Next?

CONTACT US

A Simple Guide to PCA Image Classification with Scikit-learn: Green vs Red Apples

1. What is Principal Component Analysis (PCA)?

Understanding “Dimension” in Image Data

How Does PCA Work?

2. Why Is PCA Important in Image Classification?

2.1. Image is High-dimensional

2.2. Many Pixels Are Irrelevant to Classification

2.3. PCA Highlights the Important Features

2.4. Improve the Performance of Traditional Models

3. Using Scikit-learn to Illustrate PCA in Image Classification

3.1. Preparing the environment and libraries

3.2. Data normalization

3.3. Dimensionality Reduction with PCA

3.4. Training a Classifier on PCA Features (Logistic Regression)

3.5. Predict a new image

4. Conclusion

References

COMPANY PROFILE

COMPANY PORTFOLIO

ASK ISB Vietnam ABOUT DEVELOPMENT

Data Masking Guide: How to Protect Sensitive Data Before Using AI

1. The Temptation and the "AI Trap"

2. What is Data Masking?

3. The "Danger Zone": 6 Types of Data You Must Never Paste

4. Practical Data Masking Techniques

A. Popular Data Masking Techniques

B. For Tech Staff - Automation

5. Static vs. Dynamic Masking

Conclusion

How to Optimize jqGrid for Large Datasets

I. Introduction

AWS Certified Cloud Practitioner (CLF-C02) – Domain 1 (Part 1): Understanding AWS Cloud Benefits

Domain 1 Overview: What You Need to Know

Task Statement 1.1: Define the Benefits of the AWS Cloud

Global Infrastructure Benefits

High Availability

Elasticity

Agility

Key Takeaways

What's Next?

CONTACT US