For reproducibility, the random seed (for both Numpy and PyTorch) are set to 31.

PyTorch dataset definition of the BreaKHis dataset.

Construction of the dataset object should be done using this
class's method `initialize`. Simply providing the data directory
where the data was downloaded is sufficient.

Returns a `BreaKHisDataset` object for the data contained in `data_dir`.

To create the dataset, you only need one function calls. Within this function call:

You can specify the label type when initializing the dataset by specifying label in initialize
- It must be 1 of 'tumor_class' or 'tumor_type'
You can make arbitrary splits of the data (within reason) when splitting the dataset via split_dataset
You can make sure to split equally within various criterion using criterion, which can include tumor class/tumor type, and magnification.
- You can not split equally by both tumor class and tumor type (error will be thrown if attempted).
You can use different transforms for different splits using split_transforms.

#example
train_transform = transforms.Compose([
    transforms.RandomRotation(90),
    transforms.RandomHorizontalFlip(0.8),
    transforms.RandomResizedCrop(224),
    transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010)),
])

val_transform = transforms.Compose([
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                        (0.2023, 0.1994, 0.2010)),
])

#example
ds_mapping = initialize_datasets(
    '/share/nikola/export/dt372/BreaKHis_v1/',
    label='tumor_type', criterion=['tumor_type', 'magnification'],
    split_transforms={'train': train_transform, 'val': val_transform}
)

#example
tr_ds, val_ds = ds_mapping['train'], ds_mapping['val']

#example
tr_ds[0]

(tensor([[[-2.4291, -2.4291, -2.4291,  ...,  0.8082,  0.7888,  0.8082],
          [-2.4291, -2.4291, -2.4291,  ...,  0.8082,  0.7888,  0.7888],
          [-2.4291, -2.4291, -2.4291,  ...,  0.8082,  0.8082,  0.7501],
          ...,
          [ 0.6725,  0.6144,  0.2654,  ...,  0.4399,  0.3624,  0.0522],
          [ 0.4981,  0.6338,  0.5174,  ...,  0.3236,  0.3817,  0.1297],
          [ 0.3430,  0.3624,  0.4593,  ...,  0.3624,  0.3042,  0.0522]],
 
         [[-2.4183, -2.4183, -2.4183,  ...,  0.9251,  0.9251,  0.9251],
          [-2.4183, -2.4183, -2.4183,  ...,  0.9251,  0.9054,  0.9251],
          [-2.4183, -2.4183, -2.4183,  ...,  0.9251,  0.9251,  0.9054],
          ...,
          [ 0.2564,  0.2564, -0.0189,  ...,  0.5121,  0.4531,  0.0598],
          [ 0.0598,  0.1778,  0.1581,  ...,  0.3744,  0.4138,  0.1188],
          [-0.1566, -0.0976,  0.0204,  ...,  0.4531,  0.3548,  0.0991]],
 
         [[-2.2214, -2.2214, -2.2214,  ...,  0.8027,  0.8027,  0.8027],
          [-2.2214, -2.2214, -2.2214,  ...,  0.7832,  0.7637,  0.7832],
          [-2.2214, -2.2214, -2.2214,  ...,  0.8027,  0.7832,  0.7832],
          ...,
          [ 1.0368,  0.9978,  0.7052,  ...,  0.9588,  0.9588,  0.6661],
          [ 0.8807,  0.9978,  0.9198,  ...,  0.8807,  0.9783,  0.8027],
          [ 0.7442,  0.8027,  0.8612,  ...,  0.9588,  0.9588,  0.7637]]]),
 tensor(0))

From here, it is very simple to create the dataloaders for use in training.

#example
tr_dl = torch.utils.data.DataLoader(tr_ds, batch_size=32, shuffle=True)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=32)

#example
x, y = next(iter(tr_dl))
x.shape, y.shape

(torch.Size([32, 3, 224, 224]), torch.Size([32]))

Appendix¶

All images in this dataset are captured from an ROI determined by a professional pathologist, so all images are assumed to have a tumor.

Samples¶

Samples are generated from breast tissue biopsy slides, stained with hematoxylin and eosin (HE).
Prepared for histological study and labelled by pathologists of the P&D Lab
Breast tumor specimens assessed by Immunohistochemistry (IHC)
Core Needle Biopsy (CNB) and Surgical Open Biopsy (SOB)
Section of ~3µm thickness

Image acquisition¶

Olympus BX-50 system microscope with a relay lens with magnification of 3.3× coupled to a Samsung digital color camera SCC-131AN
Magnification 40×, 100×, 200×, and 400× (objective lens 4×, 10×, 20×, and 40× with ocular lens 10×)
Camera pixel size 6.5 µm
Raw images without normalization nor color color standardization
Resulting images saved in 3-channel RGB, 8-bit depth in each channel, PNG format

BreaKHis Data Processing

`class` `BreaKHisDataset`[source]

`initialize_datasets`[source]

Appendix¶

Samples¶

Image acquisition¶

BreaKHis Data Processing

class BreaKHisDataset[source]

initialize_datasets[source]

Appendix¶

Samples¶

Image acquisition¶

`class` `BreaKHisDataset`[source]

`initialize_datasets`[source]