The BreaKHis dataset is organized by benign and malignant tumors, and then by the specific tumor types. The functions provided here will help to quickly (lazily) process the data for usage in training image classification models, while maintaining the additional information that might not be necessary. For example, if training on benign/malignant labels, the information about which specific tumor is present will still be available in the dataset definition. The data is anonymized, so there's no possibility of splitting at the patient level. Instead, we leave dataset splitting up to the user, but provide some utility functions to reproduce the results obtained in initial development.

For reproducibility, the random seed (for both Numpy and PyTorch) are set to 31.

class BreaKHisDataset[source]

BreaKHisDataset(dataset, transform=None) :: Dataset

PyTorch dataset definition of the BreaKHis dataset.

Construction of the dataset object should be done using this
class's method `initialize`. Simply providing the data directory
where the data was downloaded is sufficient.

initialize_datasets[source]

initialize_datasets(data_dir, label='tumor_class', split={'train': 0.8, 'val': 0.2}, criterion=['tumor_class'], split_transforms={'train': None, 'val': None})

Returns a `BreaKHisDataset` object for the data contained in `data_dir`.

To create the dataset, you only need one function calls. Within this function call:

  • You can specify the label type when initializing the dataset by specifying label in initialize
    • It must be 1 of 'tumor_class' or 'tumor_type'
  • You can make arbitrary splits of the data (within reason) when splitting the dataset via split_dataset
  • You can make sure to split equally within various criterion using criterion, which can include tumor class/tumor type, and magnification.
    • You can not split equally by both tumor class and tumor type (error will be thrown if attempted).
  • You can use different transforms for different splits using split_transforms.
#example
train_transform = transforms.Compose([
    transforms.RandomRotation(90),
    transforms.RandomHorizontalFlip(0.8),
    transforms.RandomResizedCrop(224),
    transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010)),
])

val_transform = transforms.Compose([
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                        (0.2023, 0.1994, 0.2010)),
])
#example
ds_mapping = initialize_datasets(
    '/share/nikola/export/dt372/BreaKHis_v1/',
    label='tumor_type', criterion=['tumor_type', 'magnification'],
    split_transforms={'train': train_transform, 'val': val_transform}
)
#example
tr_ds, val_ds = ds_mapping['train'], ds_mapping['val']
#example
tr_ds[0]
(tensor([[[-2.4291, -2.4291, -2.4291,  ...,  0.8082,  0.7888,  0.8082],
          [-2.4291, -2.4291, -2.4291,  ...,  0.8082,  0.7888,  0.7888],
          [-2.4291, -2.4291, -2.4291,  ...,  0.8082,  0.8082,  0.7501],
          ...,
          [ 0.6725,  0.6144,  0.2654,  ...,  0.4399,  0.3624,  0.0522],
          [ 0.4981,  0.6338,  0.5174,  ...,  0.3236,  0.3817,  0.1297],
          [ 0.3430,  0.3624,  0.4593,  ...,  0.3624,  0.3042,  0.0522]],
 
         [[-2.4183, -2.4183, -2.4183,  ...,  0.9251,  0.9251,  0.9251],
          [-2.4183, -2.4183, -2.4183,  ...,  0.9251,  0.9054,  0.9251],
          [-2.4183, -2.4183, -2.4183,  ...,  0.9251,  0.9251,  0.9054],
          ...,
          [ 0.2564,  0.2564, -0.0189,  ...,  0.5121,  0.4531,  0.0598],
          [ 0.0598,  0.1778,  0.1581,  ...,  0.3744,  0.4138,  0.1188],
          [-0.1566, -0.0976,  0.0204,  ...,  0.4531,  0.3548,  0.0991]],
 
         [[-2.2214, -2.2214, -2.2214,  ...,  0.8027,  0.8027,  0.8027],
          [-2.2214, -2.2214, -2.2214,  ...,  0.7832,  0.7637,  0.7832],
          [-2.2214, -2.2214, -2.2214,  ...,  0.8027,  0.7832,  0.7832],
          ...,
          [ 1.0368,  0.9978,  0.7052,  ...,  0.9588,  0.9588,  0.6661],
          [ 0.8807,  0.9978,  0.9198,  ...,  0.8807,  0.9783,  0.8027],
          [ 0.7442,  0.8027,  0.8612,  ...,  0.9588,  0.9588,  0.7637]]]),
 tensor(0))

From here, it is very simple to create the dataloaders for use in training.

#example
tr_dl = torch.utils.data.DataLoader(tr_ds, batch_size=32, shuffle=True)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=32)
#example
x, y = next(iter(tr_dl))
x.shape, y.shape
(torch.Size([32, 3, 224, 224]), torch.Size([32]))

Appendix

All images in this dataset are captured from an ROI determined by a professional pathologist, so all images are assumed to have a tumor.

Samples

  • Samples are generated from breast tissue biopsy slides, stained with hematoxylin and eosin (HE).
  • Prepared for histological study and labelled by pathologists of the P&D Lab
  • Breast tumor specimens assessed by Immunohistochemistry (IHC)
  • Core Needle Biopsy (CNB) and Surgical Open Biopsy (SOB)
  • Section of ~3µm thickness

Image acquisition

  • Olympus BX-50 system microscope with a relay lens with magnification of 3.3× coupled to a Samsung digital color camera SCC-131AN
  • Magnification 40×, 100×, 200×, and 400× (objective lens 4×, 10×, 20×, and 40× with ocular lens 10×)
  • Camera pixel size 6.5 µm
  • Raw images without normalization nor color color standardization
  • Resulting images saved in 3-channel RGB, 8-bit depth in each channel, PNG format