The BreaKHis dataset is organized by benign and malignant tumors, and then by the specific tumor types. The functions provided here will help to quickly (lazily) process the data for usage in training image classification models, while maintaining the additional information that might not be necessary. For example, if training on benign/malignant labels, the information about which specific tumor is present will still be available in the dataset definition. The data is anonymized, so there's no possibility of splitting at the patient level. Instead, we leave dataset splitting up to the user, but provide some utility functions to reproduce the results obtained in initial development.
For reproducibility, the random seed (for both Numpy and PyTorch) are set to 31.
To create the dataset, you only need one function calls. Within this function call:
- You can specify the label type when initializing the dataset by specifying
label
ininitialize
- It must be 1 of 'tumor_class' or 'tumor_type'
- You can make arbitrary splits of the data (within reason) when splitting the dataset via
split_dataset
- You can make sure to split equally within various criterion using
criterion
, which can include tumor class/tumor type, and magnification.- You can not split equally by both tumor class and tumor type (error will be thrown if attempted).
- You can use different transforms for different splits using
split_transforms
.
#example
train_transform = transforms.Compose([
transforms.RandomRotation(90),
transforms.RandomHorizontalFlip(0.8),
transforms.RandomResizedCrop(224),
transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2023, 0.1994, 0.2010)),
])
val_transform = transforms.Compose([
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2023, 0.1994, 0.2010)),
])
#example
ds_mapping = initialize_datasets(
'/share/nikola/export/dt372/BreaKHis_v1/',
label='tumor_type', criterion=['tumor_type', 'magnification'],
split_transforms={'train': train_transform, 'val': val_transform}
)
#example
tr_ds, val_ds = ds_mapping['train'], ds_mapping['val']
#example
tr_ds[0]
From here, it is very simple to create the dataloaders for use in training.
#example
tr_dl = torch.utils.data.DataLoader(tr_ds, batch_size=32, shuffle=True)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=32)
#example
x, y = next(iter(tr_dl))
x.shape, y.shape
Appendix¶
All images in this dataset are captured from an ROI determined by a professional pathologist, so all images are assumed to have a tumor.
Samples¶
- Samples are generated from breast tissue biopsy slides, stained with hematoxylin and eosin (HE).
- Prepared for histological study and labelled by pathologists of the P&D Lab
- Breast tumor specimens assessed by Immunohistochemistry (IHC)
- Core Needle Biopsy (CNB) and Surgical Open Biopsy (SOB)
- Section of ~3µm thickness
Image acquisition¶
- Olympus BX-50 system microscope with a relay lens with magnification of 3.3× coupled to a Samsung digital color camera SCC-131AN
- Magnification 40×, 100×, 200×, and 400× (objective lens 4×, 10×, 20×, and 40× with ocular lens 10×)
- Camera pixel size 6.5 µm
- Raw images without normalization nor color color standardization
- Resulting images saved in 3-channel RGB, 8-bit depth in each channel, PNG format