Dataset Handling
Deep.Net provides a generic type for handling datasets used in machine learning. It can handle samples that are of a user-defined record type containing fields of type ArrayNDT. The following features are provided:
- data storage on host and CUDA GPU
- indexed sample access
- sample range access
- mini-batch sequencing (with optional padding of last batch)
- partitioning into training, validation and test set
- loading from and saving to disk
We are going to introduce it using a simple, synthetic dataset.
Creating a dataset
In most cases you are going to load a dataset by parsing some text or binary files. However, since this is quite application-specific we do not want to concern ourselves with it here and will create a synthetic dataset using trigonometric functions on the fly.
Defining the sample type
Our sample type consists of two fields: a scalar \(x\) and a vector \(\mathbf{v}\). This corresponds to the following record type
1: 2: 3: 4: 5: 6: |
|
We use the data type single
for fast arithmetic operations on the GPU.
Generating some samples
Next, let us generate some samples. The scalar \(x\) shall be sampled randomly from a uniform distribution on the interval \([-2, 2]\). The values of vector \(v\) shall be given by the relation
\[\mathbf{v}(x) = \left( \begin{matrix} \mathrm{sinh} \, x \\ \mathrm{cosh} \, x \end{matrix} \right)\]
We can implement that using the following code.
1: 2: 3: 4: 5: 6: 7: 8: 9: |
|
The generateSamples function produces the specified number of samples. We can test it as follows.
1: 2: 3: 4: |
|
This prints
1: 2: 3: 4: 5: 6: |
|
Now that we have some data, we can create a dataset.
Instantiating the dataset type
There are two ways to construct a dataset.
- The
Dataset<'S>.FromSamples
takes a sequence of samples (of type 'S) and constructs a dataset from them. - The
Dataset<'S>
constructor takes a list of ArrayNDTs corresponding to the fields of the record type 'S. The first dimension of each passed array must correspond to the sample index.
Since we already have a sequence of sample, we use the first method.
1: 2: 3: |
|
Accessing single and multiple elements
The dataset type supports the indexing and slicing operations to access samples.
When accessing a single sample using the indexing operator we obtain a record from the sequence of samples we passed into the Dataset.FromSamples
methods.
For example to print the third sample we write
1: 2: |
|
and get the output
1:
|
|
When accessing multiple elements using the slicing operator, the returned value is of the same sample record type but the contained tensors have one additional dimension on the left corresponding to the sample index. For example we can get a record containing the first three sample using the following code.
1: 2: |
|
This prints
1: 2: 3: 4: 5: 6: |
|
Hence all tensors in the sample record raise in rank by one dimension, i.e. the scalar X
became a vector and the vector V
became a matrix with each row corresponding to a sample.
Iterating over the dataset
You can also iterate over the samples of the dataset directly.
1: 2: |
|
This prints
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: |
|
Mini-batches
The ds.Batches
function returns a sequence of mini-batches from the dataset.
It takes one argument specifying the number of samples in each batch.
If the total number of samples in the dataset is not a multiple of the batch size, the last batch will have less samples.
The following code prints the sizes of the obtained mini-batches.
1: 2: 3: |
|
This outputs
1: 2: 3: 4: |
|
If you need the last batch to be padded to the specified batch size, use the ds.PaddedBatches
method instead.
Partitioning
It is often necessary to split a dataset into partitions.
The ds.Partition
methods takes a list of ratios and returns a list of new datasets obtained by splitting the dataset according to the specified ratios.
Partitioning is done by sequentially taking samples from the beginning, until the first partition has the requested number of samples.
Then the samples for the second partition are taken and so on.
The following example splits our dataset into three partitions of ratios \(1/2\), \(1/4\) and \(1/4\).
1: 2: 3: 4: |
|
This prints
1: 2: 3: |
|
Training, validation and test splits
In machine learning it is common practice to split the dataset into a training, validation and test dataset.
Deep.Net provides the TrnValTst<'S>
type for that purpose.
It is a record type with the fields Trn
, Val
and Tst
of type Dataset<'S>
.
It can be constructed from an existing dataset using the TrnValTst.Of
function.
The following code demonstrates its use using the ratios \(0.7\), \(0.15\) and \(0.15\) for the train, validation and test set respectively. The ratio specification is optional; if it is omitted ratios of \(0.8\), \(0.1\) and \(0.1\) are used.
1: 2: 3: 4: 5: |
|
This prints
1: 2: 3: |
|
Data transfer
The ds.ToCuda
and ds.ToHost
methods copy the dataset to the CUDA GPU or to the host respectively.
The TrnValTst type provides the same methods.
Disk storage
Use the ds.Save
method to save a dataset to disk using the HDF5 format.
The Dataset<'S>.Load
function loads a saved dataset.
The TrnValTst type provides the same methods.
Dataset loaders
Currently Deep.Net provides the following loaders for common datasets.
- MNIST. Use the
Mnist.load
function. It takes two parameters; the first is the path to the MNIST dataset (containing the filest10k-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz, train-images-idx3-ubyte.gz, train-labels-idx1-ubyte.gz
) and the second is the desired ratio of the validation set to the training set (for example 0.166 if you want 50 000 training samples and 10 000 validation samples). The sample typeMnistT
contains two fields:Img
for the flattened images andLbl
for the images in one-hot encoding.
Summary
The Dataset<'S>
type provides a convenient way to work with datasets.
Type-safety is provided by preserving the user-specified sample type 'S
when accessing individual or multiple samples.
The dataset handler is used by the generic training function.
{X: obj;
V: obj;}
Full name: Dataset.MySampleType
val single : value:'T -> single (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.single
--------------------
type single = System.Single
Full name: Microsoft.FSharp.Core.single
Full name: Dataset.generateSamples
val seq : sequence:seq<'T> -> seq<'T>
Full name: Microsoft.FSharp.Core.Operators.seq
--------------------
type seq<'T> = System.Collections.Generic.IEnumerable<'T>
Full name: Microsoft.FSharp.Collections.seq<_>
type Random =
new : unit -> Random + 1 overload
member Next : unit -> int + 2 overloads
member NextBytes : buffer:byte[] -> unit
member NextDouble : unit -> float
Full name: System.Random
--------------------
System.Random() : unit
System.Random(Seed: int) : unit
Full name: Microsoft.FSharp.Core.Operators.sinh
Full name: Microsoft.FSharp.Core.Operators.cosh
Full name: Dataset.smpls
module List
from Microsoft.FSharp.Collections
--------------------
type List<'T> =
| ( [] )
| ( :: ) of Head: 'T * Tail: 'T list
interface IEnumerable
interface IEnumerable<'T>
member GetSlice : startIndex:int option * endIndex:int option -> 'T list
member Head : 'T
member IsEmpty : bool
member Item : index:int -> 'T with get
member Length : int
member Tail : 'T list
static member Cons : head:'T * tail:'T list -> 'T list
static member Empty : 'T list
Full name: Microsoft.FSharp.Collections.List<_>
Full name: Microsoft.FSharp.Collections.List.ofSeq
Full name: Microsoft.FSharp.Collections.List.indexed
Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.printfn
Full name: Dataset.ds
Full name: Dataset.smpl2
Full name: Dataset.smpl0to2
from Microsoft.FSharp.Collections
Full name: Microsoft.FSharp.Collections.Seq.indexed
Full name: Dataset.partitions
Full name: Dataset.dsp
Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.set