scparadise.sceve.train

Contents

scparadise.sceve.train#

scparadise.sceve.train(mdata, path='', rna_modality_name='rna', second_modality_name='adt', detailed_annotation=None, model_name='model_regression', accelerator='auto', random_state=0, test_size=0.1, n_d=8, n_a=8, n_steps=3, n_shared=2, cat_emb_dim=1, n_independent=2, gamma=1.3, momentum=0.02, lr=0.02, lambda_sparse=0.001, patience=10, max_epochs=200, batch_size=1024, virtual_batch_size=128, mask_type='entmax', eval_metric=['rmse'], optimizer_fn=<class 'torch.optim.adam.Adam'>, scheduler_fn=<class 'torch.optim.lr_scheduler.StepLR'>, loss_fn=MSELoss(), step_size=10, gamma_scheduler=0.95, verbose=True, drop_last=True, return_model=False)[source]#

Train custom scEve model using MuData object with different modalities.

Parameters:
  • mdata (MuData) – MuData object.

  • rna_modality_name (str, (default: 'rna')) – Name of RNA (GEX) modality in MuData object.

  • second_modality_name (str, (default: 'adt')) – Name of protein (ADT) or ATAC-seq modality in MuData object.

  • path (str, path object) – Path to create a folder with model, training history, dictionary of cell annotations and genes used for training.

  • detailed_annotation (str, (default: None)) – The most detailed level of cell annotation. Key in mdata.obs dataframe. If given may increase model evaluation score.

  • model_name (str, (default: 'model_regression')) – Name of a folder to save model.

  • accelerator (str, (default: 'auto')) – Type of accelerator to use in training model (‘cpu’, ‘cuda’). Set ‘auto’ for automatic selection.

  • random_state (int, (default: 0)) – Controls the data shuffling, splitting to folds and model training. Pass an int for reproducible output across multiple function calls.

  • test_size (float or int, (default: 0.1)) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test cells.

  • n_d (int, (default: 8)) – Width of the decision prediction layer. Bigger values gives more capacity to the model with the risk of overfitting. Values typically range from 8 to 64.

  • n_a (int, (default: 8)) – Width of the attention embedding for each mask. Values typically range from 8 to 64.

  • n_steps (int, (default: 3)) – Number of steps in the architecture. Values typically range from 3 to 10.

  • n_shared (int, (default: 2)) – Number of shared Gated Linear Units at each step. Values typically range from 1 to 5.

  • cat_emb_dim (int, (default: 1)) – List of embeddings size for each categorical features. Values typically range from 1 to 5.

  • n_independent (int, (default: 2)) – Number of independent Gated Linear Units layers at each step. Values typically range from 1 to 5.

  • gamma (float, (default: 1.3)) – This is the coefficient for feature reusage in the masks. A value close to 1 will make mask selection least correlated between layers. Values typically range from 1.0 to 2.0.

  • momentum (float, (default: 0.02)) – Momentum for batch normalization. Values typically range from 0.01 to 0.4.

  • lr (float, (default: 0.02)) – Determines the step size at each iteration while moving toward a minimum of a loss function. A large initial learning rate of 0.02 with decay is a good option

  • lambda_sparse (float, (default: 0.001)) – This is the extra sparsity loss coefficient. The bigger this coefficient is, the sparser your model will be in terms of feature selection. Depending on the difficulty of your problem, reducing this value could help.

  • patience (int, (default: 10)) – Number of consecutive epochs without improvement before performing early stopping. If patience is set to 0, then no early stopping will be performed. Note that if patience is enabled, then best weights from best epoch will automatically be loaded at the end of the training.

  • max_epochs (int, (default: 200)) – Maximum number of epochs for training.

  • batch_size (int, (default: 1024)) – Number of examples per batch. It is highly recomended to tune this parameter.

  • virtual_batch_size (int, (default: 128)) – Size of the mini batches used for “Ghost Batch Normalization”. ‘virtual_batch_size’ should divide ‘batch_size’.

  • mask_type (str, (default: 'entmax')) – Either “sparsemax” or “entmax”. This is the masking function to use for selecting features.

  • optimizer_fn (func, (default: torch.optim.Adam)) – Pytorch Optimizer function.

  • scheduler_fn (func, (default: torch.optim.lr_scheduler.StepLR)) – Pytorch Scheduler to change learning rates during training.

  • loss_fn (torch.loss function (default: torch.nn.MSELoss)) – Loss function for training.

  • step_size (int, (default: 10)) – Scheduler learning rate decay.

  • gamma_scheduler (float, (default: 0.95)) – Multiplicative factor of scheduler learning rate decay. step_size and gamma_scheduler are used in dictionary of parameters to apply to the scheduler_fn.

  • verbose (int (0 or 1), bool (True or False), (default: True)) – Show progress bar for each epoch during training. Set to 1 or ‘True’ to see every epoch progress, 0 or ‘False’ to get None.

  • eval_metric (list, (default: ['rmse'])) – List of evaluation metrics (‘mse’, ‘mae’, ‘rmse’, ‘rmsle’). The last metric is used as the target and for early stopping. Mean Squared Logarithmic Error (rmsle) cannot be used when targets contain negative values.

  • drop_last (bool, (default: True)) – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller.

  • return_model (bool, (default: False)) – Return model after training or not.