Cocomore: Composer in Drupal 8.8.0 - First impressions
Pierce Lamb: Creating a ML Pipeline on AWS Sagemaker Part Three: Training and Inference
This is the third post in a three part series on creating a reusable ML pipeline that is initiated with a single config file and five user-defined functions. The pipeline is finetuning-based for the purposes of classification, runs on distributed GPUs on AWS Sagemaker and uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.
This post originally appeared on VISO Trust’s Blog
This post will cover the training and testing (inference) steps. These are the core steps in a ML pipeline where a model is hyper-parameter tuned and the test set is used to measure performance. If you have landed on this post first, check out the first post in the series detailing the pipeline setup and the second post detailing the data steps.
Training and Tuning
The reason I have combined Training and Tuning into one section is that Tuning just is a set of training jobs where performance is incrementally improved through the changing of hyperparameters. As such, underneath the covers, the two types of jobs are calling the same code. Like we have previously, let’s take a look first at perform_training() and perform_tuning() to see how the code interacts with Sagemaker.
Zooming into perform_training(), we encounter the first bit of backend code that handles a use case we have not yet discussed: comparing two models. If you recall in part one, one of the motivations for creating this pipeline was to rapidly test multiple Document Understanding models and compare performance between them. As such, the pipeline is built to handle, in a single experiment, multiple models being passed in the settings.ini file the experimenter defines. In fact, the MODEL_NAMES parameter from this file can accept one or many model names, the latter implying that the experimenter wants to run a comparison job. A comparison job has no impact on Data Reconciliation or Data Preparation; we want these steps to be isomorphic to a single model job as the idea is that n models get trained and tested on the exact same snapshot of training data. With that preamble, perform_training() looks like this:
https://medium.com/media/5ed143495634bb0cb3a152411f3dd4f1/hrefThe loop here is iterating over either a list with n model names or a list with a single model name. For each model name, an Estimator() is constructed and .fit() is called which kicks off a training job on Sagemaker. get_estimator_kwargs() will look familiar to anyone who has trained on Sagemaker already:
https://medium.com/media/436633131315849fe7ee203221679f0d/hrefSettings are extracted from the config we discussed in the first post in the series, the most important of which is config.docker_image_path. As a refresher, this is the ECR URL of the training image the experimenter created in the setup that is used between Sagemaker Processor/Training/Tuning jobs and contains all needed dependencies. Next, perform_training checks a boolean from the settings.ini file, USE_DISTRIBUTED which defines whether or not the experimenter expects distributed GPU training to occur. If so, it sets some extra Estimator parameters which are largely inspired by the _distribution_configuration function from the sagemaker-sdk.
I will digress for a moment here to talk about one such parameter, namely, an environment variable called USE_SMDEBUG. SMDEBUG refers to a debugging tool called Sagemaker Debugger. For reasons I cannot explain and have not been answered by AWSlabs, this tool is on by default and distributed training would not work for some models, producing mysterious exception traces. It only became obvious to me when carefully examining the traces and seeing that it was some code in smdebug that was ultimately throwing. Furthermore, there are a variety of ways to turn off smdebug, for instance passing 'debugger_hook_config': False as done above or environment={‘USE_SMDEBUG’:0}. However, these methods only work on Training jobs. Again, for reasons I cannot explain, the only way to turn off SMDEBUG on Tuning jobs is to set the env var inside the docker container being used: ENV USE_SMDEBUG="0"; the other methods explained above somehow never make it to a Tuning jobs constituent Training jobs. An unfortunate side effect of this is that it makes it difficult for an experimenter to configure this environment variable. At any rate, hopefully AWSlabs fixes and or makes smdebug exceptions more user friendly.
The call to .fit() makes the actual call to the AWS API. The config.training_data_uri parameter specifies the S3 URI of the encoded training data from the Data Preparation step; the training instance will download this data to local disk before it executes where it can be easily accessed by multiple GPU processes. How does the job know what code to execute? That is specified in the base docker container which is extended by the experimenter:
https://medium.com/media/3e699b6b220cb149464b463ae71c387d/hrefThese environment variables are used by the sagemaker-training library to kick off the training script. At this point we would dive into train.py,but since it is also used by a Tuning job, let’s take a look at how we kick off a Tuning job. The beginning of a Tuning job is nearly identical to a Training job:
https://medium.com/media/2ef4d1a2e799563d33f201b80ae8a48e/hrefBut now, instead of calling .fit(), we need to set up a few more parameters a Tuning job requires. A Tuning job requires a set of constant hyperparameters and tunable hyperparameters. As such, here an example of what an experimenter might write in the settings.ini file to represent this:
https://medium.com/media/26cafe0f890711c5831b38709973d950/hrefHere the constants will not change between tuning jobs, but the tunable parameters will start with guesses and those guesses will get better as jobs complete. The -> and , are syntax I’ve chosen; in this context -> stands for an interval while , stands for categorial options. Having seen this, the next piece of the Tuning job setup should make sense:
https://medium.com/media/5d7cb011649772a20f0d2e99d8e9df22/hrefNow we have our dict of tunable parameters we can pass to the HyperparameterTuner object:
https://medium.com/media/ca5b8a696072a1d65658a2a9926904a1/hrefThis should look somewhat familiar to what we just did for Training with a few extra parameters. So far, the HyperparameterTuner object takes the constructed Estimator() object that will be re-used for each constituent Training job and the tunable hyperparameters we just discussed. A Tuning job needs to measure a metric in order to decide if one set of hyperparameters are better than another. objective_metric_name is the name of that metric. This value is also used in the metric_definitions parameter which explicitly defines how the HyperparameterTuner job can extract the objective metric value from the logs for comparison. To make this more concrete, this is how these values are defined in an example settings.ini file:
https://medium.com/media/9ed31a5a4a0257d910f4eeba248401df/hrefFinally, the max_jobs parameter defines how many total Training jobs will constitute the Tuning job and max_parallel_jobs defines how many can run in parallel at a given time. Like the Estimator in the Training job, we call fit() to actually kick off the Tuning job and pass it the training_data_uri like we did previously. With this in place, we can now look at train.py and see what executes when a Training or Tuning job is executed.
The goal of train.py is to fine tune a loaded model using a set of distributed GPUs, compute a number of metrics, determine which is the best model, extract that model’s state_dict, convert that model to torchscript, and save these files along with a number of graphs to S3. Huggingface’s Accelerate, Evaluate and Transformers libraries are all used to greatly simplify this process. Before continuing, I have to give a brief shoutout to the Accelerate devs who were extremely responsive while I was building this pipeline.
Note that in a distributed setting, every GPU process is going to execute this same train.py file. While much of this coordination can be passed off to Accelerate, it is helpful to understand that while working inside it. Diving a level deeper, train.py is going to:
- Read hyperparameters and determine if the running job is a tuning job, training job or comparison job
- Determine if gradient accumulation will be utilized
- Construct the `Accelerator()` object which handles distribution
- Initialize wandb trackers
- Load split training data and create `Dataloader()`s for training and validation
- Set up an optimizer with learning rate scheduling
- Execute a training and validation loop, computing metrics and storing metric histories and determining what the best model was
- Plot curves for metrics
- Extract the curves, statistics and best model from the loops
- Write all of this data to S3
We start by reading the passed hyperparameters and setting a few values that can be used throughout the training process:
https://medium.com/media/54ea1add460e31d7464feefdb86e917b/href_tuning_objective_metric is a hyperparamter set by Sagemaker that allows us to easily differentiate between Training and Tuning jobs. As we’ve mentioned before, the run_num is an important setting that allows us to organize our results and version our models in production so they easily connect back to training runs. Finally, job_type_str allows us to further organize our runs as training / tuning and comparison jobs.
Next we determine if gradient accumulation is needed. Briefly, gradient accumulation allows us to set batch sizes that are larger than what the GPUs we’re running on can store in memory:
https://medium.com/media/3d17328c124913af0e1b718d2c5c7c19/hrefControl now moves to setting up the Accelerator() object which is the tool for managing distributed processing:
https://medium.com/media/d497bafdb2c6ca5353a3ba4f0628b048/hrefHere we encounter a core concept in Accelerate, is_main_process. This boolean provides a simple way to execute code on one of the distributed processes. This is helpful if we want to run code as if we’re on a single process; for instance if we want to store a history of metrics as the training loop executes. We use this boolean to set up wandb so we can easily log metrics to wandb. Additionally, accelerator.print() is similar to if accelerator.is_main_process print(...), it ensures whatever statement is only printed once.
Recall that we passed config.training_data_uri to the .fit() call for both Training and Tuning jobs. This downloads all of the training data to the Sagemaker instance’s local disk. Thus, we can use Datasets load_from_disk() function to load this data. Note in the following code SAGEMAKER_LOCAL_TRAINING_DIR is just the path to the dir that data is downloaded to.
https://medium.com/media/fe813271774aea4bffe09af13d887e60/hrefEach process loads the dataset, id2label file, metrics and creates dataloaders. Note the use of Huggingface’s evaluate library to load metrics; these can be used in tandem with Accelerate to make metric tracking simple during distributed training. We will see shortly how Accelerator provides one simple function to handle distributed training.
https://medium.com/media/15226e2ad8ecbde3ca2e9ceb6aa5a3f4/hrefIn this code block, we first call the user-defined function load_model to receive the loaded model defined however the experimenter would like. Thus far, this function has typically looked like a call to a Transformers from_pretrained() function, though this is not enforced.
A common learning rate optimizer is created and used to create a learning rate scheduler. Finally, we encounter another core concept in Accelerator, namely, wait_for_everyone(). This function guarantees that all processes have made it to this point before proceeding to the next line of code. It must be called before the prepare() function which prepares all of the values we’ve created thus far for training (in our case, distributed training). wait_for_everyone() is used regularly in Accelerator code; for example, it is nice to have when ensuring that all GPUs have completed the training loop. After the prepare() step, the code enters a function to perform the training and validation loop. Next, we will look at how Accelerator works inside that loop.
https://medium.com/media/2fb146cf2b8cb53978f1aa644459b3db/hrefAt the start of the loop, we initialize a number of values to track throughout training. Here we use is_main_process again to create a single version of metric histories which we will use to plot graphs. In this example, we are only tracking training loss, validation accuracy and f1, but any number of metrics could be tracked here. Next, we enter the loop, set the model in train() mode and enter the train() function:
https://medium.com/media/9cf3010fa51a1ff611796215c6a18a72/hrefAs execution enters a batch, it first needs to check if we’re running a comparison job. If so, it needs to extract the appropriate parameters for the current model’s forward() function. If you recall, for comparison jobs, in the Data Preparation step we combined all inputs in the same pyarrow format, but prepended with the model_name (e.g. longformer_input_ids). get_model_specific_batch() just returns those parameters of the batch that match the current model_name.
Next, we encounter with accelerator.accumulate(model), a context manager that recently came out in Accelerate that manages gradient accumulation. This simple wrapper reduces gradient accumulation to a single line. Underneath that manager, back propagation should look familiar to readers who have written ML code before, the one big difference is calling accelerator.backward(loss) instead of loss.backward().
Upon completing a training batch, execution sets the model in .eval() mode and moves into the validation loop:
https://medium.com/media/85721e142429b16bd26ae6ddabfc487d/hrefHere we encounter another key accelerate function, gather_for_metrics(). This recently added function makes it much easier to gather predictions in a distributed setting so they can be used to calculate metrics. We pass the returned values to the f1_metric and acc_metric objects we created earlier using the Evaluate library. The validation loop then computes the scores and returns them.
After sending the batch through training and validation, we perform tracking on the values we initialized at the beginning:
https://medium.com/media/db048dfedd8e2fe6728622738468691b/hrefSince is_main_process contains the references to our history-tracking datastructures, we use it to append our new values. accelerator.log links up with the init_trackers call we made earlier: .log sends these values to the tracker earlier initialized. In our case wandb will create graphs out of these values. Finally we use the F1 score to determine the best model over time.
After the training and validation loop is done, we execute:
https://medium.com/media/dc63e1e5acfc8c004540aa4e3befae6e/hrefWe start by ensuring that all processes have completed the training/validation loop and then call unwrap_model to extract the model from its distributed containers. Since the main process contains our metric histories, we use it to plot curves for each metric and calculate model statistics; we then return out the best model, curves and statistics.
Now that the training/validation loops are complete and we’ve determined a best model, we need to convert that best model to torchscript and save all the returned files to S3.
https://medium.com/media/52d154dd8d59f61df8dd47845fbbea6b/hrefHere we call end_training since we are using wandb and use is_main_process since we no longer need distribution. accelerator.save() is the correct way to save the model to disk, but we need to convert it to torchscript to mirror production as closely as possible. Briefly, Torchscript is a way of converting a python-based model into a serializable, production-friendly format that need not have a python dependency. As such, when testing inference on an unseen test set, it is best to test on the model that would be in production. One way to convert a model is to call torch.jit.trace passing it the model and a sample instance which is how we’ve implemented the conversion:
https://medium.com/media/bddbf1d75422ee06914bb98c71014ba9/hrefFirst, we take the best model and put it in CPU and evaluation mode. We then grab a sample instance out of the training data. Next, we encounter another user-defined function ordered_input_keys(). If you recall, this function returns the parameter names for a model’s forward() function in the correct order. It probably didn’t make sense earlier why this function was needed, but now it should: the example_inputs parameter of torch.jit.trace takes a tuple of input values which must match the exact parameter ordering of the forward() function.
Now, if we’re running a comparison job, then ordered_input_keys() is going to return a dictionary of OrderedDict’s with keys based on each model’s name. Thus, we test for this scenario and use the same get_model_specific_batch() function we used during training to extract a sample instance for the current model being converted.
Next, we iterate the ordered input keys and call .unsqueeze(0) on each parameter of the sample instance. The reason for this is because the forward() function expects a batch size as the first dimension of the input data; .unsqueeze(0) adds a dimension of 1 onto the tensors representing each parameter’s data.
Now we are ready to run the trace, passing the model, the example inputs and setting two parameters to false. The strict parameter controls whether or not you want the tracer to record mutable containers. By turning this off, you can allow, for example, your outputs = model(**batch) to remain a dict instead of a tuple. But you must be sure that the mutable containers used in your model aren’t actually mutated. check_trace checks that the same inputs run through the traced code produce the same outputs; in our case, leaving this True was producing odd errors, likely because of some internal non-deterministic operations, so we set it to False. Again, the ultimate test of the performance of the model is the inference step which we will be discussing next.
Finally, we save the traced model to local disk so it can be uploaded to s3. The final step of the train.py file is to upload all of these generated files to S3. In the case of a tuning job, we only retain the generated files from the run with the best objective metric score:
https://medium.com/media/ce129c3e8ffc3acb731246e6465051c0/hrefAnd with that, we have completed discussing the training/tuning step of the ML Pipeline. Next, we will look at the inference step where we load the torchscript model, perform inference on the unseen test set and collect statistics.
Inference
In the Training/Tuning step, we convert our best model into torchscript which means it can easily run on the CPU or multi-CPU environment. This enables us to hijack a Sagemaker Processor instance to perform our inference job. Like the previous sections, we will first look at how an inference job is initiated. Because we can use a Processor instance, it is identical to our Data Preparation step except for pointing it at our /test/ data and our inference.py file.
https://medium.com/media/346d017e6fe8a9a369e18b5ea68715c1/hrefRefer to the Data Preparation section of the second post to learn more about Processor/ScriptProcessor jobs. Note the differences of input_source_dir pointing at /test/ and `code` pointing at inference.py. Since these are so similar, we will move on to looking at the inference.py file.
We’ve discussed repeatedly the importance of run_num and how it is used to help identify the current experiment not only while training, but also the current model in production (so a production model can be linked to a training experiment). The inference.py will use the experiment parent directory to find the test data and the run_num to find the correct trained model.
The inference.py starts by downloading the id2label file so we can translate between model predictions and human-readable predictions:
https://medium.com/media/21466d30456617190f18debf9462512e/hrefRecall from previous sections that the ML pipeline is capable of running comparison jobs (n models trained and tested on the same dataset). Inference is the step where comparison really shines, allowing you to compare performance on identical data. In the next code block, we will load n models to prepare for inference. Recall that if a single model was trained, it is passed as a list with a single value:
https://medium.com/media/54e4395375aa91710a4364c8334e4790/hrefThis loop iterates the model names, downloads/loads the torchscript converted model and initializes statistics tracking for each. Let’s take a look at each inner function:
https://medium.com/media/d70a0ed298c6054e3277eb0fa5a61762/hrefThis function constructs the path the .pt file will be behind and downloads the .pt file. It then calls torch.jit.load and sets the model to eval mode, ready for inference. init_model_stats initializes values we will track per model, for each label which provides us facts that we can use to build statistics:
https://medium.com/media/e49eea2f4f5f52e4019cb34092fed721/hrefAnd init_metrics() simply loads the metrics we used earlier in the training step:
https://medium.com/media/a0da11f8058d91ce06484850572bb35d/hrefNext, we get the test data from the Data Preparation step:
https://medium.com/media/a7654e454bd42419b3fd097f1c6f8210/hrefWith the models and data loaded, we are now ready to run inference:
https://medium.com/media/25f8fe7bc95c184b990e42ddf11cf36d/hrefThe inference code will use config.is_comparison repeatedly to execute code specific to comparison jobs. It starts by initializing statistics specifically for comparisons which we will skip for now. Next, it enters the main loop which iterates through each instance of unseen test data. The ground truth label is extracted and execution enters the inner loop over the model names (in the case of one model this is just a List with a single entry). is_comparison is called to extract the data specific to the current model using the same function used in Training (get_model_specific_batch). The instance is then prepared for the forward() function using the same technique we used in covert_to_torchscript: each value gets .unsqueeze(0) called in order to add a batch size of 1 as the first dimension of the tensor.
We then grab the currently loaded model and pass the instance to it. We extract the most confident prediction from the returned logits by calling argmax(-1). Now let’s look at the remainder of the loop (note this begins inside the inner loop):
https://medium.com/media/bd012571a9a1424df4d153c852f57773/hrefWe take the prediction produced by the model and pass it and the ground truth to our accuracy and f1 metrics. We then increment the counters we initialized at the beginning:
https://medium.com/media/6268a5bd5ab7ebc0de9476a30effabf8/hrefIf inference.py is running a comparison job, we then add counts to the structure we initialized earlier; we will skip over these calls and jump to process_statistics which occurs after the inference code has finished looping:
https://medium.com/media/2840a3e82e148554bf7c0c8b82fb8d15/hrefThis function looks intimidating, but all it is doing is calculating the F1 score and Accuracy per label, sorting the results by F1 score descending, calculating the overall F1 and Accuracy and uploading the results to S3 under the correct parent dir and run_num.
If you’ve followed the ML Pipeline blogs up to this point, it is prescient to revisit the folder structure that is built on S3 while the entire pipeline executes that we laid out in the first blog:
https://medium.com/media/71816d6b1e8b1a9f17e52b9065cc51f6/hrefThis folder structure recurs for every machine learning experiment, containing everything one would need to quickly understand the experiment or reproduce it and link an experiment to what is in production.
Prima facie, it seems like a simple part of the overall pipeline, but I believe it is one of the most important: imbuing each experiment with desirable properties like navigability, readability, reproducibility, versioning and more.
If you’ve been following these blogs up to this point then you’ve been on quite a journey. I hope they provide some guidance in setting up your own ML Pipeline. As we continue to modify ours we will post on blog-worthy topics so stay tuned. If you can check out the first two posts in the series here: Part One: Setup, Part Two: Data Steps.
Pierce Lamb: Creating a ML Pipeline on AWS Sagemaker Part Two: Data Steps
This is the second post in a three part series on creating a reusable ML pipeline that is initiated with a single config file and five user-defined functions. The pipeline is finetuning-based for the purposes of classification, runs on distributed GPUs on AWS Sagemaker and uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.
This post originally appeared on VISO Trust’s Blog
This post will cover the two data steps, data reconciliation and data preparation. These are common steps in a ML process where data is collected, cleaned and encoded the way a model will expect. If you have landed on this post first, check out the first post in the series detailing the pipeline setup. You can also jump to the third post in the series detailing training and testing.
Data ReconciliationOf all the pipeline steps, the Data Reconciliation step is the step most likely to be customized to your specific use case. It represents the taking off point for collecting, cleaning, filtering etc the training data that will compose your experiment and getting it on S3. In our case, the raw training data exists in flat files already on S3 while the labels required for supervised training exist in a production database. This is, in fact, why I called it ‘Data Reconciliation.’ In our case, the production database labels are being reconciled with the flat files on s3.
As it is unlikely the reader has the exact same setup, I will try and highlight some of the re-usable parts of Data Reconciliation without getting too into our specific flavor of Data Reconciliation. Recall that a major architecture decision in the pipeline is a separate set of training data for every experiment; the goal of this step, then, is to collect the raw data, clean it and copy it to the bucket and folder on S3 where this experiment’s storage will reside (for e.g. EXP-3333-longformer/data/reconciled_artifacts).
I’ll create a distinction here between ‘artifacts’ and ‘files’ to better understand what follows. For every ‘artifact’ uploaded into our system, tens of ‘files’ are created that represent data and analysis about the given ‘artifact.’ As such, our raw data is composed of these sets of files per uniquely identified artifact.
The first step in Data Reconciliation is to collect all of the raw data. In our case, this means authenticating to a read replica of the production database, and running a query that contains artifact identifiers related to their ground truth classification labels. We then collect all of the S3 file paths on the production instance of S3 keyed by the same artifact GUID identifier.
Data Reconciliation knows which S3 file paths to collect via a settings.ini value passed by the experimenter call FILES_FROM_PROD. For e.g. imagine each artifact has a file called raw_text.json, the experimenter would pass FILES_FROM_PROD=raw_text.json and Data Reconciliation would find the S3 path to every raw_text.json file on the production S3 bucket.
Using the artifact identifiers (GUIDs), we then filter the production database results such that both datasets contain the exact same artifact identifiers and drop duplicates using the file hash. At this point the labels and S3 paths to the flat files are now reconciled; the actual files and the label just need to be copied to the correct experiment directory.
Before that copying begins, note that we now have unique insight into the training data for this experiment. Using the filtered database results, we can discover exactly the labels that will be trained on, and the instance count per label:
https://medium.com/media/b41974050eba9c6ff1d85e41a2964fe3/hrefWhere df is a pandas dataframe of the filtered database results. Now every experiment has a unique_labels_and_counts.json in its /data folder the experimenter can interrogate to see which labels and their counts are associated with this training data set.
At this point, we encounter our first user-defined function. process_func is an optional function that will run after Data Reconciliation has copied files for every artifact identifier; it gives the experimenter the opportunity to execute some arbitrary code for each artifact identifier. As an example, when we go to train we need access to the ground truth labels extracted from the production database. process_func gives us the ability to create an additional file per artifact, say, ground_truth_label.json that contains this label. Furthermore, if one’s model requires additional files to train on, for e.g. an image of a given page, that additional file can be created here, per artifact. Because it’s optional, the user could not define it; thus:
https://medium.com/media/b8b540632f4bc8448e0285dfe77e6ac6/hrefNow that we have our reconciled data and our process_func, we have to copy data from the production S3 bucket into our experiment S3 directory. This can easily occur in parallel, so we utilize multiprocessing to kick it off as a parallel process:
https://medium.com/media/b06a7f3cf937577b531eeb15b4e217a1/hrefThis function gets the df we discussed earlier, the experiment bucket, the dict of artifact identifier (GUID) to list of desired file paths (raw_training_data_paths), the parent experiment dir (s3_artifact_path), the number of parallel processes (either a config value or multiprocessing.cpu_count()) the process_func and a boolean that determines whether or not to overwrite.
First, it uses the same function that created raw_training_data_paths except pointed at the experiment bucket and with EXP-3333-longformer/data/reconciled_artifacts/ as a filter. This gives us a dict of what training data already exists for the experiment in case Data Reconciliation failed and had been restarted; we don’t copy the same data again. Next, it splits the reconciled data per process and for each split, creates a process and calls the add_to_research_experiment function. Let’s take a look at that function:
https://medium.com/media/cfb52ba7e7b656b4c7d4d6d8b009bdc8/hrefThe parameters to this function should be fairly straightforward given our discussion of copy_s3_data_in_parallel. The function iterates the data frame chunk directly checking for three different copying scenarios. I am aware that iterating a data frame directly is generally frowned upon in favor of a vectorized approach. In our case, these chunks are fairly small so it is not something we worry about. For each artifact, this function checks to see if, first, overwriting (reload) was set to true, if the current artifact already exists in the experiment and whether or not the proposed artifact has additional files to add to it and finally if it does not exist. In each case it calls an additional function that will copy the correct set of files. Next, let’s take a look at copy_to_s3:
https://medium.com/media/696545017ed9ec343577ff7119410f4d/hrefThis function is straight forward, and nicely shows what gets passed to process_func if the user has defined it. It gets the row from the df representing the current artifact, the existing files for the artifact _after_ copying, the experiment path and the overwriting boolean. This gives the experimenter a lot of flexibility on what he/she can do per artifact.
The final step of Data Reconciliation is a validation step where we use the config value FILES_ON_RESEARCH to validate that each artifact has the files it needs for training. The reason we can’t just use the earlier FILES_FROM_PROD value is because new files may have been created in process_func. So FILE_ON_RESEARCH may look like raw_text.json, page_01.png for example. This validation step is meant to provide some assurance that when we move onto Data Preparation, each artifact will have every file it needs and we don’t need to write code to handle missing files. So after all of our parallel processing completes, validate_data_was_created runs which we will view in partial stub form:
https://medium.com/media/386054dd5e9cc3e05a39453c5aa64fcf/hrefThis function takes the full df, the list of desired files defined by FILES_FROM_PROD, the list of desired files that should be in the experiment FILES_ON_RESEARCH, the experiment directory (EXP-3333-longformer/data/reconciled_artifacts/) and the user defined process_func. It collects all the existing file paths for the given experiment and iterates them, popping file names off FILES_ON_RESEARCH to check if they exist for each artifact. If files are missing, it then discovers if they are FILES_FROM_PROD files and retrieves them from the prod S3 bucket or if they are process_func files which it re-runs to generate them. Once this step is complete, we can have high confidence that all of our raw training data files exist for each artifact. As such, we can move on to Data Preparation.
Data PreparationThe data preparation step is meant to take the raw training files for the experiment and encode them so they are prepared to be input into a model’s forward() function. For this task, we will utilize the HuggingFace Datasets library and specifically its powerful map() function. This is also the first task that will utilize Sagemaker, specifically Sagemaker Processor jobs.
Let’s start by taking a look at how the Processor job is constructed and called. First, we utilize the Sagemaker Python SDK’s ScriptProcessor class. This allows us to run an arbitrary script on a Processor instance. Creating the ScriptProcessor object will look like:
https://medium.com/media/949b306885da38dca0d8d9a4e29292c3/hrefAs you can see, this construction is basically defined by config values. Arguably the most important is config.docker_image_path. This carefully constructed docker image which we spoke about in the first post in this series is re-used among all Sagemaker jobs (Processor/Training/Tuning). We spoke in the first post about how an experimenter extends a base image that contains all common dependencies like cuda enabled pytorch, transformers, datasets, accelerate, numpy, etc and adds any of their model-specific dependencies. That base image also contains lines that allow it to run on these different Sagemaker instances, we’ll discuss one now and more during our discussion of training:
https://medium.com/media/7de71a8e565dcef441186c07cf87cdef/hrefSagemaker Training/Tuning jobs always look in the /opt/ml/code directory for custom dependencies while Processor jobs look in /opt/ml/processing. These lines copy all of our ML pipeline code into these directories to ensure that all custom dependencies are available in either type of job. Now if we jump back over to where we constructed the ScriptProcessor object, this is how we kick off the job:
https://medium.com/media/a0bc063774e034385ed5a3b59c7b7f18/hrefOne feature of Processor jobs that is easy to miss is that before the script is executed, Sagemaker copies everything from the S3 URI provided in the source param onto local disk in the destination path. Building your script around this fact will give you huge performance benefits which we will discuss more later on. Another important point that may not be immediately obvious is that the command param combined with the code param is basically like defining an ENTRYPOINT for the Processor job. While its not exactly accurate, you can imagine these params creating this command in the container:
ENTRYPOINT [‘python3’, ‘/opt/ml/code/src/preprocessing/data_preparation.py’]
So the code above is constructing the S3 URI to the reconciled artifacts we created in the Data Reconciliation step and passing it in the source` param and the Processor job copies all of this data to local disk before it kicks off. SAGEMAKER_LOCAL_DATA_DIR defines where that data will be copied and is specified in data_preparation.py` so the path can be used there as well. Processor jobs can output data which is why I’ve defined outputs, but for now the data_preparation.py script is not utilizing this feature. Now that we’ve discussed how it is kicked off, we can take a look at encoding data in data_preparation.py.
The first step at the beginning of encoding is to define the S3 directory where data will be saved and get the label file we produced during Data Reconciliation. We read a config value to get the encoded data dir, namely, ENCODED_DATA_DIR. The value will typically be full_dataset, but it gives the experimenter the ability to produce smaller test datasets if desired (e.g. partial_dataset). So the full path will look like:
encoded_data_dir = f"{config.s3_parent_dir}/data/prepared_data/{config.encoded_data_dir}"
Or EXP-3333-longformer/data/prepared_data/full_dataset
Next, we get the unique_labels_and_counts.json file we uploaded during Data Reconciliation as our ground truth for supervised learning. We give the experimenter the ability to modify the ground truth here through some basic knobs: IGNORED_LABELS and NUM_LABELS_THRESHOLD; I could imagine a number of other options here. These knobs are self explanatory:
https://medium.com/media/0511dd70829f672d8bac6009c7d15331/hrefAfter modifying the labels the way the experimenter wants, execution moves onto the get_artifact_paths function. This function gets the paths on local disk that raw training data was copied to and returns them in a format that the Huggingface Datasets library will expect:
https://medium.com/media/92c63f294ea8a9eb73125b5bf4b8f4c2/hrefget_artifact_paths is called using the same path we passed to Processor.run() to define where data should be copied along with the results of the MODEL_INPUT_FILES config param. Following our example, this value would simply be [raw_text.json]. A Huggingface.arrow_dataset.datatsets.Dataset is eventually going to expect data formatted where each row constitutes an instance of training data, and each column represents the path to the needed input file. In our case it would look like:
https://medium.com/media/9a8906b6cdd5367dfede611796f582bc/hrefThis would be easy to represent in pandas, but since we’d prefer to not depend on pandas and will utilize Dataset.from_dict(), get_artifact_paths represents this structure using the file names as keys and lists to contain the paths.
Execution then enters the directory defined in SAGEMAKER_LOCAL_DATA_DIR and extracts the list of subdirs which, in our case, are guids for each artifact. It iterates these subdirs collecting the filenames for all files that are children of each subdir. It then uses the passed MODEL_INPUT_FILES to validate that each needed file is there and adds it to the artifact_paths dict. We now have a dict that is ready for Datasets processing.
Control now moves to a get_encoded_data() function that will kick off Huggingface.arrow_dataset.datasets.Dataset.map() which is a very powerful abstraction for encoding datasets. get_encoded_data is intended to setup the map() function for parallel processing of raw training data encoding and is the main part of the Data Preparation step:
https://medium.com/media/a7d67dbb995e7a70e57c3ba10d1a68f7/hrefThis function sets up the mapper, executes it, splits the returned encoded data and saves the split, encoded data to S3. The function takes the get_artifact_paths data we just generated (as data), a list of the labels only from unique_labels_and_counts.json, a few directory paths and the number of parallel processes to spin up. It starts by generating two label dicts in handle_labels, label2id.json and id2label.json which will be used downstream to convert between the integer values predicted by the model and actual string labels.
Next, one of our user defined functions get_dataset_features is called. As you may have noticed from the hints in Datasets classpaths, Datasets uses PyArrow as the backend for writing and reading data. PyArrow needs to enforce a schema it writes to and reads from; get_dataset_features` allows the experimenter to write that schema. This function returns a Datasets Features object which packages up this schema for the backend. Following our Longformer example, this function might look like:
https://medium.com/media/d859c13115e1dd55c19cfb3303578a16/hrefThe keys here represent the parameters the Longformer forward() function will expect when performing the forward pass. Now that we have these features, we can call Dataset.from_dict() on our get_artifact_paths data and we are fully ready for the mapper. The mapper has a variety of options, but the core concept is applying a function to every instance of training data that encodes and returns it. Let’s take a closer look at the call in Data Preparation:
https://medium.com/media/ff9d2c39cd32c64d8cb2feee9d2c40b0/hrefHere we pass the function we want to execute per instance, preprocess_data; fn_kwargs allows us to specify additional parameters we want to pass to that function; batched means that preprocess_data will receive batches of data instead of single instances; this allows us to perform additional filtering. features are the features we retrieved from get_dataset_features, we remove the column names so they aren’t encoded and finally the number of processes to process in parallel between.
With this in place, we can take a look at def preprocess_data which is executed by each process in parallel:
https://medium.com/media/85e711c4d4a5fa05e82657b5539b3ef2/hrefThe function first validates that each column of data has the exact same length and returns that length so it can be iterated over. It then iterates the batch, constructing a single instance and passing it to another user-defined function, encode_data. encode_data gives the experimenter the ability to define exactly how a single training instance is encoded with the option of returning None if additional filtering is desired. For instance, say we were using a Huggingface Transformers Tokenizer to encode; a single_instance here represents the file paths to the data we need, so we would get that data, say, in a variable called text_content and call something like this:
https://medium.com/media/0a59724cade845fc38695a8f953746b2/hrefWhere TOKENIZER is defined as a constant outside the function so it’s not re-constructed each time this function is called. If we continue following preprocess_data we can see that it simply skips single_instance’s where encode_data returns None. Finally, the encoded input is returned to the mapper in the correct Features format.
I’m going to skip looking at get_train_valid_test_split(), but suffice it to say that it uses Datasets internal function dataset.train_test_split() to split data using percentages and writes a metadata file that shows the counts of the split and associated labels to the experimenter.
And with that, Data Preparation is complete. Recall from the beginning that this will run as a ScriptProcessor job on a Sagemaker Processor instance. These instances tend to have lots of vCPU’s and can really take advantage of the parallel processing we’re doing in the mapper. The encoded data will end up on S3 ready to be downloaded by a Training or Tuning job which is discussed in the third post in this series. You can jump to the first and third post via these links: Part One: Setup, Part Three: Training and Inference.
Pierce Lamb: Creating a Machine Learning Pipeline on AWS Sagemaker Part One: Intro & Set Up
Or rather, creating a reusable ML Pipeline initiated by a single config file and five user-defined functions that performs classification, is finetuning-based, is distributed-first, runs on AWS Sagemaker, uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.
This post originally appeared on VISO Trust’s Blog
This is the introductory post in a three part series. To jump to the other posts, check out Creating a ML Pipeline Part 2: The Data Steps or Creating a ML Pipeline Part 3: Training and Inference
Introduction
On the Data & Machine Learning team at VISO Trust, one of our core goals is to provide Document Intelligence to our auditor team. Every document that passes through the system is subject to collection, parsing, reformatting, analysis, reporting and more. Part of that intelligence is automatically determining what type of document has been uploaded into the system. Knowing what type of document has entered the system allows us to perform specialized analysis on that document.
The task of labeling or classifying a thing is a traditional use of machine learning, however, classifying an entire document — which, for us, can be up to 300+ pages — is on the bleeding edge of machine learning research. At the time of this writing, researchers are racing to use the advances in Deep Learning and specifically in Transformers to classify documents. In fact, at the outset of this task, I performed some research on the space with keywords like “Document Classification/Intelligence/Representation” and came across nearly 30 different papers that use Deep Learning and were published between 2020 and 2022. For those familiar with the space, names like LayoutLM/v2/v3, TiLT/LiLT, SelfDoc, StructuralLM, Longformer/Reformer/Performer/Linformer, UDOP and many more.
This result convinced me that trying a multitude of these models would be a better use of our time than trying to decide which was the best among them. As such, I decided to pick one and use the experience of fine-tuning it as a proof-of-concept to build a reusable ML pipeline the rest of my team could use. The goal was to reduce the time to perform an experiment from weeks to a day or two. This would allow us to experiment with many of the models quickly to decide which are the best for our use case.
The result of this work was an interface where an experimenter writes a single config file and five user defined functions that kick off data reconciliation, data preparation, training or tuning and inference testing automatically.
When I set out on that proof-of-concept (pre-ML Pipeline), it took over a month to collect and clean the data, prepare the model, perform inference and get everything working on Sagemaker using distribution. Since building the ML Pipeline, we’ve used it repeatedly to quickly experiment with new models, retrain existing models on new data, and compare the performance of multiple models. The time required to perform a new experiment is about half a day to a day on average. This has enabled us to iterate incredibly fast, getting models in production in our Document Intelligence platform quickly.
What follows is a description of the above Pipeline; I hope that it will save you from some of the multi-day pitfalls I encountered building it.
ML Experiment Setup
An important architectural decision we made at the beginning was to keep experiments isolated and easily reproducible. Everytime an experiment is performed, it has its own set of raw data, encoded data, docker files, model files, inference test results etc. This makes it easy to trace a given experiment across repos/S3/metrics tools and where it came from once it is in production. However, one trade off worth noting is that training data is copied separately for every experiment; for some orgs this simply may be infeasible and a more centralized solution is necessary. With that said, what follows is the process of creating an experiment.
An experiment is created in an experiments repo and tied to a ticket (e.g. JIRA) like EXP-3333-longformer. This name will follow the experiment across services; for us, all storage occurs on S3, so in the experiment's bucket, objects will be saved under the EXP-3333-longformer parent directory. Furthermore, in wandb (our tracker), the top level group name will be EXP-3333-longformer.
Next, example stubbed files are copied in and modified to the particulars of the experiment. This includes the config file and user defined function stubs mentioned above. Also included are two docker files; one dockerfile represents the dependencies required to run the pipeline, the other represents the dependencies required to run 4 different stages on AWS Sagemaker: data preparation, training or tuning and inference. Both of these docker files are made simple by extending from base docker files maintained in the ML pipeline library; the intent is that they only need to include extra libraries required by the experiment. This follows the convention established by AWS’s Deep Learning Containers (DLCs) and, in fact, our base sagemaker container starts by extending one of these DLCs.
There is an important trade off here: we use one monolithic container to run three different steps on Sagemaker. We preferred a simpler setup for experimenters (one dockerfile) versus having to create a different container per Sagemaker step. The downside is that for a given step, the container will likely contain some unnecessary dependencies which make it larger. Let’s look at an example to solidify this.
In our base Sagemaker container, we extend:
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
This gives us pytorch 1.10.2 with cuda 11.3 bindings, transformers 4.17, python 3.8 and ubuntu all ready to run on the GPU. You can see available DLCs here. We then add sagemaker-training, accelerate, evaluate, datasets and wandb. Now when an experimenter goes to extend this image, they only need to worry about any extra dependencies their model might need. For example, a model might depend on detectron2 which is an unlikely dependency among other experiments. So the experimenter would only need to think about extending the base sagemaker container and installing detectron2 and be done worrying about dependencies.
With the base docker containers in place, the files needed for the start of an experiment would look like:
https://medium.com/media/de90d5b8d6601d3975ea80c332e95e7f/hrefIn brief, these files are:
- settings.ini: A single (gitignored) configuration file that takes all settings for every step of the ML pipeline (copied into the dockerfiles)
- sagemaker.Dockerfile: Extends the base training container discussed above and adds any extra model dependencies. In many cases the base container itself will suffice.
- run.Dockerfile: Extends the base run container discussed above and adds any extra run dependencies the experimenter needs. In many cases the base container itself will suffice.
- run.sh: A shell script that builds and runs run.Dockerfile.
- build_and_push.sh: A shell script that builds and pushes sagemaker.Dockerfile to ECR.
- user_defined_funcs.py: Contains the five user defined functions that will be called by the ML pipeline at various stages (copied into the dockerfiles). We will discuss these in detail later.
These files represent the necessary and sufficient requirements for an experimenter to run an experiment on the ML pipeline. As we discuss the ML pipeline, we will examine these files in more detail. Before that discussion, however, let’s look at the interface on S3 and wandb. Assume that we’ve set up and run the experiment as shown above. The resulting directories on S3 will look like:
https://medium.com/media/823d0264d1199be7b6d3703cb0325616/hrefThe run_number will increment with each subsequent run of the experiment. This run number will be replicated in wandb and also prefixed to any deployed endpoint for production so the exact run of the experiment can be traced through training, metrics collection and production. Finally, let’s look at the resulting wandb structure:
https://medium.com/media/b6c2f56b011001028fd1e427080db31a/hrefI hope that getting a feel for the interface of the experimenter will make it easier to understand the pipeline itself.
The ML pipeline
The ML pipeline will (eventually) expose some generics that specific use cases can extend to modify the pipeline for their purposes. Since it was recently developed in the context of one use case, we will discuss it in that context; however, below I will show what it might look like with multiple:
https://medium.com/media/06e61e98a0e3c0d02df5e515fcbb9c38/hrefLet’s focus in on ml_pipeline:
https://medium.com/media/6a8f1af5aeb5c98051240440f5e42a92/hrefThe environment folder will house the files for building the base containers we spoke of earlier, one for running the framework and one for any code that executes on Sagemaker (preprocessing, training/tuning, inference). These are named using the same conventions as AWS DLCs so it is simple to create multiple versions of them with different dependencies. We will ignore the test folder for the remainder of this blog.
The lib directory houses our implementation of the ML pipeline. Let’s zoom in again on just that directory.
https://medium.com/media/78a4e37d0f6ce79cb18d2eea8de325c0/hrefLet’s start with run_framework.py since that will give us an eagle eye view of what is going on. The skeleton of run_framework will look like this:
https://medium.com/media/38ea2a2ea16b2a7fd6a0d5fd4405b292/hrefThe settings.ini file a user defines for an experiment will be copied into the same dir (BASE_PACKAGE_PATH) inside each docker container and parsed into an object called MLPipelineConfig(). In our case, we chose to use Python Decouple to handle config management. In this config file, the initial settings are: RUN_RECONCILIATION/PREPARATION/TRAINING/TUNING/INFERENCE so the pipeline is flexible to exactly what an experimenter is looking for. These values constitute the conditionals above.
Note the importlib line. This line allows us to import use-case specific functions and pass them into the steps (shown here is just data reconciliation) using an experimenter-set config value for use case.
The moment the config file is parsed, we want to run validation to identify misconfigurations now instead of in the middle of training. Without getting into too much detail on the validation step, here is what the function might look like:
https://medium.com/media/281e4b8f338f30922d8311afaddebca9/hrefThe _validate_funcs function ensures that functions with those definitions exist and that they are not defined as pass (i.e. a user has created them and defined them). The user_defined_funcs.py file above simply defines them as pass, so a user must overwrite these to execute a valid run. _validate_run_num throws an exception if the settings.ini-defined RUN_NUM already exists on s3. This saves us from common pitfalls that could occur an hour into a training run.
We’ve gotten to the point now where we can look at each pipeline step in detail. You can jump to the second and third post via these links: Part Two: The Data Steps, Part Three: Training and Inference.
Nonprofit Drupal posts: April Drupal for Nonprofits Chat
Join us TOMORROW, Thursday, April 20 at 1pm ET / 10am PT, for our regularly scheduled call to chat about all things Drupal and nonprofits. (Convert to your local time zone.)
No pre-defined topics on the agenda this month, so join us for an informal chat about anything at the intersection of Drupal and nonprofits. Got something specific on your mind? Feel free to share ahead of time in our collaborative Google doc: https://nten.org/drupal/notes!
All nonprofit Drupal devs and users, regardless of experience level, are always welcome on this call.
This free call is sponsored by NTEN.org and open to everyone.
-
Join the call: https://us02web.zoom.us/j/81817469653
-
Meeting ID: 818 1746 9653
Passcode: 551681 -
One tap mobile:
+16699006833,,81817469653# US (San Jose)
+13462487799,,81817469653# US (Houston) -
Dial by your location:
+1 669 900 6833 US (San Jose)
+1 346 248 7799 US (Houston)
+1 253 215 8782 US (Tacoma)
+1 929 205 6099 US (New York)
+1 301 715 8592 US (Washington DC)
+1 312 626 6799 US (Chicago) -
Find your local number: https://us02web.zoom.us/u/kpV1o65N
-
- Follow along on Google Docs: https://nten.org/drupal/notes
Security advisories: Drupal core - Moderately critical - Access bypass - SA-CORE-2023-005
The file download facility doesn't sufficiently sanitize file paths in certain situations. This may result in users gaining access to private files that they should not have access to.
Some sites may require configuration changes following this security release. Review the release notes for your Drupal version if you have issues accessing private files after updating.
This advisory is covered by Drupal Steward.
We would normally not apply for a release of this severity. However, in this case we have chosen to apply Drupal Steward security coverage to test our processes.
Drupal 7- All Drupal 7 sites on Windows web servers are vulnerable.
- Drupal 7 sites on Linux web servers are vulnerable with certain file directory structures, or if a vulnerable contributed or custom file access module is installed.
Drupal 9 and 10 sites are only vulnerable if certain contributed or custom file access modules are installed.
Solution:Install the latest version:
- If you are using Drupal 10.0, update to Drupal 10.0.8.
- If you are using Drupal 9.5, update to Drupal 9.5.8.
- If you are using Drupal 9.4, update to Drupal 9.4.14.
- If you are using Drupal 7, update to Drupal 7.96.
All versions of Drupal 9 prior to 9.4.x are end-of-life and do not receive security coverage. Note that Drupal 8 has reached its end of life.
Reported By:- Heine of the Drupal Security Team
- Conrad Lara
- Guy Elsmore-Paddock
- Michael Hess of the Drupal Security Team
- Heine of the Drupal Security Team
- Lee Rowlands of the Drupal Security Team
- David Rothstein of the Drupal Security Team
- xjm of the Drupal Security Team
- Wim Leers
- Damien McKenna of the Drupal Security Team
- Alex Bronstein of the Drupal Security Team
- Conrad Lara
- Peter Wolanin of the Drupal Security Team
- Drew Webber of the Drupal Security Team
- Benji Fisher of the Drupal Security Team
- Juraj Nemec, provisional member of the Drupal Security Team
- Jen Lampton, provisional member of the Drupal Security Team
- Dave Long of the Drupal Security Team
- Kim Pepper
- Alex Pott of the Drupal Security Team
- Neil Drumm of the Drupal Security Team
LN Webworks: 7 ways to enhance your ecommerce Website and online sales with Drupal
Peoples Blog: Fix Colima connection refused error: failed to get Info from .lima/colima/ha.sock on Mac
PreviousNext: Why a culture of open-source contribution is good for your business
Contributing makes good business sense, especially when open-source technology, such as Drupal, is at the core of everything you do (pun intended!).
by Owen Lansbury / 19 April 2023Based on a talk given at EverythingOpen 2023. A video of that presentation is also available at the end of this article.
Why do we contribute to the Drupal community?Adopting a formalised approach to contribution helps our business stay sustainable in the long term. It also has the added benefit of helping everyone else in the open-source community.
ReputationOver the years at PreviousNext, we’ve honed a deep expertise in Drupal. That’s because we’ve doubled down and avoided diluting our technical offering. We’re all in for Drupal.
This level of knowledge sees us regularly referred to clients looking for hard hitters in the Drupal space. Our expertise is particularly appealing, as it happens, for our Higher Education and Government clients. Being Australia’s only Platinum Certified Drupal Partner can only help in this regard.
Our Drupal Association profile records all our contributions as ‘credits’. These determine our ranking as a certified partner, demonstrating our commitment to Drupal as a technology and a community.
We focus on raising our Drupal profile using means other than traditional marketing methods. Our team attends events, volunteers at DrupalSouth, presents at conferences, sponsors the DrupalSouth CodeSprint, and takes on community leadership roles.
This level of involvement cements our position as a leading Drupal provider. It also gives all members of our team (including those who are non-technical) additional opportunities to be part of the community and raise their profiles.
Professional developmentI like to refer to Drupal as a ‘do-ocracy’. Everyone is welcome, and all help is welcome. Open-source and open-handed. It’s the same sense of community that we value at PreviousNext.
When someone first joins our business, we often use open-source contributions as the primary method of onboarding them. This induction method encourages them to develop best practices in their coding and use their involvement in the Drupal project as part of their ongoing professional development.
An offshoot of this is the chance to build relationships and be mentored by people external to our organisation. It’s a unique opportunity to broaden our collective perspectives and work alongside (and become!) some of the brightest minds in open-source tech.
A happier teamAvoiding team member burnout or a lacklustre approach to work is vital for us as a smaller organisation. Instead, we help staff to scratch those different ‘itches’.
Working on contrib helps to maintain interest and passion by giving staff time to work on projects that aren’t run-of-the-mill client engagements. It also exposes our team to larger initiatives than they might otherwise work on.
Staff retentionA happier team, in turn, leads to a more stable team over the long term. Our retention rates have steadied at around three times the industry average.
This tendency towards longevity also facilitated our decision to make PreviousNext employee-owned.
How do we contribute? An established frameworkEnshrined in our Staff Handbook is the hope that employees at PreviousNext will use 20% of their time for contrib (the remaining 80% is billable client work). If a team member chooses not to contribute, they work closer to fully billable hours.
We don’t expect staff to contribute outside their employed hours–though many do for their own interest.
With a robust time-tracking and self-management culture, this approach works well and leads to a productive, well-run company.
We’ve also baked open-source contributions into our regular ‘Hackdays’. These are days when our developers get together and innovate. This focused work feeds into our client projects and becomes part of our Drupal contributions.
Other methods for ensuring a regular flow of code include directly sponsoring developers, which helps us maintain our partnership status.
We also use project-based sponsorship to contribute patches and new modules to the Drupal ecosystem. The clients for these projects also receive credits for sponsoring this development.
Being a good Drupal citizenOpen-source contribution isn’t just about altruism. It also shouldn’t be viewed as a drain on a business’s income generation. It’s about recognising that our businesses depend on a technological ecosystem that in turn relies on as many of us playing our part to advance it as possible.
When it comes to Drupal, the result of these contributions is a platform that commands a 10% share of the top 10,000 most visited websites globally. Clearly, though, there is more to be done to promote Drupal even further. It’s something we can all get behind, because when our chosen open-source platform thrives, so do our businesses.
Watch the videoPromet Source: Navigating Web Accessibility: Expert Perspectives
kevinquillen.com: Three new Drupal module releases for Site Builders
Golems GABB: Cleaning Up Database to Speed Up Development Cycles
If you are developing websites, you would like to make this process as pleasant as possible. You probably agree that the time spent on the work you do is also an important factor. Therefore, all of us developers want to develop websites as quickly as possible while spending as little effort and energy as possible.
In this article, we will look at Drupal performance optimization and how you can clear the database tables of Drupal modules and MySQL tables. The following information will help you speed up development cycles. And also, among other things, it will help you feel a little happier after a hard day's work! But first of all, you need to understand what a database table is.
Consensus Enterprises: Aegir5: Front-end UI architecture
LN Webworks: 6 Ways to create a Winning Drupal Digital Commerce Strategy
Specbee: Data Security Matters: Marketers' Guide to Securing Your CMS
Hey, are you doing everything you should to protect your customers’ data?
Picture this: you’ve gathered the information from your audience via your latest webinar registration page. And now, your audience keeps receiving emails they never asked for. Would they trust you, your marketing process, and your services anymore?
Absolutely not.
As the world becomes more digital and interconnected, security is becoming more and more important for businesses of all sizes. Whether you're a marketer, technologist, or just a small business owner, you can't afford to overlook the importance of strong security measures.
In fact, a recent survey found that 80% of content marketers and managers believe that security is absolutely crucial to their success. With so much sensitive data and personal information at risk, it's no wonder that businesses across industries are prioritizing security like never before.
To stay ahead of the game, companies are turning to flexible content management systems that meet their customers' needs without compromising on privacy and security. Whether you're managing customer data, creating marketing materials, or just trying to run your business smoothly, a strong security strategy is key to success in today's fast-paced digital landscape.
True, it’s not realistic to anticipate every cybersecurity issue. However, the best strategy is to take proactive measures to protect your organization's and customers’ data. In other words, secure systems should be enacted to prevent potential threats before they arise.
In this blog, you'll learn the importance of data security as seen through a marketer's lens. Not only will we break down why it matters, but also provide tips and tactics for protecting your information with a CMS.
Why do marketers need to consider data securityMarketers should be aware of data breaches, as protecting customer information is essential for their success. Here's why:
Inbound marketers need to build trustBuilding trust with customers is a key ingredient to any successful marketing strategy. Creating strong relationships through exceptional service can transform leads into loyal customers who feel secure in doing business with you. It's not just about the sale, it's about building a rapport that goes beyond transactions.
Happy customers will not only continue to support your business, but they'll also recommend it to others. However, lose their trust and it'll be downhill from there. So, make sure you're always focused on fostering trust with your customers - it's worth it in the long run.
Clients and customers trust marketers for data securityAs a marketer, you hold a special responsibility to protect the trust that your customers bestow upon you. They willingly give you their information through registration forms, webinar sign-ups, and newsletter subscriptions, believing that you and your business will keep it safe and sound. It's a sign of faith, and it's important that you honor it by safeguarding their data at all costs. After all, the trust they put in you is like a delicate flower - it only takes one wrong move to crush it irreparably.
Travel the extra mile to keep your customer’s information safeProtecting and respecting your customers' privacy is crucial, and it's up to you to be their guardian. You can't just go around using their personal information willy-nilly - that's a big no-no. To avoid any potential kerfuffles, make sure you get their consent before reaching out to them. Being responsible with their data will not only earn their trust, but it will also make you a stand-up business. So, protect and serve - your customers will thank you for it.
5 Security Features You Must Look for in a CMSNow that we’ve laid the background for the need for data security for marketers, here are five must-have security features you should look for in a CMS before using it.
EncryptionEnsure that your CMS has data encryption capabilities to secure stored data and content assets exchanged between systems.
FirewallWhether software or hardware devices, your CMS’s firewall is configured to deny, permit, or proxy data via your network. It allows the data to seep through different trust levels while also reducing the intensity of security threats in your network simultaneously.
AuthenticationWith the authentication feature, a CMS becomes a high-security fortress, allowing you to control who has access to your content. This feature verifies the user's credentials by cross-referencing them with details stored in the system and offers cool access control options like single sign-on, SAML, and OAuth.
Take it up a notch with extra security measures, provided by OpenID and Default authentication protocols. Implementing authentication in your CMS workflows is like hiring your own personal bouncer for your website.
Managed Cloud HostingWith managed cloud hosting, your organization’s IT department can remain carefree about its infrastructure management. Moreover, with automatic updates and regular security patches, your cloud host ensures that your security stays in place.
CDN SupportEnsure that your CMS relies on a network of servers that equip it with advanced security and better performance. CDN caches all the data with the servers distributed geographically. This way, the primary location of your data stays secure and anyone trying to access your network can only access the nearest server.
How can you secure your CMS?When it comes to security concerns, it is highly recommended to use hack-tight security measures to secure your CMS. CMS-powered open-source websites are the most popular ones today, with WordPress, Drupal, and Joomla covering 75% of the CMS-powered website market.
Although popular, open-source websites can be vulnerable to cyber threats easily. So, here are a few ways to avoid such vulnerabilities and secure your CMS from hackers.
- Implement strong passwords at the Admin backend: Almost anybody can access your open-source website backend using /administrator. Make sure to secure your website by having strong passwords and by changing the password regularly.
- Install a firewall: Firewalls help you secure your website as well as track threatening activities and witness IP addresses of hackers.
- Back up your systems: The importance of data backup is underrated. Ensure to update your backups regularly to avoid unforeseen loss of important data.
- Build a strong security framework: Using a strong security framework allows you to comply with industry best practices, rules and regulations, and certifications, such as Safe Harbor, SOX 404, etc. Security is much more than patches to servers. It’s also about how you can control the changes made to your system, like personnel training, change in the hiring process, and media protection.
- Update your CMS: Ensure to keep your CMS updated with the latest features, bug fixes, and enhancements so that it is stable and secure. Drupal users should upgrade to the latest Drupal version (9 or 10) to stay updated on the latest features, security updates, and better integrations.
- Analyze your systems regularly: To keep your business secure, ensure that there are no risky programs installed. Utilize robust systems for software management, versioning, application, and network scanning as well as a security architecture review. Implement Intrusion Detection System sensors to remain informed about any unauthorized access attempts.
Be an Empathetic Marketer
Building a bond of trust with your audience is the backbone of marketing triumph. By providing top-notch services and earning a well-respected reputation, you can cultivate an authentic connection with potential customers.
Encourage them to express their thoughts freely so they can confidently decide whether or not to do business with you. Trust isn't built in a day, but by consistently demonstrating your expertise and professionalism, you will inevitably win their hearts and their wallets.
A CMS to help you achieve your marketing goals in accordance with your business goals is Drupal. Drupal’s security team always keeps a close watch over potential security threats and keeps your organization’s data safe from exploitable vulnerabilities. They release security patches and fixes from time to time for advanced security.
Offering transparency and assuring that your customer’s data is safe with you will encourage trust; ensure this by taking advantage of a CMS and utilizing effective marketing strategies.
Author: Priyanka Phukan
Meet Priyanka, a Junior Content Writer and Marketer at Specbee. Priyanka’s a Grammar-Freak with a knack for creating impactful content with ‘words’ being her weapon of choice. A foodie who likes all things chicken. When not writing, she likes to play the Uke and sing. On blue days, you’ll find her binge-watching Asian dramas.
Email Address Subscribe Leave this field blank Drupal Web Development Security Drupal Maintenance Drupal PlanetLeave us a Comment
Recent Blogs Image Data Security Matters: Marketers' Guide to Securing Your CMS Image Revitalize Your Forms: Enhancing User Experience with Drupal's Form API Image Cheat Codes in Music - Nitin Lama’s Life in a Mirage Do you want a robust and secure Drupal website ? We've built many and are happy to help. Talk to us Featured Case StudiesUpgrading the web presence of IEEE Information Theory Society, the most trusted voice for advanced technology
ExploreA Drupal powered multi-site, multi-lingual platform to enable a unified user experience at SEMI
ExploreGreat Southern Homes, one of the fastest growing home builders in the US, sees greater results with Drupal
Explore View all Case StudiesThe Drop Times: Seek, and Ye Shall Find
I am a big fan of Zen stories. Those are little labyrinths of concealed wisdom. Once you enter the maze, you should find the way out. There is a way, and there is no way. You are trapped and free as well. It all depends on how you perceive things.
Let me quote a Zen story. Kindly excuse me if you have already heard this one. A young man visited a Zen master who lived in the middle of a chaotic city. The man wanted to know whether he would be happy if he relocated there as already, the master lived there.
The master asked: "How is your current place like?"
"It is terrible. Everyone is mean. I hate being there, and I need change,"
answered the seeker.
"It will be terrible here. Everyone would be mean. You would hate being here, and you won't experience any change,"
said the master.
A few hours went by. Another young man came to the master with the same question, and the master raised the same query.
"It is good. Wonderful. I have nothing to complain about. Everybody is so friendly and supportive. I just need a change,"
the man answered.
"It will be good here. Wonderful. You will have nothing to complain about. Everybody would be friendly and supportive. But you won't experience any change,"
the master cautioned.
You will find what you seek. But if you desire something that you already have, you won't even realize you have reaped it anew. If you are privileged, you won't appreciate how far the privilege has taken you. It might blind you from the paths tread by the underprivileged too. To develop a balanced view is the Zen.
I am very much elated as the community is with the news that Drupal is now a Digital Public Good. But in hindsight, wasn't Drupal a digital public good from its initiation? Wasn't Drupal inherently a public good? What is there in some agency specifically designating Drupal as a digital public good?
The answers might be in the specifics. Yes, governments were using Drupal to serve their people better. Yes, Drupal was helping achieve the 17 SDGs put forward by the United Nations. Yes, Drupal is non-excludable and non-rivalrous.
But with all that, Drupal grows because of commercial interest, and it serves the hunger for profits of the businesses and agencies that use it. It is no crime, as with every other public good, like rail, roads, waterways, playgrounds, public education system, etc. Those will not have been this good this far without enough commercial interest.
Let me quote from a press release cum article put out by Zyxware Technologies.
"This recognition is a significant milestone for Drupal and the Drupal Community. Being listed as a DPG will make it easier for companies to justify using Drupal and potentially direct funding for specific feature sets. It is a testament to the community's hard work and dedication to creating software that adheres to the highest standards and helps achieve the UN's SDGs."
Let me move toward other news and events.
John Doyle, founder and president of Digital Polygon, speaks to Alethia Braganza about Owning It, Giving Back, and Building It Better. In this interview, he says that he genuinely believes that Drupal has become this successful because of the community members that drive it forward and advocate for its use.
Our second blog post on the 'build in public' initiative discusses How We Built a Newsletter System on Drupal with Mailchimp Integration. Emily Mathew, the lead developer of TDT, writes about the workflow.
Drupal Decoupled Days, scheduled for August 16-17 in Albuquerque, New Mexico, has released a call for papers. The last date to submit a session proposal is May 05, 2023. DrupalJam 2023, set to take place on June 01 at DeFabrique in Utrecht, Netherlands, has announced its sponsor packages. DrupalSouth released the schedule and agenda for the 2023 conference set for May 17 to 19 in Wellington, New Zealand. MidCamp, scheduled for April 26 to 29 in Chicago, Illinois, is still seeking more sponsors, according to their newsletter. MidCamp attendees can also attend an evening match of Baseball at Wrigley Stadium. You could also purchase regular tickets to the camp until April 21. Drupal Developer Days happening in Vienna, Austria, from July 19 to 22, have sold out early bird tickets quickly; but regular tickets are still available. TDT is a media partner for DrupalCamp Finland, scheduled for April 28.
Drupal Association has announced the onboarding of Fran Garcia-Linares. Stichting Drupal Nederland, or the Dutch Drupal Foundation, has onboarded three new members to its director board.
Those joining this year's DrupalCon Pittsburgh can avail of accommodation facilities at a discounted rate if they choose from hotels within the official hotel block. Also, check out 7 best places to visit while you are at Pittsburgh for DrupalCon North America. Meanwhile, the deadline to submit sessions for DrupalCon Lille is April 24.
NERD Summit has published session videos of the 2023 camp on its official YouTube channel.
Two international PHP conferences are coming. The first is from May 22 to 26 in Berlin, and the second is from October 23 to 27 in Munich. You may submit session proposals for the October event.
Third and Grove has published a blog post explaining the new features and improvements coming to Drupal 10.1 and beyond. Franz Glauber Vanderlinde has written a blog post in Evolving Web's portal explaining Claro, Drupal's default admin user interface. The University of Nebraska-Lincoln has announced an upgrade of its web platform to Drupal 10 in 18 months. Tomato Elephant Studio (TES) will conduct a Drupal training session on April 24, 2023. The 3-hour event will focus on Building Data Reports in Drupal. Srijan, recently acquired by Material+, has published a blog post about the key factors to improve SEO and performance for a Drupal Website. TDT published a listicle on why Drupal is the best for travel agency websites. Neeraj Kumar, founder and CEO of Valuebound published a blog post on opensource.com about Drupal Modules to Improve Accessibility.
Acquia has opened a new partnership with KPMG to advance DXP roll-out across industries. Acquia certifications in Japan have crossed the 200 mark. Acquia employees can apply to the headless developer advisory board.
There was a critical security alert on the protected pages module.
The number of stories selected from the past week is more extensive than we used to pack. Some of the stories linked in the written part might not repeat in the list beneath. Before winding up this newsletter issue, let me apologize for postponing the publication for almost a day. We are bound to rectify this procedural lapse linked to human error. We will return the following Monday with a new issue of Editor's Pick.
Sincerely,
Sebin A. Jacob
Editor-in-chief, TheDropTimes
Opensource.com: Use autoloading and namespaces in PHP
PHP autoloading and namespaces provide handy conveniences with huge benefits.
In the PHP language, autoloading is a way to automatically include class files of a project in your code. Say you had a complex object-oriented PHP project with more than a…
Talking Drupal: Talking Drupal #395 - Accessibility from Sales to Delivery
Today we are talking about Accessibility from the sales process to delivery with Kat Shaw.
For show notes visit: www.talkingDrupal.com/395
Topics- Where does Accessibility (A11y) begin
- What are the A11y levels
- Who should be thinking about A11y
- How do you research a solution for A11y
- What tools do you use
- What are the biggest struggles with selling A11y
- A11y and regulations
- Selling A11y only projects
- Ensuring delivery
- Ensuring support after launch
- Future of A11y 2.2 and 3.0
- GAAD
- Tools
- Dominoes
- Screen readers
- How cool accessibility tools can make your life easier
Nic Laflin - www.nLighteneddevelopment.com @nicxvan John Picozzi - www.epam.com @johnpicozzi Kat Shaw - drupal.org/u/katannshaw @katannshaw
MOTW CorrespondentMartin Anderson-Clutz - @mandclu Gesso A Sass-based, Webpack-based, and Storybook integrated accessible starter theme.
LakeDrops Drupal Consulting, Development and Hosting: Now is the right time to update Drupal 7 to 10 thanks to ECA
There are plenty of reasons why an update of Drupal sites based on version 7 to the modern platform of Drupal 9 or 10 can be a challenge. Limited resources come to mind, but more often missing features due to not yet working modules are blocking the otherwise useful update plan. The good news explored in this blog post, thanks to the ECA module those missing features may no longer be holding you back and that's why now is the best time to get started with the Drupal 7 update.
Sooper Drupal Themes: DXPR Builder 2.2.3: Undo/Redo, Keyboard Shortcuts, Choose Templates for User Roles, and More
We're excited to announce the latest release of DXPR Builder, version 2.2.3! In this update, we've added several new features and enhancements that will improve your experience as an editor while building web content with DXPR Builder. Let's take a closer look at what's new in DXPR Builder 2.2.3.
Keyboard Shortcuts for Undo and RedoWe've introduced two new keyboard shortcuts to make it easier for you to undo and redo actions. You can now use Ctrl/Cmd+Z to undo an action and Shift+Ctrl/Cmd+Z to redo an action. This will help you save time and work more efficiently.
Upgraded History State ManagementWe've upgraded our history state management system to allow you to undo and redo actions even after saving your work. This means that you'll have access to all previous states, even if you've already saved your work. This is a great feature that will help you work with greater confidence.
Expanded User-Profile FeatureWe've expanded our user-profile feature to include page templates and global user template selection. This means that you can now save user templates globally and apply them to specific user roles, as well as select specific page templates for user roles. This is a powerful feature that will help you streamline the content creation process when you have users in different roles editing different sections of the website.
Enabled Cloning Functionality for Tabs and Collapsible PanelsWe've enabled cloning functionality for tabs and collapsible panels (toggles). This means that you'll be able to clone these elements without any issues. This is a small improvement that will save you time and effort.
Enhanced Local Video ElementWe've enhanced our local video element with improved file validation and clearer upload error messages. This means that you'll have a better experience when uploading videos to your site. You'll be able to quickly identify and fix any issues that arise during the upload process.
That's it for this release of DXPR Builder! We hope you find these updates helpful and that they make your experience with our web content editing platform even better. As always, we welcome your feedback and suggestions for future updates. Thanks for using DXPR Builder!