Subscribe to Planet Drupal feed
Drupal.org - aggregated feeds in category Planet Drupal
Updated: 24 min 21 sec ago

Opensource.com: What you need to know about the Drupal 9 to 10 migration

Thu, 2023/04/20 - 9:00am
What you need to know about the Drupal 9 to 10 migration thejimbirch Thu, 04/20/2023 - 03:00

Check out these tips for a hassle-free upgrade experience.

Drupal 10 was released in December 2022. If you're a current Drupal 9 user, you may be strategizing your website's Drupal 9 to 10 migration. Luckily, the Drupal 9 to 10…

Categories:

Evolving Web: Hands-On With Drupal 10: Symfony 6.2, the New Tech Stack

Thu, 2023/04/20 - 8:33am

Symfony is an open-source framework that helps developers build complex PHP web applications. 

Many of Symfony’s reusable components are included in the Drupal core library. They’re integrated into thousands of projects and have been downloaded billions of times.

It’s not hard to see why—Symfony provides access to clean, stable code that saves developers from having to reinvent the wheel. It promotes decoupled code and invites standardization of best practices. Using Symfony in combination with Drupal helps developers to create more maintainable solutions with superior performance.

So it’s no surprise that Drupal has chosen to integrate Symfony even more tightly. Drupal 10 relies on Symfony 6.2 as its underlying technology stack. 

The latest version of Symfony—released in November 2022—brings exciting new features for Drupal 10 developers. We shared our top picks below to help you leverage the benefits of Symfony 6.2. 

Our Pick of Best New Features in Symfony 6.2 1. PSR-4 Route Loader

This is a fantastic addition that provides a faster, more efficient way of finding route attributes defined in PHP classes. As Symfony is used in Drupal 10 and many other CMSs, a lot of projects should get a nice speed boost.

The new PSR-4 route loader replaces an outdated process, whereby AnnotationDirectoryLoader found PHP files recursively and AnnotationFileLoader inspected their contents. This process became unnecessary for modern PHP projects as they all use PSR-4 class autoloading. 

You can implement the new route loader by simply defining the PSR-4 namespace used by your controller classes.

YAML configuration to define the PSR-4 namespace. The namespace option is also supported in XML and PHP configs. Source: Symfony. 

2. More Built-In Attributes

Cache, security, templates and Doctrine annotations are core in Symfony 6.2. This is great because it means you no longer have to install SensioFrameworkExtraBundle to use them. In most applications, you’ll simply need to update the imported namespace without having to change anything in your code.

3. New UID Features

Symfony 6.2 has support for UUID Version 7 and UUID Version 8 formats—an addition that should please developers who’ve been concerned about UUID Version 4’s collision chances. 

And there are some fresh features in the UID component including:

  • MaxUUID and MaxULID– two new classes that represent the highest possible value of both UUID and ULID. 
  • Time-based UID Interface – makes it easier to get date/time values from UIDs. 
  • UID Conversion to Hexadecimal Values – the new toHex() method returns the binary value as an hexadecimal string.

Returned hexadecimal strings can be used in other parts of your application, such as querying UIDs in binary format in the database. Source: Symfony. 

4. Better Debugging Commands

Symfony 6.2 features improved commands for debugging issues while developing your applications. We particularly appreciate the new --resolve-env option that’s been added to the debug:config command. This new feature hides the secret by default when you're trying to debug variables in the console. It may be useful when recording a training session with live data that you don't want to expose, for example. 

5. Redesign of Profiler

Symfony’s Profiler is a powerful development tool that gives detailed information about debugging the execution of any request. Symfony 6.2 features a redesign of profiler that offers a modern look and feel. 

Profiler has a fresh coat of paint in Symfony 6.2. Source: Symfony. 

More Improvements in Symfony 6.2

That’s not all! We only touched on a handful of the updates and new features in Symfony 6.2. There are many more to get excited about, including:

  • Security improvements such as an easier process for logging in users programmatically and customization of the impersonating target URL.
  • Better emoji support that lets you slugify and transliterate emojis and their description into any language.
  • New clock component to improve the testability of time-sensitive code.
  • Finder component improvements to make it easier to sort by file extension, size and case-insensitive name.
  • New AST-based translation extractor to find translatable contents in PHP files.
  • File constraint improvements that validate both file extensions and media types (MIME types) in a much simpler way. 
  • PHP Enum support in service parameters, YAML files, and environment variable processors.
  • DX improvements such as a simpler way to get the current route in templates and hide sensitive information.
  • Console improvements such as improved color support and autocompletion for Zsh shells.
More From ‘Our Hands-On With Drupal 10’ Blog Series Planning to migrate to Drupal 10? 

Get insights into a major, real-life Drupal migration that we executed for the University of Waterloo. Our free webinar offers practical tips and best practices to help you plan, resource, and execute your migration project.

//--> //-->

+ more awesome articles by Evolving Web
Categories:

MidCamp - Midwest Drupal Camp: 1. Week. ‘till MidCamp!

Thu, 2023/04/20 - 2:57am
1. Week. ‘till MidCamp!

We had our “first summer” last week in Chicago, but spring is back and the weather for next week is looking… particularly Chicago-y. Pack accordingly.

Tonight (Wednesday) is our MidCamp Preview Meetup (don't forget to signup)! We’ll have introductions, a Contribution Day overview with AmyJune Hineline, a Session Overview, and other MidCamp fun.

Wednesday, April 26

We’ll kick things off with our opening remarks and then dive right into sessions. After a full day we’ll adjourn a few blocks north for some sports.

  • 8:30 AM: Registration begins on the 3rd floor of the DePaul Student Center. Coffee & tea will be available.
  • 9 AM - noon: Opening remarks and sessions
  • noon - 1:15 PM: Lunch in the 2nd floor cafeteria
  • 1:15 - 3:30 PM: More sessions and BoFs - don’t forget to submit your BoF ideas!
  • 3:45 - 4:45 PM: ⚡⚡ Lightning Talks! ⚡⚡
  • 6 PM - ?:Wednesday Social: Cubs Game 🧤
Thursday, April 27

After we cheer the Cubs to victory, we’ll be back for another round of sessions, BoFs, and more. The contribution room will be open all day.

  • 8:30 AM: Registration & beverages on the 3rd floor
  • 9:15 - 11:30AM: Sessions
  • 11:30 AM - 12:45 PM: Lunch in the 2nd floor cafeteria
  • 12:45 - 3 PM: Sessions and BoFs - don’t forget to submit your BoF ideas!
  • 3PM: Wind down and prep for…
  • 5PM - ?: ♟️Bring your board games, your decks, your DM kit for our Thursday Social: Game Night & Tacos! 🌮
Friday, April 28

We’ll have a full day of Drupal Contribution starting at 10AM. Coffee and tea will be provided in the morning and we’ll have lunch in the cafeteria. 

We’ll start with a First Time Contributor Workshop, and initiative leads will be present for:

To review

There’s a lot going on, and it’s our first time back in person since 2019. We’re excited, and we’re sure you have questions. Join the MidCamp Slack and feel free to ask anything you need in #general.

Categories:

Zyxware Technologies: 4 Must-Have Drupal Modules for Public Sector Websites

Thu, 2023/04/20 - 1:30am
This article discusses 4 Drupal modules useful for implementing standard features required by public sector websites.
Categories:

Cocomore: Innovative e-learning goes Drupal

Wed, 2023/04/19 - 10:14pm
Cocomore's first project on Opigno is soon going live.
Categories:

Cocomore: Drupal & Magento: commercial & technical power couple for your online business

Wed, 2023/04/19 - 10:14pm
Read here how your online sales can benefit from the post-pandemic social media boost and how Drupal and Magento contribute to a frictionless sales journey from catchy Insta content to hitting the buy button on your online shopping site.
Categories:

Cocomore: Drupalcamp Zaragoza 2022: the deeply missed event is back!

Wed, 2023/04/19 - 10:14pm
Until 2020 Drupalcamp was held every year in a different location in Spain, but since the advent of Covid, we haven’t had the chance to meet our friends and colleagues from the Spanish Drupal community. Now, circumstances have finally allowed us to hold large conferences again, and we couldn’t miss the chance.
Categories:

Cocomore: Empowerment comes from opportunity. Meet Cocomore's new Fair Trade talent training program

Wed, 2023/04/19 - 10:14pm
Cocomore is currently exploring ways to recruit in new areas in the ever expanding software development market. Find out why we are looking for a Drupal trainer to help with our new program.
Categories:

Cocomore: The true-life experience of a Drupal Content Editor

Wed, 2023/04/19 - 10:14pm
– we are only as good as the system is! I took on the mission of raising more awareness of how we feel as content editors at the end of a long line of website development, collecting my experiences, and applying for the online Drupal Con Europe 2021.
Categories:

Cocomore: Drupal 9: How to successfully upgrade

Wed, 2023/04/19 - 10:14pm
The big day has come: the new version of the CMS Drupal was launched on June 3rd!
Categories:

Cocomore: How to get ready for Drupal 9

Wed, 2023/04/19 - 10:14pm
As Drupal support for the CRM-versions D7 and D8 will be running out in November 2021, it is advisable to start planning your D9 upgrade asap. On t3n our expert Marc Kutschera explains how to do it.
Categories:

Cocomore: Composer in Drupal 8.8.0 - First impressions

Wed, 2023/04/19 - 10:14pm
With the release of the new Drupal 8.8.0 version, a number of new changes are waiting for the community. From a development point of view, the most interesting one is probably the full support of Composer to build Drupal projects.
Categories:

Pierce Lamb: Creating a ML Pipeline on AWS Sagemaker Part Three: Training and Inference

Wed, 2023/04/19 - 8:27pm

This is the third post in a three part series on creating a reusable ML pipeline that is initiated with a single config file and five user-defined functions. The pipeline is finetuning-based for the purposes of classification, runs on distributed GPUs on AWS Sagemaker and uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.

This post originally appeared on VISO Trust’s Blog

This post will cover the training and testing (inference) steps. These are the core steps in a ML pipeline where a model is hyper-parameter tuned and the test set is used to measure performance. If you have landed on this post first, check out the first post in the series detailing the pipeline setup and the second post detailing the data steps.

Training and Tuning

The reason I have combined Training and Tuning into one section is that Tuning just is a set of training jobs where performance is incrementally improved through the changing of hyperparameters. As such, underneath the covers, the two types of jobs are calling the same code. Like we have previously, let’s take a look first at perform_training() and perform_tuning() to see how the code interacts with Sagemaker.

Zooming into perform_training(), we encounter the first bit of backend code that handles a use case we have not yet discussed: comparing two models. If you recall in part one, one of the motivations for creating this pipeline was to rapidly test multiple Document Understanding models and compare performance between them. As such, the pipeline is built to handle, in a single experiment, multiple models being passed in the settings.ini file the experimenter defines. In fact, the MODEL_NAMES parameter from this file can accept one or many model names, the latter implying that the experimenter wants to run a comparison job. A comparison job has no impact on Data Reconciliation or Data Preparation; we want these steps to be isomorphic to a single model job as the idea is that n models get trained and tested on the exact same snapshot of training data. With that preamble, perform_training() looks like this:

https://medium.com/media/5ed143495634bb0cb3a152411f3dd4f1/href

The loop here is iterating over either a list with n model names or a list with a single model name. For each model name, an Estimator() is constructed and .fit() is called which kicks off a training job on Sagemaker. get_estimator_kwargs() will look familiar to anyone who has trained on Sagemaker already:

https://medium.com/media/436633131315849fe7ee203221679f0d/href

Settings are extracted from the config we discussed in the first post in the series, the most important of which is config.docker_image_path. As a refresher, this is the ECR URL of the training image the experimenter created in the setup that is used between Sagemaker Processor/Training/Tuning jobs and contains all needed dependencies. Next, perform_training checks a boolean from the settings.ini file, USE_DISTRIBUTED which defines whether or not the experimenter expects distributed GPU training to occur. If so, it sets some extra Estimator parameters which are largely inspired by the _distribution_configuration function from the sagemaker-sdk.

I will digress for a moment here to talk about one such parameter, namely, an environment variable called USE_SMDEBUG. SMDEBUG refers to a debugging tool called Sagemaker Debugger. For reasons I cannot explain and have not been answered by AWSlabs, this tool is on by default and distributed training would not work for some models, producing mysterious exception traces. It only became obvious to me when carefully examining the traces and seeing that it was some code in smdebug that was ultimately throwing. Furthermore, there are a variety of ways to turn off smdebug, for instance passing 'debugger_hook_config': False as done above or environment={‘USE_SMDEBUG’:0}. However, these methods only work on Training jobs. Again, for reasons I cannot explain, the only way to turn off SMDEBUG on Tuning jobs is to set the env var inside the docker container being used: ENV USE_SMDEBUG="0"; the other methods explained above somehow never make it to a Tuning jobs constituent Training jobs. An unfortunate side effect of this is that it makes it difficult for an experimenter to configure this environment variable. At any rate, hopefully AWSlabs fixes and or makes smdebug exceptions more user friendly.

The call to .fit() makes the actual call to the AWS API. The config.training_data_uri parameter specifies the S3 URI of the encoded training data from the Data Preparation step; the training instance will download this data to local disk before it executes where it can be easily accessed by multiple GPU processes. How does the job know what code to execute? That is specified in the base docker container which is extended by the experimenter:

https://medium.com/media/3e699b6b220cb149464b463ae71c387d/href

These environment variables are used by the sagemaker-training library to kick off the training script. At this point we would dive into train.py,but since it is also used by a Tuning job, let’s take a look at how we kick off a Tuning job. The beginning of a Tuning job is nearly identical to a Training job:

https://medium.com/media/2ef4d1a2e799563d33f201b80ae8a48e/href

But now, instead of calling .fit(), we need to set up a few more parameters a Tuning job requires. A Tuning job requires a set of constant hyperparameters and tunable hyperparameters. As such, here an example of what an experimenter might write in the settings.ini file to represent this:

https://medium.com/media/26cafe0f890711c5831b38709973d950/href

Here the constants will not change between tuning jobs, but the tunable parameters will start with guesses and those guesses will get better as jobs complete. The -> and , are syntax I’ve chosen; in this context -> stands for an interval while , stands for categorial options. Having seen this, the next piece of the Tuning job setup should make sense:

https://medium.com/media/5d7cb011649772a20f0d2e99d8e9df22/href

Now we have our dict of tunable parameters we can pass to the HyperparameterTuner object:

https://medium.com/media/ca5b8a696072a1d65658a2a9926904a1/href

This should look somewhat familiar to what we just did for Training with a few extra parameters. So far, the HyperparameterTuner object takes the constructed Estimator() object that will be re-used for each constituent Training job and the tunable hyperparameters we just discussed. A Tuning job needs to measure a metric in order to decide if one set of hyperparameters are better than another. objective_metric_name is the name of that metric. This value is also used in the metric_definitions parameter which explicitly defines how the HyperparameterTuner job can extract the objective metric value from the logs for comparison. To make this more concrete, this is how these values are defined in an example settings.ini file:

https://medium.com/media/9ed31a5a4a0257d910f4eeba248401df/href

Finally, the max_jobs parameter defines how many total Training jobs will constitute the Tuning job and max_parallel_jobs defines how many can run in parallel at a given time. Like the Estimator in the Training job, we call fit() to actually kick off the Tuning job and pass it the training_data_uri like we did previously. With this in place, we can now look at train.py and see what executes when a Training or Tuning job is executed.

The goal of train.py is to fine tune a loaded model using a set of distributed GPUs, compute a number of metrics, determine which is the best model, extract that model’s state_dict, convert that model to torchscript, and save these files along with a number of graphs to S3. Huggingface’s Accelerate, Evaluate and Transformers libraries are all used to greatly simplify this process. Before continuing, I have to give a brief shoutout to the Accelerate devs who were extremely responsive while I was building this pipeline.

Note that in a distributed setting, every GPU process is going to execute this same train.py file. While much of this coordination can be passed off to Accelerate, it is helpful to understand that while working inside it. Diving a level deeper, train.py is going to:

  • Read hyperparameters and determine if the running job is a tuning job, training job or comparison job
  • Determine if gradient accumulation will be utilized
  • Construct the `Accelerator()` object which handles distribution
  • Initialize wandb trackers
  • Load split training data and create `Dataloader()`s for training and validation
  • Set up an optimizer with learning rate scheduling
  • Execute a training and validation loop, computing metrics and storing metric histories and determining what the best model was
  • Plot curves for metrics
  • Extract the curves, statistics and best model from the loops
  • Write all of this data to S3

We start by reading the passed hyperparameters and setting a few values that can be used throughout the training process:

https://medium.com/media/54ea1add460e31d7464feefdb86e917b/href

_tuning_objective_metric is a hyperparamter set by Sagemaker that allows us to easily differentiate between Training and Tuning jobs. As we’ve mentioned before, the run_num is an important setting that allows us to organize our results and version our models in production so they easily connect back to training runs. Finally, job_type_str allows us to further organize our runs as training / tuning and comparison jobs.

Next we determine if gradient accumulation is needed. Briefly, gradient accumulation allows us to set batch sizes that are larger than what the GPUs we’re running on can store in memory:

https://medium.com/media/3d17328c124913af0e1b718d2c5c7c19/href

Control now moves to setting up the Accelerator() object which is the tool for managing distributed processing:

https://medium.com/media/d497bafdb2c6ca5353a3ba4f0628b048/href

Here we encounter a core concept in Accelerate, is_main_process. This boolean provides a simple way to execute code on one of the distributed processes. This is helpful if we want to run code as if we’re on a single process; for instance if we want to store a history of metrics as the training loop executes. We use this boolean to set up wandb so we can easily log metrics to wandb. Additionally, accelerator.print() is similar to if accelerator.is_main_process print(...), it ensures whatever statement is only printed once.

Recall that we passed config.training_data_uri to the .fit() call for both Training and Tuning jobs. This downloads all of the training data to the Sagemaker instance’s local disk. Thus, we can use Datasets load_from_disk() function to load this data. Note in the following code SAGEMAKER_LOCAL_TRAINING_DIR is just the path to the dir that data is downloaded to.

https://medium.com/media/fe813271774aea4bffe09af13d887e60/href

Each process loads the dataset, id2label file, metrics and creates dataloaders. Note the use of Huggingface’s evaluate library to load metrics; these can be used in tandem with Accelerate to make metric tracking simple during distributed training. We will see shortly how Accelerator provides one simple function to handle distributed training.

https://medium.com/media/15226e2ad8ecbde3ca2e9ceb6aa5a3f4/href

In this code block, we first call the user-defined function load_model to receive the loaded model defined however the experimenter would like. Thus far, this function has typically looked like a call to a Transformers from_pretrained() function, though this is not enforced.

A common learning rate optimizer is created and used to create a learning rate scheduler. Finally, we encounter another core concept in Accelerator, namely, wait_for_everyone(). This function guarantees that all processes have made it to this point before proceeding to the next line of code. It must be called before the prepare() function which prepares all of the values we’ve created thus far for training (in our case, distributed training). wait_for_everyone() is used regularly in Accelerator code; for example, it is nice to have when ensuring that all GPUs have completed the training loop. After the prepare() step, the code enters a function to perform the training and validation loop. Next, we will look at how Accelerator works inside that loop.

https://medium.com/media/2fb146cf2b8cb53978f1aa644459b3db/href

At the start of the loop, we initialize a number of values to track throughout training. Here we use is_main_process again to create a single version of metric histories which we will use to plot graphs. In this example, we are only tracking training loss, validation accuracy and f1, but any number of metrics could be tracked here. Next, we enter the loop, set the model in train() mode and enter the train() function:

https://medium.com/media/9cf3010fa51a1ff611796215c6a18a72/href

As execution enters a batch, it first needs to check if we’re running a comparison job. If so, it needs to extract the appropriate parameters for the current model’s forward() function. If you recall, for comparison jobs, in the Data Preparation step we combined all inputs in the same pyarrow format, but prepended with the model_name (e.g. longformer_input_ids). get_model_specific_batch() just returns those parameters of the batch that match the current model_name.

Next, we encounter with accelerator.accumulate(model), a context manager that recently came out in Accelerate that manages gradient accumulation. This simple wrapper reduces gradient accumulation to a single line. Underneath that manager, back propagation should look familiar to readers who have written ML code before, the one big difference is calling accelerator.backward(loss) instead of loss.backward().

Upon completing a training batch, execution sets the model in .eval() mode and moves into the validation loop:

https://medium.com/media/85721e142429b16bd26ae6ddabfc487d/href

Here we encounter another key accelerate function, gather_for_metrics(). This recently added function makes it much easier to gather predictions in a distributed setting so they can be used to calculate metrics. We pass the returned values to the f1_metric and acc_metric objects we created earlier using the Evaluate library. The validation loop then computes the scores and returns them.

After sending the batch through training and validation, we perform tracking on the values we initialized at the beginning:

https://medium.com/media/db048dfedd8e2fe6728622738468691b/href

Since is_main_process contains the references to our history-tracking datastructures, we use it to append our new values. accelerator.log links up with the init_trackers call we made earlier: .log sends these values to the tracker earlier initialized. In our case wandb will create graphs out of these values. Finally we use the F1 score to determine the best model over time.

After the training and validation loop is done, we execute:

https://medium.com/media/dc63e1e5acfc8c004540aa4e3befae6e/href

We start by ensuring that all processes have completed the training/validation loop and then call unwrap_model to extract the model from its distributed containers. Since the main process contains our metric histories, we use it to plot curves for each metric and calculate model statistics; we then return out the best model, curves and statistics.

Now that the training/validation loops are complete and we’ve determined a best model, we need to convert that best model to torchscript and save all the returned files to S3.

https://medium.com/media/52d154dd8d59f61df8dd47845fbbea6b/href

Here we call end_training since we are using wandb and use is_main_process since we no longer need distribution. accelerator.save() is the correct way to save the model to disk, but we need to convert it to torchscript to mirror production as closely as possible. Briefly, Torchscript is a way of converting a python-based model into a serializable, production-friendly format that need not have a python dependency. As such, when testing inference on an unseen test set, it is best to test on the model that would be in production. One way to convert a model is to call torch.jit.trace passing it the model and a sample instance which is how we’ve implemented the conversion:

https://medium.com/media/bddbf1d75422ee06914bb98c71014ba9/href

First, we take the best model and put it in CPU and evaluation mode. We then grab a sample instance out of the training data. Next, we encounter another user-defined function ordered_input_keys(). If you recall, this function returns the parameter names for a model’s forward() function in the correct order. It probably didn’t make sense earlier why this function was needed, but now it should: the example_inputs parameter of torch.jit.trace takes a tuple of input values which must match the exact parameter ordering of the forward() function.

Now, if we’re running a comparison job, then ordered_input_keys() is going to return a dictionary of OrderedDict’s with keys based on each model’s name. Thus, we test for this scenario and use the same get_model_specific_batch() function we used during training to extract a sample instance for the current model being converted.

Next, we iterate the ordered input keys and call .unsqueeze(0) on each parameter of the sample instance. The reason for this is because the forward() function expects a batch size as the first dimension of the input data; .unsqueeze(0) adds a dimension of 1 onto the tensors representing each parameter’s data.

Now we are ready to run the trace, passing the model, the example inputs and setting two parameters to false. The strict parameter controls whether or not you want the tracer to record mutable containers. By turning this off, you can allow, for example, your outputs = model(**batch) to remain a dict instead of a tuple. But you must be sure that the mutable containers used in your model aren’t actually mutated. check_trace checks that the same inputs run through the traced code produce the same outputs; in our case, leaving this True was producing odd errors, likely because of some internal non-deterministic operations, so we set it to False. Again, the ultimate test of the performance of the model is the inference step which we will be discussing next.

Finally, we save the traced model to local disk so it can be uploaded to s3. The final step of the train.py file is to upload all of these generated files to S3. In the case of a tuning job, we only retain the generated files from the run with the best objective metric score:

https://medium.com/media/ce129c3e8ffc3acb731246e6465051c0/href

And with that, we have completed discussing the training/tuning step of the ML Pipeline. Next, we will look at the inference step where we load the torchscript model, perform inference on the unseen test set and collect statistics.

Inference

In the Training/Tuning step, we convert our best model into torchscript which means it can easily run on the CPU or multi-CPU environment. This enables us to hijack a Sagemaker Processor instance to perform our inference job. Like the previous sections, we will first look at how an inference job is initiated. Because we can use a Processor instance, it is identical to our Data Preparation step except for pointing it at our /test/ data and our inference.py file.

https://medium.com/media/346d017e6fe8a9a369e18b5ea68715c1/href

Refer to the Data Preparation section of the second post to learn more about Processor/ScriptProcessor jobs. Note the differences of input_source_dir pointing at /test/ and `code` pointing at inference.py. Since these are so similar, we will move on to looking at the inference.py file.

We’ve discussed repeatedly the importance of run_num and how it is used to help identify the current experiment not only while training, but also the current model in production (so a production model can be linked to a training experiment). The inference.py will use the experiment parent directory to find the test data and the run_num to find the correct trained model.

The inference.py starts by downloading the id2label file so we can translate between model predictions and human-readable predictions:

https://medium.com/media/21466d30456617190f18debf9462512e/href

Recall from previous sections that the ML pipeline is capable of running comparison jobs (n models trained and tested on the same dataset). Inference is the step where comparison really shines, allowing you to compare performance on identical data. In the next code block, we will load n models to prepare for inference. Recall that if a single model was trained, it is passed as a list with a single value:

https://medium.com/media/54e4395375aa91710a4364c8334e4790/href

This loop iterates the model names, downloads/loads the torchscript converted model and initializes statistics tracking for each. Let’s take a look at each inner function:

https://medium.com/media/d70a0ed298c6054e3277eb0fa5a61762/href

This function constructs the path the .pt file will be behind and downloads the .pt file. It then calls torch.jit.load and sets the model to eval mode, ready for inference. init_model_stats initializes values we will track per model, for each label which provides us facts that we can use to build statistics:

https://medium.com/media/e49eea2f4f5f52e4019cb34092fed721/href

And init_metrics() simply loads the metrics we used earlier in the training step:

https://medium.com/media/a0da11f8058d91ce06484850572bb35d/href

Next, we get the test data from the Data Preparation step:

https://medium.com/media/a7654e454bd42419b3fd097f1c6f8210/href

With the models and data loaded, we are now ready to run inference:

https://medium.com/media/25f8fe7bc95c184b990e42ddf11cf36d/href

The inference code will use config.is_comparison repeatedly to execute code specific to comparison jobs. It starts by initializing statistics specifically for comparisons which we will skip for now. Next, it enters the main loop which iterates through each instance of unseen test data. The ground truth label is extracted and execution enters the inner loop over the model names (in the case of one model this is just a List with a single entry). is_comparison is called to extract the data specific to the current model using the same function used in Training (get_model_specific_batch). The instance is then prepared for the forward() function using the same technique we used in covert_to_torchscript: each value gets .unsqueeze(0) called in order to add a batch size of 1 as the first dimension of the tensor.

We then grab the currently loaded model and pass the instance to it. We extract the most confident prediction from the returned logits by calling argmax(-1). Now let’s look at the remainder of the loop (note this begins inside the inner loop):

https://medium.com/media/bd012571a9a1424df4d153c852f57773/href

We take the prediction produced by the model and pass it and the ground truth to our accuracy and f1 metrics. We then increment the counters we initialized at the beginning:

https://medium.com/media/6268a5bd5ab7ebc0de9476a30effabf8/href

If inference.py is running a comparison job, we then add counts to the structure we initialized earlier; we will skip over these calls and jump to process_statistics which occurs after the inference code has finished looping:

https://medium.com/media/2840a3e82e148554bf7c0c8b82fb8d15/href

This function looks intimidating, but all it is doing is calculating the F1 score and Accuracy per label, sorting the results by F1 score descending, calculating the overall F1 and Accuracy and uploading the results to S3 under the correct parent dir and run_num.

If you’ve followed the ML Pipeline blogs up to this point, it is prescient to revisit the folder structure that is built on S3 while the entire pipeline executes that we laid out in the first blog:

https://medium.com/media/71816d6b1e8b1a9f17e52b9065cc51f6/href

This folder structure recurs for every machine learning experiment, containing everything one would need to quickly understand the experiment or reproduce it and link an experiment to what is in production.

Prima facie, it seems like a simple part of the overall pipeline, but I believe it is one of the most important: imbuing each experiment with desirable properties like navigability, readability, reproducibility, versioning and more.

If you’ve been following these blogs up to this point then you’ve been on quite a journey. I hope they provide some guidance in setting up your own ML Pipeline. As we continue to modify ours we will post on blog-worthy topics so stay tuned. If you can check out the first two posts in the series here: Part One: Setup, Part Two: Data Steps.

Categories:

Pierce Lamb: Creating a ML Pipeline on AWS Sagemaker Part Two: Data Steps

Wed, 2023/04/19 - 8:26pm

This is the second post in a three part series on creating a reusable ML pipeline that is initiated with a single config file and five user-defined functions. The pipeline is finetuning-based for the purposes of classification, runs on distributed GPUs on AWS Sagemaker and uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.

This post originally appeared on VISO Trust’s Blog

This post will cover the two data steps, data reconciliation and data preparation. These are common steps in a ML process where data is collected, cleaned and encoded the way a model will expect. If you have landed on this post first, check out the first post in the series detailing the pipeline setup. You can also jump to the third post in the series detailing training and testing.

Data Reconciliation

Of all the pipeline steps, the Data Reconciliation step is the step most likely to be customized to your specific use case. It represents the taking off point for collecting, cleaning, filtering etc the training data that will compose your experiment and getting it on S3. In our case, the raw training data exists in flat files already on S3 while the labels required for supervised training exist in a production database. This is, in fact, why I called it ‘Data Reconciliation.’ In our case, the production database labels are being reconciled with the flat files on s3.

As it is unlikely the reader has the exact same setup, I will try and highlight some of the re-usable parts of Data Reconciliation without getting too into our specific flavor of Data Reconciliation. Recall that a major architecture decision in the pipeline is a separate set of training data for every experiment; the goal of this step, then, is to collect the raw data, clean it and copy it to the bucket and folder on S3 where this experiment’s storage will reside (for e.g. EXP-3333-longformer/data/reconciled_artifacts).

I’ll create a distinction here between ‘artifacts’ and ‘files’ to better understand what follows. For every ‘artifact’ uploaded into our system, tens of ‘files’ are created that represent data and analysis about the given ‘artifact.’ As such, our raw data is composed of these sets of files per uniquely identified artifact.

The first step in Data Reconciliation is to collect all of the raw data. In our case, this means authenticating to a read replica of the production database, and running a query that contains artifact identifiers related to their ground truth classification labels. We then collect all of the S3 file paths on the production instance of S3 keyed by the same artifact GUID identifier.

Data Reconciliation knows which S3 file paths to collect via a settings.ini value passed by the experimenter call FILES_FROM_PROD. For e.g. imagine each artifact has a file called raw_text.json, the experimenter would pass FILES_FROM_PROD=raw_text.json and Data Reconciliation would find the S3 path to every raw_text.json file on the production S3 bucket.

Using the artifact identifiers (GUIDs), we then filter the production database results such that both datasets contain the exact same artifact identifiers and drop duplicates using the file hash. At this point the labels and S3 paths to the flat files are now reconciled; the actual files and the label just need to be copied to the correct experiment directory.

Before that copying begins, note that we now have unique insight into the training data for this experiment. Using the filtered database results, we can discover exactly the labels that will be trained on, and the instance count per label:

https://medium.com/media/b41974050eba9c6ff1d85e41a2964fe3/href

Where df is a pandas dataframe of the filtered database results. Now every experiment has a unique_labels_and_counts.json in its /data folder the experimenter can interrogate to see which labels and their counts are associated with this training data set.

At this point, we encounter our first user-defined function. process_func is an optional function that will run after Data Reconciliation has copied files for every artifact identifier; it gives the experimenter the opportunity to execute some arbitrary code for each artifact identifier. As an example, when we go to train we need access to the ground truth labels extracted from the production database. process_func gives us the ability to create an additional file per artifact, say, ground_truth_label.json that contains this label. Furthermore, if one’s model requires additional files to train on, for e.g. an image of a given page, that additional file can be created here, per artifact. Because it’s optional, the user could not define it; thus:

https://medium.com/media/b8b540632f4bc8448e0285dfe77e6ac6/href

Now that we have our reconciled data and our process_func, we have to copy data from the production S3 bucket into our experiment S3 directory. This can easily occur in parallel, so we utilize multiprocessing to kick it off as a parallel process:

https://medium.com/media/b06a7f3cf937577b531eeb15b4e217a1/href

This function gets the df we discussed earlier, the experiment bucket, the dict of artifact identifier (GUID) to list of desired file paths (raw_training_data_paths), the parent experiment dir (s3_artifact_path), the number of parallel processes (either a config value or multiprocessing.cpu_count()) the process_func and a boolean that determines whether or not to overwrite.

First, it uses the same function that created raw_training_data_paths except pointed at the experiment bucket and with EXP-3333-longformer/data/reconciled_artifacts/ as a filter. This gives us a dict of what training data already exists for the experiment in case Data Reconciliation failed and had been restarted; we don’t copy the same data again. Next, it splits the reconciled data per process and for each split, creates a process and calls the add_to_research_experiment function. Let’s take a look at that function:

https://medium.com/media/cfb52ba7e7b656b4c7d4d6d8b009bdc8/href

The parameters to this function should be fairly straightforward given our discussion of copy_s3_data_in_parallel. The function iterates the data frame chunk directly checking for three different copying scenarios. I am aware that iterating a data frame directly is generally frowned upon in favor of a vectorized approach. In our case, these chunks are fairly small so it is not something we worry about. For each artifact, this function checks to see if, first, overwriting (reload) was set to true, if the current artifact already exists in the experiment and whether or not the proposed artifact has additional files to add to it and finally if it does not exist. In each case it calls an additional function that will copy the correct set of files. Next, let’s take a look at copy_to_s3:

https://medium.com/media/696545017ed9ec343577ff7119410f4d/href

This function is straight forward, and nicely shows what gets passed to process_func if the user has defined it. It gets the row from the df representing the current artifact, the existing files for the artifact _after_ copying, the experiment path and the overwriting boolean. This gives the experimenter a lot of flexibility on what he/she can do per artifact.

The final step of Data Reconciliation is a validation step where we use the config value FILES_ON_RESEARCH to validate that each artifact has the files it needs for training. The reason we can’t just use the earlier FILES_FROM_PROD value is because new files may have been created in process_func. So FILE_ON_RESEARCH may look like raw_text.json, page_01.png for example. This validation step is meant to provide some assurance that when we move onto Data Preparation, each artifact will have every file it needs and we don’t need to write code to handle missing files. So after all of our parallel processing completes, validate_data_was_created runs which we will view in partial stub form:

https://medium.com/media/386054dd5e9cc3e05a39453c5aa64fcf/href

This function takes the full df, the list of desired files defined by FILES_FROM_PROD, the list of desired files that should be in the experiment FILES_ON_RESEARCH, the experiment directory (EXP-3333-longformer/data/reconciled_artifacts/) and the user defined process_func. It collects all the existing file paths for the given experiment and iterates them, popping file names off FILES_ON_RESEARCH to check if they exist for each artifact. If files are missing, it then discovers if they are FILES_FROM_PROD files and retrieves them from the prod S3 bucket or if they are process_func files which it re-runs to generate them. Once this step is complete, we can have high confidence that all of our raw training data files exist for each artifact. As such, we can move on to Data Preparation.

Data Preparation

The data preparation step is meant to take the raw training files for the experiment and encode them so they are prepared to be input into a model’s forward() function. For this task, we will utilize the HuggingFace Datasets library and specifically its powerful map() function. This is also the first task that will utilize Sagemaker, specifically Sagemaker Processor jobs.

Let’s start by taking a look at how the Processor job is constructed and called. First, we utilize the Sagemaker Python SDK’s ScriptProcessor class. This allows us to run an arbitrary script on a Processor instance. Creating the ScriptProcessor object will look like:

https://medium.com/media/949b306885da38dca0d8d9a4e29292c3/href

As you can see, this construction is basically defined by config values. Arguably the most important is config.docker_image_path. This carefully constructed docker image which we spoke about in the first post in this series is re-used among all Sagemaker jobs (Processor/Training/Tuning). We spoke in the first post about how an experimenter extends a base image that contains all common dependencies like cuda enabled pytorch, transformers, datasets, accelerate, numpy, etc and adds any of their model-specific dependencies. That base image also contains lines that allow it to run on these different Sagemaker instances, we’ll discuss one now and more during our discussion of training:

https://medium.com/media/7de71a8e565dcef441186c07cf87cdef/href

Sagemaker Training/Tuning jobs always look in the /opt/ml/code directory for custom dependencies while Processor jobs look in /opt/ml/processing. These lines copy all of our ML pipeline code into these directories to ensure that all custom dependencies are available in either type of job. Now if we jump back over to where we constructed the ScriptProcessor object, this is how we kick off the job:

https://medium.com/media/a0bc063774e034385ed5a3b59c7b7f18/href

One feature of Processor jobs that is easy to miss is that before the script is executed, Sagemaker copies everything from the S3 URI provided in the source param onto local disk in the destination path. Building your script around this fact will give you huge performance benefits which we will discuss more later on. Another important point that may not be immediately obvious is that the command param combined with the code param is basically like defining an ENTRYPOINT for the Processor job. While its not exactly accurate, you can imagine these params creating this command in the container:

ENTRYPOINT [‘python3’, ‘/opt/ml/code/src/preprocessing/data_preparation.py’]

So the code above is constructing the S3 URI to the reconciled artifacts we created in the Data Reconciliation step and passing it in the source` param and the Processor job copies all of this data to local disk before it kicks off. SAGEMAKER_LOCAL_DATA_DIR defines where that data will be copied and is specified in data_preparation.py` so the path can be used there as well. Processor jobs can output data which is why I’ve defined outputs, but for now the data_preparation.py script is not utilizing this feature. Now that we’ve discussed how it is kicked off, we can take a look at encoding data in data_preparation.py.

The first step at the beginning of encoding is to define the S3 directory where data will be saved and get the label file we produced during Data Reconciliation. We read a config value to get the encoded data dir, namely, ENCODED_DATA_DIR. The value will typically be full_dataset, but it gives the experimenter the ability to produce smaller test datasets if desired (e.g. partial_dataset). So the full path will look like:

encoded_data_dir = f"{config.s3_parent_dir}/data/prepared_data/{config.encoded_data_dir}"

Or EXP-3333-longformer/data/prepared_data/full_dataset

Next, we get the unique_labels_and_counts.json file we uploaded during Data Reconciliation as our ground truth for supervised learning. We give the experimenter the ability to modify the ground truth here through some basic knobs: IGNORED_LABELS and NUM_LABELS_THRESHOLD; I could imagine a number of other options here. These knobs are self explanatory:

https://medium.com/media/0511dd70829f672d8bac6009c7d15331/href

After modifying the labels the way the experimenter wants, execution moves onto the get_artifact_paths function. This function gets the paths on local disk that raw training data was copied to and returns them in a format that the Huggingface Datasets library will expect:

https://medium.com/media/92c63f294ea8a9eb73125b5bf4b8f4c2/href

get_artifact_paths is called using the same path we passed to Processor.run() to define where data should be copied along with the results of the MODEL_INPUT_FILES config param. Following our example, this value would simply be [raw_text.json]. A Huggingface.arrow_dataset.datatsets.Dataset is eventually going to expect data formatted where each row constitutes an instance of training data, and each column represents the path to the needed input file. In our case it would look like:

https://medium.com/media/9a8906b6cdd5367dfede611796f582bc/href

This would be easy to represent in pandas, but since we’d prefer to not depend on pandas and will utilize Dataset.from_dict(), get_artifact_paths represents this structure using the file names as keys and lists to contain the paths.

Execution then enters the directory defined in SAGEMAKER_LOCAL_DATA_DIR and extracts the list of subdirs which, in our case, are guids for each artifact. It iterates these subdirs collecting the filenames for all files that are children of each subdir. It then uses the passed MODEL_INPUT_FILES to validate that each needed file is there and adds it to the artifact_paths dict. We now have a dict that is ready for Datasets processing.

Control now moves to a get_encoded_data() function that will kick off Huggingface.arrow_dataset.datasets.Dataset.map() which is a very powerful abstraction for encoding datasets. get_encoded_data is intended to setup the map() function for parallel processing of raw training data encoding and is the main part of the Data Preparation step:

https://medium.com/media/a7d67dbb995e7a70e57c3ba10d1a68f7/href

This function sets up the mapper, executes it, splits the returned encoded data and saves the split, encoded data to S3. The function takes the get_artifact_paths data we just generated (as data), a list of the labels only from unique_labels_and_counts.json, a few directory paths and the number of parallel processes to spin up. It starts by generating two label dicts in handle_labels, label2id.json and id2label.json which will be used downstream to convert between the integer values predicted by the model and actual string labels.

Next, one of our user defined functions get_dataset_features is called. As you may have noticed from the hints in Datasets classpaths, Datasets uses PyArrow as the backend for writing and reading data. PyArrow needs to enforce a schema it writes to and reads from; get_dataset_features` allows the experimenter to write that schema. This function returns a Datasets Features object which packages up this schema for the backend. Following our Longformer example, this function might look like:

https://medium.com/media/d859c13115e1dd55c19cfb3303578a16/href

The keys here represent the parameters the Longformer forward() function will expect when performing the forward pass. Now that we have these features, we can call Dataset.from_dict() on our get_artifact_paths data and we are fully ready for the mapper. The mapper has a variety of options, but the core concept is applying a function to every instance of training data that encodes and returns it. Let’s take a closer look at the call in Data Preparation:

https://medium.com/media/ff9d2c39cd32c64d8cb2feee9d2c40b0/href

Here we pass the function we want to execute per instance, preprocess_data; fn_kwargs allows us to specify additional parameters we want to pass to that function; batched means that preprocess_data will receive batches of data instead of single instances; this allows us to perform additional filtering. features are the features we retrieved from get_dataset_features, we remove the column names so they aren’t encoded and finally the number of processes to process in parallel between.

With this in place, we can take a look at def preprocess_data which is executed by each process in parallel:

https://medium.com/media/85e711c4d4a5fa05e82657b5539b3ef2/href

The function first validates that each column of data has the exact same length and returns that length so it can be iterated over. It then iterates the batch, constructing a single instance and passing it to another user-defined function, encode_data. encode_data gives the experimenter the ability to define exactly how a single training instance is encoded with the option of returning None if additional filtering is desired. For instance, say we were using a Huggingface Transformers Tokenizer to encode; a single_instance here represents the file paths to the data we need, so we would get that data, say, in a variable called text_content and call something like this:

https://medium.com/media/0a59724cade845fc38695a8f953746b2/href

Where TOKENIZER is defined as a constant outside the function so it’s not re-constructed each time this function is called. If we continue following preprocess_data we can see that it simply skips single_instance’s where encode_data returns None. Finally, the encoded input is returned to the mapper in the correct Features format.

I’m going to skip looking at get_train_valid_test_split(), but suffice it to say that it uses Datasets internal function dataset.train_test_split() to split data using percentages and writes a metadata file that shows the counts of the split and associated labels to the experimenter.

And with that, Data Preparation is complete. Recall from the beginning that this will run as a ScriptProcessor job on a Sagemaker Processor instance. These instances tend to have lots of vCPU’s and can really take advantage of the parallel processing we’re doing in the mapper. The encoded data will end up on S3 ready to be downloaded by a Training or Tuning job which is discussed in the third post in this series. You can jump to the first and third post via these links: Part One: Setup, Part Three: Training and Inference.

Categories:

Pierce Lamb: Creating a Machine Learning Pipeline on AWS Sagemaker Part One: Intro & Set Up

Wed, 2023/04/19 - 8:24pm

Or rather, creating a reusable ML Pipeline initiated by a single config file and five user-defined functions that performs classification, is finetuning-based, is distributed-first, runs on AWS Sagemaker, uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.

This post originally appeared on VISO Trust’s Blog

This is the introductory post in a three part series. To jump to the other posts, check out Creating a ML Pipeline Part 2: The Data Steps or Creating a ML Pipeline Part 3: Training and Inference

Introduction

On the Data & Machine Learning team at VISO Trust, one of our core goals is to provide Document Intelligence to our auditor team. Every document that passes through the system is subject to collection, parsing, reformatting, analysis, reporting and more. Part of that intelligence is automatically determining what type of document has been uploaded into the system. Knowing what type of document has entered the system allows us to perform specialized analysis on that document.

The task of labeling or classifying a thing is a traditional use of machine learning, however, classifying an entire document — which, for us, can be up to 300+ pages — is on the bleeding edge of machine learning research. At the time of this writing, researchers are racing to use the advances in Deep Learning and specifically in Transformers to classify documents. In fact, at the outset of this task, I performed some research on the space with keywords like “Document Classification/Intelligence/Representation” and came across nearly 30 different papers that use Deep Learning and were published between 2020 and 2022. For those familiar with the space, names like LayoutLM/v2/v3, TiLT/LiLT, SelfDoc, StructuralLM, Longformer/Reformer/Performer/Linformer, UDOP and many more.

This result convinced me that trying a multitude of these models would be a better use of our time than trying to decide which was the best among them. As such, I decided to pick one and use the experience of fine-tuning it as a proof-of-concept to build a reusable ML pipeline the rest of my team could use. The goal was to reduce the time to perform an experiment from weeks to a day or two. This would allow us to experiment with many of the models quickly to decide which are the best for our use case.

The result of this work was an interface where an experimenter writes a single config file and five user defined functions that kick off data reconciliation, data preparation, training or tuning and inference testing automatically.

When I set out on that proof-of-concept (pre-ML Pipeline), it took over a month to collect and clean the data, prepare the model, perform inference and get everything working on Sagemaker using distribution. Since building the ML Pipeline, we’ve used it repeatedly to quickly experiment with new models, retrain existing models on new data, and compare the performance of multiple models. The time required to perform a new experiment is about half a day to a day on average. This has enabled us to iterate incredibly fast, getting models in production in our Document Intelligence platform quickly.

What follows is a description of the above Pipeline; I hope that it will save you from some of the multi-day pitfalls I encountered building it.

ML Experiment Setup

An important architectural decision we made at the beginning was to keep experiments isolated and easily reproducible. Everytime an experiment is performed, it has its own set of raw data, encoded data, docker files, model files, inference test results etc. This makes it easy to trace a given experiment across repos/S3/metrics tools and where it came from once it is in production. However, one trade off worth noting is that training data is copied separately for every experiment; for some orgs this simply may be infeasible and a more centralized solution is necessary. With that said, what follows is the process of creating an experiment.

An experiment is created in an experiments repo and tied to a ticket (e.g. JIRA) like EXP-3333-longformer. This name will follow the experiment across services; for us, all storage occurs on S3, so in the experiment's bucket, objects will be saved under the EXP-3333-longformer parent directory. Furthermore, in wandb (our tracker), the top level group name will be EXP-3333-longformer.

Next, example stubbed files are copied in and modified to the particulars of the experiment. This includes the config file and user defined function stubs mentioned above. Also included are two docker files; one dockerfile represents the dependencies required to run the pipeline, the other represents the dependencies required to run 4 different stages on AWS Sagemaker: data preparation, training or tuning and inference. Both of these docker files are made simple by extending from base docker files maintained in the ML pipeline library; the intent is that they only need to include extra libraries required by the experiment. This follows the convention established by AWS’s Deep Learning Containers (DLCs) and, in fact, our base sagemaker container starts by extending one of these DLCs.

There is an important trade off here: we use one monolithic container to run three different steps on Sagemaker. We preferred a simpler setup for experimenters (one dockerfile) versus having to create a different container per Sagemaker step. The downside is that for a given step, the container will likely contain some unnecessary dependencies which make it larger. Let’s look at an example to solidify this.

In our base Sagemaker container, we extend:

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04

This gives us pytorch 1.10.2 with cuda 11.3 bindings, transformers 4.17, python 3.8 and ubuntu all ready to run on the GPU. You can see available DLCs here. We then add sagemaker-training, accelerate, evaluate, datasets and wandb. Now when an experimenter goes to extend this image, they only need to worry about any extra dependencies their model might need. For example, a model might depend on detectron2 which is an unlikely dependency among other experiments. So the experimenter would only need to think about extending the base sagemaker container and installing detectron2 and be done worrying about dependencies.

With the base docker containers in place, the files needed for the start of an experiment would look like:

https://medium.com/media/de90d5b8d6601d3975ea80c332e95e7f/href

In brief, these files are:

  • settings.ini: A single (gitignored) configuration file that takes all settings for every step of the ML pipeline (copied into the dockerfiles)
  • sagemaker.Dockerfile: Extends the base training container discussed above and adds any extra model dependencies. In many cases the base container itself will suffice.
  • run.Dockerfile: Extends the base run container discussed above and adds any extra run dependencies the experimenter needs. In many cases the base container itself will suffice.
  • run.sh: A shell script that builds and runs run.Dockerfile.
  • build_and_push.sh: A shell script that builds and pushes sagemaker.Dockerfile to ECR.
  • user_defined_funcs.py: Contains the five user defined functions that will be called by the ML pipeline at various stages (copied into the dockerfiles). We will discuss these in detail later.

These files represent the necessary and sufficient requirements for an experimenter to run an experiment on the ML pipeline. As we discuss the ML pipeline, we will examine these files in more detail. Before that discussion, however, let’s look at the interface on S3 and wandb. Assume that we’ve set up and run the experiment as shown above. The resulting directories on S3 will look like:

https://medium.com/media/823d0264d1199be7b6d3703cb0325616/href

The run_number will increment with each subsequent run of the experiment. This run number will be replicated in wandb and also prefixed to any deployed endpoint for production so the exact run of the experiment can be traced through training, metrics collection and production. Finally, let’s look at the resulting wandb structure:

https://medium.com/media/b6c2f56b011001028fd1e427080db31a/href

I hope that getting a feel for the interface of the experimenter will make it easier to understand the pipeline itself.

The ML pipeline

The ML pipeline will (eventually) expose some generics that specific use cases can extend to modify the pipeline for their purposes. Since it was recently developed in the context of one use case, we will discuss it in that context; however, below I will show what it might look like with multiple:

https://medium.com/media/06e61e98a0e3c0d02df5e515fcbb9c38/href

Let’s focus in on ml_pipeline:

https://medium.com/media/6a8f1af5aeb5c98051240440f5e42a92/href

The environment folder will house the files for building the base containers we spoke of earlier, one for running the framework and one for any code that executes on Sagemaker (preprocessing, training/tuning, inference). These are named using the same conventions as AWS DLCs so it is simple to create multiple versions of them with different dependencies. We will ignore the test folder for the remainder of this blog.

The lib directory houses our implementation of the ML pipeline. Let’s zoom in again on just that directory.

https://medium.com/media/78a4e37d0f6ce79cb18d2eea8de325c0/href

Let’s start with run_framework.py since that will give us an eagle eye view of what is going on. The skeleton of run_framework will look like this:

https://medium.com/media/38ea2a2ea16b2a7fd6a0d5fd4405b292/href

The settings.ini file a user defines for an experiment will be copied into the same dir (BASE_PACKAGE_PATH) inside each docker container and parsed into an object called MLPipelineConfig(). In our case, we chose to use Python Decouple to handle config management. In this config file, the initial settings are: RUN_RECONCILIATION/PREPARATION/TRAINING/TUNING/INFERENCE so the pipeline is flexible to exactly what an experimenter is looking for. These values constitute the conditionals above.

Note the importlib line. This line allows us to import use-case specific functions and pass them into the steps (shown here is just data reconciliation) using an experimenter-set config value for use case.

The moment the config file is parsed, we want to run validation to identify misconfigurations now instead of in the middle of training. Without getting into too much detail on the validation step, here is what the function might look like:

https://medium.com/media/281e4b8f338f30922d8311afaddebca9/href

The _validate_funcs function ensures that functions with those definitions exist and that they are not defined as pass (i.e. a user has created them and defined them). The user_defined_funcs.py file above simply defines them as pass, so a user must overwrite these to execute a valid run. _validate_run_num throws an exception if the settings.ini-defined RUN_NUM already exists on s3. This saves us from common pitfalls that could occur an hour into a training run.

We’ve gotten to the point now where we can look at each pipeline step in detail. You can jump to the second and third post via these links: Part Two: The Data Steps, Part Three: Training and Inference.

Categories:

Nonprofit Drupal posts: April Drupal for Nonprofits Chat

Wed, 2023/04/19 - 8:14pm

Join us TOMORROW, Thursday, April 20 at 1pm ET / 10am PT, for our regularly scheduled call to chat about all things Drupal and nonprofits. (Convert to your local time zone.)

No pre-defined topics on the agenda this month, so join us for an informal chat about anything at the intersection of Drupal and nonprofits.  Got something specific on your mind? Feel free to share ahead of time in our collaborative Google doc: https://nten.org/drupal/notes!

All nonprofit Drupal devs and users, regardless of experience level, are always welcome on this call.

This free call is sponsored by NTEN.org and open to everyone. 

  • Join the call: https://us02web.zoom.us/j/81817469653

    • Meeting ID: 818 1746 9653
      Passcode: 551681

    • One tap mobile:
      +16699006833,,81817469653# US (San Jose)
      +13462487799,,81817469653# US (Houston)

    • Dial by your location:
      +1 669 900 6833 US (San Jose)
      +1 346 248 7799 US (Houston)
      +1 253 215 8782 US (Tacoma)
      +1 929 205 6099 US (New York)
      +1 301 715 8592 US (Washington DC)
      +1 312 626 6799 US (Chicago)

    • Find your local number: https://us02web.zoom.us/u/kpV1o65N

  • Follow along on Google Docs: https://nten.org/drupal/notes

View notes of previous months' calls.

Categories:

Security advisories: Drupal core - Moderately critical - Access bypass - SA-CORE-2023-005

Wed, 2023/04/19 - 7:06pm
Project: Drupal coreDate: 2023-April-19Security risk: Moderately critical 13∕25 AC:Basic/A:None/CI:Some/II:None/E:Theoretical/TD:AllVulnerability: Access bypassDescription: 

The file download facility doesn't sufficiently sanitize file paths in certain situations. This may result in users gaining access to private files that they should not have access to.

Some sites may require configuration changes following this security release. Review the release notes for your Drupal version if you have issues accessing private files after updating.

This advisory is covered by Drupal Steward.

We would normally not apply for a release of this severity. However, in this case we have chosen to apply Drupal Steward security coverage to test our processes.

Drupal 7
  • All Drupal 7 sites on Windows web servers are vulnerable.
  • Drupal 7 sites on Linux web servers are vulnerable with certain file directory structures, or if a vulnerable contributed or custom file access module is installed.
Drupal 9 and 10

Drupal 9 and 10 sites are only vulnerable if certain contributed or custom file access modules are installed.

Solution: 

Install the latest version:

All versions of Drupal 9 prior to 9.4.x are end-of-life and do not receive security coverage. Note that Drupal 8 has reached its end of life.

Reported By: Fixed By: 
Categories:

LN Webworks: 7 ways to enhance your ecommerce Website and online sales with Drupal

Wed, 2023/04/19 - 11:32am
Drupal is an open-source content management software that enables companies to create captivating e-commerce websites and online stores. It has given new life to the world of digital commerce. Fortune 500 companies like Tesla and General Electric have unleashed the power of Drupal commerce to create cutting-edge digital experiences for their customers. This brings one question to the mind, “What makes this content management system the popular choice of these eminent companies?” A brief yet all-encompassing answer to this question is that this software complements the ever-evolving digital trends. As consumer behavior changes and evolves with the technological revolution, Drupal helps you match strides with it and consistently deliver the best digital experience. Given that, it wouldn’t be wrong to call it a stepping stone to creating a thriving online store.
Categories:

Peoples Blog: Fix Colima connection refused error: failed to get Info from .lima/colima/ha.sock on Mac

Wed, 2023/04/19 - 4:30am
This article is about fixing only a single error which you see with Colima on Mac machines. This might be a simple & specific issue, but people who are facing this issue will really feel grateful with the solution provided. While you are running Colima on your mac machines, generally you get into such issue, when you power off or shut down your mac, without stopping the colima service (and de
Categories:

PreviousNext: Why a culture of open-source contribution is good for your business

Wed, 2023/04/19 - 12:58am

Contributing makes good business sense, especially when open-source technology, such as Drupal, is at the core of everything you do (pun intended!). 

by Owen Lansbury / 19 April 2023

Based on a talk given at EverythingOpen 2023. A video of that presentation is also available at the end of this article.

Why do we contribute to the Drupal community? 

Adopting a formalised approach to contribution helps our business stay sustainable in the long term. It also has the added benefit of helping everyone else in the open-source community.

Reputation

Over the years at PreviousNext, we’ve honed a deep expertise in Drupal. That’s because we’ve doubled down and avoided diluting our technical offering. We’re all in for Drupal. 

This level of knowledge sees us regularly referred to clients looking for hard hitters in the Drupal space. Our expertise is particularly appealing, as it happens, for our Higher Education and Government clients. Being Australia’s only Platinum Certified Drupal Partner can only help in this regard.

Our Drupal Association profile records all our contributions as ‘credits’. These determine our ranking as a certified partner, demonstrating our commitment to Drupal as a technology and a community.

We focus on raising our Drupal profile using means other than traditional marketing methods. Our team attends events, volunteers at DrupalSouth, presents at conferences, sponsors the DrupalSouth CodeSprint, and takes on community leadership roles. 

This level of involvement cements our position as a leading Drupal provider. It also gives all members of our team (including those who are non-technical) additional opportunities to be part of the community and raise their profiles.

Professional development

I like to refer to Drupal as a ‘do-ocracy’. Everyone is welcome, and all help is welcome. Open-source and open-handed. It’s the same sense of community that we value at PreviousNext.

When someone first joins our business, we often use open-source contributions as the primary method of onboarding them. This induction method encourages them to develop best practices in their coding and use their involvement in the Drupal project as part of their ongoing professional development.

An offshoot of this is the chance to build relationships and be mentored by people external to our organisation. It’s a unique opportunity to broaden our collective perspectives and work alongside (and become!) some of the brightest minds in open-source tech.

A happier team

Avoiding team member burnout or a lacklustre approach to work is vital for us as a smaller organisation. Instead, we help staff to scratch those different ‘itches’.

Working on contrib helps to maintain interest and passion by giving staff time to work on projects that aren’t run-of-the-mill client engagements. It also exposes our team to larger initiatives than they might otherwise work on.

Staff retention

A happier team, in turn, leads to a more stable team over the long term. Our retention rates have steadied at around three times the industry average. 

This tendency towards longevity also facilitated our decision to make PreviousNext employee-owned.

How do we contribute? An established framework 

Enshrined in our Staff Handbook is the hope that employees at PreviousNext will use 20% of their time for contrib (the remaining 80% is billable client work). If a team member chooses not to contribute, they work closer to fully billable hours.

We don’t expect staff to contribute outside their employed hours–though many do for their own interest.

With a robust time-tracking and self-management culture, this approach works well and leads to a productive, well-run company.

We’ve also baked open-source contributions into our regular ‘Hackdays’. These are days when our developers get together and innovate. This focused work feeds into our client projects and becomes part of our Drupal contributions. 

Other methods for ensuring a regular flow of code include directly sponsoring developers, which helps us maintain our partnership status. 

We also use project-based sponsorship to contribute patches and new modules to the Drupal ecosystem. The clients for these projects also receive credits for sponsoring this development.

Being a good Drupal citizen

Open-source contribution isn’t just about altruism. It also shouldn’t be viewed as a drain on a business’s income generation. It’s about recognising that our businesses depend on a technological ecosystem that in turn relies on as many of us playing our part to advance it as possible.

When it comes to Drupal, the result of these contributions is a platform that commands a 10% share of the top 10,000 most visited websites globally. Clearly, though, there is more to be done to promote Drupal even further. It’s something we can all get behind, because when our chosen open-source platform thrives, so do our businesses.

 Watch the video
Categories: