In the blog post, we will see some best practices for authoring DAGs. Let’s start.
DAG as configuration file
Airflow scheduler scans and compile DAG files at each heartbeat. If DAG files are heavy and lot of top level codes are present in it, scheduler will consume lot of resources and time to process them at each heartbeat. So it is advised to keep the DAGs light, more like a configuration file. As an step forward it will be a good choice to have YAML/JSON based definition of workflow and then generating the DAG based off that. This has double…
Regularization is a principle which penalizes complex models so that they can generalizes better. It prevents overfitting. In this blog we will visit common regularization techniques.
Your neural network is only as good as the data you feed it.
The performance of deep learning neural networks often improves with the amount of data available. But we don’t usually have huge amount of data. Data augmentation is a technique to artificially create new training data from existing training data. Depending upon when we apply these transformations we have two types of augmentation:
Online — perform all the necessary transformations beforehand
Optimizers are methods/algorithms used to modify the attributes like weights & learning rate in order to minimize the loss.
Batch Gradient Descent — Regression & classification
It computes the gradient of the loss function w.r.t. to the parameters for the entire training dataset.
for i in range(epochs):
param_gradient = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * param_gradient
Stochastic Gradient Descent
In SGD the parameter update happens for each training example and label.
for i in range(epochs):
for sample in data:
params_gradient = evaluate_gradient(loss_function, sample, params)
params = params - learning_rate * params_gradient
Loss functions quantify how good or bad the model is performing. In terms of optimization its an convergence indicator. Choosing the right loss function becomes very critical. In this blog we will see some of the most common loss functions.
from sklearn.metrics import hinge_loss
from sklearn.metrics import log_loss
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from scipy.stats import entropy
Predicting continuous values
Mean Squared Error —Also known as L2 loss. This is used if you have Gaussian Distribution. This loss function will penalize overestimates heavily. This is one of the most common loss function. …
Optimization algorithm like Stochastic Gradient Descent(SGD) depends on the initial values of the parameters. Initial values, when chosen wisely, help in avoiding slow convergence, also they ensure that we don’t keep oscillating off the minima. In simple terms weight initialization prevents activation outputs from exploding or vanishing during the forward pass of neural network. In this blog we will look at some initialization techniques:
Data platform is an integrated technology solution, which makes data accessible. A good data platform helps in creating information and making that information available in people’s hand who can use this information.
Data is a precious thing and will last longer than the systems themselves.
A working Data platform will consist of five major services/frameworks.
Data is produced by systems at various places and in various formats. A good data platform should have a Ingestion service with plug and…
Activation function, also known as Transfer functions, decide whether input to the neuron is relevant or not. These functions are applied at hidden layer to introduce nonlinearity. This nonlinearity helps in understanding complex relationships. Activation functions are also used at output layer.
In my previous blog we looked at basics of Airflow. This blog will cover some advance topics.
Airflow allows missed DAG Runs to be scheduled again so that the pipelines catchup on the schedules that were missed for some reason. It also allows rerunning of DAGs in back date manually & backfill those runs. Backfill & Catch up are confusing at first glance. In this blog we will understand the concepts. But before we start on these we need to refresh about “start_date” and “execution_date”.
start_date — date at which DAG will start being scheduled
schedule_interval — the interval of…
Airflow is an open source workflow management platform. We define those workflows with DAGs/“Configuration as code” written in python.
Apache Airflow at high level have following components talking to each other.
Cloud | ML | Big Data