When a minority in a group demonstrates self-confidence, the majority _____

Q: Is the tendency for group members to make more extreme decisions toward greater danger?

Group polarization refers to the tendency for a group to make decisions that are more extreme than the initial inclination of its members.

Q: Which of the following is one of the three determinants of minority influence?

The three determinants of minority influence are: consistency, self-confidence, and defection.

Q: Is the loss of self awareness in groups?

Deindividuation: the loss of self-awareness and self-restraint occurring in group situations that foster arousal and anonymity.

Q: What refers to the tendency of influential groups to suppress dissent in order to maintain group harmony?

Groupthink is a psychological phenomenon that occurs within a group of people in which the desire for harmony or conformity in the group results in an irrational or dysfunctional decision-making outcome.

You can filter the glossary by choosing a topic from the Glossary dropdown in the top navigation bar. The hatching bird icon signifies definitions aimed at ML newcomers.

Inhaltsverzeichnis Show

A/B testing
activation function
active learning
agglomerative clustering
anomaly detection
area under the PR curve
area under the ROC curve
artificial general intelligence
artificial intelligence
attribute sampling
AUC (Area under the ROC curve)
augmented reality
automation bias
average precision
axis-aligned condition
backpropagation
bag of words
batch normalization
Bayesian neural network
Bayesian optimization
Bellman equation
BERT (Bidirectional Encoder Representations from Transformers)
bias (ethics/fairness)
bias (math) or bias term
bidirectional
bidirectional language model
binary classification
binary condition
BLEU (Bilingual Evaluation Understudy)
bounding box
broadcasting
calibration layer
candidate generation
candidate sampling
categorical data
causal language model
centroid-based clustering
classification model
classification threshold
class-imbalanced dataset
co-adaptation
collaborative filtering
confirmation bias
confusion matrix
continuous feature
convenience sampling
convergence
convex function
convex optimization
convolution
convolutional filter
convolutional layer
convolutional neural network
convolutional operation
co-training
counterfactual fairness
coverage bias
crash blossom
cross-entropy
cross-validation
data analysis
data augmentation
data parallelism
data set or dataset
Dataset API (tf.data)
decision boundary
decision forest
decision threshold
decision tree
deep neural network
Deep Q-Network (DQN)
demographic parity
dense feature
dense layer
depthwise separable convolutional neural network (sepCNN)
derived label
dimension reduction
discrete feature
discriminative model
discriminator
disparate impact
disparate treatment
divisive clustering
downsampling
dropout regularization
dynamic model
eager execution
early stopping
earth mover's distance (EMD)
embedding layer
embedding space
embedding vector
empirical risk minimization (ERM)
environment
epsilon greedy policy
equality of opportunity
equalized odds
experience replay
experimenter's bias
exploding gradient problem
fairness constraint
fairness metric
false negative (FN)
false negative rate
false positive (FP)
false positive rate (FPR)
feature cross
feature engineering
feature extraction
feature importances
feature set
feature spec
feature vector
federated learning
feedback loop
feedforward neural network (FFN)
few-shot learning
fine tuning
forget gate
full softmax
fully connected layer
generalization
generalization curve
generalized linear model
generative adversarial network (GAN)
generative model
GPT (Generative Pre-trained Transformer)
gini impurity
gradient boosting
gradient boosted (decision) trees (GBT)
gradient clipping
gradient descent
graph execution
greedy policy
ground truth
group attribution bias
hallucination
hidden layer
hierarchical clustering
holdout data
hyperparameter
image recognition
imbalanced dataset
implicit bias
incompatibility of fairness metrics
independently and identically distributed (i.i.d)
individual fairness
inference path
information gain
in-group bias
input layer
in-set condition
interpretability
inter-rater agreement
intersection over union (IoU)
item matrix
Kernel Support Vector Machines (KSVMs)
L0 regularization
L1 regularization
L2 regularization
labeled example
LaMDA (Language Model for Dialogue Applications)
language model
large language model
Layers API (tf.layers)
learning rate
least squares regression
linear model
linear regression
logistic regression
Long Short-Term Memory (LSTM)
loss function
loss surface
machine learning
majority class
Markov decision process (MDP)
Markov property
masked language model
matrix factorization
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
meta-learning
Metrics API (tf.metrics)
mini-batch stochastic gradient descent
minimax loss
minority class
model capacity
model parallelism
model training
multi-class classification
multi-class logistic regression
multi-head self-attention
multimodal model
multinomial classification
multinomial regression
natural language understanding
negative class
neural network
node (neural network)
node (TensorFlow graph)
node (decision tree)
non-binary condition
non-response bias
nonstationarity
normalization
novelty detection
numerical data
objective function
oblique condition
offline inference
one-hot encoding
one-shot learning
one-vs.-all
online inference
operation (op)
out-of-bag evaluation (OOB evaluation)
out-group homogeneity bias
outlier detection
output layer
overfitting
oversampling
Parameter Server (PS)
parameter update
partial derivative
participation bias
partitioning strategy
performance
permutation variable importances
positive class
post-processing
PR AUC (area under the PR curve)
precision-recall curve
prediction bias
predictive parity
predictive rate parity
preprocessing
pre-trained model
prior belief
probabilistic regression model
proxy (sensitive attributes)
proxy labels
quantile bucketing
quantization
random forest
random policy
rank (ordinality)
rank (Tensor)
recommendation system
Rectified Linear Unit (ReLU)
recurrent neural network
regression model
regularization
regularization rate
reinforcement learning (RL)
replay buffer
reporting bias
representation
ridge regularization
ROC (receiver operating characteristic) Curve
root directory
Root Mean Squared Error (RMSE)
rotational invariance
sampling bias
sampling with replacement
scikit-learn
selection bias
self-attention (also called self-attention layer)
self-supervised learning
self-training
semi-supervised learning
sensitive attribute
sentiment analysis
sequence model
sequence-to-sequence task
shape (Tensor)
sigmoid function
similarity measure
size invariance
sparse feature
sparse representation
sparse vector
spatial pooling
squared hinge loss
squared loss
staged training
state-action value function
static inference
stationarity
stochastic gradient descent (SGD)
structural risk minimization (SRM)
subsampling
supervised machine learning
synthetic feature
tabular Q-learning
target network
temporal data
TensorBoard
TensorFlow Playground
TensorFlow Serving
Tensor Processing Unit (TPU)
Tensor rank
Tensor shape
Tensor size
termination condition
threshold (for decision trees)
time series analysis
TPU resource
training loss
training-serving skew
training set
transfer learning
Transformer
translational invariance
true negative (TN)
true positive (TP)
true positive rate (TPR)
unawareness (to a sensitive attribute)
underfitting
undersampling
unidirectional
unidirectional language model
unlabeled example
unsupervised machine learning
uplift modeling
upweighting
user matrix
validation loss
validation set
vanishing gradient problem
variable importances
Wasserstein loss
Weighted Alternating Least Squares (WALS)
weighted sum
wisdom of the crowd
word embedding
Z-score normalization
Is the tendency for group members to make more extreme decisions toward greater danger?
Which of the following is one of the three determinants of minority influence?
Is the loss of self awareness in groups?
What refers to the tendency of influential groups to suppress dissent in order to maintain group harmony?

A

A/B testing

A statistical way of comparing two (or more) techniques—the A and the B. Typically, the A is an existing technique, and the B is a new technique. A/B testing not only determines which technique performs better but also whether the difference is statistically significant.

A/B testing usually compares a single on two techniques; for example, how does model compare for two techniques? However, A/B testing can also compare any finite number of metrics.

accuracy

#fundamentals

The number of correct classification divided by the total number of predictions. That is:

$$\text{Accuracy} = \frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrect predictions would have an accuracy of:

$$\text{Accuracy} = \frac{\text{40}} {\text{40 + 10}} = \text{80%}$$

provides specific names for the different categories of correct predictions and incorrect predictions. So, the accuracy formula for binary classification is as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

where:

TP is the number of (correct predictions).
TN is the number of (correct predictions).
FP is the number of (incorrect predictions).
FN is the number of (incorrect predictions).

Compare and contrast accuracy with and .

Click the icon for additional notes.

Although a valuable metric for some situations, accuracy is highly misleading for others. Notably, accuracy is usually a poor metric for evaluating classification models that process .

For example, suppose snow falls only 25 days per century in a certain subtropical city. Since days without snow (the negative class) vastly outnumber days with snow (the positive class), the snow dataset for this city is class-imbalanced. Imagine a model that is supposed to predict either snow or no snow each day but simply predicts "no snow" every day. This model is highly accurate but has no predictive power. The following table summarizes the results for a century of predictions:

CategoryNumberTP0TN36500FP25FN0

The accuracy of this model is therefore:

accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy = (0 + 36500) / (0 + 36500 + 25 + 0) = 0.9993 = 99.93%

Although 99.93% accuracy seems like very a impressive percentage, the model actually has no predictive power.

and are usually more useful metrics than accuracy for evaluating models trained on class-imbalanced datasets.

action

#rl

In , the mechanism by which the transitions between of the . The agent chooses the action by using a .

activation function

#fundamentals

A function that enables to learn (complex) relationships between features and the label.

Popular activation functions include:

The plots of activation functions are never single straight lines. For example, the plot of the ReLU activation function consists of two straight lines:

A plot of the sigmoid activation function looks as follows:

Click the icon to see an example.

In a neural network, activation functions manipulate the of all the inputs to a . To calculate a weighted sum, the neuron adds up the products of the relevant values and weights. For example, suppose the relevant input to a neuron consists of the following:

input valueinput weight2-1.3-10.630.4The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

Suppose the designer of this neural network chooses the to be the activation function. In that case, the neuron calculates the sigmoid of -2.0, which is approximately 0.12. Therefore, the neuron passes 0.12 (rather than -2.0) to the next layer in the neural network. The following figure illustrates the relevant part of the process:

active learning

A approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning.

AdaGrad

A sophisticated gradient descent algorithm that rescales the gradients of each , effectively giving each parameter an independent . For a full explanation, see this paper.

agent

#rl

In , the entity that uses a to maximize the expected gained from transitioning between of the .

agglomerative clustering

#clustering

See .

anomaly detection

The process of identifying . For example, if the mean for a certain is 100 with a standard deviation of 10, then anomaly detection should flag a value of 200 as suspicious.

AR

Abbreviation for .

area under the PR curve

See .

area under the ROC curve

See .

artificial general intelligence

A non-human mechanism that demonstrates a broad range of problem solving, creativity, and adaptability. For example, a program demonstrating artificial general intelligence could translate text, compose symphonies, and excel at games that have not yet been invented.

artificial intelligence

#fundamentals

A non-human program or that can solve sophisticated tasks. For example, a program or model that translates text or a program or model that identifies diseases from radiologic images both exhibit artificial intelligence.

Formally, is a sub-field of artificial intelligence. However, in recent years, some organizations have begun using the terms artificial intelligence and machine learning interchangeably.

attention

#language

Any of a wide range of architecture mechanisms that aggregate information from a set of inputs in a data-dependent manner. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the for each input is computed by another part of the neural network.

Refer also to and , which are the building blocks of .

attribute

#fairness

Synonym for .

In machine learning fairness, attributes often refer to characteristics pertaining to individuals.

attribute sampling

#df

A tactic for training a in which each considers only a random subset of possible when learning the . Generally, a different subset of features is sampled for each . In contrast, when training a decision tree without attribute sampling, all possible features are considered for each node.

AUC (Area under the ROC curve)

#fundamentals

A number between 0.0 and 1.0 representing a model's ability to separate from . The closer the AUC is to 1.0, the better the model's ability to separate classes from each other.

For example, the following illustration shows a classifier model that separates positive classes (green ovals) from negative classes (purple rectangles) perfectly. This unrealistically perfect model has an AUC of 1.0:

Conversely, the following illustration shows the results for a classifier model that generated random results. This model has an AUC of 0.5:

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, the following model separates positives from negatives somewhat, and therefore has an AUC somewhere between 0.5 and 1.0:

AUC ignores any value you set for . Instead, AUC considers all possible classification thresholds.

Click the icon to learn about the relationship between AUC and ROC curves.

AUC represents the area under an . For example, the ROC curve for a model that perfectly separates positives from negatives looks as follows:

AUC is the area of the gray region in the preceding illustration. In this unusual case, the area is simply the length of the gray region (1.0) multiplied by the width of the gray region (1.0). So, the product of 1.0 and 1.0 yields an AUC of exactly 1.0, which is the highest possible AUC score.

Conversely, the ROC curve for a classifier that can't separate classes at all is as follows. The area of this gray region is 0.5.

A more typical ROC curve looks approximately like the following:

It would be painstaking to calculate the area under this curve manually, which is why a program typically calculates most AUC values.

Click the icon for a more formal definition of AUC.

AUC is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

augmented reality

#image

A technology that superimposes a computer-generated image on a user's view of the real world, thus providing a composite view.

automation bias

#fairness

When a human decision maker favors recommendations made by an automated decision-making system over information made without automation, even when the automated decision-making system makes errors.

average precision

A metric for summarizing the performance of a ranked sequence of results. Average precision is calculated by taking the average of the values for each relevant result (each result in the ranked list where the recall increases relative to the previous result).

See also .

axis-aligned condition

#df

In a , a that involves only a single . For example, if area is a feature, then the following is an axis-aligned condition:

area > 200

Contrast with .

B

backpropagation

#fundamentals

The algorithm that implements in .

Training a neural network involves many of the following two-pass cycle:

During the forward pass, the system processes a of to yield prediction(s). The system compares each prediction to each value. The difference between the prediction and the label value is the for that example. The system aggregates the losses for all the examples to compute the total loss for the current batch.
During the backward pass (backpropagation), the system reduces loss by adjusting the weights of all the in all the .

Neural networks often contain many neurons across many hidden layers. Each of those neurons contribute to the overall loss in different ways. Backpropagation determines whether to increase or decrease the weights applied to particular neurons.

The is a multiplier that controls the degree to which each backward pass increases or decreases each weight. A large learning rate will increase or decrease each weight more than a small learning rate.

In calculus terms, backpropagation implements calculus' chain rule. That is, backpropagation calculates the of the error with respect to each parameter. For more details, see this tutorial in Machine Learning Crash Course.

Years ago, ML practitioners had to write code to implement backpropagation. Modern ML APIs like TensorFlow now implement backpropagation for you. Phew!

bagging

#df

A method to an where each constituent trains on a random subset of training examples . For example, a is a collection of trained with bagging.

The term bagging is short for bootstrap aggregating.

bag of words

#language

A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents the following three phrases identically:

the dog jumps
jumps the dog
dog jumps the

Each word is mapped to an index in a , where the vector has an index for every word in the vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with non-zero values at the three indices corresponding to the words the, dog, and jumps. The non-zero value can be any of the following:

A 1 to indicate the presence of a word.
A count of the number of times a word appears in the bag. For example, if the phrase were the maroon dog is a dog with maroon fur, then both maroon and dog would be represented as 2, while the other words would be represented as 1.
Some other value, such as the logarithm of the count of the number of times a word appears in the bag.

baseline

A used as a reference point for comparing how well another model (typically, a more complex one) is performing. For example, a might serve as a good baseline for a .

For a particular problem, the baseline helps model developers quantify the minimal expected performance that a new model must achieve for the new model to be useful.

batch

#fundamentals

The set of used in one training . The determines the number of examples in a batch.

See for an explanation of how a batch relates to an epoch.

batch normalization

the input or output of the in a . Batch normalization can provide the following benefits:

Make more stable by protecting against weights.
Enable higher , which can speed training.
Reduce .

batch size

#fundamentals

The number of in a . For instance, if the batch size is 100, then the model processes 100 examples per .

The following are popular batch size strategies:

, in which the batch size is 1.
full batch, in which the batch size is the number of examples in the entire . For instance, if the training set contains a million examples, then the batch size would be a million examples. Full batch is usually an inefficient strategy.
in which the batch size is usually between 10 and 1000. Mini-batch is usually the most efficient strategy.

Bayesian neural network

A probabilistic that accounts for uncertainty in and outputs. A standard neural network regression model typically a scalar value; for example, a model predicts a house price of 853,000. In contrast, a Bayesian neural network predicts a distribution of values; for example, a model predicts a house price of 853,000 with a standard deviation of 67,200. A Bayesian neural network relies on Bayes' Theorem to calculate uncertainties in weights and predictions. A Bayesian neural network can be useful when it is important to quantify uncertainty, such as in models related to pharmaceuticals. Bayesian neural networks can also help prevent .

Bayesian optimization

A technique for optimizing computationally expensive by instead optimizing a surrogate that quantifies the uncertainty via a Bayesian learning technique. Since Bayesian optimization is itself very expensive, it is usually used to optimize expensive-to-evaluate tasks that have a small number of parameters, such as selecting .

Bellman equation

#rl

In reinforcement learning, the following identity satisfied by the optimal :

\[Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s'|s,a} \max_{a'} Q(s', a')\]

algorithms apply this identity to create via the following update rule:

\[Q(s,a) \gets Q(s,a) + \alpha \left[r(s,a) + \gamma \displaystyle\max_{\substack{a_1}} Q(s’,a’) - Q(s,a) \right] \]

Beyond reinforcement learning, the Bellman equation has applications to dynamic programming. See the Wikipedia entry for Bellman Equation.

BERT (Bidirectional Encoder Representations from Transformers)

#language

A model architecture for text . A trained BERT model can act as part of a larger model for text classification or other ML tasks.

BERT has the following characteristics:

Uses the architecture, and therefore relies on .
Uses the part of the Transformer. The encoder's job is to produce good text representations, rather than to perform a specific task like classification.
Is .
Uses for .

BERT's variants include:

ALBERT, which is an acronym for A Light BERT.
LaBSE.

See Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing for an overview of BERT.

bias (ethics/fairness)

#fairness

#fundamentals

1. Stereotyping, prejudice or favoritism towards some things, people, or groups over others. These biases can affect collection and interpretation of data, the design of a system, and how users interact with a system. Forms of this type of bias include:

2. Systematic error introduced by a sampling or reporting procedure. Forms of this type of bias include:

Not to be confused with the in machine learning models or .

bias (math) or bias term

#fundamentals

An intercept or offset from an origin. Bias is a parameter in machine learning models, which is symbolized by either of the following:

For example, bias is the b in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In a simple two-dimensional line, bias just means "y-intercept." For example, the bias of the line in the following illustration is 2.

Bias exists because not all models start from the origin (0,0). For example, suppose an amusement park costs 2 Euros to enter and an additional 0.5 Euro for every hour a customer stays. Therefore, a model mapping the total cost has a bias of 2 because the lowest cost is 2 Euros.

Bias is not to be confused with or .

bigram

#seq

#language

An in which N=2.

bidirectional

#language

A term used to describe a system that evaluates the text that both precedes and follows a target section of text. In contrast, a system only evaluates the text that precedes a target section of text.

For example, consider a that must determine probabilities for the word or words representing the underline in the following question:

What is the _____ with you?

A unidirectional language model would have to base its probabilities only on the context provided by the words "What", "is", and "the". In contrast, a bidirectional language model could also gain context from "with" and "you", which might help the model generate better predictions.

bidirectional language model

#language

A that determines the probability that a given token is present at a given location in an excerpt of text based on the preceding and following text.

binary classification

#fundamentals

A type of task that predicts one of two mutually exclusive classes:

For example, the following two machine learning models each perform binary classification:

A model that determines whether email messages are spam (the positive class) or not spam (the negative class).
A model that evaluates medical symptoms to determine whether a person has a particular disease (the positive class) or doesn't have that disease (the negative class).

Contrast with .

See also and .

binary condition

#df

In a , a that has only two possible outcomes, typically yes or no. For example, the following is a binary condition:

temperature >= 100

Contrast with .

binning

Synonym for .

BLEU (Bilingual Evaluation Understudy)

#language

A score between 0.0 and 1.0, inclusive, indicating the quality of a translation between two human languages (for example, between English and Russian). A BLEU score of 1.0 indicates a perfect translation; a BLEU score of 0.0 indicates a terrible translation.

boosting

A machine learning technique that iteratively combines a set of simple and not very accurate classifiers (referred to as "weak" classifiers) into a classifier with high accuracy (a "strong" classifier) by the examples that the model is currently misclassifying.

bounding box

#image

In an image, the (x, y) coordinates of a rectangle around an area of interest, such as the dog in the image below.

broadcasting

Expanding the shape of an operand in a matrix math operation to compatible for that operation. For instance, linear algebra requires that the two operands in a matrix addition operation must have the same dimensions. Consequently, you can't add a matrix of shape (m, n) to a vector of length n. Broadcasting enables this operation by virtually expanding the vector of length n to a matrix of shape (m, n) by replicating the same values down each column.

For example, given the following definitions, linear algebra prohibits A+B because A and B have different dimensions:

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

However, broadcasting enables the operation A+B by virtually expanding B to:

 [[2, 2, 2],
  [2, 2, 2]]

Thus, A+B is now a valid operation:

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

See the following description of broadcasting in NumPy for more details.

bucketing

#fundamentals

Converting a single into multiple binary features called buckets or bins, typically based on a value range. The chopped feature is typically a .

For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete buckets, such as:

<= 10 degrees Celsius would be the "cold" bucket.
11 - 24 degrees Celsius would be the "temperate" bucket.
>= 25 degrees Celsius would be the "warm" bucket.

The model will treat every value in the same bucket identically. For example, the values

temperature >= 100

8 and

temperature >= 100

9 are both in the temperate bucket, so the model treats the two values identically.

Click the icon for additional notes.

If you represent temperature as a continuous feature, then the model treats temperature as a single feature. If you represent temperature as three buckets, then the model treats each bucket as a separate feature. That is, a model can learn separate relationships of each bucket to the . For example, a model can learn separate for each bucket.

Increasing the number of buckets makes your model more complicated by increasing the number of relationships that your model must learn. For example, the cold, temperate, and warm buckets are essentially three separate features for your model to train on. If you decide to add two more buckets--for example, freezing and hot--your model would now have to train on five separate features.

How do you know how many buckets to create, or what the ranges for each bucket should be? The answers typically require a fair amount of experimentation.

C

calibration layer

A post-prediction adjustment, typically to account for . The adjusted predictions and probabilities should match the distribution of an observed set of labels.

candidate generation

#recsystems

The initial set of recommendations chosen by a . For example, consider a bookstore that offers 100,000 titles. The candidate generation phase creates a much smaller list of suitable books for a particular user, say 500. But even 500 books is way too many to recommend to a user. Subsequent, more expensive, phases of a recommendation system (such as and ) reduce those 500 to a much smaller, more useful set of recommendations.

candidate sampling

A training-time optimization in which a probability is calculated for all the labels, using, for example, , but only for a random sample of negative labels. For example, if we have an example labeled beagle and dog candidate sampling computes the predicted probabilities and corresponding loss terms for the beagle and dog class outputs in addition to a random subset of the remaining classes (cat, lollipop, fence). The idea is that the can learn from less frequent negative reinforcement as long as always get proper positive reinforcement, and this is indeed observed empirically. The motivation for candidate sampling is a computational efficiency win from not computing predictions for all negatives.

categorical data

#fundamentals

having a specific set of possible values. For example, consider a categorical feature named

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

0, which can only have one of the following three possible values:

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

By representing

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

0 as a categorical feature, a model can learn the differing impacts of

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

3, and

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

2 on driver behavior.

Categorical features are sometimes called .

Contrast with .

causal language model

#language

Synonym for .

See to contrast different directional approaches in language modeling.

centroid

#clustering

The center of a cluster as determined by a or algorithm. For instance, if k is 3, then the k-means or k-median algorithm finds 3 centroids.

centroid-based clustering

#clustering

A category of algorithms that organizes data into nonhierarchical clusters. is the most widely used centroid-based clustering algorithm.

Contrast with algorithms.

checkpoint

Data that captures the state of a model's at a particular training iteration. Checkpoints enable exporting model , or performing across multiple sessions. Checkpoints also enable training to continue past errors (for example, job preemption).

class

#fundamentals

A category that a can belong to. For example:

In a model that detects spam, the two classes might be spam and not spam.
In a model that identifies dog breeds, the classes might be poodle, beagle, pug, and so on.

A predicts a class. In contrast, a predicts a number rather than a class.

classification model

#fundamentals

A whose prediction is a . For example, the following are all classification models:

A model that predicts an input sentence's language (French? Spanish? Italian?).
A model that predicts tree species (Maple? Oak? Baobab?).
A model that predicts the positive or negative class for a particular medical condition.

In contrast, predict numbers rather than classes.

Two common types of classification models are:

classification threshold

#fundamentals

In a , a number between 0 and 1 that converts the raw output of a model into a prediction of either the or the . Note that the classification threshold is a value that a human chooses, not a value chosen by model training.

A logistic regression model outputs a raw value between 0 and 1. Then:

If this raw value is greater than the classification threshold, then the positive class is predicted.
If this raw value is less than the classification threshold, then the negative class is predicted.

For example, suppose the classification threshold is 0.8. If the raw value is 0.9, then the model predicts the positive class. If the raw value is 0.7, then the model predicts the negative class.

The choice of classification threshold strongly influences the number of and .

Click the icon for additional notes.

As models or datasets evolve, engineers sometimes also change the classification threshold. When the classification threshold changes, positive class predictions can suddenly become negative classes and vice-versa.

For example, consider a binary classification disease prediction model. Suppose that when the system runs in the first year:

The raw value for a particular patient is 0.95.
The classification threshold is 0.94.

Therefore, the system diagnoses the positive class. (The patient gasps, "Oh no! I'm sick!")

A year later, perhaps the values now look as follows:

The raw value for the same patient remains at 0.95.
The classification threshold changes to 0.97.

Therefore, the system now reclassifies that patient as the negative class. ("Happy day! I'm not sick.") Same patient. Different diagnosis.

class-imbalanced dataset

#fundamentals

A dataset for a classification problem in which the total number of of each class differs significantly. For example, consider a binary classification dataset whose two labels are divided as follows:

1,000,000 negative labels
10 positive labels

The ratio of negative to positive labels is 100,000 to 1, so this is a class-imbalanced dataset.

In contrast, the following dataset is not class-imbalanced because the ratio of negative labels to positive labels is relatively close to 1:

517 negative labels
483 positive labels

Multi-class datasets can also be class-imbalanced. For example, the following multi-class classification dataset is also class-imbalanced because one label has far more examples than the other two:

1,000,000 labels with class "green"
200 labels with class "purple"
350 labels with class "orange"

clipping

#fundamentals

A technique for handling by doing either or both of the following:

Reducing values that are greater than a maximum threshold down to that maximum threshold.
Increasing feature values that are less than a minimum threshold up to that minimum threshold.

For example, suppose that <0.5% of values for a particular feature fall outside the range 40–60. In this case, you could do the following:

Clip all values over 60 (the maximum threshold) to be exactly 60.
Clip all values under 40 (the minimum threshold) to be exactly 40.

Outliers can damage models, sometimes causing to overflow during training. Some outliers can also dramatically spoil metrics like . Clipping is a common technique to limit the damage.

forces values within a designated range during training.

Cloud TPU

#TensorFlow

#GoogleCloud

A specialized hardware accelerator designed to speed up machine learning workloads on Google Cloud Platform.

clustering

#clustering

Grouping related , particularly during . Once all the examples are grouped, a human can optionally supply meaning to each cluster.

Many clustering algorithms exist. For example, the algorithm clusters examples based on their proximity to a , as in the following diagram:

A human researcher could then review the clusters and, for example, label cluster 1 as "dwarf trees" and cluster 2 as "full-size trees."

As another example, consider a clustering algorithm based on an example's distance from a center point, illustrated as follows:

co-adaptation

When predict patterns in training data by relying almost exclusively on outputs of specific other neurons instead of relying on the network's behavior as a whole. When the patterns that cause co-adaption are not present in validation data, then co-adaptation causes overfitting. reduces co-adaptation because dropout ensures neurons cannot rely solely on specific other neurons.

collaborative filtering

#recsystems

Making about the interests of one user based on the interests of many other users. Collaborative filtering is often used in .

condition

#df

In a , any that evaluates an expression. For example, the following portion of a decision tree contains two conditions:

A condition is also called a split or a test.

Contrast condition with .

confirmation bias

#fairness

The tendency to search for, interpret, favor, and recall information in a way that confirms one's preexisting beliefs or hypotheses. Machine learning developers may inadvertently collect or label data in ways that influence an outcome supporting their existing beliefs. Confirmation bias is a form of .

Experimenter's bias is a form of confirmation bias in which an experimenter continues training models until a preexisting hypothesis is confirmed.

confusion matrix

#fundamentals

An NxN table that summarizes the number of correct and incorrect predictions that a made. For example, consider the following confusion matrix for a model:

Tumor (predicted)Non-Tumor (predicted)Tumor (ground truth)18 (TP)1 (FN)Non-Tumor (ground truth)6 (FP)452 (TN)

The preceding confusion matrix shows the following:

Of the 19 predictions in which was Tumor, the model correctly classified 18 and incorrectly classified 1.
Of the 458 predictions in which ground truth was Non-Tumor, the model correctly classified 452 and incorrectly classified 6.

The confusion matrix for a problem can help you identify patterns of mistakes. For example, consider the following confusion matrix for a 3-class multi-class classification model that categorizes three different iris types (Virginica, Versicolor, and Setosa). When the ground truth was Virginica, the confusion matrix shows that the model was far more likely to mistakenly predict Versicolor than Setosa:

Setosa (predicted)Versicolor (predicted)Virginica (predicted)Setosa (ground truth)88120Versicolor (ground truth)61417Virginica (ground truth)227109

As yet another example, a confusion matrix could reveal that a model trained to recognize handwritten digits tends to mistakenly predict 9 instead of 4, or mistakenly predict 1 instead of 7.

Confusion matrices contain sufficient information to calculate a variety of performance metrics, including and .

continuous feature

#fundamentals

A floating-point with an infinite range of possible values, such as temperature or weight.

Contrast with .

convenience sampling

Using a dataset not gathered scientifically in order to run quick experiments. Later on, it's essential to switch to a scientifically gathered dataset.

convergence

#fundamentals

A state reached when values change very little or not at all with each . For example, the following suggests convergence at around 700 iterations:

A model converges when additional training will not improve the model.

In , loss values sometimes stay constant or nearly so for many iterations before finally descending. During a long period of constant loss values, you may temporarily get a false sense of convergence.

See also .

convex function

A function in which the region above the graph of the function is a . The prototypical convex function is shaped something like the letter U. For example, the following are all convex functions:

In contrast, the following function is not convex. Notice how the region above the graph is not a convex set:

A strictly convex function has exactly one local minimum point, which is also the global minimum point. The classic U-shaped functions are strictly convex functions. However, some convex functions (for example, straight lines) are not U-shaped.

Click the icon for a deeper look at the math.

A lot of the common , including the following, are convex functions:

Many variations of are guaranteed to find a point close to the minimum of a strictly convex function. Similarly, many variations of have a high probability (though, not a guarantee) of finding a point close to the minimum of a strictly convex function.

The sum of two convex functions (for example, L2 loss + L1 regularization) is a convex function.

are never convex functions. Remarkably, algorithms designed for tend to find reasonably good solutions on deep networks anyway, even though those solutions are not guaranteed to be a global minimum.

convex optimization

The process of using mathematical techniques such as to find the minimum of a . A great deal of research in machine learning has focused on formulating various problems as convex optimization problems and in solving those problems more efficiently.

For complete details, see Boyd and Vandenberghe, Convex Optimization.

convex set

A subset of Euclidean space such that a line drawn between any two points in the subset remains completely within the subset. For instance, the following two shapes are convex sets:

In contrast, the following two shapes are not convex sets:

convolution

#image

In mathematics, casually speaking, a mixture of two functions. In machine learning, a convolution mixes the and the input matrix in order to train .

The term "convolution" in machine learning is often a shorthand way of referring to either or .

Without convolutions, a machine learning algorithm would have to learn a separate weight for every cell in a large . For example, a machine learning algorithm training on 2K x 2K images would be forced to find 4M separate weights. Thanks to convolutions, a machine learning algorithm only has to find weights for every cell in the , dramatically reducing the memory needed to train the model. When the convolutional filter is applied, it is simply replicated across cells such that each is multiplied by the filter.

convolutional filter

#image

One of the two actors in a . (The other actor is a slice of an input matrix.) A convolutional filter is a matrix having the same as the input matrix, but a smaller shape. For example, given a 28x28 input matrix, the filter could be any 2D matrix smaller than 28x28.

In photographic manipulation, all the cells in a convolutional filter are typically set to a constant pattern of ones and zeroes. In machine learning, convolutional filters are typically seeded with random numbers and then the network the ideal values.

convolutional layer

#image

A layer of a in which a passes along an input matrix. For example, consider the following 3x3 :

The following animation shows a convolutional layer consisting of 9 convolutional operations involving the 5x5 input matrix. Notice that each convolutional operation works on a different 3x3 slice of the input matrix. The resulting 3x3 matrix (on the right) consists of the results of the 9 convolutional operations:

convolutional neural network

#image

A in which at least one layer is a . A typical convolutional neural network consists of some combination of the following layers:

Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.

convolutional operation

#image

The following two-step mathematical operation:

Element-wise multiplication of the and a slice of an input matrix. (The slice of the input matrix has the same rank and size as the convolutional filter.)
Summation of all the values in the resulting product matrix.

For example, consider the following 5x5 input matrix:

Now imagine the following 2x2 convolutional filter:

Each convolutional operation involves a single 2x2 slice of the input matrix. For instance, suppose we use the 2x2 slice at the top-left of the input matrix. So, the convolution operation on this slice looks as follows:

A consists of a series of convolutional operations, each acting on a different slice of the input matrix.

cost

Synonym for .

co-training

A approach particularly useful when all of the following conditions are true:

The ratio of to in the dataset is high.
This is a classification problem ( or ).
The contains two different sets of predictive features that are independent of each other and complementary.

Co-training essentially amplifies independent signals into a stronger signal. For instance, consider a that categorizes individual used cars as either Good or Bad. One set of predictive features might focus on aggregate characteristics such as the year, make, and model of the car; another set of predictive features might focus on the previous owner's driving record and the car's maintenance history.

The seminal paper on co-training is Combining Labeled and Unlabeled Data with Co-Training by Blum and Mitchell.

counterfactual fairness

#fairness

A that checks whether a classifier produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more . Evaluating a classifier for counterfactual fairness is one method for surfacing potential sources of bias in a model.

See "When Worlds Collide: Integrating Different Counterfactual Assumptions in Fairness" for a more detailed discussion of counterfactual fairness.

coverage bias

#fairness

See .

crash blossom

#language

A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in . For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an NLU model could interpret the headline literally or figuratively.

Click the icon for additional notes.

Just to clarify that mysterious headline:

Red Tape could refer to either of the following:
- An adhesive
- Excessive bureaucracy
Holds Up could refer to either of the following:
- Structural support
- Delays

critic

#rl

Synonym for .

cross-entropy

A generalization of to . Cross-entropy quantifies the difference between two probability distributions. See also .

cross-validation

A mechanism for estimating how well a would generalize to new data by testing the model against one or more non-overlapping data subsets withheld from the .

D

data analysis

Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first . It is also crucial in understanding experiments and debugging problems with the system.

data augmentation

#image

Artificially boosting the range and number of examples by transforming existing to create additional examples. For example, suppose images are one of your , but your dataset doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add enough images to your dataset to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.

DataFrame

#fundamentals

A popular datatype for representing in memory.

A DataFrame is analogous to a table or a spreadsheet. Each column of a DataFrame has a name (a header), and each row is identified by a unique number.

Each column in a DataFrame is structured like a 2D array, except that each column can be assigned its own data type.

See also the official pandas.DataFrame reference page.

data parallelism

A way of scaling or that replicates an entire model onto multiple devices and then passes a subset of the input data to each device. Data parallelism can enable training and inference on very large ; however, data parallelism requires that the model be small enough to fit on all devices.

See also .

data set or dataset

#fundamentals

A collection of raw data, commonly (but not exclusively) organized in one of the following formats:

a spreadsheet
a file in CSV (comma-separated values) format

Dataset API (tf.data)

#TensorFlow

A high-level API for reading data and transforming it into a form that a machine learning algorithm requires. A

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

8 object represents a sequence of elements, in which each element contains one or more . A

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

9 object provides access to the elements of a

 [[2, 2, 2],
  [2, 2, 2]]

For details about the Dataset API, see tf.data: Build TensorFlow input pipelines in the TensorFlow Programmer's Guide.

decision boundary

The separator between learned by a in a or . For example, in the following image representing a binary classification problem, the decision boundary is the frontier between the orange class and the blue class:

decision forest

#df

A model created from multiple . A decision forest makes a prediction by aggregating the predictions of its decision trees. Popular types of decision forests include and .

decision threshold

Synonym for .

decision tree

#df

A supervised learning model composed of a set of and organized hierarchically. For example, the following is a decision tree:

deep model

#fundamentals

A containing more than one .

A deep model is also called a deep neural network.

Contrast with .

decoder

#language

In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.

Decoders are often a component of a larger model, where they are frequently paired with an .

In , a decoder starts with the internal state generated by the encoder to predict the next sequence.

Refer to for the definition of a decoder within the Transformer architecture.

deep neural network

Synonym for .

Deep Q-Network (DQN)

#rl

In , a deep that predicts .

Critic is a synonym for Deep Q-Network.

demographic parity

#fairness

A that is satisfied if the results of a model's classification are not dependent on a given .

For example, if both Lilliputians and Brobdingnagians apply to Glubbdubdrib University, demographic parity is achieved if the percentage of Lilliputians admitted is the same as the percentage of Brobdingnagians admitted, irrespective of whether one group is on average more qualified than the other.

Contrast with and , which permit classification results in aggregate to depend on sensitive attributes, but do not permit classification results for certain specified ground-truth labels to depend on sensitive attributes. See "Attacking discrimination with smarter machine learning" for a visualization exploring the tradeoffs when optimizing for demographic parity.

denoising

#language

A common approach to in which:

is artificially added to the dataset.
The tries to remove the noise.

Denoising enables learning from . The original serves as the target or and the noisy data as the input.

Some use denoising as follows:

Noise is artificially added to an unlabeled sentence by masking some of the tokens.
The model tries to predict the original tokens.

dense feature

#fundamentals

A in which most or all values are nonzero, typically a of floating-point values. For example, the following 10-element Tensor is dense because 9 of its values are nonzero:

8375240496

Contrast with .

dense layer

Synonym for .

depth

#fundamentals

The sum of the following in a :

the number of
the number of , which is typically 1
the number of any

For example, a neural network with five hidden layers and one output layer has a depth of 6.

Notice that the does not influence depth.

depthwise separable convolutional neural network (sepCNN)

#image

A architecture based on Inception, but where Inception modules are replaced with depthwise separable convolutions. Also known as Xception.

A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3-D convolution into two separate convolution operations that are more computationally efficient: first, a depthwise convolution, with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution, with length and width of 1 (1 ✕ 1 ✕ n).

To learn more, see Xception: Deep Learning with Depthwise Separable Convolutions.

derived label

Synonym for .

device

#TensorFlow

A category of hardware that can run a TensorFlow session, including CPUs, GPUs, and .

dimension reduction

Decreasing the number of dimensions used to represent a particular feature in a feature vector, typically by converting to an .

dimensions

Overloaded term having any of the following definitions:

The number of levels of coordinates in a . For example:
- A scalar has zero dimensions; for example,
```
 [[2, 2, 2],
  [2, 2, 2]]
```
  1.
- A vector has one dimension; for example,
```
 [[2, 2, 2],
  [2, 2, 2]]
```
  2.
- A matrix has two dimensions; for example,
```
 [[2, 2, 2],
  [2, 2, 2]]
```
  3.
You can uniquely specify a particular cell in a one-dimensional vector with one coordinate; you need two coordinates to uniquely specify a particular cell in a two-dimensional matrix.
The number of entries in a .
The number of elements in an .

discrete feature

#fundamentals

A with a finite set of possible values. For example, a feature whose values may only be animal, vegetable, or mineral is a discrete (or categorical) feature.

Contrast with .

discriminative model

A that predicts from a set of one or more . More formally, discriminative models define the conditional probability of an output given the features and ; that is:

p(output | features, weights)

For example, a model that predicts whether an email is spam from features and weights is a discriminative model.

The vast majority of supervised learning models, including classification and regression models, are discriminative models.

Contrast with .

discriminator

A system that determines whether are real or fake.

Alternatively, the subsystem within a that determines whether the examples created by the are real or fake.

disparate impact

#fairness

Making decisions about people that impact different population subgroups disproportionately. This usually refers to situations where an algorithmic decision-making process harms or benefits some subgroups more than others.

For example, suppose an algorithm that determines a Lilliputian's eligibility for a miniature-home loan is more likely to classify them as “ineligible” if their mailing address contains a certain postal code. If Big-Endian Lilliputians are more likely to have mailing addresses with this postal code than Little-Endian Lilliputians, then this algorithm may result in disparate impact.

Contrast with , which focuses on disparities that result when subgroup characteristics are explicit inputs to an algorithmic decision-making process.

disparate treatment

#fairness

Factoring subjects' into an algorithmic decision-making process such that different subgroups of people are treated differently.

For example, consider an algorithm that determines Lilliputians’ eligibility for a miniature-home loan based on the data they provide in their loan application. If the algorithm uses a Lilliputian’s affiliation as Big-Endian or Little-Endian as an input, it is enacting disparate treatment along that dimension.

Contrast with , which focuses on disparities in the societal impacts of algorithmic decisions on subgroups, irrespective of whether those subgroups are inputs to the model.

Warning: Because sensitive attributes are almost always correlated with other features the data may have, explicitly removing sensitive attribute information does not guarantee that subgroups will be treated equally. For example, removing sensitive demographic attributes from a training data set that still includes postal code as a feature may address disparate treatment of subgroups, but there still might be disparate impact upon these groups because postal code might serve as a for other demographic information.

divisive clustering

#clustering

See .

downsampling

#image

Overloaded term that can mean either of the following:

Reducing the amount of information in a in order to a model more efficiently. For example, before training an image recognition model, downsampling high-resolution images to a lower-resolution format.
Training on a disproportionately low percentage of over-represented examples in order to improve model training on under-represented classes. For example, in a , models tend to learn a lot about the and not enough about the . Downsampling helps balance the amount of training on the majority and minority classes.

DQN

#rl

Abbreviation for .

dropout regularization

A form of useful in training . Dropout regularization removes a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large of smaller networks. For full details, see Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

dynamic

#fundamentals

Something done frequently or continuously. The terms dynamic and online are synonyms in machine learning. The following are common uses of dynamic and online in machine learning:

A (or online model) is a model that is retrained frequently or continuously.
Dynamic training (or online training) is the process of training frequently or continuously.
Dynamic inference (or online inference) is the process of generating predictions on demand.

dynamic model

#fundamentals

A that is frequently (maybe even continuously) retrained. A dynamic model is a "lifelong learner" that constantly adapts to evolving data. A dynamic model is also known as an online model.

Contrast with .

E

eager execution

#TensorFlow

A TensorFlow programming environment in which run immediately. In contrast, operations called in don't run until they are explicitly evaluated. Eager execution is an imperative interface, much like the code in most programming languages. Eager execution programs are generally far easier to debug than graph execution programs.

early stopping

#fundamentals

A method for that involves ending before training loss finishes decreasing. In early stopping, you intentionally stop training the model when the loss on a starts to increase; that is, when performance worsens.

Click the icon for additional notes.

Early stopping may seem counterintuitive. After all, telling a model to halt training while the loss is still decreasing may seem like telling a chef to stop cooking before the dessert has fully baked. However, training a model for too long can lead to . That is, if you train a model too long, the model may fit the training data so closely that the model doesn't make good predictions on new examples.

earth mover's distance (EMD)

A measure of the relative similarity between two documents. The lower the earth mover's distance, the more similar the documents.

embedding layer

#language

#fundamentals

A special that trains on a high-dimensional feature to gradually learn a lower dimension embedding vector. An embedding layer enables a neural network to train far more efficiently than training just on the high-dimensional categorical feature.

For example, Earth currently supports about 73,000 tree species. Suppose tree species is a in your model, so your model's input layer includes a 73,000 elements long. For example, perhaps

 [[2, 2, 2],
  [2, 2, 2]]

4 would be represented something like this:

A 73,000-element array is very long. If you don't add an embedding layer to the model, training is going to be very time consuming due to multiplying 72,999 zeros. Perhaps you pick the embedding layer to consist of 12 dimensions. Consequently, the embedding layer will gradually learn a new embedding vector for each tree species.

In certain situations, is a reasonable alternative to an embedding layer.

embedding space

#language

The d-dimensional vector space that features from a higher-dimensional vector space are mapped to. Ideally, the embedding space contains a structure that yields meaningful mathematical results; for example, in an ideal embedding space, addition and subtraction of embeddings can solve word analogy tasks.

The dot product of two embeddings is a measure of their similarity.

embedding vector

#language

Broadly speaking, an array of floating-point numbers taken from any that describe the inputs to that hidden layer. Often, an embedding vector is the array of floating-point numbers trained in an embedding layer. For example, suppose an embedding layer must learn an embedding vector for each of the 73,000 tree species on Earth. Perhaps the following array is the embedding vector for a baobab tree:

An embedding vector is not a bunch of random numbers. An embedding layer determines these values through training, similar to the way a neural network learns other weights during training. Each element of the array is a rating along some characteristic of a tree species. Which element represents which tree species' characteristic? That's very hard for humans to determine.

The mathematically remarkable part of an embedding vector is that similar items have similar sets of floating-point numbers. For example, similar tree species have a more similar set of floating-point numbers than dissimilar tree species. Redwoods and sequoias are related tree species, so they'll have a more similar set of floating-pointing numbers than redwoods and coconut palms. The numbers in the embedding vector will change each time you retrain the model, even if you retrain the model with identical input.

empirical risk minimization (ERM)

Choosing the function that minimizes loss on the training set. Contrast with .

encoder

#language

In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation.

Encoders are often a component of a larger model, where they are frequently paired with a . Some pair encoders with decoders, though other Transformers use only the encoder or only the decoder.

Some systems use the encoder's output as the input to a classification or regression network.

In , an encoder takes an input sequence and returns an internal state (a vector). Then, the uses that internal state to predict the next sequence.

Refer to for the definition of an encoder in the Transformer architecture.

ensemble

A collection of trained independently whose predictions are averaged or aggregated. In many cases, an ensemble produces better predictions than a single model. For example, a is an ensemble built from multiple . Note that not all are ensembles.

entropy

#df

In information theory, a description of how unpredictable a probability distribution is. Alternatively, entropy is also defined as how much information each contains. A distribution has the highest possible entropy when all values of a random variable are equally likely.

The entropy of a set with two possible values "0" and "1" (for example, the labels in a problem) has the following formula:

H = -p log p - q log q = -p log p - (1-p) * log (1-p)

where:

H is the entropy.
p is the fraction of "1" examples.
q is the fraction of "0" examples. Note that q = (1 - p)
log is generally log2. In this case, the entropy unit is a bit.

For example, suppose the following:

100 examples contain the value "1"
300 examples contain the value "0"

Therefore, the entropy value is:

p = 0.25
q = 0.75
H = (-0.25)log2(0.25) - (0.75)log2(0.75) = 0.81

A set that is perfectly balanced (for example, 200 "0"s and 200 "1"s) would have an entropy of 1.0 bit per example. As a set becomes more , its entropy moves towards 0.0.

In , entropy helps formulate to help the select the during the growth of a classification decision tree.

Compare entropy with:

loss function

Entropy is often called Shannon's entropy.

environment

#rl

In reinforcement learning, the world that contains the and allows the agent to observe that world's . For example, the represented world can be a game like chess, or a physical world like a maze. When the agent applies an to the environment, then the environment transitions between states.

episode

#rl

In reinforcement learning, each of the repeated attempts by the to learn an .

epoch

#fundamentals

A full training pass over the entire such that each has been processed once.

An epoch represents

 [[2, 2, 2],
  [2, 2, 2]]

5/ training , where

 [[2, 2, 2],
  [2, 2, 2]]

5 is the total number of examples.

For instance, suppose the following:

The dataset consists of 1,000 examples.
The batch size is 50 examples.

Therefore, a single epoch requires 20 iterations:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

epsilon greedy policy

#rl

In reinforcement learning, a that either follows a with epsilon probability or a otherwise. For example, if epsilon is 0.9, then the policy follows a random policy 90% of the time and a greedy policy 10% of the time.

Over successive episodes, the algorithm reduces epsilon’s value in order to shift from following a random policy to following a greedy policy. By shifting the policy, the agent first randomly explores the environment and then greedily exploits the results of random exploration.

equality of opportunity

#fairness

A that checks whether, for a preferred (one that confers an advantage or benefit to a person) and a given , a classifier predicts that preferred label equally well for all values of that attribute. In other words, equality of opportunity measures whether the people who should qualify for an opportunity are equally likely to do so regardless of their group membership.

For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians’ secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians’ secondary schools don’t offer math classes at all, and as a result, far fewer of their students are qualified. Equality of opportunity is satisfied for the preferred label of "admitted" with respect to nationality (Lilliputian or Brobdingnagian) if qualified students are equally likely to be admitted irrespective of whether they're a Lilliputian or a Brobdingnagian.

For example, let's say 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 1. Lilliputian applicants (90% are qualified)

QualifiedUnqualifiedAdmitted453Rejected457Total9010Percentage of qualified students admitted: 45/90 = 50%
Percentage of unqualified students rejected: 7/10 = 70%
Total percentage of Lilliputian students admitted: (45+3)/100 = 48%

Table 2. Brobdingnagian applicants (10% are qualified):

QualifiedUnqualifiedAdmitted59Rejected581Total1090Percentage of qualified students admitted: 5/10 = 50%
Percentage of unqualified students rejected: 81/90 = 90%
Total percentage of Brobdingnagian students admitted: (5+9)/100 = 14%

The preceding examples satisfy equality of opportunity for acceptance of qualified students because qualified Lilliputians and Brobdingnagians both have a 50% chance of being admitted.

Note: While equality of opportunity is satisfied, the following two fairness metrics are not satisfied:

: Lilliputians and Brobdingnagians are admitted to the university at different rates; 48% of Lilliputians students are admitted, but only 14% of Brobdingnagian students are admitted.
: While qualified Lilliputian and Brobdingnagian students both have the same chance of being admitted, the additional constraint that unqualified Lilliputians and Brobdingnagians both have the same chance of being rejected is not satisfied. Unqualified Lilliputians have a 70% rejection rate, whereas unqualified Brobdingnagians have a 90% rejection rate.

See "Equality of Opportunity in Supervised Learning" for a more detailed discussion of equality of opportunity. Also see "Attacking discrimination with smarter machine learning" for a visualization exploring the tradeoffs when optimizing for equality of opportunity.

equalized odds

#fairness

A that checks if, for any particular label and attribute, a classifier predicts that label equally well for all values of that attribute.

For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians' secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians' secondary schools don’t offer math classes at all, and as a result, far fewer of their students are qualified. Equalized odds is satisfied provided that no matter whether an applicant is a Lilliputian or a Brobdingnagian, if they are qualified, they are equally as likely to get admitted to the program, and if they are not qualified, they are equally as likely to get rejected.

Let’s say 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 3. Lilliputian applicants (90% are qualified)

QualifiedUnqualifiedAdmitted452Rejected458Total9010Percentage of qualified students admitted: 45/90 = 50%
Percentage of unqualified students rejected: 8/10 = 80%
Total percentage of Lilliputian students admitted: (45+2)/100 = 47%

Table 4. Brobdingnagian applicants (10% are qualified):

QualifiedUnqualifiedAdmitted518Rejected572Total1090Percentage of qualified students admitted: 5/10 = 50%
Percentage of unqualified students rejected: 72/90 = 80%
Total percentage of Brobdingnagian students admitted: (5+18)/100 = 23%

Equalized odds is satisfied because qualified Lilliputian and Brobdingnagian students both have a 50% chance of being admitted, and unqualified Lilliputian and Brobdingnagian have an 80% chance of being rejected.

Note: While equalized odds is satisfied here, is not satisfied. Lilliputian and Brobdingnagian students are admitted to Glubbdubdrib University at different rates; 47% of Lilliputian students are admitted, and 23% of Brobdingnagian students are admitted.

Equalized odds is formally defined in "Equality of Opportunity in Supervised Learning" as follows: "predictor Ŷ satisfies equalized odds with respect to protected attribute A and outcome Y if Ŷ and A are independent, conditional on Y."

Note: Contrast equalized odds with the more relaxed metric.

Estimator

#TensorFlow

A deprecated TensorFlow API. Use instead of Estimators.

example

#fundamentals

The values of one row of and possibly a . Examples in fall into two general categories:

A consists of one or more features and a label. Labeled examples are used during training.
An consists of one or more features but no label. Unlabeled examples are used during inference.

For instance, suppose you are training a model to determine the influence of weather conditions on student test scores. Here are three labeled examples:

FeaturesLabelTemperatureHumidityPressureTest score1547998Good19341020Excellent18921012Poor

Here are three unlabeled examples:

TemperatureHumidityPressure 12621014 21471017 19411021

The row of a is typically the raw source for an example. That is, an example typically consists of a subset of the columns in the dataset. Furthermore, the features in an example can also include , such as .

experience replay

#rl

In reinforcement learning, a technique used to reduce temporal correlations in training data. The stores state transitions in a , and then samples transitions from the replay buffer to create training data.

experimenter's bias

#fairness

See .

exploding gradient problem

#seq

The tendency for in (especially ) to become surprisingly steep (high). Steep gradients often cause very large updates to the of each in a deep neural network.

Models suffering from the exploding gradient problem become difficult or impossible to train. can mitigate this problem.

Compare to .

F

fairness constraint

#fairness

Applying a constraint to an algorithm to ensure one or more definitions of fairness are satisfied. Examples of fairness constraints include:

your model's output.
Altering the to incorporate a penalty for violating a .
Directly adding a mathematical constraint to an optimization problem.

fairness metric

#fairness

A mathematical definition of “fairness” that is measurable. Some commonly used fairness metrics include:

Many fairness metrics are mutually exclusive; see .

false negative (FN)

#fundamentals

An example in which the model mistakenly predicts the . For example, the model predicts that a particular email message is not spam (the negative class), but that email message actually is spam.

false negative rate

The proportion of actual positive examples for which the model mistakenly predicted the negative class. The following formula calculates the false negative rate:

#fundamentals

A formed by "crossing" or features.

For example, consider a "mood forecasting" model that represents temperature in one of the following four buckets:

```
 [[2, 2, 2],
  [2, 2, 2]]
```
7
```
 [[2, 2, 2],
  [2, 2, 2]]
```
8
```
 [[2, 2, 2],
  [2, 2, 2]]
```
9

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

And represents wind speed in one of the following three buckets:

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

Without feature crosses, the linear model trains independently on each of the preceding seven various buckets. So, the model trains on, for instance,

 [[2, 2, 2],
  [2, 2, 2]]

7 independently of the training on, for instance,

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

Alternatively, you could create a feature cross of temperature and wind speed. This synthetic feature would have the following 12 possible values:

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

```
p(output | features, weights)
```
0
```
p(output | features, weights)
```
1
```
p(output | features, weights)
```
2
```
p(output | features, weights)
```
3
```
p(output | features, weights)
```
4
```
p(output | features, weights)
```
5
```
p(output | features, weights)
```
6
```
p(output | features, weights)
```
7

Thanks to feature crosses, the model can learn mood differences between a

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

8 day and a

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

6 day.

If you create a synthetic feature from two features that each have a lot of different buckets, the resulting feature cross will have a huge number of possible combinations. For example, if one feature has 1,000 buckets and the other feature has 2,000 buckets, the resulting feature cross has 2,000,000 buckets.

Formally, a cross is a Cartesian product.

Feature crosses are mostly used with linear models and are rarely used with neural networks.

feature engineering

#fundamentals

#TensorFlow

A process that involves the following steps:

Determining which might be useful in training a model.
Converting raw data from the dataset into efficient versions of those features.

For example, you might determine that

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

0 might be a useful feature. Then, you might experiment with to optimize what the model can learn from different

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

0 ranges.

Feature engineering is sometimes called feature extraction.

Click the icon for additional notes about TensorFlow.

In TensorFlow, feature engineering often means converting raw log file entries to protocol buffers. See also tf.Transform.

feature extraction

Overloaded term having either of the following definitions:

Retrieving intermediate feature representations calculated by an or pretrained model (for example, values in a ) for use in another model as input.
Synonym for .

feature importances

#df

Synonym for .

feature set

#fundamentals

The group of your machine learning trains on. For example, postal code, property size, and property condition might comprise a simple feature set for a model that predicts housing prices.

feature spec

#TensorFlow

Describes the information required to extract data from the protocol buffer. Because the tf.Example protocol buffer is just a container for data, you must specify the following:

the data to extract (that is, the keys for the features)
the data type (for example, float or int)
The length (fixed or variable)

feature vector

#fundamentals

The array of values comprising an . The feature vector is input during and during . For example, the feature vector for a model with two discrete features might be:

[0.92, 0.56]

Each example supplies different values for the feature vector, so the feature vector for the next example could be something like:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

determines how to represent features in the feature vector. For example, a binary categorical feature with five possible values might be represented with . In this case, the portion of the feature vector for a particular example would consist of four zeroes and a single 1.0 in the third position, as follows:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

As another example, suppose your model consists of three features:

a binary categorical feature with five possible values represented with one-hot encoding; for example:
```
1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations
```
2
another binary categorical feature with three possible values represented with one-hot encoding; for example:
```
1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations
```
3

a floating-point feature; for example:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

In this case, the feature vector for each example would be represented by nine values. Given the example values in the preceding list, the feature vector would be:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

federated learning

A distributed machine learning approach that machine learning using decentralized residing on devices such as smartphones. In federated learning, a subset of devices downloads the current model from a central coordinating server. The devices use the examples stored on the devices to make improvements to the model. The devices then upload the model improvements (but not the training examples) to the coordinating server, where they are aggregated with other updates to yield an improved global model. After the aggregation, the model updates computed by devices are no longer needed, and can be discarded.

Since the training examples are never uploaded, federated learning follows the privacy principles of focused data collection and data minimization.

For more information about federated learning, see this tutorial.

feedback loop

#fundamentals

In machine learning, a situation in which a model's predictions influence the training data for the same model or another model. For example, a model that recommends movies will influence the movies that people see, which will then influence subsequent movie recommendation models.

feedforward neural network (FFN)

A neural network without cyclic or recursive connections. For example, traditional are feedforward neural networks. Contrast with , which are cyclic.

few-shot learning

A machine learning approach, often used for object classification, designed to train effective classifiers from only a small number of training examples.

See also .

fine tuning

Performing a secondary optimization to adjust the parameters of an already trained to fit a new problem. Fine tuning often refers to refitting the weights of a trained model to a model.

forget gate

#seq

The portion of a cell that regulates the flow of information through the cell. Forget gates maintain context by deciding which information to discard from the cell state.

full softmax

Synonym for .

Contrast with .

fully connected layer

A in which each is connected to every node in the subsequent hidden layer.

A fully connected layer is also known as a .

G

GAN

Abbreviation for .

generalization

#fundamentals

A ability to make correct predictions on new, previously unseen data. A model that can generalize is the opposite of a model that is .

Click the icon for additional notes.

You train a model on the examples in the training set. Consequently, the model learns the peculiarities of the data in the training set. Generalization essentially asks whether your model can make good predictions on examples that are not in the training set.

To encourage generalization, helps a model train less exactly to the peculiarities of the data in the training set.

generalization curve

#fundamentals

A plot of both and as a function of the number of .

A generalization curve can help you detect possible . For example, the following generalization curve suggests overfitting because validation loss ultimately becomes significantly higher than training loss.

generalized linear model

A generalization of models, which are based on Gaussian noise, to other types of models based on other types of noise, such as Poisson noise or categorical noise. Examples of generalized linear models include:

multi-class regression
least squares regression

The parameters of a generalized linear model can be found through .

Generalized linear models exhibit the following properties:

The average prediction of the optimal least squares regression model is equal to the average label on the training data.
The average probability predicted by the optimal logistic regression model is equal to the average label on the training data.

The power of a generalized linear model is limited by its features. Unlike a deep model, a generalized linear model cannot "learn new features."

generative adversarial network (GAN)

A system to create new data in which a creates data and a determines whether that created data is valid or invalid.

generative model

Practically speaking, a model that does either of the following:

Creates (generates) new examples from the training dataset. For example, a generative model could create poetry after training on a dataset of poems. The part of a falls into this category.
Determines the probability that a new example comes from the training set, or was created from the same mechanism that created the training set. For example, after training on a dataset consisting of English sentences, a generative model could determine the probability that new input is a valid English sentence.

A generative model can theoretically discern the distribution of examples or particular features in a dataset. That is:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

Unsupervised learning models are generative.

Contrast with .

generator

The subsystem within a that creates new .

Contrast with .

GPT (Generative Pre-trained Transformer)

#language

A family of -based developed by OpenAI.

GPT variants can apply to multiple , including:

image generation (for example, ImageGPT)
text-to-image generation (for example, DALL-E).

gini impurity

#df

A metric similar to . use values derived from either gini impurity or entropy to compose for classification . is derived from entropy. There is no universally accepted equivalent term for the metric derived from gini impurity; however, this unnamed metric is just as important as information gain.

Gini impurity is also called gini index, or simply gini.

Click the icon for mathematical details about gini impurity.

Gini impurity is the probability of misclassifying a new piece of data taken from the same distribution. The gini impurity of a set with two possible values "0" and "1" (for example, the labels in a problem) is calculated from the following formula:

I = 1 - (p2 + q2) = 1 - (p2 + (1-p)2)

where:

I is the gini impurity.
p is the fraction of "1" examples.
q is the fraction of "0" examples. Note that q = 1-p

For example, consider the following dataset:

100 labels (0.25 of the dataset) contain the value "1"
300 labels (0.75 of the dataset) contain the value "0"

Therefore, the gini impurity is:

p = 0.25
q = 0.75
I = 1 - (0.252 + 0.752) = 0.375

Consequently, a random label from the same dataset would have a 37.5% chance of being misclassified, and a 62.5% chance of being properly classified.

A perfectly balanced label (for example, 200 "0"s and 200 "1"s) would have a gini impurity of 0.5. A highly label would have a gini impurity close to 0.0.

gradient

The vector of with respect to all of the independent variables. In machine learning, the gradient is the vector of partial derivatives of the model function. The gradient points in the direction of steepest ascent.

gradient boosting

#df

A training algorithm where weak models are trained to iteratively improve the quality (reduce the loss) of a strong model. For example, a weak model could be a linear or small decision tree model. The strong model becomes the sum of all the previously trained weak models.

In the simplest form of gradient boosting, at each iteration, a weak model is trained to predict the loss gradient of the strong model. Then, the strong model's output is updated by subtracting the predicted gradient, similar to .

$$F_{0} = 0$$ $$F_{i+1} = F_i - \xi f_i $$

where:

$F_{0}$ is the starting strong model.
$F_{i+1}$ is the next strong model.
$F_{i}$ is the current strong model.
$\xi$ is a value between 0.0 and 1.0 called , which is analogous to the in gradient descent.
$f_{i}$ is the weak model trained to predict the loss gradient of $F_{i}$.

Modern variations of gradient boosting also include the second derivative (Hessian) of the loss in their computation.

are commonly used as weak models in gradient boosting. See .

gradient boosted (decision) trees (GBT)

#df

A type of in which:

relies on .
The weak model is a .

gradient clipping

#seq

A commonly used mechanism to mitigate the by artificially limiting (clipping) the maximum value of gradients when using to a model.

gradient descent

#fundamentals

A mathematical technique to minimize . Gradient descent iteratively adjusts and , gradually finding the best combination to minimize loss.

Gradient descent is older—much, much older—than machine learning.

graph

#TensorFlow

In TensorFlow, a computation specification. Nodes in the graph represent operations. Edges are directed and represent passing the result of an operation (a ) as an operand to another operation. Use to visualize a graph.

graph execution

#TensorFlow

A TensorFlow programming environment in which the program first constructs a and then executes all or part of that graph. Graph execution is the default execution mode in TensorFlow 1.x.

Contrast with .

greedy policy

#rl

In reinforcement learning, a that always chooses the action with the highest expected .

ground truth

#fundamentals

Reality.

The thing that actually happened.

For example, consider a model that predicts whether a student in their first year of university will graduate within six years. Ground truth for this model is whether or not that student actually graduated within six years.

Click the icon for additional notes.

We assess model quality against ground truth. However, ground truth is not always completely, well, truthful. For example, consider the following examples of potential imperfections in ground truth:

In the graduation example, are we certain that the graduation records for each student are always correct? Is the university's record-keeping flawless?
Suppose the label is a floating-point value measured by instruments (for instance, barometers). How can we be sure that each instrument is calibrated identically or that each reading was taken under the same circumstances?
If the label is a matter of human opinion, how can we be sure that each human is evaluating events in the same way? To improve consistency, expert human raters sometimes intervene.

group attribution bias

#fairness

Assuming that what is true for an individual is also true for everyone in that group. The effects of group attribution bias can be exacerbated if a is used for data collection. In a non-representative sample, attributions may be made that do not reflect reality.

See also and .

H

hallucination

The production of plausible-seeming but factually incorrect output by a that purports to be making an assertion about the real world. For example, if a dialog agent claims that Barack Obama died in 1865, the agent is hallucinating.

hashing

In machine learning, a mechanism for bucketing , particularly when the number of categories is large, but the number of categories actually appearing in the dataset is comparatively small.

For example, Earth is home to about 73,000 tree species. You could represent each of the 73,000 tree species in 73,000 separate categorical buckets. Alternatively, if only 200 of those tree species actually appear in a dataset, you could use hashing to divide tree species into perhaps 500 buckets.

A single bucket could contain multiple tree species. For example, hashing could place baobab and red maple—two genetically dissimilar species—into the same bucket. Regardless, hashing is still a good way to map large categorical sets into the desired number of buckets. Hashing turns a categorical feature having a large number of possible values into a much smaller number of values by grouping values in a deterministic way.

heuristic

A simple and quickly implemented solution to a problem. For example, "With a heuristic, we achieved 86% accuracy. When we switched to a deep neural network, accuracy went up to 98%."

hidden layer

#fundamentals

A layer in a between the (the features) and the (the prediction). Each hidden layer consists of one or more . For example, the following neural network contains two hidden layers, the first with three neurons and the second with two neurons:

A contains more than one hidden layer. For example, the preceding illustration is a deep neural network because the model contains two hidden layers.

hierarchical clustering

#clustering

A category of algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:

Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.

Contrast with .

hinge loss

A family of functions for designed to find the as distant as possible from each training example, thus maximizing the margin between examples and the boundary. use hinge loss (or a related function, such as squared hinge loss). For binary classification, the hinge loss function is defined as follows:

$$\text{loss} = \text{max}(0, 1 - (y * y'))$$

where y is the true label, either -1 or +1, and y' is the raw output of the classifier model:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

Consequently, a plot of hinge loss vs. (y * y') looks as follows:

holdout data

intentionally not used ("held out") during training. The and are examples of holdout data. Holdout data helps evaluate your model's ability to generalize to data other than the data it was trained on. The loss on the holdout set provides a better estimate of the loss on an unseen dataset than does the loss on the training set.

hyperparameter

#fundamentals

The variables that you or a hyperparameter tuning service adjust during successive runs of training a model. For example, is a hyperparameter. You could set the learning rate to 0.01 before one training session. If you determine that 0.01 is too high, you could perhaps set the learning rate to 0.003 for the next training session.

In contrast, are the various and that the model learns during training.

hyperplane

A boundary that separates a space into two subspaces. For example, a line is a hyperplane in two dimensions and a plane is a hyperplane in three dimensions. More typically in machine learning, a hyperplane is the boundary separating a high-dimensional space. use hyperplanes to separate positive classes from negative classes, often in a very high-dimensional space.

I

i.i.d.

Abbreviation for .

image recognition

#image

A process that classifies object(s), pattern(s), or concept(s) in an image. Image recognition is also known as image classification.

For more information, see ML Practicum: Image Classification.

imbalanced dataset

Synonym for .

implicit bias

#fairness

Automatically making an association or assumption based on one’s mental models and memories. Implicit bias can affect the following:

How data is collected and classified.
How machine learning systems are designed and developed.

For example, when building a classifier to identify wedding photos, an engineer may use the presence of a white dress in a photo as a feature. However, white dresses have been customary only during certain eras and in certain cultures.

See also .

incompatibility of fairness metrics

#fairness

The idea that some notions of fairness are mutually incompatible and cannot be satisfied simultaneously. As a result, there is no single universal for quantifying fairness that can be applied to all ML problems.

While this may seem discouraging, incompatibility of fairness metrics doesn’t imply that fairness efforts are fruitless. Instead, it suggests that fairness must be defined contextually for a given ML problem, with the goal of preventing harms specific to its use cases.

See "On the (im)possibility of fairness" for a more detailed discussion of this topic.

independently and identically distributed (i.i.d)

#fundamentals

Data drawn from a distribution that doesn't change, and where each value drawn doesn't depend on values that have been drawn previously. An i.i.d. is the ideal gas of machine learning—a useful mathematical construct but almost never exactly found in the real world. For example, the distribution of visitors to a web page may be i.i.d. over a brief window of time; that is, the distribution doesn't change during that brief window and one person's visit is generally independent of another's visit. However, if you expand that window of time, seasonal differences in the web page's visitors may appear.

See also .

individual fairness

#fairness

A fairness metric that checks whether similar individuals are classified similarly. For example, Brobdingnagian Academy might want to satisfy individual fairness by ensuring that two students with identical grades and standardized test scores are equally likely to gain admission.

Note that individual fairness relies entirely on how you define "similarity" (in this case, grades and test scores), and you can run the risk of introducing new fairness problems if your similarity metric misses important information (such as the rigor of a student’s curriculum).

See "Fairness Through Awareness" for a more detailed discussion of individual fairness.

inference

#fundamentals

In machine learning, the process of making predictions by applying a trained model to .

Inference has a somewhat different meaning in statistics. See the Wikipedia article on statistical inference for details.

inference path

#df

In a , during , the route a particular takes from the to other , terminating with a . For instance, in the following decision tree, the thicker arrows show the inference path for an example with the following feature values:

x = 7
y = 12
z = -3

The inference path in the following illustration travels through three conditions before reaching the leaf (

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

5).

The three thick arrows show the inference path.

information gain

#df

In , the difference between a node's and the weighted (by number of examples) sum of the entropy of its children nodes. A node's entropy is the entropy of the examples in that node.

For example, consider the following entropy values:

entropy of parent node = 0.6
entropy of one child node with 16 relevant examples = 0.2
entropy of another child node with 24 relevant examples = 0.1

So 40% of the examples are in one child node and 60% are in the other child node. Therefore:

weighted entropy sum of child nodes = (0.4 * 0.2) + (0.6 * 0.1) = 0.14

So, the information gain is:

information gain = entropy of parent node - weighted entropy sum of child nodes
information gain = 0.6 - 0.14 = 0.46

Most seek to create that maximize information gain.

in-group bias

#fairness

Showing partiality to one's own group or own characteristics. If testers or raters consist of the machine learning developer's friends, family, or colleagues, then in-group bias may invalidate product testing or the dataset.

In-group bias is a form of . See also .

input layer

#fundamentals

The of a that holds the . That is, the input layer provides for or . For example, the input layer in the following neural network consists of two features:

in-set condition

#df

In a , a that tests for the presence of one item in a set of items. For example, the following is an in-set condition:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

During inference, if the value of the house-style is

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

6 or

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

7 or

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

8, then this condition evaluates to Yes. If the value of the house-style feature is something else (for example,

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

9), then this condition evaluates to No.

In-set conditions usually lead to more efficient decision trees than conditions that test features.

instance

Synonym for .

interpretability

#fundamentals

The ability to explain or to present an ML model's reasoning in understandable terms to a human.

Most linear regression models, for example, are highly interpretable. (You merely need to look at the trained weights for each feature.) Decision forests are also highly interpretable. Some models, however, require sophisticated visualization to become interpretable.

inter-rater agreement

A measurement of how often human raters agree when doing a task. If raters disagree, the task instructions may need to be improved. Also sometimes called inter-annotator agreement or inter-rater reliability. See also Cohen's kappa, which is one of the most popular inter-rater agreement measurements.

intersection over union (IoU)

#image

The intersection of two sets divided by their union. In machine-learning image-detection tasks, IoU is used to measure the accuracy of the model’s predicted with respect to the bounding box. In this case, the IoU for the two boxes is the ratio between the overlapping area and the total area, and its value ranges from 0 (no overlap of predicted bounding box and ground-truth bounding box) to 1 (predicted bounding box and ground-truth bounding box have the exact same coordinates).

For example, in the image below:

The predicted bounding box (the coordinates delimiting where the model predicts the night table in the painting is located) is outlined in purple.
The ground-truth bounding box (the coordinates delimiting where the night table in the painting is actually located) is outlined in green.

Here, the intersection of the bounding boxes for prediction and ground truth (below left) is 1, and the union of the bounding boxes for prediction and ground truth (below right) is 7, so the IoU is $\frac{1}{7}$.

IoU

Abbreviation for .

item matrix

#recsystems

In , a matrix of generated by that holds latent signals about each . Each row of the item matrix holds the value of a single latent feature for all items. For example, consider a movie recommendation system. Each column in the item matrix represents a single movie. The latent signals might represent genres, or might be harder-to-interpret signals that involve complex interactions among genre, stars, movie age, or other factors.

The item matrix has the same number of columns as the target matrix that is being factorized. For example, given a movie recommendation system that evaluates 10,000 movie titles, the item matrix will have 10,000 columns.

items

#recsystems

In a , the entities that a system recommends. For example, videos are the items that a video store recommends, while books are the items that a bookstore recommends.

iteration

#fundamentals

A single update of a parameters—the model's and —during . The determines how many examples the model processes in a single iteration. For instance, if the batch size is 20, then the model processes 20 examples before adjusting the parameters.

When training a , a single iteration involves the following two passes:

A forward pass to evaluate loss on a single batch.
A backward pass () to adjust the model's parameters based on the loss and the learning rate.

K

Keras

A popular Python machine learning API. Keras runs on several deep learning frameworks, including TensorFlow, where it is made available as tf.keras.

keypoints

#image

The coordinates of particular features in an image. For example, for an model that distinguishes flower species, keypoints might be the center of each petal, the stem, the stamen, and so on.

Kernel Support Vector Machines (KSVMs)

A classification algorithm that seeks to maximize the margin between and by mapping input data vectors to a higher dimensional space. For example, consider a classification problem in which the input dataset has a hundred features. To maximize the margin between positive and negative classes, a KSVM could internally map those features into a million-dimension space. KSVMs uses a loss function called .

k-means

#clustering

A popular algorithm that groups examples in unsupervised learning. The k-means algorithm basically does the following:

Iteratively determines the best k center points (known as ).
Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.

The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each example to its closest centroid.

For example, consider the following plot of dog height to dog width:

If k=3, the k-means algorithm will determine three centroids. Each example is assigned to its closest centroid, yielding three groups:

Imagine that a manufacturer wants to determine the ideal sizes for small, medium, and large sweaters for dogs. The three centroids identify the mean height and mean width of each dog in that cluster. So, the manufacturer should probably base sweater sizes on those three centroids. Note that the centroid of a cluster is typically not an example in the cluster.

The preceding illustrations shows k-means for examples with only two features (height and width). Note that k-means can group examples across many features.

k-median

#clustering

A clustering algorithm closely related to . The practical difference between the two is as follows:

In k-means, centroids are determined by minimizing the sum of the squares of the distance between a centroid candidate and each of its examples.
In k-median, centroids are determined by minimizing the sum of the distance between a centroid candidate and each of its examples.

Note that the definitions of distance are also different:

k-means relies on the Euclidean distance from the centroid to an example. (In two dimensions, the Euclidean distance means using the Pythagorean theorem to calculate the hypotenuse.) For example, the k-means distance between (2,2) and (5,-2) would be:

$$ {\text{Euclidean distance}} = {\sqrt {(2-5)^2 + (2--2)^2}} = 5 $$

k-median relies on the Manhattan distance from the centroid to an example. This distance is the sum of the absolute deltas in each dimension. For example, the k-median distance between (2,2) and (5,-2) would be:

$$ {\text{Manhattan distance}} = \lvert 2-5 \rvert + \lvert 2--2 \rvert = 7 $$

L

A type of that penalizes in proportion to the sum of the absolute value of the weights. L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0. A with a weight of 0 is effectively removed from the model.

Contrast with .

L2 loss

#fundamentals

A that calculates the square of the difference between actual values and the values that a predicts. For example, here's the calculation of L2 loss for a of five :

Actual value of exampleModel's predicted valueSquare of delta7615418119464981 16 = L2 loss

Due to squaring, L2 loss amplifies the influence of . That is, L2 loss reacts more strongly to bad predictions than . For example, the L1 loss for the preceding batch would be 8 rather than 16. Notice that a single outlier accounts for 9 of the 16.

typically use L2 loss as the loss function.

The is the average L2 loss per example. Squared loss is another name for L2 loss.

Click the icon to see the formal math.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

where:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

L2 regularization

#fundamentals

A type of that penalizes in proportion to the sum of the squares of the weights. L2 regularization helps drive weights (those with high positive or low negative values) closer to 0 but not quite to 0. Features with values very close to 0 remain in the model but don't influence the model's prediction very much.

L2 regularization always improves generalization in .

Contrast with .

label

#fundamentals

In , the "answer" or "result" portion of an .

Each consists of one or more and a label. For instance, in a spam detection dataset, the label would probably be either "spam" or "not spam." In a rainfall dataset, the label might be the amount of rain that fell during a certain period.

labeled example

#fundamentals

An example that contains one or more and a . For example, the following table shows three labeled examples from a house valuation model, each with three features and one label:

Number of bedroomsNumber of bathroomsHouse ageHouse price (label)3215$345,0002172$179,0004234$392,000

In , models train on labeled examples and make predictions on .

Contrast labeled example with unlabeled examples.

LaMDA (Language Model for Dialogue Applications)

#language

A -based developed by Google trained on a large dialogue dataset that can generate realistic conversational responses.

LaMDA: our breakthrough conversation technology provides an overview.

lambda

#fundamentals

Synonym for .

Lambda is an overloaded term. Here we're focusing on the term's definition within .

landmarks

#image

Synonym for .

language model

#language

A that estimates the probability of a or sequence of tokens occurring in a longer sequence of tokens.

Click the icon for additional notes.

Though counterintuitive, many models that evaluate text are not language models. For example, text classification models and sentiment analysis models are not language models.

large language model

#language

An informal term with no strict definition that usually means a that has a high number of . Some large language models contain over 100 billion parameters.

Click the icon for additional notes.

You might be wondering when a becomes large enough to be termed a large language model. Currently, there is no agreed-upon defining line for the number of parameters.

Most current large language models (for example, ) are based on architecture.

layer

#fundamentals

A set of in a . Three common types of layers are as follows:

The , which provides values for all the .
One or more , which find nonlinear relationships between the features and the label.
The , which provides the prediction.

For example, the following illustration shows a neural network with one input layer, two hidden layers, and one output layer:

In , layers are also Python functions that take and configuration options as input and produce other tensors as output.

Layers API (tf.layers)

#TensorFlow

A TensorFlow API for constructing a neural network as a composition of layers. The Layers API enables you to build different types of , such as:

```
[0.92, 0.56]
```
0 for a .
```
[0.92, 0.56]
```
1 for a convolutional layer.

The Layers API follows the layers API conventions. That is, aside from a different prefix, all functions in the Layers API have the same names and signatures as their counterparts in the Keras layers API.

leaf

#df

Any endpoint in a . Unlike a , a leaf does not perform a test. Rather, a leaf is a possible prediction. A leaf is also the terminal of an .

For example, the following decision tree contains three leaves:

learning rate

#fundamentals

A floating-point number that tells the algorithm how strongly to adjust weights and biases on each . For example, a learning rate of 0.3 would adjust weights and biases three times more powerfully than a learning rate of 0.1.

Learning rate is a key . If you set the learning rate too low, training will take too long. If you set the learning rate too high, gradient descent often has trouble reaching .

Click the icon for a more mathematical explanation.

During each iteration, the algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step.

least squares regression

A linear regression model trained by minimizing .

linear model

#fundamentals

A that assigns one per to make . (Linear models also incorporate a .) In contrast, the relationship of features to predictions in is generally nonlinear.

Linear models are usually easier to train and more than deep models. However, deep models can learn complex relationships between features.

and are two types of linear models.

Click the icon to see the math.

A linear model follows this formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

where:

y' is the raw prediction. (In certain kinds of linear models, this raw prediction will be further modified. For example, see .)
b is the .
w is a , so w1 is the weight of the first feature, w2 is the weight of the second feature, and so on.
x is a , so x1 is the value of the first feature, x2 is the value of the second feature, and so on.

For example, suppose a linear model for three features learns the following bias and weights:

b = 7
w1 = -2.5
w2 = -1.2
w3 = 1.4

Therefore, given three features (x1, x2, and x3), the linear model uses the following equation to generate each prediction:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

Suppose a particular example contains the following values:

x1 = 4
x2 = -10
x3 = 5

Plugging those values into the formula yields a prediction for this example:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

Linear models include not only models that use only a linear equation to make predictions but also a broader set of models that use a linear equation as just one component of the formula that makes predictions. For example, logistic regression post-processes the raw prediction (y') to produce a final prediction value between 0 and 1, exclusively.

linear

#fundamentals

A relationship between two or more variables that can be represented solely through addition and multiplication.

The plot of a linear relationship is a line.

Contrast with .

linear regression

#fundamentals

A type of machine learning model in which both of the following are true:

The model is a .
The prediction is a floating-point value. (This is the part of linear regression.)

Contrast linear regression with . Also, contrast regression with .

logistic regression

#fundamentals

A type of that predicts a probability. Logistic regression models have the following characteristics:

The label is . The term logistic regression usually refers to binary logistic regression, that is, to a model that calculates probabilities for labels with two possible values. A less common variant, multinomial logistic regression, calculates probabilities for labels with more than two possible values.
The loss function during training is . (Multiple Log Loss units can be placed in parallel for labels with more than two possible values.)
The model has a linear architecture, not a deep neural network. However, the remainder of this definition also applies to that predict probabilities for categorical labels.

For example, consider a logistic regression model that calculates the probability of an input email being either spam or not spam. During inference, suppose the model predicts 0.72. Therefore, the model is estimating:

A 72% chance of the email being spam.
A 28% chance of the email not being spam.

A logistic regression model uses the following two-step architecture:

The model generates a raw prediction (y') by applying a linear function of input features.
The model uses that raw prediction as input to a , which converts the raw prediction to a value between 0 and 1, exclusive.

Like any regression model, a logistic regression model predicts a number. However, this number typically becomes part of a binary classification model as follows:

If the predicted number is greater than the , the binary classification model predicts the positive class.
If the predicted number is less than the classification threshold, the binary classification model predicts the negative class.

logits

The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a problem, logits typically become an input to the function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.

tf.nn.sigmoid_cross_entropy_with_logits.

Log Loss

#fundamentals

The used in binary .

Click the icon to see the math.

The following formula calculates Log Loss:

$$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$

where:

$(x,y)\in D$ is the data set containing many labeled examples, which are $(x,y)$ pairs.
$y$ is the label in a labeled example. Since this is logistic regression, every value of $y$ must either be 0 or 1.
$y'$ is the predicted value (somewhere between 0 and 1, exclusive), given the set of features in $x$.

log-odds

#fundamentals

The logarithm of the odds of some event.

Click the icon to see the math.

If the event is a binary probability, then odds refers to the ratio of the probability of success (p) to the probability of failure (1-p). For example, suppose that a given event has a 90% probability of success and a 10% probability of failure. In this case, odds is calculated as follows:

$$ {\text{odds}} = \frac{\text{p}} {\text{(1-p)}} = \frac{.9} {.1} = {\text{9}} $$

The log-odds is simply the logarithm of the odds. By convention, "logarithm" refers to natural logarithm, but logarithm could actually be any base greater than 1. Sticking to convention, the log-odds of our example is therefore:

$$ {\text{log-odds}} = ln(9) ~= 2.2 $$

The log-odds function is the inverse of the .

Long Short-Term Memory (LSTM)

#seq

A type of cell in a used to process sequences of data in applications such as handwriting recognition, machine translation, and image captioning. LSTMs address the that occurs when training RNNs due to long data sequences by maintaining history in an internal memory state based on new input and context from previous cells in the RNN.

loss

#fundamentals

During the of a , a measure of how far a model's is from its .

A calculates the loss.

loss curve

#fundamentals

A plot of as a function of the number of training . The following plot shows a typical loss curve:

Loss curves can help you determine when your model is or .

Loss curves can plot all of the following types of loss:

See also .

loss function

#fundamentals

During or testing, a mathematical function that calculates the loss on a of examples. A loss function returns a lower loss for models that makes good predictions than for models that make bad predictions.

The goal of training is typically to minimize the loss that a loss function returns.

Many different kinds of loss functions exist. Pick the appropriate loss function for the kind of model you are building. For example:

(or ) is the loss function for .
is the loss function for .

loss surface

A graph of weight(s) vs. loss. aims to find the weight(s) for which the loss surface is at a local minimum.

LSTM

#seq

Abbreviation for .

M

machine learning

#fundamentals

A program or system that a from input data. The trained model can make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model.

Machine learning also refers to the field of study concerned with these programs or systems.

majority class

#fundamentals

The more common label in a . For example, given a dataset containing 99% negative labels and 1% positive labels, the negative labels are the majority class.

Contrast with .

Markov decision process (MDP)

#rl

A graph representing the decision-making model where decisions (or ) are taken to navigate a sequence of under the assumption that the holds. In , these transitions between states return a numerical .

Markov property

#rl

A property of certain , where state transitions are entirely determined by information implicit in the current and the agent’s .

masked language model

#language

A that predicts the probability of candidate tokens to fill in blanks in a sequence. For instance, a masked language model can calculate probabilities for candidate word(s) to replace the underline in the following sentence:

The ____ in the hat came back.

The literature typically uses the string "MASK" instead of an underline. For example:

The "MASK" in the hat came back.

Most modern masked language models are .

matplotlib

An open-source Python 2D plotting library. matplotlib helps you visualize different aspects of machine learning.

matrix factorization

#recsystems

In math, a mechanism for finding the matrices whose dot product approximates a target matrix.

In , the target matrix often holds users' ratings on . For example, the target matrix for a movie recommendation system might look something like the following, where the positive integers are user ratings and 0 means that the user didn't rate the movie:

CasablancaThe Philadelphia StoryBlack PantherWonder WomanPulp FictionUser 15.03.00.02.00.0User 24.00.00.01.05.0User 33.01.04.05.00.0

The movie recommendation system aims to predict user ratings for unrated movies. For example, will User 1 like Black Panther?

One approach for recommendation systems is to use matrix factorization to generate the following two matrices:

A , shaped as the number of users X the number of embedding dimensions.
An , shaped as the number of embedding dimensions X the number of items.

For example, using matrix factorization on our three users and five items could yield the following user matrix and item matrix:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

The dot product of the user matrix and item matrix yields a recommendation matrix that contains not only the original user ratings but also predictions for the movies that each user hasn't seen. For example, consider User 1's rating of Casablanca, which was 5.0. The dot product corresponding to that cell in the recommendation matrix should hopefully be around 5.0, and it is:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

More importantly, will User 1 like Black Panther? Taking the dot product corresponding to the first row and the third column yields a predicted rating of 4.3:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

Matrix factorization typically yields a user matrix and item matrix that, together, are significantly more compact than the target matrix.

Mean Absolute Error (MAE)

The average loss per example when is used. Calculate Mean Absolute Error as follows:

Calculate the L1 loss for a batch.
Divide the L1 loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Absolute Error} = \frac{1}{n}\sum_{i=0}^n | y_i - \hat{y}_i |$$ where:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

For example, consider the calculation of L1 loss on the following batch of five examples:

Actual value of exampleModel's predicted valueLoss (difference between actual and predicted)7615418113462981 8 = L1 loss

So, L1 loss is 8 and the number of examples is 5. Therefore, the Mean Absolute Error is:

area > 200

Contrast Mean Absolute Error with and .

Mean Squared Error (MSE)

The average loss per example when is used. Calculate Mean Squared Error as follows:

Calculate the L2 loss for a batch.
Divide the L2 loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Squared Error} = \frac{1}{n}\sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$ where:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the model's prediction for $y$.

For example, consider the loss on the following batch of five examples:

Actual valueModel's predictionLossSquared loss76115411811394624981116 = L2 loss

Therefore, the Mean Squared Error is:

area > 200

Mean Squared Error is a popular training , particularly for .

Contrast Mean Squared Error with and .

uses Mean Squared Error to calculate loss values.

Click the icon to see more details about outliers.

strongly influence Mean Squared Error. For example, a loss of 1 is a squared loss of 1, but a loss of 3 is a squared loss of 9. In the preceding table, the example with a loss of 3 accounts for ~56% of the Mean Squared Error, while each of the examples with a loss of 1 accounts for only 6% of the Mean Squared Error.

Outliers do not influence Mean Absolute Error as strongly as Mean Squared Error. For example, a loss of 3 accounts for only ~38% of the Mean Absolute Error.

is one way to prevent extreme outliers from damaging your model's predictive ability.

metric

#TensorFlow

A statistic that you care about.

An is a metric that a machine learning system tries to optimize.

meta-learning

#language

A subset of machine learning that discovers or improves a learning algorithm. A meta-learning system can also aim to train a model to quickly learn a new task from a small amount of data or from experience gained in previous tasks. Meta-learning algorithms generally try to achieve the following:

Improve/learn hand-engineered features (such as an initializer or an optimizer).
Be more data-efficient and compute-efficient.
Improve generalization.

Meta-learning is related to .

Metrics API (tf.metrics)

A TensorFlow API for evaluating models. For example,

[0.92, 0.56]

2 determines how often a model's predictions match labels.

mini-batch

#fundamentals

A small, randomly selected subset of a processed in one . The of a mini-batch is usually between 10 and 1,000 examples.

For example, suppose the entire training set (the full batch) consists of 1,000 examples. Further suppose that you set the of each mini-batch to 20. Therefore, each iteration determines the loss on a random 20 of the 1,000 examples and then adjusts the and accordingly.

It is much more efficient to calculate the loss on a mini-batch than the loss on all the examples in the full batch.

mini-batch stochastic gradient descent

A algorithm that uses . In other words, mini-batch stochastic gradient descent estimates the gradient based on a small subset of the training data. Regular uses a mini-batch of size 1.

minimax loss

A loss function for , based on the between the distribution of generated data and real data.

Minimax loss is used in the first paper to describe generative adversarial networks.

minority class

#fundamentals

The less common label in a . For example, given a dataset containing 99% negative labels and 1% positive labels, the positive labels are the minority class.

Contrast with .

Click the icon for additional notes.

A training set with a million sounds impressive. However, if the minority class is poorly represented, then even a very large training set might be insufficient. Focus less on the total number of examples in the dataset and more on the number of examples in the minority class.

If your dataset doesn't contain enough minority class examples, consider using (the definition in the second bullet) to supplement the minority class.

ML

Abbreviation for .

MNIST

#image

A public-domain dataset compiled by LeCun, Cortes, and Burges containing 60,000 images, each image showing how a human manually wrote a particular digit from 0–9. Each image is stored as a 28x28 array of integers, where each integer is a grayscale value between 0 and 255, inclusive.

MNIST is a canonical dataset for machine learning, often used to test new machine learning approaches. For details, see The MNIST Database of Handwritten Digits.

modality

#language

A high-level data category. For example, numbers, text, images, video, and audio are five different modalities.

model

#fundamentals

In general, any mathematical construct that processes input data and returns output. Phrased differently, a model is the set of parameters and structure needed for a system to make predictions. In , a model takes an as input and infers a as output. Within supervised machine learning, models differ somewhat. For example:

A linear regression model consists of a set of and a .
A model consists of:
- A set of , each containing one or more .
- The weights and bias associated with each neuron.
A model consists of:
- The shape of the tree; that is, the pattern in which the conditions and leaves are connected.
- The conditions and leaves.

You can save, restore, or make copies of a model.

also generates models, typically a function that can map an input example to the most appropriate .

Click the icon to compare algebraic and programming functions to ML models.

An algebraic function such as the following is a model:

area > 200

The preceding function maps input values (x and y) to output.

Similarly, a programming function like the following is also a model:

area > 200

A caller passes arguments to the preceding Python function, and the Python function generates output (via the return statement).

Although a has a very different mathematical structure than an algebraic or programming function, a deep neural network still takes input (an example) and returns output (a prediction).

A human programmer codes a programming function manually. In contrast, a machine learning model gradually learns the optimal parameters during automated training.

model capacity

The complexity of problems that a model can learn. The more complex the problems that a model can learn, the higher the model’s capacity. A model’s capacity typically increases with the number of model parameters. For a formal definition of classifier capacity, see VC dimension.

model parallelism

#language

A way of scaling training or inference that puts different parts of one model on different devices. Model parallelism enables models that are too big to fit on a single device.

See also .

model training

The process of determining the best .

Momentum

A sophisticated gradient descent algorithm in which a learning step depends not only on the derivative in the current step, but also on the derivatives of the step(s) that immediately preceded it. Momentum involves computing an exponentially weighted moving average of the gradients over time, analogous to momentum in physics. Momentum sometimes prevents learning from getting stuck in local minima.

multi-class classification

#fundamentals

In supervised learning, a problem in which the dataset contains more than two of labels. For example, the labels in the Iris dataset must be one of the following three classes:

Iris setosa
Iris virginica
Iris versicolor

A model trained on the Iris dataset that predicts Iris type on new examples is performing multi-class classification.

In contrast, classification problems that distinguish between exactly two classes are . For example, an email model that predicts either spam or not spam is a binary classification model.

In clustering problems, multi-class classification refers to more than two clusters.

multi-class logistic regression

Using in problems.

multi-head self-attention

#language

An extension of that applies the self-attention mechanism multiple times for each position in the input sequence.

introduced multi-head self-attention.

multimodal model

#language

A model whose inputs and/or outputs include more than one . For example, consider a model that takes both an image and a text caption (two modalities) as , and outputs a score indicating how appropriate the text caption is for the image. So, this model's inputs are multimodal and the output is unimodal.

multinomial classification

Synonym for .

multinomial regression

Synonym for .

N

NaN trap

When one number in your model becomes a NaN during training, which causes many or all other numbers in your model to eventually become a NaN.

NaN is an abbreviation for Not a Number.

natural language understanding

#language

Determining a user's intentions based on what the user typed or said. For example, a search engine uses natural language understanding to determine what the user is searching for based on what the user typed or said.

negative class

#fundamentals

In , one class is termed positive and the other is termed negative. The positive class is the thing or event that the model is testing for and the negative class is the other possibility. For example:

The negative class in a medical test might be "not tumor."
The negative class in an email classifier might be "not spam."

Contrast with .

neural network

#fundamentals

A containing at least one . A is a type of neural network containing more than one hidden layer. For example, the following diagram shows a deep neural network containing two hidden layers.

Each neuron in a neural network connects to all of the nodes in the next layer. For example, in the preceding diagram, notice that each of the three neurons in the first hidden layer separately connect to both of the two neurons in the second hidden layer.

Neural networks implemented on computers are sometimes called artificial neural networks to differentiate them from neural networks found in brains and other nervous systems.

Some neural networks can mimic extremely complex nonlinear relationships between different features and the label.

See also and .

neuron

#fundamentals

In machine learning, a distinct unit within a of a . Each neuron performs the following two-step action:

Calculates the of input values multiplied by their corresponding weights.
Passes the weighted sum as input to an .

A neuron in the first hidden layer accepts inputs from the feature values in the . A neuron in any hidden layer beyond the first accepts inputs from the neurons in the preceding hidden layer. For example, a neuron in the second hidden layer accepts inputs from the neurons in the first hidden layer.

The following illustration highlights two neurons and their inputs.

A neuron in a neural network mimics the behavior of neurons in brains and other parts of nervous systems.

N-gram

#seq

#language

An ordered sequence of N words. For example, truly madly is a 2-gram. Because order is relevant, madly truly is a different 2-gram than truly madly.

NName(s) for this kind of N-gramExamples2bigram or 2-gramto go, go to, eat lunch, eat dinner3trigram or 3-gramate too much, three blind mice, the bell tolls44-gramwalk in the park, dust in the wind, the boy ate lentils

Many models rely on N-grams to predict the next word that the user will type or say. For example, suppose a user typed three blind. An NLU model based on trigrams would likely predict that the user will next type mice.

Contrast N-grams with , which are unordered sets of words.

NLU

#language

Abbreviation for .

node (neural network)

#fundamentals

A in a .

node (TensorFlow graph)

#TensorFlow

An operation in a TensorFlow .

node (decision tree)

#df

In a , any or .

noise

Broadly speaking, anything that obscures the signal in a dataset. Noise can be introduced into data in a variety of ways. For example:

Human raters make mistakes in labeling.
Humans and instruments mis-record or omit feature values.

non-binary condition

#df

A containing more than two possible outcomes. For example, the following non-binary condition contains three possible outcomes:

nonlinear

#fundamentals

A relationship between two or more variables that can't be represented solely through addition and multiplication. A linear relationship can be represented as a line; a nonlinear relationship can't be represented as a line. For example, consider two models that each relate a single feature to a single label. The model on the left is linear and the model on the right is nonlinear:

non-response bias

#fairness

See .

nonstationarity

#fundamentals

A feature whose values change across one or more dimensions, usually time. For example, consider the following examples of nonstationarity:

The number of swimsuits sold at a particular store varies with the season.
The quantity of a particular fruit harvested in a particular region is zero for much of the year but large for a brief period.
Due to climate change, annual mean temperatures are shifting.

Contrast with .

normalization

#fundamentals

Broadly speaking, the process of converting a variable's actual range of values into a standard range of values, such as:

-1 to +1
0 to 1
the normal distribution

For example, suppose the actual range of values of a certain feature is 800 to 2,400. As part of , you could normalize the actual values down to a standard range, such as -1 to +1.

Normalization is a common task in . Models usually train faster (and produce better predictions) when every numerical feature in the has roughly the same range.

novelty detection

The process of determining whether a new (novel) example comes from the same distribution as the . In other words, after training on the training set, novelty detection determines whether a new example (during inference or during additional training) is an .

Contrast with .

numerical data

#fundamentals

represented as integers or real-valued numbers. For example, a house valuation model would probably represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as numerical data indicates that the feature's values have a mathematical relationship to the label. That is, the number of square meters in a house probably has some mathematical relationship to the value of the house.

Not all integer data should be represented as numerical data. For example, postal codes in some parts of the world are integers; however, integer postal codes should not be represented as numerical data in models. That's because a postal code of

[0.92, 0.56]

3 is not twice (or half) as potent as a postal code of 10000. Furthermore, although different postal codes do correlate to different real estate values, we can't assume that real estate values at postal code 20000 are twice as valuable as real estate values at postal code 10000. Postal codes should be represented as instead.

Numerical features are sometimes called .

NumPy

An open-source math library that provides efficient array operations in Python. is built on NumPy.

O

objective

A metric that your algorithm is trying to optimize.

objective function

The mathematical formula or that a model aims to optimize. For example, the objective function for is usually . Therefore, when training a linear regression model, training aims to minimize Mean Squared Loss.

In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy, the goal is to maximize accuracy.

See also .

oblique condition

#df

In a , a that involves more than one . For example, if height and width are both features, then the following is an oblique condition:

area > 200

Contrast with .

offline

#fundamentals

Synonym for .

offline inference

#fundamentals

The process of a model generating a batch of and then caching (saving) those predictions. Apps can then access the desired prediction from the cache rather than rerunning the model.

For example, consider a model that generates local weather forecasts (predictions) once every four hours. After each model run, the system caches all the local weather forecasts. Weather apps retrieve the forecasts from the cache.

Offline inference is also called static inference.

Contrast with .

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

One element is set to 1.
All other elements are set to 0.

One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a certain categorical feature named

[0.92, 0.56]

4 has five possible values:

"Denmark"
"Sweden"
"Norway"
"Finland"
"Iceland"

One-hot encoding could represent each of the five values as follows:

countryVector"Denmark"10000"Sweden"01000"Norway"00100"Finland"00010"Iceland"00001

Thanks to one-hot encoding, a model can learn different connections based on each of the five countries.

Representing a feature as is an alternative to one-hot encoding. Unfortunately, representing the Scandinavian countries numerically is not a good choice. For example, consider the following numeric representation:

"Denmark" is 0
"Sweden" is 1
"Norway" is 2
"Finland" is 3
"Iceland" is 4

With numeric encoding, a model would interpret the raw numbers mathematically and would try to train on those numbers. However, Iceland isn't actually twice as much (or half as much) of something as Norway, so the model would come to some strange conclusions.

one-shot learning

A machine learning approach, often used for object classification, designed to learn effective classifiers from a single training example.

See also .

one-vs.-all

#fundamentals

Given a classification problem with N classes, a solution consisting of N separate —one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers:

animal vs. not animal
vegetable vs. not vegetable
mineral vs. not mineral

online

#fundamentals

Synonym for .

online inference

#fundamentals

Generating on demand. For example, suppose an app passes input to a model and issues a request for a prediction. A system using online inference responds to the request by running the model (and returning the prediction to the app).

Contrast with .

operation (op)

#TensorFlow

In TensorFlow, any procedure that creates, manipulates, or destroys a . For example, a matrix multiply is an operation that takes two Tensors as input and generates one Tensor as output.

out-of-bag evaluation (OOB evaluation)

#df

A mechanism for evaluating the quality of a by testing each against the not used during of that decision tree. For example, in the following diagram, notice that the system trains each decision tree on about two-thirds of the examples and then evaluates against the remaining one-third of the examples.

Out-of-bag evaluation is a computationally efficient and conservative approximation of the mechanism. In cross-validation, one model is trained for each cross-validation round (for example, 10 models are trained in a 10-fold cross-validation). With OOB evaluation, a single model is trained. Because withholds some data from each tree during training, OOB evaluation can use that data to approximate cross-validation.

optimizer

A specific implementation of the algorithm. Popular optimizers include:

, which stands for ADAptive GRADient descent.
Adam, which stands for ADAptive with Momentum.

out-group homogeneity bias

#fairness

The tendency to see out-group members as more alike than in-group members when comparing attitudes, values, personality traits, and other characteristics. In-group refers to people you interact with regularly; out-group refers to people you do not interact with regularly. If you create a dataset by asking people to provide attributes about out-groups, those attributes may be less nuanced and more stereotyped than attributes that participants list for people in their in-group.

For example, Lilliputians might describe the houses of other Lilliputians in great detail, citing small differences in architectural styles, windows, doors, and sizes. However, the same Lilliputians might simply declare that Brobdingnagians all live in identical houses.

Out-group homogeneity bias is a form of .

See also .

outlier detection

The process of identifying in a .

Contrast with .

outliers

Values distant from most other values. In machine learning, any of the following are outliers:

Input data whose values are more than roughly 3 standard deviations from the mean.
with high absolute values.
Predicted values relatively far away from the actual values.

For example, suppose that

[0.92, 0.56]

5 is a feature of a certain model. Assume that the mean

[0.92, 0.56]

5 is 7 Euros with a standard deviation of 1 Euro. Examples containing a

[0.92, 0.56]

5 of 12 Euros or 2 Euros would therefore be considered outliers because each of those prices is five standard deviations from the mean.

Outliers are often caused by typos or other input mistakes. In other cases, outliers aren't mistakes; after all, values five standard deviations away from the mean are rare but hardly impossible.

Outliers often cause problems in model training. is one way of managing outliers.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an input layer, two hidden layers, and an output layer:

overfitting

#fundamentals

Creating a that matches the so closely that the model fails to make correct predictions on new data.

can reduce overfitting. Training on a large and diverse training set can also reduce overfitting.

Click the icon for additional notes.

Overfitting is like strictly following advice from only your favorite teacher. You'll probably be successful in that teacher's class, but you might "overfit" to that teacher's ideas and be unsuccessful in other classes. Following advice from a mixture of teachers will enable you to adapt better to new situations.

oversampling

Reusing the of a in a in order to create a more balanced .

For example, consider a problem in which the ratio of the to the minority class is 5,000:1. If the dataset contains a million examples, then the dataset contains only about 200 examples of the minority class, which might be too few examples for effective training. To overcome this deficiency, you might oversample (reuse) those 200 examples multiple times, possibly yielding sufficient examples for useful training.

You need to be careful about over when oversampling.

Contrast with .

P

pandas

#fundamentals

A column-oriented data analysis API built on top of . Many machine learning frameworks, including TensorFlow, support pandas data structures as inputs. See the pandas documentation for details.

parameter

#fundamentals

The and that a model learns during . For example, in a model, the parameters consist of the bias (b) and all the weights (w1, w2, and so on) in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast, are the values that you (or a hyperparameter turning service) supply to the model. For example, is a hyperparameter.

Parameter Server (PS)

#TensorFlow

A job that keeps track of a model's in a distributed setting.

parameter update

The operation of adjusting a model's during training, typically within a single iteration of .

partial derivative

A derivative in which all but one of the variables is considered a constant. For example, the partial derivative of f(x, y) with respect to x is the derivative of f considered as a function of x alone (that is, keeping y constant). The partial derivative of f with respect to x focuses only on how x is changing and ignores all other variables in the equation.

participation bias

#fairness

Synonym for non-response bias. See .

partitioning strategy

The algorithm by which variables are divided across .

perceptron

A system (either hardware or software) that takes in one or more input values, runs a function on the weighted sum of the inputs, and computes a single output value. In machine learning, the function is typically nonlinear, such as , , or tanh. For example, the following perceptron relies on the sigmoid function to process three input values:

$$f(x_1, x_2, x_3) = \text{sigmoid}(w_1 x_1 + w_2 x_2 + w_3 x_3)$$

In the following illustration, the perceptron takes three inputs, each of which is itself modified by a weight before entering the perceptron:

Perceptrons are the in .

performance

Overloaded term with the following meanings:

The traditional meaning within software engineering. Namely: How fast (or efficiently) does this piece of software run?
The meaning within machine learning. Here, performance answers the following question: How correct is this ? That is, how good are the model's predictions?

permutation variable importances

#df

A type of that evaluates the increase in the prediction error of a model after permuting the feature’s values. Permutation variable importance is a model agnostic metric.

perplexity

One measure of how well a is accomplishing its task. For example, suppose your task is to read the first few letters of a word a user is typing on a smartphone keyboard, and to offer a list of possible completion words. Perplexity, P, for this task is approximately the number of guesses you need to offer in order for your list to contain the actual word the user is trying to type.

Perplexity is related to as follows:

$$P= 2^{-\text{cross entropy}}$$

pipeline

The infrastructure surrounding a machine learning algorithm. A pipeline includes gathering the data, putting the data into training data files, training one or more models, and exporting the models to production.

pipelining

#language

A form of in which a model's processing is divided into consecutive stages and each stage is executed on a different device. While a stage is processing one batch, the preceding stage can work on the next batch.

See also .

policy

#rl

In reinforcement learning, an probabilistic mapping from to .

pooling

#image

Reducing a matrix (or matrices) created by an earlier to a smaller matrix. Pooling usually involves taking either the maximum or average value across the pooled area. For example, suppose we have the following 3x3 matrix:

A pooling operation, just like a convolutional operation, divides that matrix into slices and then slides that convolutional operation by . For example, suppose the pooling operation divides the convolutional matrix into 2x2 slices with a 1x1 stride. As the following diagram illustrates, four pooling operations take place. Imagine that each pooling operation picks the maximum value of the four in that slice:

Pooling helps enforce in the input matrix.

Pooling for vision applications is known more formally as spatial pooling. Time-series applications usually refer to pooling as temporal pooling. Less formally, pooling is often called subsampling or downsampling.

positive class

#fundamentals

The class you are testing for.

For example, the positive class in a cancer model might be "tumor." The positive class in an email classifier might be "spam."

Contrast with .

Click the icon for additional notes.

The term positive class can be confusing because the "positive" outcome of many tests is often an undesirable result. For example, the positive class in many medical tests corresponds to tumors or diseases. In general, you want a doctor to tell you, "Congratulations! Your test results were negative." Regardless, the positive class is the event that the test is seeking to find.

Admittedly, you're simultaneously testing for both the positive and negative classes.

post-processing

#fairness

#fundamentals

Adjusting the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

For example, one might apply post-processing to a binary classifier by setting a classification threshold such that is maintained for some attribute by checking that the is the same for all values of that attribute.

PR AUC (area under the PR curve)

Area under the interpolated , obtained by plotting (recall, precision) points for different values of the . Depending on how it's calculated, PR AUC may be equivalent to the of the model.

precision

A metric for that answers the following question:

When the model predicted the , what percentage of the predictions were correct?

Here is the formula:

$$\text{Precision} = \frac{\text{true positives}} {\text{true positives} + \text{false positives}}$$

where:

true positive means the model correctly predicted the positive class.
false positive means the model mistakenly predicted the positive class.

For example, suppose a model made 200 positive predictions. Of these 200 positive predictions:

150 were true positives.
50 were false positives.

In this case:

$$\text{Precision} = \frac{\text{150}} {\text{150} + \text{50}} = 0.75$$

Contrast with and .

precision-recall curve

A curve of vs. at different .

prediction

#fundamentals

A model's output. For example:

The prediction of a binary classification model is either the positive class or the negative class.
The prediction of a multi-class classification model is one class.
The prediction of a linear regression model is a number.

What you believe about the data before you begin training on it. For example, relies on a prior belief that should be small and normally distributed around zero.

probabilistic regression model

A that uses not only the for each , but also the uncertainty of those weights. A probabilistic regression model generates a prediction and the uncertainty of that prediction. For example, a probabilistic regression model might yield a prediction of 325 with a standard deviation of 12. For more information about probabilistic regression models, see this Colab on tensorflow.org.

proxy (sensitive attributes)

#fairness

An attribute used as a stand-in for a . For example, an individual's postal code might be used as a proxy for their income, race, or ethnicity.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

For example, suppose you must train a model to predict employee stress level. Your dataset contains a lot of predictive features but doesn't contain a label named stress level. Undaunted, you pick "workplace accidents" as a proxy label for stress level. After all, employees under high stress get into more accidents than calm employees. Or do they? Maybe workplace accidents actually rise and fall for multiple reasons.

As a second example, suppose you want is it raining? to be a Boolean label for your dataset, but your dataset doesn't contain rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label for is it raining? Is that a good proxy label? Possibly, but people in some cultures may be more likely to carry umbrellas to protect against sun than the rain.

Proxy labels are often imperfect. When possible, choose actual labels over proxy labels. That said, when an actual label is absent, pick the proxy label very carefully, choosing the least horrible proxy label candidate.

Q

Q-function

#rl

In , the function that predicts the expected from taking an in a and then following a given .

Q-function is also known as state-action value function.

Q-learning

#rl

In , an algorithm that allows an to learn the optimal of a by applying the . The Markov decision process models an .

quantile

Each bucket in .

quantile bucketing

Distributing a feature's values into so that each bucket contains the same (or almost the same) number of examples. For example, the following figure divides 44 points into 4 buckets, each of which contains 11 points. In order for each bucket in the figure to contain the same number of points, some buckets span a different width of x-values.

quantization

An algorithm that implements on a particular in a .

queue

#TensorFlow

A TensorFlow that implements a queue data structure. Typically used in I/O.

R

random forest

#df

An of in which each decision tree is trained with a specific random noise, such as .

Random forests are a type of .

random policy

#rl

In , a that chooses an at random.

ranking

A type of whose objective is to order a list of items.

rank (ordinality)

The ordinal position of a class in a machine learning problem that categorizes classes from highest to lowest. For example, a behavior ranking system could rank a dog's rewards from highest (a steak) to lowest (wilted kale).

rank (Tensor)

#TensorFlow

The number of dimensions in a . For instance, a scalar has rank 0, a vector has rank 1, and a matrix has rank 2.

Not to be confused with .

rater

#fundamentals

A human who provides for . "Annotator" is another name for rater.

recall

A metric for that answers the following question:

When was the , what percentage of predictions did the model correctly identify as the positive class?

Here is the formula:

\[\text{Recall} = \frac{\text{true positives}} {\text{true positives} + \text{false negatives}} \]

where:

true positive means the model correctly predicted the positive class.
false negative means that the model mistakenly predicted the .

For instance, suppose your model made 200 predictions on examples for which ground truth was the positive class. Of these 200 predictions:

180 were true positives.
20 were false negatives.

In this case:

\[\text{Recall} = \frac{\text{180}} {\text{180} + \text{20}} = 0.9 \]

Click the icon for notes about class-imbalanced datasets.

Recall is particularly useful for determining the predictive power of classification models in which the positive class is rare. For example, consider a in which the positive class for a certain disease occurs in only 10 patients out of a million. Suppose your model makes five million predictions that yield the following outcomes:

30 True Positives
20 False Negatives
4,999,000 True Negatives
950 False Positives

The recall of this model is therefore:

area > 200

5By contrast, the of this model is:

area > 200

That high value of accuracy looks impressive but is essentially meaningless. Recall is a much more useful metric for class-imbalanced datasets than accuracy.

recommendation system

#recsystems

A system that selects for each user a relatively small set of desirable from a large corpus. For example, a video recommendation system might recommend two videos from a corpus of 100,000 videos, selecting Casablanca and The Philadelphia Story for one user, and Wonder Woman and Black Panther for another. A video recommendation system might base its recommendations on factors such as:

Movies that similar users have rated or watched.
Genre, directors, actors, target demographic...

Rectified Linear Unit (ReLU)

#fundamentals

An with the following behavior:

If input is negative or zero, then the output is 0.
If input is positive, then the output is equal to the input.

For example:

If the input is -3, then the output is 0.
If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

ReLU is a very popular activation function. Despite its simple behavior, ReLU still enables a neural network to learn relationships between and the .

recurrent neural network

#seq

A that is intentionally run multiple times, where parts of each run feed into the next run. Specifically, hidden layers from the previous run provide part of the input to the same hidden layer in the next run. Recurrent neural networks are particularly useful for evaluating sequences, so that the hidden layers can learn from previous runs of the neural network on earlier parts of the sequence.

For example, the following figure shows a recurrent neural network that runs four times. Notice that the values learned in the hidden layers from the first run become part of the input to the same hidden layers in the second run. Similarly, the values learned in the hidden layer on the second run become part of the input to the same hidden layer in the third run. In this way, the recurrent neural network gradually trains and predicts the meaning of the entire sequence rather than just the meaning of individual words.

regression model

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast, a generates a class prediction.) For example, the following are all regression models:

A model that predicts a certain house's value, such as 423,000 Euros.
A model that predicts a certain tree's life expectancy, such as 23.2 years.
A model that predicts the amount of rain that will fall in a certain city over the next six hours, such as 0.18 inches.

Two common types of regression models are:

, which finds the line that best fits label values to features.
, which generates a probability between 0.0 and 1.0 that a system typically then maps to a class prediction.

Not every model that outputs numerical predictions is a regression model. In some cases, a numeric prediction is really just a classification model that happens to have numeric class names. For example, a model that predicts a numeric postal code is a classification model, not a regression model.

regularization

#fundamentals

Any mechanism that reduces . Popular types of regularization include:

(this is not a formal regularization method, but can effectively limit overfitting)

Regularization can also be defined as the penalty on a model's complexity.

Click the icon for additional notes.

Regularization is counterintuitive. Increasing regularization usually increases training loss, which is confusing because, well, isn't the goal to minimize training loss?

Actually, no. The goal isn't to minimize training loss. The goal is to make excellent predictions on real-world examples. Remarkably, even though increasing regularization increases training loss, it usually helps models make better predictions on real-world examples.

regularization rate

#fundamentals

A number that specifies the relative importance of during training. Raising the regularization rate reduces but may reduce the model's predictive power. Conversely, reducing or omitting the regularization rate increases overfitting.

Click the icon to see the math.

The regularization rate is usually represented as the Greek letter lambda. The following simplified equation shows lambda's influence:

$$\text{minimize(loss function + }\lambda\text{(regularization))}$$

where regularization is any regularization mechanism, including;

reinforcement learning (RL)

#rl

A family of algorithms that learn an optimal , whose goal is to maximize when interacting with an . For example, the ultimate reward of most games is victory. Reinforcement learning systems can become expert at playing complex games by evaluating sequences of previous game moves that ultimately led to wins and sequences that ultimately led to losses.

ReLU

#fundamentals

Abbreviation for .

replay buffer

#rl

In -like algorithms, the memory used by the agent to store state transitions for use in .

reporting bias

#fairness

The fact that the frequency with which people write about actions, outcomes, or properties is not a reflection of their real-world frequencies or the degree to which a property is characteristic of a class of individuals. Reporting bias can influence the composition of data that machine learning systems learn from.

For example, in books, the word laughed is more prevalent than breathed. A machine learning model that estimates the relative frequency of laughing and breathing from a book corpus would probably determine that laughing is more common than breathing.

representation

The process of mapping data to useful .

re-ranking

#recsystems

The final stage of a , during which scored items may be re-graded according to some other (typically, non-ML) algorithm. Re-ranking evaluates the list of items generated by the phase, taking actions such as:

Eliminating items that the user has already purchased.
Boosting the score of fresher items.

return

#rl

In reinforcement learning, given a certain policy and a certain state, the return is the sum of all that the expects to receive when following the from the to the end of the . The agent accounts for the delayed nature of expected rewards by discounting rewards according to the state transitions required to obtain the reward.

Therefore, if the discount factor is $\gamma$, and $r_0, \ldots, r_{N}$ denote the rewards until the end of the episode, then the return calculation is as follows:

$$\text{Return} = r_0 + \gamma r_1 + \gamma^2 r_2 + \ldots + \gamma^{N-1} r_{N-1}$$

reward

#rl

In reinforcement learning, the numerical result of taking an in a , as defined by the .

ridge regularization

Synonym for . The term ridge regularization is more frequently used in pure statistics contexts, whereas L2 regularization is used more often in machine learning.

RNN

#seq

Abbreviation for .

ROC (receiver operating characteristic) Curve

#fundamentals

A graph of vs. for different in binary classification.

The shape of an ROC curve suggests a binary classification model's ability to separate positive classes from negative classes. Suppose, for example, that a binary classification model perfectly separates all the negative classes from all the positive classes:

The ROC curve for the preceding model looks as follows:

In contrast, the following illustration graphs the raw logistic regression values for a terrible model that can't separate negative classes from positive classes at all:

The ROC curve for this model looks as follows:

Meanwhile, back in the real world, most binary classification models separate positive and negative classes to some degree, but usually not perfectly. So, a typical ROC curve falls somewhere between the two extremes:

The point on an ROC curve closest to (0.0,1.0) theoretically identifies the ideal classification threshold. However, several other real-world issues influence the selection of the ideal classification threshold. For example, perhaps false negatives cause far more pain than false positives.

A numerical metric called summarizes the ROC curve into a single floating-point value.

root

#df

The starting (the first ) in a . By convention, diagrams put the root at the top of the decision tree. For example:

root directory

#TensorFlow

The directory you specify for hosting subdirectories of the TensorFlow checkpoint and events files of multiple models.

Root Mean Squared Error (RMSE)

#fundamentals

The square root of the .

rotational invariance

#image

In an image classification problem, an algorithm's ability to successfully classify images even when the orientation of the image changes. For example, the algorithm can still identify a tennis racket whether it is pointing up, sideways, or down. Note that rotational invariance is not always desirable; for example, an upside-down 9 should not be classified as a 9.

See also and .

S

sampling bias

#fairness

See .

sampling with replacement

#df

A method of picking items from a set of candidate items in which the same item can be picked multiple times. The phrase "with replacement" means that after each selection, the selected item is returned to the pool of candidate items. The inverse method, sampling without replacement, means that a candidate item can only be picked once.

For example, consider the following fruit set:

area > 200

Suppose that the system randomly picks

[0.92, 0.56]

8 as the first item. If using sampling with replacement, then the system picks the second item from the following set:

area > 200

Yes, that's the same set as before, so the system could potentially pick

[0.92, 0.56]

8 again.

If using sampling without replacement, once picked, a sample can't be picked again. For example, if the system randomly picks

[0.92, 0.56]

8 as the first sample, then

[0.92, 0.56]

8 can't be picked again. Therefore, the system picks the second sample from the following (reduced) set:

area > 200

Click the icon for additional notes.

The word replacement in sampling with replacement confuses many people. In English, replacement means "substitution." However, sampling with replacement actually uses the French definition for replacement, which means "putting something back." The English word replacement is translated as the French word remplacement.

SavedModel

#TensorFlow

The recommended format for saving and recovering TensorFlow models. SavedModel is a language-neutral, recoverable serialization format, which enables higher-level systems and tools to produce, consume, and transform TensorFlow models.

See the Saving and Restoring chapter in the TensorFlow Programmer's Guide for complete details.

Saver

#TensorFlow

A TensorFlow object responsible for saving model checkpoints.

scalar

A single number or a single string that can be represented as a of 0. For example, the following lines of code each create one scalar in TensorFlow:

temperature >= 100

scaling

Any mathematical transform or technique that shifts the range of a label and/or feature value. Some forms of scaling are very useful for transformations like .

Common forms of scaling useful in Machine Learning include:

linear scaling, which typically uses a combination of subtraction and division to replace the original value with a number between -1 and +1 or between 0 and 1.
logarithmic scaling, which replaces the original value with its logarithm.
, which replaces the original value with a floating-point value representing the number of standard deviations from that feature's mean.

scikit-learn

A popular open-source machine learning platform. See scikit-learn.org.

scoring

#recsystems

The part of a that provides a value or ranking for each item produced by the phase.

selection bias

#fairness

Errors in conclusions drawn from sampled data due to a selection process that generates systematic differences between samples observed in the data and those not observed. The following forms of selection bias exist:

coverage bias: The population represented in the dataset does not match the population that the machine learning model is making predictions about.
sampling bias: Data is not collected randomly from the target group.
non-response bias (also called participation bias): Users from certain groups opt-out of surveys at different rates than users from other groups.

For example, suppose you are creating a machine learning model that predicts people's enjoyment of a movie. To collect training data, you hand out a survey to everyone in the front row of a theater showing the movie. Offhand, this may sound like a reasonable way to gather a dataset; however, this form of data collection may introduce the following forms of selection bias:

coverage bias: By sampling from a population who chose to see the movie, your model's predictions may not generalize to people who did not already express that level of interest in the movie.
sampling bias: Rather than randomly sampling from the intended population (all the people at the movie), you sampled only the people in the front row. It is possible that the people sitting in the front row were more interested in the movie than those in other rows.
non-response bias: In general, people with strong opinions tend to respond to optional surveys more frequently than people with mild opinions. Since the movie survey is optional, the responses are more likely to form a bimodal distribution than a normal (bell-shaped) distribution.

self-attention (also called self-attention layer)

#language

A neural network layer that transforms a sequence of embeddings (for instance, embeddings) into another sequence of embeddings. Each embedding in the output sequence is constructed by integrating information from the elements of the input sequence through an mechanism.

The self part of self-attention refers to the sequence attending to itself rather than to some other context. Self-attention is one of the main building blocks for and uses dictionary lookup terminology, such as “query”, “key”, and “value”.

A self-attention layer starts with a sequence of input representations, one for each word. The input representation for a word can be a simple embedding. For each word in an input sequence, the network scores the relevance of the word to every element in the whole sequence of words. The relevance scores determine how much the word's final representation incorporates the representations of other words.

For example, consider the following sentence:

The animal didn't cross the street because it was too tired.

The following illustration (from Transformer: A Novel Neural Network Architecture for Language Understanding) shows a self-attention layer's attention pattern for the pronoun it, with the darkness of each line indicating how much each word contributes to the representation:

The self-attention layer highlights words that are relevant to "it". In this case, the attention layer has learned to highlight words that it might refer to, assigning the highest weight to animal.

For a sequence of n , self-attention transforms a sequence of embeddings n separate times, once at each position in the sequence.

Refer also to and .

self-supervised learning

A family of techniques for converting an problem into a problem by creating surrogate from .

Some -based models such as use self-supervised learning.

Self-supervised training is a approach.

self-training

A variant of that is particularly useful when all of the following conditions are true:

The ratio of to in the dataset is high.
This is a problem.

Self-training works by iterating over the following two steps until the model stops improving:

Use to train a model on the labeled examples.
Use the model created in Step 1 to generate predictions (labels) on the unlabeled examples, moving those in which there is high confidence into the labeled examples with the predicted label.

Notice that each iteration of Step 2 adds more labeled examples for Step 1 to train on.

semi-supervised learning

Training a model on data where some of the training examples have labels but others don't. One technique for semi-supervised learning is to infer labels for the unlabeled examples, and then to train on the inferred labels to create a new model. Semi-supervised learning can be useful if labels are expensive to obtain but unlabeled examples are plentiful.

is one technique for semi-supervised learning.

sensitive attribute

#fairness

A human attribute that may be given special consideration for legal, ethical, social, or personal reasons.

sentiment analysis

#language

Using statistical or machine learning algorithms to determine a group's overall attitude—positive or negative—toward a service, product, organization, or topic. For example, using , an algorithm could perform sentiment analysis on the textual feedback from a university course to determine the degree to which students generally liked or disliked the course.

sequence model

#seq

A model whose inputs have a sequential dependence. For example, predicting the next video watched from a sequence of previously watched videos.

sequence-to-sequence task

#language

A task that converts an input sequence of to an output sequence of tokens. For example, two popular kinds of sequence-to-sequence tasks are:

Translators:
- Sample input sequence: "I love you."
- Sample output sequence: "Je t'aime."
Question answering:
- Sample input sequence: "Do I need my car in New York City?"
- Sample output sequence: "No. Please keep your car at home."

serving

A synonym for .

shape (Tensor)

The number of elements in each of a tensor. The shape is represented as a list of integers. For example, the following two-dimensional tensor has a shape of [3,4]:

temperature >= 100

TensorFlow uses row-major (C-style) format to represent the order of dimensions, which is why the shape in TensorFlow is [3,4] rather than [4,3]. In other words, in a two-dimensional TensorFlow Tensor, the shape is [number of rows, number of columns].

shrinkage

#df

A in that controls . Shrinkage in gradient boosting is analogous to in . Shrinkage is a decimal value between 0.0 and 1.0. A lower shrinkage value reduces overfitting more than a larger shrinkage value.

sigmoid function

#fundamentals

A mathematical function that "squishes" an input value into a constrained range, typically 0 to 1 or -1 to +1. That is, you can pass any number (two, a million, negative billion, whatever) to a sigmoid and the output will still be in the constrained range. A plot of the sigmoid activation function looks as follows:

The sigmoid function has several uses in machine learning, including:

Converting the raw output of a or to a probability.
Acting as an in some neural networks.

Click the icon to see the math.

The sigmoid function over an input number x has the following formula:

$$ sigmoid(x) = \frac{1}{1 + e^{-\text{x}}} $$

In machine learning, x is generally a .

similarity measure

#clustering

In algorithms, the metric used to determine how alike (how similar) any two examples are.

size invariance

#image

In an image classification problem, an algorithm's ability to successfully classify images even when the size of the image changes. For example, the algorithm can still identify a cat whether it consumes 2M pixels or 200K pixels. Note that even the best image classification algorithms still have practical limits on size invariance. For example, an algorithm (or human) is unlikely to correctly classify a cat image consuming only 20 pixels.

See also and .

sketching

#clustering

In , a category of algorithms that perform a preliminary similarity analysis on examples. Sketching algorithms use a locality-sensitive hash function to identify points that are likely to be similar, and then group them into buckets.

Sketching decreases the computation required for similarity calculations on large datasets. Instead of calculating similarity for every single pair of examples in the dataset, we calculate similarity only for each pair of points within each bucket.

softmax

#fundamentals

A function that determines probabilities for each possible class in a . The probabilities add up to exactly 1.0. For example, the following table shows how softmax distributes various probabilities:

Image is a...Probabilitydog.85cat.13horse.02

Softmax is also called full softmax.

Contrast with .

Click the icon to see the math.

The softmax equation is as follows:

$$\sigma_i = \frac{e^{\text{z}_i}} {\sum_{j=1}^{j=K} {e^{\text{z}_j}}} $$

where:

$\sigma_i$ is the output vector. Each element of the output vector specifies the probability of this element. The sum of all the elements in the output vector is 1.0. The output vector contains the same number of elements as the input vector, $z$.
$z$ is the input vector. Each element of the input vector contains a floating-point value.
$K$ is the number of elements in the input vector (and the output vector).

For example, suppose the input vector is:

temperature >= 100

Therefore, softmax calculates the denominator as follows:

$$\text{denominator} = e^{1.2} + e^{2.5} + e^{1.8} = 21.552$$

The softmax probability of each element is therefore:

$$\sigma_1 = \frac{e^{1.2}}{21.552} = 0.154 $$ $$\sigma_2 = \frac{e^{2.5}}{21.552} = 0.565 $$ $$\sigma_1 = \frac{e^{1.8}}{21.552} = 0.281 $$

So, the output vector is therefore:

$$\sigma = [0.154, 0.565, 0.281]$$

The sum of the three elements in $\sigma$ is 1.0. Phew!

sparse feature

#language

#fundamentals

A whose values are predominately zero or empty. For example, a feature containing a single 1 value and a million 0 values is sparse. In contrast, a has values that are predominantly not zero or empty.

In machine learning, a surprising number of features are sparse features. Categorical features are usually sparse features. For example, of the 300 possible tree species in a forest, a single example might identify just a maple tree. Or, of the millions of possible videos in a video library, a single example might identify just "Casablanca."

In a model, you typically represent sparse features with . If the one-hot encoding is big, you might put an on top of the one-hot encoding for greater efficiency.

sparse representation

#language

#fundamentals

Storing only the position(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature named

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

02 identifies the 36 tree species in a particular forest. Further assume that each identifies only a single species.

You could use a one-hot vector to represent the tree species in each example. A one-hot vector would contain a single

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

03 (to represent the particular tree species in that example) and 35

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

04s (to represent the 35 tree species not in that example). So, the one-hot representation of

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

05 might look something like the following:

Alternatively, sparse representation would simply identify the position of the particular species. If

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

05 is at position 24, then the sparse representation of

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

05 would simply be:

temperature >= 100

Notice that the sparse representation is much more compact than the one-hot representation.

Note: You shouldn't pass a sparse representation as a direct feature input to a model. Instead, you should convert the sparse representation into a one-hot representation before training on it.

Click the icon for a slightly more complex example.

Suppose each example in your model must represent the words—but not the order of those words—in an English sentence. English consists of about 170,000 words, so English is a categorical feature with about 170,000 elements. Most English sentences use an extremely tiny fraction of those 170,000 words, so the set of words in a single example is almost certainly going to be sparse data.

Consider the following sentence:

temperature >= 100

You could use a variant of one-hot vector to represent the words in this sentence. In this variant, multiple cells in the vector can contain a nonzero value. Furthermore, in this variant, a cell can contain an integer other than one. Although the words "my", "is", "a", and "great" appear only once in the sentence, the word "dog" appears twice. Using this variant of one-hot vectors to represent the words in this sentence yields the following 170,000-element vector:

A sparse representation of the same sentence would simply be:

temperature >= 100

Click the icon if you are confused.

The term "sparse representation" confuses a lot of people because sparse representation is itself not a sparse vector. Rather, sparse representation is actually a dense representation of a sparse vector. The synonym index representation is a little clearer than "sparse representation."

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See also and .

sparsity

The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that vector or matrix. For example, consider a 100-element matrix in which 98 cells contain zero. The calculation of sparsity is as follows:

$$ {\text{sparsity}} = \frac{\text{98}} {\text{100}} = {\text{0.98}} $$

Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model weights.

spatial pooling

#image

See .

split

#df

In a , another name for a .

splitter

#df

While training a , the routine (and algorithm) responsible for finding the best at each .

squared hinge loss

The square of the . Squared hinge loss penalizes outliers more harshly than regular hinge loss.

squared loss

#fundamentals

Synonym for .

staged training

#language

A tactic of training a model in a sequence of discrete stages. The goal can be either to speed up the training process, or to achieve better model quality.

An illustration of the progressive stacking approach is shown below:

Stage 1 contains 3 hidden layers, stage 2 contains 6 hidden layers, and stage 3 contains 12 hidden layers.
Stage 2 begins training with the weights learned in the 3 hidden layers of Stage 1. Stage 3 begins training with the weights learned in the 6 hidden layers of Stage 2.

See also .

state

#fundamentals

A algorithm in which the is one. In other words, SGD trains on a single example chosen uniformly at random from a .

stride

#image

In a convolutional operation or pooling, the delta in each dimension of the next series of input slices. For example, the following animation demonstrates a (1,1) stride during a convolutional operation. Therefore, the next input slice starts one position to the right of the previous input slice. When the operation reaches the right edge, the next slice is all the way over to the left but one position down.

The preceding example demonstrates a two-dimensional stride. If the input matrix is three-dimensional, the stride would also be three-dimensional.

structural risk minimization (SRM)

An algorithm that balances two goals:

The desire to build the most predictive model (for example, lowest loss).
The desire to keep the model as simple as possible (for example, strong regularization).

For example, a function that minimizes loss+regularization on the training set is a structural risk minimization algorithm.

Contrast with .

subsampling

#image

See .

summary

#TensorFlow

In TensorFlow, a value or set of values calculated at a particular , usually used for tracking model metrics during training.

supervised machine learning

#fundamentals

Training a from and their corresponding . Supervised machine learning is analogous to learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, a student can then provide answers to new (never-before-seen) questions on the same topic.

Compare with .

synthetic feature

#fundamentals

A not present among the input features, but assembled from one or more of them. Methods for creating synthetic features include the following:

a continuous feature into range bins.
Creating a .
Multiplying (or dividing) one feature value by other feature value(s) or by itself. For example, if
```
weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0
```
08 and
```
weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0
```
09 are input features, then the following are examples of synthetic features:
- ab
- a2
Applying a transcendental function to a feature value. For example, if
```
weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0
```
10 is an input feature, then the following are examples of synthetic features:
- sin(c)
- ln(c)

Features created by or alone are not considered synthetic features.

T

tabular Q-learning

#rl

In , implementing by using a table to store the for every combination of and .

target

Synonym for .

target network

#rl

#fundamentals

A representing a model's against the . When building a , you typically try to minimize test loss. That's because a low test loss is a stronger quality signal than a low or low .

A large gap between test loss and training loss or validation loss sometimes suggests that you need to increase the .

test set

A subset of the reserved for testing a trained .

Traditionally, you divide examples in the dataset into the following three distinct subsets:

a
a
a test set

Each example in a dataset should belong to only one of the preceding subsets. For instance, a single example should not belong to both the training set and the test set.

The training set and validation set are both closely tied to training a model. Because the test set is only indirectly associated with training, is a less biased, higher quality metric than or .

tf.Example

#TensorFlow

A standard protocol buffer for describing input data for machine learning model training or inference.

tf.keras

#TensorFlow

An implementation of integrated into .

threshold (for decision trees)

#df

In an , the value that a is being compared against. For example, 75 is the threshold value in the following condition:

temperature >= 100

6This form of the term threshold is different than .

time series analysis

#clustering

A subfield of machine learning and statistics that analyzes . Many types of machine learning problems require time series analysis, including classification, clustering, forecasting, and anomaly detection. For example, you could use time series analysis to forecast the future sales of winter coats by month based on historical sales data.

timestep

#seq

One "unrolled" cell within a . For example, the following figure shows three timesteps (labeled with the subscripts t-1, t, and t+1):

token

#language

In a , the atomic unit that the model is training on and making predictions on. A token is typically one of the following:

a word—for example, the phrase "dogs like cats" consists of three word tokens: "dogs", "like", and "cats".
a character—for example, the phrase "bike fish" consists of nine character tokens. (Note that the blank space counts as one of the tokens.)
subwords—in which a single word can be a single token or multiple tokens. A subword consists of a root word, a prefix, or a suffix. For example, a language model that uses subwords as tokens might view the word "dogs" as two tokens (the root word "dog" and the plural suffix "s"). That same language model might view the single word "taller" as two subwords (the root word "tall" and the suffix "er").

#fundamentals

The process of determining the ideal (weights and biases) comprising a . During training, a system reads in and gradually adjusts parameters. Training uses each example anywhere from a few times to billions of times.

training loss

#fundamentals

A representing a model's during a particular training iteration. For example, suppose the loss function is . Perhaps the training loss (the Mean Squared Error) for the 10th iteration is 2.2, and the training loss for the 100th iteration is 1.9.

A plots training loss vs. the number of iterations. A loss curve provides the following hints about training:

A downward slope implies that the model is improving.
An upward slope implies that the model is getting worse.
A flat slope implies that the model has reached .

For example, the following somewhat idealized shows:

A steep downward slope during the initial iterations, which implies rapid model improvement.
A gradually flattening (but still downward) slope until close to the end of training, which implies continued model improvement at a somewhat slower pace then during the initial iterations.
A flat slope towards the end of training, which suggests convergence.

Although training loss is important, see also .

training-serving skew

#fundamentals

The difference between a model's performance during and that same model's performance during .

training set

#fundamentals

The subset of the used to train a .

Traditionally, examples in the dataset are divided into the following three distinct subsets:

a training set
a
a

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example should not belong to both the training set and the validation set.

trajectory

#rl

In , a sequence of tuples that represent a sequence of transitions of the , where each tuple corresponds to the state, , , and next state for a given state transition.

transfer learning

Transferring information from one machine learning task to another. For example, in multi-task learning, a single model solves multiple tasks, such as a that has different output nodes for different tasks. Transfer learning might involve transferring knowledge from the solution of a simpler task to a more complex one, or involve transferring knowledge from a task where there is more data to one where there is less data.

Most machine learning systems solve a single task. Transfer learning is a baby step towards artificial intelligence in which a single program can solve multiple tasks.

Transformer

#language

A architecture developed at Google that relies on mechanisms to transform a sequence of input embeddings into a sequence of output embeddings without relying on or . A Transformer can be viewed as a stack of self-attention layers.

A Transformer can include any of the following:

an
a
both an encoder and decoder

An encoder transforms a sequence of embeddings into a new sequence of the same length. An encoder includes N identical layers, each of which contains two sub-layers. These two sub-layers are applied at each position of the input embedding sequence, transforming each element of the sequence into a new embedding. The first encoder sub-layer aggregates information from across the input sequence. The second encoder sub-layer transforms the aggregated information into an output embedding.

A decoder transforms a sequence of input embeddings into a sequence of output embeddings, possibly with a different length. A decoder also includes N identical layers with three sub-layers, two of which are similar to the encoder sub-layers. The third decoder sub-layer takes the output of the encoder and applies the mechanism to gather information from it.

The blog post Transformer: A Novel Neural Network Architecture for Language Understanding provides a good introduction to Transformers.

translational invariance

#image

In an image classification problem, an algorithm's ability to successfully classify images even when the position of objects within the image changes. For example, the algorithm can still identify a dog, whether it is in the center of the frame or at the left end of the frame.

See also and .

trigram

#seq

#fundamentals

Producing a with poor predictive ability because the model hasn't fully captured the complexity of the training data. Many problems can cause underfitting, including:

Training on the wrong set of .
Training for too few or at too low a .
Training with too high a .
Providing too few in a deep neural network.

undersampling

Removing from the in a in order to create a more balanced .

For example, consider a dataset in which the ratio of the majority class to the is 20:1. To overcome this class imbalance, you could create a training set consisting of all of the minority class examples but only a tenth of the majority class examples, which would create a training-set class ratio of 2:1. Thanks to undersampling, this more balanced training set might produce a better model. Alternatively, this more balanced training set might contain insufficient examples to train an effective model.

Contrast with .

unidirectional

#language

A system that only evaluates the text that precedes a target section of text. In contrast, a bidirectional system evaluates both the text that precedes and follows a target section of text. See for more details.

unidirectional language model

#language

A that bases its probabilities only on the appearing before, not after, the target token(s). Contrast with .

unlabeled example

#fundamentals

An example that contains but no . For example, the following table shows three unlabeled examples from a house valuation model, each with three features but no house value:

Number of bedroomsNumber of bathroomsHouse age321521724234

In , models train on labeled examples and make predictions on .

In and learning, unlabeled examples are used during training.

Contrast unlabeled example with .

unsupervised machine learning

#clustering

#fundamentals

Training a to find patterns in a dataset, typically an unlabeled dataset.

The most common use of unsupervised machine learning is to data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can help when useful labels are scarce or absent. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Contrast with .

Click the icon for additional notes.

Another example of unsupervised machine learning is principal component analysis (PCA). For example, applying PCA on a dataset containing the contents of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids.

uplift modeling

A modeling technique, commonly used in marketing, that models the "causal effect" (also known as the "incremental impact") of a "treatment" on an "individual." Here are two examples:

Doctors might use uplift modeling to predict the mortality decrease (causal effect) of a medical procedure (treatment) depending on the age and medical history of a patient (individual).
Marketers might use uplift modeling to predict the increase in probability of a purchase (causal effect) due to an advertisement (treatment) on a person (individual).

Uplift modeling differs from or in that some labels (for example, half of the labels in binary treatments) are always missing in uplift modeling. For example, a patient can either receive or not receive a treatment; therefore, we can only observe whether the patient is going to heal or not heal in only one of these two situations (but never both). The main advantage of an uplift model is that it can generate predictions for the unobserved situation (the counterfactual) and use it to compute the causal effect.

upweighting

Applying a weight to the class equal to the factor by which you downsampled.

user matrix

#recsystems

In , an generated by that holds latent signals about user preferences. Each row of the user matrix holds information about the relative strength of various latent signals for a single user. For example, consider a movie recommendation system. In this system, the latent signals in the user matrix might represent each user's interest in particular genres, or might be harder-to-interpret signals that involve complex interactions across multiple factors.

The user matrix has a column for each latent feature and a row for each user. That is, the user matrix has the same number of rows as the target matrix that is being factorized. For example, given a movie recommendation system for 1,000,000 users, the user matrix will have 1,000,000 rows.

V

validation

#fundamentals

The initial evaluation of a model's quality. Validation checks the quality of a model's predictions against the .

Because the validation set differs from the , validation helps guard against .

You might think of evaluating the model against the validation set as the first round of testing and evaluating the model against the as the second round of testing.

validation loss

#fundamentals

A representing a model's on the during a particular of training.

See also .

validation set

#fundamentals

The subset of the that performs initial evaluation against a trained . Typically, you evaluate the trained model against the several times before evaluating the model against the .

Traditionally, you divide the examples in the dataset into the following three distinct subsets:

a
a validation set
a

Ideally, each example in the dataset should belong to only one of the preceding subsets. For example, a single example should not belong to both the training set and the validation set.

vanishing gradient problem

#seq

The tendency for the gradients of early of some to become surprisingly flat (low). Increasingly lower gradients result in increasingly smaller changes to the weights on nodes in a deep neural network, leading to little or no learning. Models suffering from the vanishing gradient problem become difficult or impossible to train. cells address this issue.

Compare to .

variable importances

#df

A set of scores that indicates the relative importance of each to the model.

For example, consider a that estimates house prices. Suppose this decision tree uses three features: size, age, and style. If a set of variable importances for the three features are calculated to be {size=5.8, age=2.5, style=4.7}, then size is more important to the decision tree than age or style.

Different variable importance metrics exist, which can inform ML experts about different aspects of models.

W

Wasserstein loss

One of the loss functions commonly used in , based on the between the distribution of generated data and real data.

weight

#fundamentals

A value that a model multiplies by another value. is the process of determining a model's ideal weights; is the process of using those learned weights to make predictions.

Click the icon to see an example of weights in a linear model.

Imagine a with two features. Suppose that training determines the following weights (and ):

The bias, b, has a value of 2.2
The weight, w1 associated with one feature is 1.5.
The weight, w2 associated with the other feature is 0.4.

Now imagine an with the following feature values:

The value of one feature, x1, is 6.
The value of the other feature, x2, is 10.

This linear model uses the following formula to generate a prediction, y':

$$y' = b + w_1x_1 + w_2x_2$$

Therefore, the prediction is:

$$y' = 2.2 + (1.5)(6) + (0.4)(10) = 15.2$$

If a weight is 0, then the corresponding feature does not contribute to the model. For example, if w1 is 0, then the value of x1 is irrelevant.

Weighted Alternating Least Squares (WALS)

#recsystems

An algorithm for minimizing the objective function during in , which allows a downweighting of the missing examples. WALS minimizes the weighted squared error between the original matrix and the reconstruction by alternating between fixing the row factorization and column factorization. Each of these optimizations can be solved by least squares . For details, see the Recommendation Systems course.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their corresponding weights. For example, suppose the relevant inputs consist of the following:

input valueinput weight2-1.3-10.630.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to an .

wide model

A linear model that typically has many . We refer to it as "wide" since such a model is a special type of with a large number of inputs that connect directly to the output node. Wide models are often easier to debug and inspect than . Although wide models cannot express nonlinearities through , wide models can use transformations such as and to model nonlinearities in different ways.

Contrast with .

width

The number of in a particular of a .

wisdom of the crowd

#df

The idea that averaging the opinions or estimates of a large group of people ("the crowd") often produces surprisingly good results. For example, consider a game in which people guess the number of jelly beans packed into a large jar. Although most individual guesses will be inaccurate, the average of all the guesses has been empirically shown to be surprisingly close to the actual number of jelly beans in the jar.

are a software analog of wisdom of the crowd. Even if individual models make wildly inaccurate predictions, averaging the predictions of many models often generates surprisingly good predictions. For example, although an individual might make poor predictions, a often makes very good predictions.

word embedding

#language

each word in a word set within an ; that is, representing each word as a vector of floating-point values between 0.0 and 1.0. Words with similar meanings have more-similar representations than words with different meanings. For example, carrots, celery, and cucumbers would all have relatively similar representations, which would be very different from the representations of airplane, sunglasses, and toothpaste.

Z

Z-score normalization

#fundamentals

A technique that replaces a raw value with a floating-point value representing the number of standard deviations from that feature's mean. For example, consider a feature whose mean is 800 and whose standard deviation is 100. The following table shows how Z-score normalization would map the raw value to its Z-score:

Raw valueZ-score8000950+1.5575-2.25

The machine learning model then trains on the Z-scores for that feature instead of on the raw values.

Is the tendency for group members to make more extreme decisions toward greater danger?

Group polarization refers to the tendency for a group to make decisions that are more extreme than the initial inclination of its members.

Which of the following is one of the three determinants of minority influence?

The three determinants of minority influence are: consistency, self-confidence, and defection.

Is the loss of self awareness in groups?

Deindividuation: the loss of self-awareness and self-restraint occurring in group situations that foster arousal and anonymity.

What refers to the tendency of influential groups to suppress dissent in order to maintain group harmony?

Groupthink is a psychological phenomenon that occurs within a group of people in which the desire for harmony or conformity in the group results in an irrational or dysfunctional decision-making outcome.

When a minority in a group demonstrates self-confidence, the majority ______.

A

A/B testing

accuracy

Click the icon for additional notes.

action

activation function

Click the icon to see an example.

active learning

AdaGrad

agent

agglomerative clustering

anomaly detection

AR

area under the PR curve

area under the ROC curve

artificial general intelligence

artificial intelligence

attention

attribute

attribute sampling

AUC (Area under the ROC curve)

Click the icon to learn about the relationship between AUC and ROC curves.

Click the icon for a more formal definition of AUC.

augmented reality

automation bias

average precision

axis-aligned condition

B

backpropagation

bagging

bag of words

baseline

batch

batch normalization

batch size

Bayesian neural network

Bayesian optimization

Bellman equation

BERT (Bidirectional Encoder Representations from Transformers)

bias (ethics/fairness)

bias (math) or bias term

bigram

bidirectional

bidirectional language model

binary classification

binary condition

binning

BLEU (Bilingual Evaluation Understudy)

boosting

bounding box

broadcasting

bucketing

Click the icon for additional notes.

C

calibration layer

candidate generation

candidate sampling

categorical data

causal language model

centroid

centroid-based clustering

checkpoint

class

classification model

classification threshold

Click the icon for additional notes.

class-imbalanced dataset

clipping

Cloud TPU

clustering

co-adaptation

collaborative filtering

condition

confirmation bias

confusion matrix

continuous feature

convenience sampling

convergence

convex function