Machine learning is a complex and rapidly growing field (full definition here in this post), and it can be challenging to keep up with all the terminologies. In this glossary where machine learning terms are explained, essential industry buzzwords and their definitions to help you get up to speed:
A type of experiment used to compare two variants of a product, feature, or change to see which is more effective. This is a common technique in machine learning to ensure that a new algorithm performs better than the old one or determine which feature is more important for a model.
This is a workflow management system used to create and manage data pipelines. It helps automate data engineering tasks, such as data pre-processing, cleaning, transformation, and loading into a data warehouse or an analytics platform. For example, you can use Airflow to schedule a job that cleans up your data every night or automatically loads new data into your analysis platform every day.
In the context of machine learning, a batch is a collection of data that has been processed as a unit. For example, when you train a machine learning model, you typically use a batch of training data. This contrasts with “online learning,” which refers to the processing of data as it is received.
CDC is a process that captures the changes made to data in a database. It can be used for replication, audit tracking, and data synchronization. For example, you could use CDC to keep a copy of your production data in a separate database for analysis or backup.
A DBT is used to build data models. It can be used for data transformation, cleansing, and augmentation. For example, you might use a DBT to convert your data from CSV format into a table that can be used in a machine learning model. This is often necessary because most machine learning algorithms are designed to work with data in a particular format.
Errors and retry logic is a technique used to handle errors when processing data. It involves checking for errors, attempting to recover from them, and then trying again if necessary. This can be helpful for tasks such as data loading and pre-processing, where it’s common for errors to occur.
Edge ML deployment is the process of deploying machine learning models to devices such as routers, switches, and gateways. It can be used for network monitoring, traffic analysis, and security. For example, you might use edge ML deployment to detect cyber-attacks or optimize network performance. This can be helpful because it allows you to process data closer to the source, improving performance and reducing latency.
The term hashing refers to categorizing data into buckets. It is often used with machine learning algorithms, such as k-means clustering, to group similar data together. For example, you could use hashing to group customer data into buckets based on their age or location.
A hyperparameter is a parameter that is not part of the model, but that affects the way it is trained. For example, you might use a hyperparameter to control the number of iterations that the model trains or the amount of data used for training. Hyperparameters are often tuned by hand to get the best results.
IoT core refers to the core technologies used in the internet of things. These include devices such as sensors, actuators, and controllers that can connect physical objects to the internet. IoT Core is often used in conjunction with machine learning algorithms to enable predictive maintenance and real-time analytics.
Machine learning operations is the practice of managing and optimizing the machine learning process. It includes data preparation, model selection, tuning, and deployment tasks. MLOps is important because it can help ensure that the machine learning process is efficient and effective. Read more about MLOps strategy here.
Model drift is the tendency of a machine learning model to become less accurate over time. This can happen for various reasons, such as changes in the data relationships, changes in the environment, or simply overfitting the training data. It’s important to monitor for model drift and take corrective action if necessary.
A pipeline is an infrastructure that surrounds a machine learning algorithm. Pipelines can be used for data pre-processing, feature extraction, model training, and prediction tasks. Pipelines are often implemented using a tool called a pipeline manager. This is helpful because it allows you to chain together multiple steps and easily repeat them.
Pooling is a technique used to reduce the size of data sets. It involves combining multiple instances of a particular feature into a single instance. This can be helpful for tasks such as data pre-processing, where it can reduce the amount of memory needed to store the data.
Vertex AI is a machine learning platform that enables you to train and deploy machine learning models from experimentation, deployment, management, and monitoring stages. It includes a variety of features, such as pre-built models, a drag-and-drop interface, and support for multiple programming languages. Vertex AI can be used for data pre-processing, model training, and predictions.
While there is still much more to learn on the rapidly growing concept of machine learning, these terms are an excellent start for anyone looking to wrap their head around the basics. As the industry continues to grow, so too will the vocabulary. To learn more about machine learning from the experts at Bitstrapped, check out some of the latest blog posts that are always updated with the latest information to help you stay ahead of the curve. Or if you have a machine learning project you would like to explore, contact us.