This recipe-ingredient dataset contains about 1,000,000 carefully cleaned and preprocessed recipes. The underlying data comes from five different base datasets (see sources below) which were merged in order to create a more complete recipe collection. The main contribution here is that all recipes have been meticulously cleaned and standardized (see preprocessing section).
|# of recipes
|Average # of ingredients per recipe
|Most common ingredients
|Least common ingredients
crystal hot sauce,
Using the dataset
The dataset is available as a single
.npz (compressed numpy binary) file that contains both the recipes and the ingredient index.
import numpy as np
with np.load('simplified-recipes-1M.npz') as data:
recipes = data['recipes']
ingredients = data['ingredients']
ingredients is just a numpy string array that contains all used cooking ingredients.
Each recipe in
array(['salt', 'pepper', 'butter', ..., 'watercress leaves',
'emerils essence', 'corn flakes cereal'])
recipes is a numpy array of ingredient indices.
To get the string representation of a recipe use numpys array indexing notation
In : recipes
Out : array([233, 2754, 42, 120, 560, 345, 150, 2081, 12, 21])
In : ingredients[recipes]
Out : array(['basil leaves', 'focaccia', 'leaves', 'mozzarella', 'pesto',
'plum tomatoes', 'rosemary', 'sandwiches', 'sliced', 'tomatoes'])
Preprocessing & Cleaning
As mentioned above, the main contribution of this dataset is the work that went into cleaning the ingredient strings to make them usable for machine learning tasks.
The raw messy recipes from the base datasets mostly look like this:
Since each ingredient string also contains sizes/units, special characters, instructions and other descriptions the total number of unique ingredients was more than 1 million. This makes the dataset practically unusable for machine learning tasks due to data sparsity. Unless the neural networks reads the actual string associated with each ingredient it won't know that
['1 fennel bulb (sometimes called anise), stalks discarded, bulb cut\xa0into '
'1/2-inch dice, and feathery leaves reserved for garnish',
'1 onion, diced',
'2 tablespoons unsalted butter',
'2 medium russet (baking) potatoes',
'2 cups chicken broth',
'1 1/2 cups milk']
organic sweetened soy sauce is actually very similar to
glutenfree soy sauce (since the network only sees the differing ingredient indices).
Even after removing non-alpha characters and units the dataset contained ten-thousands of unique ingredient strings. Therefore I used two cleaned ingredient-list datasets and a number of other simplification approaches to further reduce the number of unique ingredients to about 16,000. Then I discarded all but the most common 3,500. The recipe above, now cleaned, looks like this:
As you can see, the strings are much simpler, only containing the ingredient names themselves. Furthermore some ingredients like
['baking potatoes' 'butter' 'chicken' 'chicken broth' 'fennel'
'fennel bulb' 'garnish' 'leaves' 'milk' 'onion' 'potatoes'
baking potatoes or
unsalted butter have been split up into [
potatoes] and [
butter], eliminating the problem of data sparsity.
The dataset includes a few errors due to the used simplification algorithms. For example sometimes
olive oil is incorectly mapped to [
olive oil]. Additionally some ingredients may be wrongly split up:
other, these can just be ignored.
You can download the dataset here:
Marin, Javier, et al. “Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, pp. 1–1. DOI.org (Crossref), doi:10.1109/TPAMI.2019.2927476.
Epicurious - Recipes with Rating and Nutrition
Recipes with Rating and Nutrition
Yummly - Recipe Ingredients Dataset
Recipe Ingredients Dataset
Datafinity - Food Ingredient Lists
Food Ingredient Lists
Eight Portions - Recipe Box
Most of the original data comes from
Yummly, Datafiniti, Epicurious and Allrecipes.