This recipe-ingredient dataset contains about 1,000,000 carefully cleaned and preprocessed recipes. The underlying data comes from five different base datasets (see sources below) which were merged in order to create a more complete recipe collection. The main contribution here is that all recipes have been meticulously cleaned and standardized (see preprocessing section). I also trained a neural network on this dataset, see the results here.
# of recipes |
1,067,557 |
Total size |
62.2MB |
Average # of ingredients per recipe |
16.5 |
Most common ingredients |
salt , pepper , butter , garlic , sugar , flour , onion |
Least common ingredients |
crystal hot sauce , watercress leaves , emerils essence
|
Example recipes
['butter','cocoa','eggs','flour','sugar','whitesugar']
['basilleaves','focaccia','leaves','mozzarella','pesto','plumtomatoes','rosemary','sandwiches','sliced','tomatoes']
['babyspinach','blackpepper','cheese','eggs','fetacheese','grapetomatoes','kosher','koshersalt','largeeggs','olive','oliveoil','pepper','scallions','spinach','tomatoes']
Using the dataset
The dataset is available as a single .npz
(compressed numpy binary) file that contains both the recipes and the ingredient index.
import numpy as np
with np.load('simplified-recipes-1M.npz') as data:
recipes = data['recipes']
ingredients = data['ingredients']
ingredients
is just a numpy string array that contains all used cooking ingredients.
array(['salt', 'pepper', 'butter', ..., 'watercress leaves',
'emerils essence', 'corn flakes cereal'])
Each recipe in recipes
is a numpy array of ingredient indices.
In [2]: recipes[0]
Out [2]: array([233, 2754, 42, 120, 560, 345, 150, 2081, 12, 21])
To get the string representation of a recipe use numpys array indexing notation
In [2]: ingredients[recipes[0]]
Out [2]: array(['basil leaves', 'focaccia', 'leaves', 'mozzarella', 'pesto',
'plum tomatoes', 'rosemary', 'sandwiches', 'sliced', 'tomatoes'])
Preprocessing & Cleaning
As mentioned above, the main contribution of this dataset is the work that went into cleaning the ingredient strings to make them usable for machine learning tasks.
The raw messy recipes from the base datasets mostly look like this:
['1 fennel bulb (sometimes called anise), stalks discarded, bulb cut\xa0into '
'1/2-inch dice, and feathery leaves reserved for garnish',
'1 onion, diced',
'2 tablespoons unsalted butter',
'2 medium russet (baking) potatoes',
'2 cups chicken broth',
'1 1/2 cups milk']
Since each ingredient string also contains sizes/units, special characters, instructions and other descriptions the total number of unique ingredients was more than 1 million. This makes the dataset practically unusable for machine learning tasks due to data sparsity. Unless the neural networks reads the actual string associated with each ingredient it won't know that organic sweetened soy sauce
is actually very similar to glutenfree soy sauce
(since the network only sees the differing ingredient indices).
Even after removing non-alpha characters and units the dataset contained ten-thousands of unique ingredient strings. Therefore I used two cleaned ingredient-list datasets and a number of other simplification approaches to further reduce the number of unique ingredients to about 16,000. Then I discarded all but the most common 3,500. The recipe above, now cleaned, looks like this:
['baking potatoes' 'butter' 'chicken' 'chicken broth' 'fennel'
'fennel bulb' 'garnish' 'leaves' 'milk' 'onion' 'potatoes'
'unsalted butter']
As you can see, the strings are much simpler, only containing the ingredient names themselves. Furthermore some ingredients like baking potatoes
or unsalted butter
have been split up into [baking potatoes
, potatoes
] and [unsalted butter
, butter
], eliminating the problem of data sparsity.
Dataset errors
The dataset includes a few errors due to the used simplification algorithms. For example sometimes olive oil
is incorectly mapped to [olive
, oil
, olive oil
]. Additionally some ingredients may be wrongly split up: yellow
, dry
, prepared
, other
, these can just be ignored.
Download
You can download the dataset here:
Dataset Sources
Recipe1M+
Marin, Javier, et al. “Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, pp. 1–1. DOI.org (Crossref), doi:10.1109/TPAMI.2019.2927476.
Epicurious - Recipes with Rating and Nutrition
Recipes with Rating and Nutrition
Yummly - Recipe Ingredients Dataset
Recipe Ingredients Dataset
Datafinity - Food Ingredient Lists
Food Ingredient Lists
Eight Portions - Recipe Box
Recipe Box
Most of the original data comes from
Yummly, Datafiniti, Epicurious and Allrecipes.