This recipe-ingredient dataset contains about 1,000,000 carefully cleaned and preprocessed recipes. The underlying data comes from five different base datasets (see sources below) which were merged in order to create a more complete recipe collection. The main contribution here is that all recipes have been meticulously cleaned and standardized (see preprocessing section).

# of recipes 1,067,557
Total size 62.2MB
Average # of ingredients per recipe 16.5
Most common ingredients salt, pepper, butter, garlic, sugar, flour, onion
Least common ingredients crystal hot sauce, watercress leaves, emerils essence

Example recipes

['butter','cocoa','eggs','flour','sugar','whitesugar']
['basilleaves','focaccia','leaves','mozzarella','pesto','plumtomatoes','rosemary','sandwiches','sliced','tomatoes']
['babyspinach','blackpepper','cheese','eggs','fetacheese','grapetomatoes','kosher','koshersalt','largeeggs','olive','oliveoil','pepper','scallions','spinach','tomatoes']

Using the dataset

The dataset is available as a single .npz (compressed numpy binary) file that contains both the recipes and the ingredient index.
import numpy as np

with np.load('simplified-recipes-1M.npz') as data:
    recipes = data['recipes']
    ingredients = data['ingredients']
ingredients is just a numpy string array that contains all used cooking ingredients.
array(['salt', 'pepper', 'butter', ..., 'watercress leaves',
       'emerils essence', 'corn flakes cereal'])
Each recipe in recipes is a numpy array of ingredient indices.
In [2]: recipes[0]
Out [2]: array([233, 2754, 42, 120, 560, 345, 150, 2081, 12, 21])
To get the string representation of a recipe use numpys array indexing notation
In [2]: ingredients[recipes[0]]
Out [2]: array(['basil leaves', 'focaccia', 'leaves', 'mozzarella', 'pesto',
                'plum tomatoes', 'rosemary', 'sandwiches', 'sliced', 'tomatoes'])

Preprocessing & Cleaning

As mentioned above, the main contribution of this dataset is the work that went into cleaning the ingredient strings to make them usable for machine learning tasks. The raw messy recipes from the base datasets mostly look like this:
['1 fennel bulb (sometimes called anise), stalks discarded, bulb cut\xa0into '
 '1/2-inch dice, and feathery leaves reserved for garnish',
 '1 onion, diced',
 '2 tablespoons unsalted butter',
 '2 medium russet (baking) potatoes',
 '2 cups chicken broth',
 '1 1/2 cups milk']
Since each ingredient string also contains sizes/units, special characters, instructions and other descriptions the total number of unique ingredients was more than 1 million. This makes the dataset practically unusable for machine learning tasks due to data sparsity. Unless the neural networks reads the actual string associated with each ingredient it won't know that organic sweetened soy sauce is actually very similar to glutenfree soy sauce (since the network only sees the differing ingredient indices).

Even after removing non-alpha characters and units the dataset contained ten-thousands of unique ingredient strings. Therefore I used two cleaned ingredient-list datasets and a number of other simplification approaches to further reduce the number of unique ingredients to about 16,000. Then I discarded all but the most common 3,500. The recipe above, now cleaned, looks like this:
['baking potatoes' 'butter' 'chicken' 'chicken broth' 'fennel'
  'fennel bulb' 'garnish' 'leaves' 'milk' 'onion' 'potatoes'
  'unsalted butter']
As you can see, the strings are much simpler, only containing the ingredient names themselves. Furthermore some ingredients like baking potatoes or unsalted butter have been split up into [baking potatoes, potatoes] and [unsalted butter, butter], eliminating the problem of data sparsity.

Dataset errors

The dataset includes a few errors due to the used simplification algorithms. For example sometimes olive oil is incorectly mapped to [olive, oil, olive oil]. Additionally some ingredients may be wrongly split up: yellow, dry, prepared, other, these can just be ignored.

Download

You can download the dataset here:

Dataset Sources

Recipe1M+
Marin, Javier, et al. “Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, pp. 1–1. DOI.org (Crossref), doi:10.1109/TPAMI.2019.2927476.

Epicurious - Recipes with Rating and Nutrition
Recipes with Rating and Nutrition

Yummly - Recipe Ingredients Dataset
Recipe Ingredients Dataset

Datafinity - Food Ingredient Lists
Food Ingredient Lists

Eight Portions - Recipe Box
Recipe Box

Most of the original data comes from Yummly, Datafiniti, Epicurious and Allrecipes.