In 2011, James Dixon, then CTO of the business intelligence company Pentaho, coined the term data lake. He described the data lake in contrast to the information silos typical of data marts, which were popular at the time:
If you think of a data mart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
Data lakes have evolved since then, and now compete with data warehouses for a share of big data storage and analytics. Various tools and products support faster SQL querying in data lakes, and all three major cloud providers offer data lake storage and analytics. There's even the new data lakehouse concept, which combines governance, security, and analytics with affordable storage. This article is a high dive into data lakes, including what they are, how they're used, and how to ensure your data lake does not become a data swamp.