For when your data set is dozens of .csv or .rda files
Python comes with sqlite3 out of the box. It provides a fast and efficient way to store medium to large data sets. If you're juggling dozens of CSVs or large JSON files, a SQL database is a great solution and can save tons of disk space! I'll run through the basics needed to get started with the iris data set.
First import libraries and the iris data set...
import sqlite3
import pandas as pd
file_name = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
iris_df = pd.read_csv(file_name)
iris_df.head()
Creating and connecting to a database uses the same command. If the database already exists, the connect method will connect to it. If it does not, it will create it.
It's often useful to create an in memory database rather than a physical file on your computer when you're first setting up your database. This allows you to make quick and easy changes without having to constantly delete and re-create the database file. To do so, simply replace the file name in the connect() method with ":memory:"
# Create or connect to a database named 'Iris.db'
conn = sqlite3.connect('Iris.db')
# in memory database:
# conn = sqlite3.connect(':memory:')
Once connected, you need to create a cursor object which is used to navigate the object.
c = conn.cursor()
In order to store data in the database we have to first create a table. Within the execute() method of our cursor object we pass a string of SQL code. Here, I create a table called iris_dimensions. In the parentheses following the create table command, I define the columns in my table as well as the data type in each column. Data can be stored in one of the folowing formats:
In this instance I create a table of columns that hold float values for the dimension measurements, and text for the species of the flower.
After you make any change in the database, those changes need to be comitted using the commit() method on your connection object.
c.execute("""CREATE TABLE iris_dimensions (
sepal_length real,
sepal_width real,
petal_length real,
petal_width real,
species text
)""")
conn.commit()
If your data is already in a data frame then pandas offers a to_sql() method to write the data to your connected database.
iris_df.to_sql('iris_dimensions', conn, index = False, if_exists = 'append')
Lists are generally the most efficient way to store and write data to a database. Let's say your data is structured as a list of lists like so:
iris_list = iris_df.values.tolist()
iris_list[0:5]
To insert rows one at a time you can use a simple for loop. The question marks in parentheses identify the structure of the data in the list. In this case each list has five columns separated by commas so we use five ?'s separated by commas to enter the data.
for row in iris_list:
c.execute("INSERT INTO iris_dimensions VALUES(?,?,?,?,?)", row)
conn.commit()
Alternatively, if you're inserting a lot of data at once it's much faster to use executemany and pass the entire list as an argument.
c.executemany("INSERT INTO iris_dimensions VALUES(?,?,?,?,?)", iris_list)
conn.commit()
To pull data from the database, use the SELECT command to specify which columns to pull, and the FROM command to specify the table you want to pull from. Then use the fetchall() method on your cursor object to assign the data to an object.
c.execute("""SELECT sepal_length, sepal_width, species
FROM iris_dimensions
""")
iris = c.fetchall()
pd.DataFrame(iris).head()
When pulling data you can set conditional parameters as well. The codw below returns all rows where the species is virginica and the sepal length is greater than 5.
c.execute("""SELECT *
FROM iris_dimensions
WHERE species = 'virginica'
AND sepal_length > 5
""")
setosa = c.fetchall()
pd.DataFrame(setosa).head()
This guide covers the very basics for getting started with SQLite, but SQL is capable of doing much more! You can join or manipulate tables, run advanced text searches, or deploy it as the back end to a website. Read the documentation (https://www.sqlite.org/index.html) to get more familiar with the SQL language and capabilities.