Inserting Data into HBase with Python
For this little project, we are going to use the Happybase Python package. Happybase uses HBase’s Thrift API.
For our test, we are going to create a namespace and a table in HBase. We will do this in the HBase shell. To make things simple, our table is going to have only one column family - data
, and we are going to accept all defaults.
% hbase shell
hbase> create_namespace "sample_data"
hbase> create "sample_data:rfic", "data"
The data for this project was downloaded from data.indy.gov on 10 February 2016.
Batching
You can insert directly into a table with the Table#put()
function. However, I recommend using Batch#put()
instead. When the number of records reaches the batch_size
, Batch#send()
will be called. See the section on benchmarks for timing data.
When you’re done, be sure to call Batch#send()
manually, to flush any remaining records to the database.
Benchmarks
I ran the program for several batch sizes and averaged the results. As you can see from the time results, we can only go so fast - a little above 2 seconds - before we cannot insert any faster.
Batch (n) | Time (s) |
---|---|
1 | 10.227 |
100 | 3.533 |
1,000 | 2.091 |
2,000 | 2.141 |
5,000 | 2.081 |
10,000 | 2.190 |