Tag Archives: csv

Dealing with large data sets in Couchbase Views

The other day I ran into an issue while trying to extract data from Couchbase 2.0 into csv files. The problem stemmed from our original document design. We never created a unique known key schema for each document, but instead went with the having Couchbase assign a unique key for each document. In a future post I will expand more on the Key schema designs for Couchbase and why I believe you should create a known unique key value for all of your documents. Back to the issue at hand; I had no way to extract the documents I wanted to extract other then by creating a view. While this isn’t a big deal it threw me for a couple of hours of troubleshooting when I went to extract the documents using the view due to the size of our data. I created a Ruby script using the latest Couchbase gem (1.2.2) as of this writing but ran into an out of memory issue. Below is a screen shot of the issue I was getting:Couchbase Ruby Client View Error

Here is a sample of the code I was using that caused the problem. Bonus points if you can spot the error.

#!/usr/bin/env ruby
 
require 'rubygems'
require 'couchbase'
require 'json'
require 'csv'
 
# Couchbase Server IP
ip = 'localhost'
# Bucket name
bucket_name = 'Your_Bucket_Name'
# Design Doc Name
design_doc_name = 'Your_Doc_Name'
# Output Filename
output_filename = 'extracted_data_file.csv'
 
client = Couchbase.connect("http://#{ip}:8091/pools/default/buckets/#{bucket_name}")
 
document_count = 0
 
# Get all the existing items
design_doc = client.design_docs[design_doc_name]
# Replace "by_id" with the name of your view
view = design_doc.by_id 
 
view.each do |doc|
  doc_key = doc.key
 
  # if the document can't be retrieved or it fails to 
  # be converted to JSON just continue to the next document
  begin
    document = JSON.parse(client.get(doc_key))
  rescue
    next
  end
 
  # open file for appending. if the file doesn't exist it will be created. 
  CSV.open(output_filename, 'ab', {:col_sep => '|'}) do |csv|
    csv << document.keys if document_count == 0 # write the header row first
    csv << document.values # write document values to the csv file
  end
  
  document_count += 1
end
 
client.disconnect
 
puts "Total Documents Extracted = #{document_count}"

Hopefully you could spot the issue. It took me a little while to find the problem but once I did everything worked great. The main issue is that the view I created was returning over 90MM document keys, which was causing the memory error. The ruby client is supposed to stream the results but it doesn’t appear to be doing that well, otherwise the memory error shouldn’t have been reached. To resolve the issue you need to add the following update to the prior script on line 24:

view = design_doc.by_id(:limit => 10000000, :stale => false)

One thing to note is that in my case I wanted to include stale results so I added :stale => false but that property is optional. If you don’t want stale results to show up in your results just remove that property. For those of you that don’t know what the stale false option does, it updates the index of the view documents before the query is executed thus including any newly inserted documents into the view. The other :stale options are ok and update_after. You can read more about these options in the couchbase 2.0 documentation.

The main thing to focus on is the :limit property. You can play around with the max limit you want to use. In my case I found that 10MM seemed to be the magic value that didn’t cause the out of memory error I was experiencing previously. If I set the :limit value any higher than 10MM I received the out of memory error. Your results could be different so you will have to determine what your limit value should be.

Hope this helps someone in the future as it took me a few hours to figure out how to solve this problem and googling didn't provide much help on this issue.