Overview

Big Data – What to do with it? (Part 2 of 2)

No Comments

Analyze it, of course!

Yes, but we’ll do it nice and easy – without getting our hands dirty!

Ok, maybe we get out hands just a little dirty 😉

This is the second part of my two-part blog series where I examine the possibilities for Big Data analytics offered by JasperReports. In the first part I described the general problems in analyzing Big Data and extracting useful information out of it and provided, among others, JasperReports as one tool suite to help in the tasks required.

JasperReports Ad Hoc View

Since we all know, how helpful those ‘Hello World’-type tutorials are, I will try to provide a more interesting one – although, it is difficult to make it difficult because working with the ad hoc editor is easy.

We are going to provide access to the data of LendingClub, encompassing data collected since 2007. LendingClub is a US-based peer-to-peer lending site.

As you will see, the data is unstructured and and quite big – in other words: ideal to be used in a Big Data tutorial.

The actual report could look like the following ones – since you are going to “play around” with the data, yours might look very different and allow for other interpretations:

Screen 78
The chart above shows the number of credits loaned by LendingClub for purpose of vacation since 2007, differenced by the home ownership state of the credit user. A little more than 800 credits (816, to be exact) were granted for people indicating that they would use the money to go on vacation; the vast majority of whom already had regular obligations to fulfill – rent or mortgage payments. Interesting…

Screen 79
This chart shows the number of credits loaned by LendingClub for purpose of vacation since 2007, differenced by address state of the credit user. The most credit users originate from California, which doesn’t come as a surprise: it is the most populated state of the U.S. However, it’s interesting to see that from New York, even though it’s ranked as the 27th most populated state, originate the second-most credit users asking LendingClub for a credit to go on vacation; seems like people from the state of New York are more desparate than people from other states to go on a vacation.

“Ingredients”

You’ll need the following software components installed and running in order to follow the tutorial:

  • MongoDB.
    We want to explore JasperReports connectivity to Big Data, after all. MongoDB is, in my opinion, one of those Big Data storages that are easier to get started with than the others available.
    You can download a version for your operating system of choice directly from their web site http://www.mongodb.org and follow these instructstions to install http://docs.mongodb.org/manual/installation/ (don’t worry, it’s easy).
    I employed version 2.4.3 while writing this tutorial.
  • JasperReports Server and iReport Designer
    We want to use the ad hoc report viewer offered by the professional edition of JasperReports Server. The 30-day trial edition of which can be downloaded from the following site:
    http://community.jaspersoft.com/download.
    The iReport Designer is essential in efficiently creating reports – even though we are not going to create a reportm we are going to make use of it to provide the initial query to the ad hoc report viewer (remember the caveat I mentioned in part one).
    You can access all required packages from the community download page. I employed version 5.1.0 of each one.

After you have installed (I chose to use the defaults wherever possible, keeping the installation as simple as possible) and started the software, we will continue by importing the data into the MongoDB instance.

“HowTo”

Step 1: The data

The raw data of lending club can be accessed for download on the pages of LendingClub Corporation:

https://www.lendingclub.com/info/download-data.action

The data is available in CSV format, which you could import into MongoDB using the command mongoimport.

However, four your convenience, I have provided a dump of the data I used which you can download from here (in 7-Zip-format):

https://public.centerdevice.de/f8a9c5a7-4fff-42d0-864b-ce3508cb39ea

After you have extraced it (will create directories dump/test) you can import this into your MongoDB instance using the following command
mongorestore <path to dump-directory>
which will restore a collection called ‘lendingclub ’ in the db ‘test’ of the MongoDB instance accessible on localhost using the default port 27017. If you do not want that to happen and, for example, use another db or collection, you’ll have to provide the corresponding parameters.

Please see http://docs.mongodb.org/manual/reference/program/mongorestore/ for further details.

In total, this collection has more than 120k documents. Let’s take a look at the data:
Open a shell to your MongoDB with the command mongo and issue the command db.omnibus.find().limit(3). The resulting output should look like this:

> db.lendingclub.find().limit(3)
{ "_id" : ObjectId("5195a7799b75445f56dee9c7"), "id" : 54734, "member_id" : 80364, "loan_amnt" : 25000, "funded_amnt" : 25000, "funded_amnt_inv" : 19419.94, "term" : " 36 months", "apr" : "13.62%", "int_rate" : " 11.89%", "installment" : 829.1, "grade" : "B", "sub_grade" : "B4", "emp_length" : "< 1 year", "home_ownership" : "RENT", "annual_inc" : 85000, "is_inc_v" : "TRUE", "accept_d" : "2009-07-26", "exp_d" : "2009-08-09", "list_d" : "2009-07-26", "issue_d" : "2009-08-05", "loan_status" : "Fully Paid", "pymnt_plan" : "n", "url" : "https://www.lendingclub.com/browse/loanDetail.action?loan_id=54734", "desc" : "Due to a lack of personal finance education and exposure to poor financing skills growing up, I was easy prey for credit predators. I am devoted to becoming debt-free and can assure my lenders that I will pay on-time every time. I have never missed a payment during the last 16 years that I have had credit. ", "purpose" : "debt_consolidation", "title" : "Debt consolidation for on-time payer", "addr_city" : "San Francisco", "addr_state" : "CA", "acc_now_delinq" : 0, "dti" : 19.48, "delinq_2yrs" : 0, "delinq_amnt" : 0, "earliest_cr_line" : "1994-02-15 10:39", "fico_range_low" : 735, "fico_range_high" : 739, "inq_last_6mths" : 0, "open_acc" : 10, "pub_rec" : 0, "revol_bal" : 28854, "revol_util" : "52.1%", "total_acc" : 42, "initial_list_status" : "f", "out_prncp" : 0, "out_prncp_inv" : 0, "total_pymnt" : 29324.32, "total_pymnt_inv" : 23345.15, "total_rec_prncp" : 25000, "total_rec_int" : 4324.32, "total_rec_late_fee" : 0, "last_pymnt_d" : "2011-10-14", "last_pymnt_amnt" :7392.08, "next_pymnt_d" : "null", "last_credit_pull_d" : "2012-08-28", "last_fico_range_high" : 789, "last_fico_range_low" : 785, "pub_rec_bankruptcies" : 0, "chargeoff_within_12_mths" : 0, "collections_12_mths_ex_med" : 0, "tax_liens" : 0 }

{ "_id" : ObjectId("5195a7799b75445f56dee9c8"), "id" : 55742, "member_id" : 114426, "loan_amnt" : 7000, "funded_amnt" : 7000, "funded_amnt_inv" : 6549.4, "term" : " 36 months", "apr" : "11.40%", "int_rate" : " 10.71%", "installment" : 228.22, "grade" : "B", "sub_grade" : "B5", "emp_name" : "CNN", "emp_length" : "< 1 year", "home_ownership" : "RENT", "annual_inc" : 65000, "is_inc_v" : "FALSE", "accept_d" : "2008-05-12", "exp_d" : "2008-05-26", "list_d" : "2008-05-12", "issue_d" : "2008-05-27", "loan_status" : "Fully Paid", "pymnt_plan" : "n", "url" : "https://www.lendingclub.com/browse/loanDetail.action?loan_id=55742", "desc" : "Just want to pay off the last bit of credit card debt at a better rate.", "purpose" : "credit_card", "title" : "Credit Card payoff", "addr_city" : "Brooklyn", "addr_state" : "NY", "acc_now_delinq" : 0, "dti" : 14.29, "delinq_2yrs" : 0, "delinq_amnt" : 0, "earliest_cr_line" : "2000-10-21 12:30", "fico_range_low" : 705, "fico_range_high" : 709, "inq_last_6mths" :0, "open_acc" : 7, "pub_rec" : 0, "revol_bal" : 33623, "revol_util" : "76.7%", "total_acc" : 7, "initial_list_status" : "f", "out_prncp" : 0, "out_prncp_inv" : 0, "total_pymnt" : 8215.45, "total_pymnt_inv" : 7758.8, "total_rec_prncp" : 7000, "total_rec_int" : 1215.45, "total_rec_late_fee" : 0, "last_pymnt_d" : "2011-06-02", "last_pymnt_amnt" : 228.48, "next_pymnt_d" : "null", "last_credit_pull_d": "2012-08-06", "last_fico_range_high" : 679, "last_fico_range_low" : 675, "pub_rec_bankruptcies" : 0, "chargeoff_within_12_mths" : 0, "collections_12_mths_ex_med" : 0, "tax_liens" : 0 } 

{ "_id" : ObjectId("5195a7799b75445f56dee9c9"), "id" : 57416, "member_id" : 139635, "loan_amnt" : 10800, "funded_amnt" : 10800, "funded_amnt_inv" : 10637.65, "term" : " 36 months", "apr" : "16.17%", "int_rate" : " 13.57%", "installment" : 366.86, "grade" : "C", "sub_grade" : "C3", "emp_name" : "State Farm Insurance", "emp_length" : "6 years", "home_ownership" : "RENT", "annual_inc" : 32000, "is_inc_v" : "FALSE", "accept_d" : "2009-11-04", "exp_d" : "2009-11-18", "list_d" : "2009-11-04", "issue_d" : "2009-11-12", "loan_status" : "Fully Paid", "pymnt_plan" : "n", "url" : "https://www.lendingclub.com/browse/loanDetail.action?loan_id=57416", "desc" : "I currently have a personal loan with Citifinancial that I have a high interest rate on I need 7000 to pay this off.  I also have 3 other creidit cards I would like to pay off with this loan to get this into one easy payment.  139635 added on 11/04/09 > Having one monthly payment will be a lot easier instead of making multiple payments to different companies.  I have paid all of my bills on time", "purpose" : "debt_consolidation", "title" : "Nicolechr1978", "addr_city" : "Waterbury", "addr_state" : "CT", "acc_now_delinq" : 0, "dti" : 11.63, "delinq_2yrs" : 0, "delinq_amnt" : 0, "earliest_cr_line" : "1996-12-12 07:47", "fico_range_low" : 680, "fico_range_high" : 684, "inq_last_6mths" : 1, "mths_since_last_delinq" : 58, "open_acc": 14, "pub_rec" : 0, "revol_bal" : 3511, "revol_util" : "25.6%", "total_acc" : 40, "initial_list_status" : "f", "out_prncp" : 0, "out_prncp_inv" : 0, "total_pymnt" : 13195.27, "total_pymnt_inv" : 13009.3, "total_rec_prncp" : 10800.01, "total_rec_int" : 2395.26, "total_rec_late_fee" : 0, "last_pymnt_d" : "2012-11-23", "last_pymnt_amnt" : 398, "next_pymnt_d" : "null", "last_credit_pull_d" : "2012-12-21", "last_fico_range_high" : 674, "last_fico_range_low" : 670, "pub_rec_bankruptcies" : 0, "chargeoff_within_12_mths" : 0, "collections_12_mths_ex_med" : 0, "tax_liens" : 0 }

As you can see, lending club stores a lot of information on its members – I hope that the data has been thoroughly anonymized.

You should, however, get the impression that the documents may vary in structure by looking at the example – and indeed, they do. In the collection, we have documents with only 2 fields stored as well as documents containing up to 102 fields.

On a side note, LendingClub provides graphical views on some statistics generated out of the data which are accessible here:

https://www.lendingclub.com/info/statistics.action

Of course, renderings like those can be generated with JasperReports – however, I don’t know if LendingClub is actually using JasperReports.

Step 2: The Topic

We will dynamically explore the data made available to create a report showing whatever we come up with. To do that, we need to provide the MongoDB-datasource and -query with which to obtain the data to JasperReports Server.

If you already followed my JasperReports tutorial, then you can skip ahead to Step 2.4: The query.

Let’s continue by starting iReport Designer (MongoDB should be kept running).

Step 2.1: The datasource

In general, the first step in creating a JasperReport with iReport is creating and/or selecting a datasource to use.
Click on the “Report datasource”-button (marked by red borders in the screenshot):

Screen 4

This opens a dialogue -window in which you can modify and select one of the available datasources, delete one or create a new one.

Click on the “New” button to do the latter.

Select “MongoDB Connection” from the list of available connection types and click on “Next”.

In the following dialogue you have to provide a name for the datasource as well as the actual connection data. If, during the installation of MongoDB, you chose to use the defaults (like me), all you have to enter for connection data is host (i.e. ‘localhost’) and db-name (i.e. ‘test’):

Screen 5

Hit “Test” to check if the connection works and then “Save” to store the datasource. Return to the main screen and make sure that the datasource you just created is the active one:
Screen 6

Step 2.2: iReport to JasperServer

Now, let’s connect iReport to the JasperServer, so that working with it becomes possible.

In iReport, click on the Repository Navigator (if it’s not available select “Window” -> “JasperReports Server Repository”) followed by the “Add new server”-button:

Screen 38

In the dialogue that opens, we’ll need to provide the data required by iReport to successfully connect to the server:

  • ID: an arbitrary name to identify the connection. This will be used in the Repository Navigator.
  • URL: the URL to reach the REST-API interface of the server. For you, the defaults should be fine. I had to install the server on another computer because my evaluation license expired while writing this tutorial, so I could no longer use localhost 😉
  • Organization: enter ‘organization_1’.
  • Username/Password: enter ‘jasperadmin’ in both fields.

Screen 39

Click “Save” and then check if the connection works by expanding the newly created server-node in the Repository Navigator:

Screen 40

Step 2.3: The datasource (again?)

Yes, again: since we want to use the ad hoc view to explore the data, the server needs to get access to our MongoDB-instance.

In the Repository Navigator, right-click on the “Data Sources” node , select “Add” followed by “Data Source”. This will open the following dialogue:

Screen 42

You’ll have to provide an ID and a name. The ID should not contain whitespaces or special characters because it is used in the URI for the REST-API and if you’d wanted to access it, you’d have to URL-encode those characters.

The name is used in the Repository Navigator, so can contain any character you like.

The second tab called “Data Source Details” contains the actual connection information. Just enter the same as you did when configuring the datasource for iReport:

Screen 43

Click “Save” to store the datasource.

Step 2.4: The query

As for the query, in order to provide it to the ad-hoc report creator, we need to configure a so-called Topic. Topics are reports (that is: jrxml-files) that provide the query necessary to access the data to work with; any band-configurations defined in them are ignored.

In iReport, select “File” -> “New” and click “Launch Report Wizard”:

In the opened dialogue, first provide a name for the report (i.e. ‘lending_club’) and check if the storage location is as desired. Click “Next” to continue.

In this next screen, provide the query to use:

Screen 59

We are going to work with the whole collection. Click “Next” to let iReport explore the available fields (make sure MongoDB is running):

Screen 60

Because this collection offers so many fields, we are not going to select all of them, but limit our selection to the more interesting ones – it is always possible to change the selection in the report later by opening the query editor.

For a (more or less) complete listing of the fields available and their purpose, please refer to the data glossary provided by LendingClub on their web site:

http://www.lendingclub.com/kb/index.php?View=entry&EntryID=253

The meaning of the ones I selected is:

  • addr_state: The address state provided by the borrower during loan application.
  • annual_inc: The annual income provided by the borrower during registration.
  • apr: The APR (annual percentage rate) to be paid by the borrower.
  • grade: LendingClub assigned loan grade.
  • home_ownership: The home ownership status provided by the borrower during registration. Possible values are: RENT, OWN, MORTGAGE, OTHER.
  • loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
  • loan_status: Current status of the loan.
  • int_rate: nominal interest rate on the loan.
  • purpose: A category provided by the borrower for the loan request. Values are: DEBT_CONSOLIDATION, MEDICAL, HOME_IMPROVEMENT, RENEWABLE_ENERGY, SMALL_BUSINESS, WEDDING, VACATION, MOVING, HOUSE, CAR, MAJOR_PURCHASE, OTHER.

Click “Next” to continue and skip the “Group by…”-screen. Hit “Finish” to close the wizard.

That’s it. We are ready to store this report as a Topic in JasperReports Server:

Open the Repository Navigator. Expand the node “Ad Hoc Components”, Right-click “Topics”, select “Add” and choose “JasperServer Report”.

This opens a wizard dialogue:

  • On the first screen, provide an ID and a name – I used ‘lendingclub’ and ‘LendingClub’ respectively.
  • Next, provide the report file to use. Click “Get source from current opened report” and continue.
  • Finally, provide the datasource to use. Select “From the repository” and choose our previously defined MongoDB datasource.

Click “Finish” to store the Topic.

Step 3: The report

Creating the Topic was comparatively easy, wasn’t it?

Well, creating the report is not getting difficult, either.

Log in to JasperReports Server web frontend with user “jasperadmin”.

On the welcome screen, select “Create Ad-Hoc Report” (“Ad-hoc-Ansicht erstellen” in german):

Screen 61

In the following dialogue, select the Topic we created and click “Chart” (“Diagramm in german):

Screen 62

Now, JasperReports server will pre-load the data returned by the query into its in-memory cache. By default, 200k documents are loaded in this manner so this may take a bit.

You will be presented with the following view:

Screen 63

The panels are:

  • 1. The selectable standard fields
  • 2. The selectable measures. Measures contain summarized values. They are typically numeric fields that determine the length of bars, size of pie slices, location of points (in line charts), and height of areas.
  • 3. Display of the currently selected fields functioning as row and column.
  • 4. The main area showing the chart.
  • 5. Filter and data selection pane.

Next, I’m going to show you how to work with these panels and hopefully, you’ll understand what their respective purpose is.

Step 3.1: The measure

Without a measure on the report, JasperReports server will not display anything. Therefore, select the field “loan_amnt” and drag it unto the canvas. This will result in the rendering of one bar of a bar chart showing the total sum of all loans taken since 2007 – personally, I find the amount of more than 1.6 billion us dollar quite impressive considering that we are talking about peer-to-peer lending:

Screen 64

When you hover with your mouse pointer over a bar, you’ll be shown the exact value in a tooltip.

Ok, before we add another field, let’s first have a look at what you can do with just this one measure alone.

Changing the summary function

By default, a measure is always shown as the total sum of its values. Of course, you are not limited to that!

Right-click on “loan_amnt” in the display of the currently selected fields (area 3) and select “change summary function”:

Screen 65

In the sub-menu that opens you’ll get to select from different functions that are self-explanatory. I invite you to play around with them, before we continue.

Filtering

By default, a report will take all values of the selected fields into account. It is possible to define a filter for a field (any field!) to change that.

Right-click on “loan_amnt” in the selectable measures area (area 2) and select “create filter”:

Screen 66

This will create an input control in the filter pane (area 5) allowing you to select the range of values to display – that is: the loan amount needs to be between the selected values to be taken into account.

Change filter operator

The fun does not stop there!

You can change the operator the filter is using.

When you click the following symbol, you’ll open a menu, allowing you to remove it or to display the operator selection:

Screen 67
Screen 68

Play around with those to test their effect on the chart.

Change filter expression

When you create more than one filter, they are by default taken all into account (AND-operator).

Let’s check that out first!

Create a filter for “grade” and select “A” in its input control:

Screen 69

As you can see, the value of the bar changes even though “grade” has not been selected as a field to display, yet.

Now, change the upper bound of the “loan_amnt”-filter to 10,000. This will result in another change to the total value of the bar – the filters are both taking effect.

If you do not want the filters to be concatenated using AND you can change that by clicking the following symbol and selecting “Show Expression”:

Screen 70

When you click the “Edit” button the following dialogue appears:

Screen 71

In the field, you can define your expression using AND, OR, NOT and parenthesis – watch out: even if you are using a non-english JasperReports Server (like me) you need to use the English keywords in this expression.

Custom fields

Last, but not least, it is possible to define custom fields by applying mathematical formulas to a view’s existing numeric fields. A custom field can be created out of one measure or multiple measures. Select a measure (or use CTRL-click to select multiple measures) then right-click and choose “Create Custom Field” in the context menu:

Screen 73

In the then opening menu, you’ll be able to select from basic functions (single or multiple fields) or advanced functions (single fields only) that should be self-explanatory.
After you’ve defined a custom field you can use it like any other measure — you can even define a new custom field using it. This allows to create complex calculations!

Step 3.2: The fields

As useful as the measures are in themselves, combining them with the fields thereby grouping the corresponding values is where it gets interesting.

Let’s check it out!
Select the measure “loan_amnt” as column and the field “purpose” as row (happens automatically when you double-click on them). And….

Screen 74

…nothing happened!
What’s that supposed to mean?

Well, since calculating groupings can become very complex and time-consuming, these calculations are only done when you specifically tell JasperReports Server to – simply selecting fields is not enough; this safeguards against time-consuming mistakes (e.g. accidently selecting the wrong field for grouping).

To actually enable the grouping you’ll need to drag the corresponding slider in the filter and data selection pane (area 5) to the right:

Screen 75

This will result in the following report:

Screen 76

Zooming

How much money has been lent on purpose of vacation?!
Let’s take a closer look…

You can zoom into areas of the chart by “opening” a rectangle of the desired region – click and hold the left mouse button, then drag to open the rectangle:

Screen 761

when you release the mouse-button, the chart will zoom in on this regon:

Screen 77

Nearly 5 million dollar lent for vacation purposes – I don’t know what fascinates me more: that there are people willing to indebt themselves to go on vacation OR that there are people willing to lend them the money to do so!

Click “Reset Zoom” to return the view to normal.

Step 3.3: Store it

That’s it, we’re done!
That is: you now know the most important features available to you to explore the data and create useful reports.

In order to let others make use of your work you can either export your view directly into a format others can use (e.g. PDF) or you can store the view as is into the repository or you can create a “normal” JasperReport out of it and store it in the repository as well.

In general, you would most likely store your ad-hoc view in the repository (so that your work is safe) and then create an export for your target audience. Creating and storing a normal JasperReport built from an ad-hoc view is intended for those cases where, for example, reports are needed on a regular basis or when you are working on huge amounts of data and creation of the view would take too long (you can schedule jobs to execute reports in JasperReports Server).

In any case, you can store the ad hoc view and/or the report by selecting one of the corresponding options available by opening the following menu:
Screen 80

Likewise, exports are created by selecting one of the following options:
Screen 81

Summary

Even though working with the ad hoc report creator is easy, you have to be careful so as not to slow working speed down to an unbearable crawl. In my experience, I have found it best practice to first limit the amount of data returned by the query and only, when I have a view created, increase the amount to include all data (if possible) or store a normal report and schedule it for execution overnight.

Of course, other strategies to limit the amount of data you’d have to work with are utilization of a data warehouse, or regular map-reduce-jobs that pre-filter the data into smaller collections, or a combination thereof.

That’s it from me. I hope, you’ve found my posts worthwhile and if you want to dig deeper into the ins and outs of JasperReports, I suggest reading the various ultimate guides available on the following JasperSoft site:
http://community.jaspersoft.com/documentation.

Thank’s for your time!

Comment

Your email address will not be published. Required fields are marked *