XWiki Big Data with Elastic Search

May 09 2016

Update Nov 2016: we published an extension to integrate Elasticsearch and Kibana4 into XWiki.

Almost two years ago, in the XWiki open source community, we needed to learn how many XWiki users we actually had and understand a bit better our user base. For this we decided to implement a big data system allowing to get information about the XWiki open source installs, all anonymously of course.

We decided to use Elastic Search which is a great technology to store a lot of information and build reports with this data. But we decided to push this a little further in order to have integration with the XWiki software itself. The reason is that while it's important to be able to make complex reports, it is actually even more important to give context to these reports and make them available to our team. This is what the integration with XWiki allowed us to do.

In this article we'll open the box to show how this works and the advantages of using XWiki to display the data, as well as some improvements XWiki SAS has done to query the data:

Architecture

This is the architecture of the system:

activeinstall-architecture.png

Installed XWiki instances are sending pings every day to the xwiki.org Elastic Search instance. Then we can query the data using Kibana, or using the XWiki integration on xwiki.org or on the XWiki Intranet running XWiki. The data stored in Elastic Search provides the following information:

  • An anonymous unique ID for the instance which will stay the same after upgrades
  • The date of the first install
  • The current date
  • The Database System
  • The Operating System
  • The version of XWiki
  • The Java version
  • The Servlet Container (name and version)
  • Information about which extensions are installed

For more information you can check the XWiki extension entry on xwiki.org
You can also look at the Elastic Search Macro.

Reports

With this data we are able to run reports. These reports are available live on the XWiki.org web site.

The first reports we did are counting the actual installations:

activeinstalls-counts.png

This counting has only started with XWiki 5.x so we expect more actual XWiki installations, and of course installations that are behind a firewall that cannot contact the outside are not counted. Also, as you can see, we have two counts, since, starting from XWiki 6.x we have deployed a new Elastic Search index giving us more details beyond the counting.

With this we can calculate the Active Installs growth over time:

activeinstalls-history.png

Another thing we can do is see what Operating System and Database is used, which Java version and which actual XWiki version is deployed:

activeinstalls-byversionandtype.png

Explanations

How does this work:

  • We have an API in an XWiki Java module allowing to send a query to the elastic search instance
  • Once the data is calculated using Elastic Search an XWiki table is generated
  • The XWiki table can then be passed to the XWiki chart macro which uses JFreeChart to make graphs
  • For the counter we use a JS library allowing to display the counter. A little bonus is that the counter is updated using Ajax in near realtime

Here is an example of a script that is used to generate the table:

{{include reference="ActiveInstalls.QueryVariables"/}}
{{velocity}}
#set ($query = "
{
  ""query"": {
      ""query_string"": {
          ""lowercase_expanded_terms"": false,
          ""query"": ""_timestamp:[now-1d TO *] $snapshotQueryFragment AND distributionId:\""$distributionId\""""
       }
   },
  ""aggs"": {
      ""versions"": {
          ""terms"": {
              ""field"": ""distributionVersion"",
              ""size"": 100
            },
           ""aggs"": {
               ""instanceId_count"": {
                   ""cardinality"": {
                       ""field"": ""instanceId""
                    }
                }
            }
       }
   }
}
")
#set ($result = $services.activeinstalls.searchInstalls("install2", $query, {"search_type": "count"}))
#set ($headingLine = "|=XWiki Version|=Active Installs Count")
#if (!$result)
 #set ($errorText = $!services.activeinstalls.lastError)
  {{error}}{{{$errorText}}}{{/error}}

 ## Still generate an empty table so that the chart macro using it doesn't fail!
 $headingLine
#else
 ## Generate table with data
 #set ($jsonItems = $result.getAsJsonObject('aggregations').getAsJsonObject('versions').getAsJsonArray('buckets'))
 $headingLine
 #set ($count = 0)
 #foreach ($jsonItem in $jsonItems)
   #set ($value = $jsonItem.getAsJsonObject('instanceId_count').getAsJsonPrimitive('value').getAsLong())
   #if ($velocityCount <= 10)
     #set ($version = $jsonItem.getAsJsonPrimitive('key').getAsString())
     |$version|$value
   #else
     #set ($count = $mathtool.add($count, $value))
   #end
 #end
 #if ($count > 0)
   |Other|$count
 #end
#end
{{/velocity}}

Then here is the script used to chart the table:

{{velocity}}
{{chart type="pie" source="xdom" params="document:ActiveInstalls.XWikiVersionsData;range:B2-B.;series:columns;pie_label_format:{0} - {2} ({1})"/}}
{{/velocity}}

Combining queries to measure installation retention

We wanted to go further and try to understand how many XWiki testers are actually converting to users, so that we can improve our software and/or our documentation to increase the number of users. This work was done by XWiki SAS on our Intranet. Of course, with the measurement will come implementation and additional tools (like surveys) will be needed to improve the conversion rate. 

For this we have built some custom Elastic Search queries which allow us to measure by day, week or month, the number of installs that initially happened and then the number of installs that are still active after a set number of days (10, 30, 90 as an example). We allowed in this query to set the specific version of XWiki on which we would like the measurement to be performed.

Of course the data is not fully precise, because a user who installs XWiki could be a developer and only planning to test this specific version before actually upgrading a production install, or even better a user that has tested in one place might decide to deploy to a production server. But as with all statistics, the value is not as much in the absolute numbers than in the changes we can see in the numbers we measure.

Here is a sample report we have on our intranet allowing to measure the number of active installs after 10 days:

activeinstall-retention.png

In order to give some context to this data we have added on top of the report the dates of the XWiki releases, so that we can try to see a change at a specific date of release:

  • XWiki 8.0: March 15th 2016 
  • XWiki 7.4: December 28th 2015
  • XWiki 7.3: November 11th 2015
  • XWiki 7.2: September 24th 2015
  • XWiki 7.1: June 15th 2015
  • XWiki 7.0: March 30th 2015
  • XWiki 6.4: January 21th 2015
  • XWiki 6.3: November 12th 2014
  • XWiki 6.2: September 18th 2014

One of the things we see is that in January 2016, just after the 7.4 release, which is our end of cycle Long Term Stable release, the rate jumped to 36%, which is the highest we ever had. Also we can see that in February 2016 our rate started to grow and stayed higher. This matches the 7.4 Long Term Release as well as the 7.x cycle brings usability improvements and Nested Pages which makes XWiki unique compared to its competition Confluence and MediaWiki. We can confirm in the active install numbers that these improvements have been welcomed by our users. To confirm this even more by restricting the reports to a specific major version we can measure the average rate of the 6.x, 7.x and 8.x cycles:

  • 6.x: 24,55%
  • 7.x: 29,41%
  • 8.x: 28.47% (only starting)

We can see a significant increase in our retention rate which we can confirm when we extend to 30, 90 or 180 days. We can also see that it's in the first days that most users decide wether or not they should continue using XWiki, and the bigger the 10 day rate, then automatically the long term rate for 180 or 360 days will also increase. This has given us the tool to directly measure future improvements we do in the product and their effect on our users decisions. We look forward to making more improvements to XWiki that will be even more useful to our users.

This report could not be done using a single query in Kibana as it combines data from two queries. We used XWiki Groovy scripting to merge the data from the two queries and present them in one report. We also calculated the average rate, filtering out data that is incomplete (data for which we have the install numbers but not the installs after the set number of days because the delay has not yes passed). 

There is one more thing we learned from this report. We will beat our absolute number of active installs after 10 days in April as the number is already 211 and there are still 10 days to go, while our current biggest number is 245. We see XWiki growing as a contender to Confluence and Mediawiki in the Wiki space and to Sharepoint or Drupal to build complex collaborative applications. If you don't know XWiki, visit http://www.xwiki.org to try it out or take a look at http://www.xwiki.com to see what you can build with XWiki.

Measuring extensions usage

The final report we created allows to measure the installs of extensions from our extension repository:

activeinstall-extensions.png

This allows us to see that our users are very interested in the PDF Viewer, the LDAP module (which is not in this report as platform modules are filtered), Calendar and File Manager modules. We plan to use this data in the future in the Extension Manager to show the popular extensions.

Advantages of the XWiki integration with Elastic Search

What is great when using XWiki to display the reports is that we have complete freedom to mix the reports with text, which allows to give context to the data. Having worked in the past in statistics I know that numbers always need explanations and you need the ability to show these explanations next to the data. A report with just a graph or data will be misunderstood and bad decisions can result from this.

Using the XWiki integration you can have a tree of pages displaying important data that's needed to make decisions. It is also possible to mix data from other tools when the data is not available in the main tool where you are storing the data. For example we have been measuring some items using Apache logs and we have been able to show them in the XWiki pages next to the Elastic Search data.

Finally, as we have seen with the retention report, it was sufficiently complex to require programming and could not be done with the query tools. The flexibility of the Groovy scripting combined with XWiki's presentation capabilities translate in a great tool to display the data to our team which helps them make better decisions.

Thanks to the XWiki APIs, we can build things like the realtime counter and we can even let the user make choices when displaying the data. She can filter the platform extensions from the extension report or choose the number of days for the retention report and we can play with the UI with the full capabilities of HTML.

If your company has a big data project, XWiki SAS can help you, not only to display your data in XWiki but also to build the right model that will allow to retrieve the data that is meaningful for your company. 

Ludovic Dubost
XWiki Founder and XWiki SAS CEO