Bringing features and labels views to Shogun

2018/07/28

Tags: GSoC Shogun

This week I’m happy to introduce the new thing to Shogun, features and labels view. In machine learning, operating on a subset of features or labels is usual, especially in recursive tree algorithms. The view, is a new instance of CFeatures or CLabels, or their subclasses, with a subset of data shared with the original instance. It is implemented as a global template function that accept a pointer or a small pointer to a viewable instance, and an index vector as arguments.

template <typename T>
T* view(T* viewable, const SGVector<index_t>& index)
{
    auto result = viewable->duplicate();
    result->add_subset(index);
    return result;
}

template <typename T>
Some<T> view(Some<T> viewable, const SGVector<index_t>& index);

Using templates makes it possible creating views for any viewable classes, as long as it supports duplicate and subset functionalities. duplicate is a virtual function that creates a shadow copy of the instance by calling the constructor of the class. Note that we require it to be virtual so we can create an instance of subclasses given a pointer of the base class.

Motivation
Why do we need views? Before it, we used subsets directly: given an instance of features or labels, we call add_subset to add an index vector to the subset stack. After that, data beyond the subset is excluded from the original instance as if it has never been there. And then after some processing, we call remove_subset to restore the state.

Since we are moving towards immutable features and labels, direct mutation is obviously undesirable. Instead, we would like to create a new instance here, with only data shared to reduce the overhead of copy. Therefore, view is a quite helpful thing to this end.

Design Choices
Why don’t we make it a method of CFeatures and CLabels? We actually had a try for this. If we put it in the base class and make it inherited by subclasses, we have to use the base class pointer, e.g. CFeatures* or CLabels* as the return type. Then when we call it via a subclass instance, we will get a base class pointer as result and have to manually cast it to the subtype. This will be much repetitions in our code.

So finally I decided it to make it global template which provides much flexibility. This also avoids duplication when implementing views for both features and labels.