Sunday 16 March 2014

Reflection in lame engine Part 3 - Methods and LUA binding.

It's been a while! I am going back to the series of articles on the topic of reflection. Just as I planned... four and a half months ago. So without further ado, today I will go into how I implemented reflection of methods. This article will explain how I store, invoke and retrieve results from. Last but not least, the most interesting part - how I developed automatic script binding.

The ideas presented here, although described in a context of methods, can be applied to different aspects like properties, which are essentially just methods with a twist, but also to fields, events and whatnot.

To be perfectly clear - presented code is not yet finished, so if some snippets look really bad feel free to complain in the comments, but there is a big chance I am already on it! I mentioned about automatic script binding - in this case I decided to go with LUA as this is pretty much the only widely used scripting language in game industry that I have some knowledge about.

To recap: last time I talked about design, goals, type registration etc. The goals for the reflection system were in short:
- Portable and intrusive - gives control over what, when and how
- Efficiency of operations (const-time class relationship look-up, casting etc.)
- No dynamic allocations whatsoever (which I already violated and gave up on this goal, now it is dynamic allocations limited to a minimum)

So in contrast with these goals the implementation of method invocation I made was really bugging me, as It is definitely not the most efficient solution that probably exists and is compatible with what I tried to achieve, but hey, it is working! If you have any ideas on how things might get improved feel free to leave a comment or drop me an e-mail. So going down to business...



THE NATIVE C++ PART
  // base class for all the methods
  class Method : public Annotatable // more on annotations in next article
  {
  public:
    Method(const char* name, const u32 hash,
      const Type* classType, const Type* returnType)
      : m_name(name, hash)
      , m_class(classType)
      , m_return(returnType)
    {
    }
    // non-essential parts like getters/setters/utils stripped away 
    ...

    // pure virtual function Invoke that is implemented by derived classes
    // first arguemet is an instance on which we call this method
    // result is a pointer to where returned value will be stored (or nil for void)
    // args is an array of void pointers to the arguments for the call
    virtual void Invoke(void* instance, void* result, void* args[]) = 0;

  protected:
    HashedString                      m_name;
    const Type*                       m_class;
    const Type*                       m_return;
    Vector<const Type*>               m_arguments;
  };

So far nothing fancy and interesting - probably the only thing worth noticing here is the pure virtual function Invoke - yes, this is the 'non-efficient' part I was worried about, but so far I have no other ideas how to improve the design. This introduces a double indirection. First I have to call a virtual function and then after some voodoo magic, call the method via pointer with each argument casted to the pointer of a proper type and then dereferenced - I will go back to this. As you will see in a moment, each wrapped method is actually an instace of a templated class. It has to be templated on the types of the method owner, the return and arguments to properly wrap a pointer to member function. The problem is that I still need a way to uniformly store pointers to methods in a container so a base class was a must. To be able to call a function from this base class a pure virtual function was my bet. So how does the template looks like?

  // Wrapper for const method.
  template<bool TConst, typename TReturn, typename TClass, typename... TArgs>
  class TMethod : public Method
  {
  private:
    friend class Method; // I don't even remember why It is a friend, maybe I need to refactor this ;-)

    typedef TReturn(TClass::*MethodPointerType)(TArgs...) const;

    typedef TMethod<true, TReturn, TClass, TArgs...> MethodType;

  public:
    TMethod(const char* name, MethodPointerType method)
      : Method(name, Djb(name, strlen(name)), GetType<TClass>(), GetType<TReturn>())
    {
      m_method.ptr = method;
      details::_get_types<Vector<const Type*>, TArgs...>(m_arguments);
    }

    virtual void Invoke(void* instance, void* result, void* args[]) {
      dispatch((const TClass*)instance, (TReturn*)result, args);
    }

  private:
    struct returns_void : std::is_void<TReturn> {};

    TW_INLINE void dispatch(const TClass* instance, TReturn* result, void* args[])
    {
      _dispatch(instance, result, args, returns_void(),
        typename make_index_tuple<sizeof...(TArgs)>::type());
    }

    template<unsigned... TArgIndices>
    TW_INLINE void _dispatch(const TClass* instance, TReturn* result, void* args[],
      std::true_type, index_tuple<TArgIndices...>)
    {
      (instance->*(m_method.ptr))(
        *(static_cast<typename std::remove_reference<TArgs>::type*>
        (args[TArgIndices]))...);
    }

    template<unsigned... TArgIndices>
    TW_INLINE void _dispatch(const TClass* instance, TReturn* result, void* args[],
      std::false_type, index_tuple<TArgIndices...>)
    {
      *result = (instance->*(m_method.ptr))( 
        *(static_cast<typename std::remove_reference<TArgs>::type*>
        (args[TArgIndices]))...);
    }

    struct {
      union {
        MethodPointerType ptr;
        u8 _align[20];
      };
    } m_method;
  };


  // Wrapper for non-const methods. (note: this is pretty much the same as above but without consts
  template<typename TReturn, typename TClass, typename... TArgs>
  class TMethod<false, TReturn, TClass, TArgs...> : public Method
  {
  public:
    ...
    virtual void Invoke(void* instance, void* result, void* args[])
    {
      dispatch((TClass*)instance, (TReturn*)result, args);
    }
    ...
  };

Unfortunately, I am aware that such template can potentially generate lots of instantiations that is why I am investigating ways of reducing the numbers down. But anyways, there are more interesting pieces of the puzzle here. First there is this union - but more on that later, next we have proper typedef for the member function. Why is this important? Well, apparently pointers to member functions are another beast altogether. They're absolutely nothing like regular function or data pointers. In fact their size vary so much depending, on platform, compiler, it's internal implementation and inheritance (single/multi/virtual). That it is a huge mess to grasp. If you are interested - I encourage you to read the article on FastDelegate library by it's author Don Clugston. This is one of the reasons why I needed the TMethod in the first place. Next we have this mysterious: details::_get_types<Vector<const Type*>, TArgs...>(m_arguments); call and finally some dispatch craziness. I will try and break-down every part one by one in a manageable fashion.

The nice thing about this class and it's implementation is the syntax required to actually create the instance and register it in reflected class definition:

    // Script1 is the function with arbitrary signature
     ->AddMethod(TW_METHOD(ReflectedClass, Script1))

This translates to:

   #define TW_METHOD(Clazz, Func) \
    (Method*) New<typename tw::details::_method_wrapper< \
    decltype(&Clazz::Func)>::TMethod>(tw::details::g_methodPool, #Func, &Clazz::Func)

Oh, seriously? Another evil macro with more ugly template stuff? And I ranted about code-bloat last time! But seriously, it is not how it looks, totally... The New is just new for memory allocation but in a little bit different flavour. I use it widely in my engine and it is a part of the whole memory management system. So New is actually a syntactical sugar for the use of allocator passed as the first argument and placement new for a given type (the template parameter). The decltype part is a great new tool in C++11 that lets you deduce the type of a given expression and in this case this is an actual member function pointer. Let's leave the details of _method_wrapper for now as I will go back to it later. Just to let you know TMethod is a typedef of _method_wrapper struct for the particular TMethod we need, with all the template arguments nicely deduced for us (TClass, TReturn, TArgs pack etc.). As we can see from the second code snippet, the constructor of the TMethod requires 2 parameters, the name of the function and the pointer itself, and these are the second and third parameter to the New. Ufff...

Now going back to the union. All of this should start to make sense now - the New takes something called g_methodPool as an argument. The thing is that objects in the pool tend to be of the same type - without that there is little sense in a pool after all. So there is a problem when you have instances of a templated class which encapsulate a pointer to member function, which in turn can have sizes ranging from 4 to 20... Hell yeah... To overcome this issue the union has been introduced. It assures me that all instances of TMethod are the same size (20 bytes for the pointer + some additional stuff from the base class). Therefore, they are suitable for the pool and play nicely at the cost of some additional memory being wasted.

The _get_types stuff from the constructor is just another template black-box that essentially deduces the types of the arguments from the variadic macro. It does the expansion calling GetType<> function for each type passed to the template and stores the result in the given container.

Last thing is the Invoke method calling dispatch which on the other hand calls the real implementation in the form of _dispatch. The evil in its purest form:

    _dispatch(instance, result, args, returns_void(),
        typename make_index_tuple<sizeof...(TArgs)>::type());

    template<unsigned... TArgIndices>
    TW_INLINE void _dispatch(TClass* instance, TReturn* result, void* args[],
      std::false_type, index_tuple<TArgIndices...>)
    {
      *result = (instance->*(m_method.ptr))(
        (*static_cast<typename std::remove_reference<TArgs>::type*>
        (args[TArgIndices]))...);
    }

Things start to get a little bit complicated... At least they were for me when I tried to figure out all of this. The problem here is that we have really generic pure virtual function that takes instance on which it will invoke the call, the pointer to void* which will store the return value (if needed) and finally the table of void* pointers to the arguments. How do we actually cast parameters and invoke a function hidden behind a pointer with parameters packed in an array? The answer to this question is of course the power of variadic templates, parameter packs and to be more precise the parameter pack expansion.

The idea here is to generate a tuple of indices that has the same number of elements as the number of arguments (TArgs...). The indices will start of from 0 and go up to n-1 where n is the number of arguments. Then we use this tuple as a template argument to the _dispatch method and from there we can easily unpack the parameter pack using TArg to cast our argument to proper type, and TArgIndices to index our table of argument. Finally last thing to do is actually dereference the argument. One step I omitted is the use of std::remove_reference as pointers to references are illegal in C++ and this is another hidden cost of such a generic call.

The index_tuple<> and make_index_tuple<> implementation was heavily influenced by the answer of Jonathan Wakely to this StackOverflow question. You can also find to an implementation and some other C++11 utilities he authored in the link so I won't duplicate the information here.

So now we know how the methods are called, but how to efficiently push those parameters? This I will present while talking about LUA binding.


THE LUA BINDING PART

The scripting integration in lame engine is a topic for entirely new blog post so not going to deep let me say that I have a basic IScriptingContext interface that all concrete scripting language bindings need to implement. The most important function from our perspective at the moment is the function:
virtual void Initialize(ReflectionSystem* reflection, Allocator* allocator) = 0;

The LUA binding and for that matter any scripting integration is transparent to the other parts of the engine. The bridge that ties everything together is the reflection system, therefore the ScriptingContext implementation is provided with pointers to the reflection system and the allocator that should be used for all internal memory allocations of the scripting language. The initialization in case of the LuaContext is going trough each class that should get reflected, creates appropriate tables and pushes method pointers as lightuserdata. The following snippet is executed for each applicable type:


            // create table
            lua_createtable(L, 0, 4 + nMethods + nProperties + nFields);
            // push the typename key
            lua_pushstring(L, "_typeName");
            // push the typename value
            lua_pushstring(L, types[i].GetName());
            // insert key-pair into the table
            lua_settable(L, -3);

            // add the new function as new_lua_class (defined in lua)
            lua_pushstring(L, "new");
            lua_getglobal(L, "new_lua_class");
            lua_settable(L, -3);
            
            // Create table for methods
            lua_pushstring(L, "_nativeMethods");
            lua_createtable(L, 0, nMethods);
            
            for (u32 j = 0; j < nMethods; ++j) {
              // push name of the method
              lua_pushstring(L, (*m)[j]->GetName());
              // push pointer to the Method object
              lua_pushlightuserdata(L, (void*)(*m)[j]);
              // set the entry in _nativeMethods
              lua_settable(L, -3);
            }
            lua_settable(L, -3);

The Lua bridge also contains three important methods: twlua_invoke, twlua_new_cpp_instance and twlua_delete_cpp_instance. The second and third functions create and destroy instances of named type. Appropriate constructor is being used and the function returns pointer to this instance back to Lua which is stored as lightuserdata in the field _nativeThis. I will omit the details and go straight to the twlua_invoke implementation with only the essential stuff kept (well this is my first go to the implementation of this method and it is still lacking some checks and probably some reorganization to make things even more efficient):

 static s32 twlua_invoke(LUA L)
  {
    s32 nArgs = lua_gettop(L);
    if (!lua_isuserdata(L, 1)) {
      Log.e(TAG, "Wrong argument to lua_invoke, first argument should be "
        "pointer to an instance of a compatible class to make a call");
    }
    Object* instance = (Object*)lua_touserdata(L, 1);
    Method* method = (Method*)lua_touserdata(L, 2);

    // Create call-stack for nParams.
    u32 nParams = nArgs - 2;
    void** stack = nullptr;
    if (nParams > 0) {
      stack = (void**)alloca(sizeof(void*)* nParams);
    }
    // details for return value support are similar to passing args so cut them out of the source.
    ...

    const Vector<const Type*>& argTypes = method->GetArgumentTypes();
    for (u32 p = 0; p < nParams; ++p) {
      const u32 i = p + 3;
      switch (lua_type(L, i)) {
        // if our type is a number
        case LUA_TNUMBER: {
          // but argument our function expects a non-float we cast it to int
          if (argTypes[p] != GetType<f32>()) {
            s32* n = (s32*)alloca(sizeof(s32));
            *n = lua_tointeger(L, i);
            stack[p] = n;
          }
          else
          {
            f32* n = (f32*)alloca(sizeof(f32));
            *n = lua_tonumber(L, i);
            stack[p] = n;
          }
        }
        break;

        case LUA_TBOOLEAN: {
          bool* boolean = (bool*)alloca(sizeof(bool));
          *boolean = lua_toboolean(L, i);
          stack[p] = boolean;
        }
        break;

        case LUA_TNIL: {
          void* pointer = alloca(sizeof(void*));
          pointer = nullptr;
          stack[p] = pointer;
        }
        break;

        case LUA_TSTRING: {
          const char** string = (const char**)alloca(sizeof(const char**));
          *string = lua_tostring(L, i);
          stack[p] = string;
        }
        break;

       ... // other lua types
      }
    }
    // Call the function - yay!
    method->Invoke(instance, returnedValue, stack);
    
    if (returnedValue) {
      const Type* rtype = method->GetReturnType();
      if (rtype == GetType<f32>()) {
        lua_pushnumber(L, *((f32*)returnedValue));
        return 1;
      }
      else if (rtype == GetType<s32>()) {
        ...
      }
      ...
    }

The last and most important piece of the puzzle to note here is the usage of alloca instead of dynamic memory allocation. If you have never heard about alloca I encourage you to read about it as it is a powerful tool when used wisely. In this case we allocate memory 'dynamically' but from the stack which is extremely fast. The problem with alloca is that the errors are not signaled with exceptions but rather with a stack overflow (when we run out of memory) but since we only allocate the arguments and the table for the arguments on the stack I am pretty much sure that this will work as a charm. Anyway, if you would try to allocate a table of 5Mb on the stack the program would probably crash anyway so how is that better than usage of alloca? Assuming we have even 12 arguments and some of them are Vectors or even Matrices we end up with: 12 * 4 (sizeof(void*)) + 2 * 16*4 (2 4x4 matrices) + 2*8*4 (2 Vector4D) + 8 * 4 (8 integers) we end up with: 48 + 128 + 64 + 32 = 272 bytes which is less than advised 1kb (according to microsoft documentation, it might vary from platform to platform).

And this is pretty much the end of this post. It was a lengthy one, I hope it was an interesting read and I encourage everyone to ask questions or comment on anything you feel right. Thanks for the reading and till the next time!

For those of you who are interested- next article I plan on writing will be Efficient Text Rendering Part 2 - caching, batching and drawing.

No comments:

Post a Comment