Wednesday, March 31, 2010

Utils3D.projectVectors Slow?

If you have tried to implement 3D stuff in flash without using some already made library out there then its likely that you are using Utils3D.projectVectors somewhere in your code. This function takes 3d vertexes packed in a single Vector<Float> and outputs them on a second Vector<Float> instance applying a projection and transformation defined by a 4x4 Matrix

It is logical to assume that this function is faster than a manual implementation of a 3D projection, but this doesn't seem to be the case in the current version of Flash Player. 

I implemented the following perspective projection class:

class Projector 
{
public var m : Matrix4A;
public var focusDistance : Float;

public function new()
{
focusDistance = 1000;
}

public inline function projectVecList(
src : flash.Vector < Float > ,
dstVerts : flash.Vector < Float > ) : Void
{
var sindex = 0;
var dindex = 0;
var n = src.length;

while ( sindex < n ) {
var srcx = src[sindex++];
var srcy = src[sindex++];
var srcz = src[sindex++];

var zInv = focusDistance /
(srcx*m.m13 + srcy*m.m23 + srcz*m.m33 + m.m43);

dstVerts[dindex++] =
(srcx*m.m11 + srcy*m.m21 + srcz*m.m31 + m.m41) * zInv;

dstVerts[dindex++] =
(srcx*m.m12 + srcy*m.m22 + srcz*m.m32 + m.m42) * zInv;
}
}
}

Matrix4A represents an affine 4x4 matrix ( which means the bottom row is always 0,0,0,1 ), multiplication between affine matrices can be optimized to skip a lot of multiplications without mentioning that you also save the space of 4 Floats members.

So granted, this isn't exactly what Utils3D projection function is doing.

Utils3D.projection uses the Flash 10 Matrix3D class, which is a full 4x4 matrix ( without the constraint of being affine ). This makes it slower but also gives it a bit of extra power, since it can express projection information inside. I'm not completely in touch with the mathemagics of homogenous coordinates, but I think there are some other useful tricks in there that might make 4x4 matrices worth it in some cases.

What my projector class is doing is first transforming the vertexes using its matrix, and then divide the resulting x and y coords by the resulting z multiplied by focusDistance.

I timed how long it takes to project a vector filled with 100k vertexes with both Utils3D.projectVectors and how long it takes using the Projector class:

Utils3D.projectVectors: 1031ms

Projector.projectVecList: 755ms

Projector is faster than the native function! You might say that this shows nothing since it is not doing exactly the same stuff as Utils3D, but the point is that the same projection algorithm implemented in C/C++ should beat my projector by far.

Also, as Andre Michelle has pointed out in this blog entry, the Utils3D.projectVector api could use a few changes. If the api isnt really faster than writing it yourself then that means that there's no reason to wait for adobe to implement such changes.

Anyway, I never really liked having the vertices packed in a Float vector like required by the Utils3D.projectVectors function. Since I was going to be using my own projection now, I decided to see how performance would be affected if I used Vector<Vec3D> instead of Vector<Float> ( packed vertexes ) as the source vertexes of the projection.

So instead of: [ x1, y1, z1, x2, y2, z2, ... ] now I use: [ v1, v2, ... ].

Old times were:
Utils3D.projectVectors: 1031ms
Projector.projectVecList Vector<Float>: 755ms

New time is:
Projector.projectVecList Vector<vec3f>: 631ms

Its even faster! Takes almost half the time it takes Utils3D to finish.

So in conclusion:

From now own I'll manually project the vertices, and I'll do it by receiving a Vector of Vertices and not a Vector of Floats with vertices packed in them.

Run the test here.


Update:

Made some tests using HaXes flash.Memory api. I pack source vertexes in a ByteArray in a similar way to how they are packed in the Vector<Float> version.

I tested storing the vertex components as Double (8 bytes) and as Float ( 4 Bytes ). The result with the Double storage was a bit faster than the Vector<Float> implementation but slower than the Vector<Vec3D> one. 

Storing the components as Float turned out to be faster than any of the other methods, although the difference between it and the Vector<Vec3D> was very small and a lot of precision is lost by this conversion.


Utils3D.projectVectors: 1071
Projector Vector<Float>: 766
Projector Vector<Vec3D>: 642
ByteArray Double: 733
ByteArray Float: 599

I think I'll stay with the Vector<Vec3D> implementation for now, since I'm not very keen on sacrificing numeric precision or usability for such a small increase in performance.

Sunday, March 28, 2010

Flash Quake 3 BSP Loader + Renderer

Inspired by Alternativa3Ds great performance, I am writing this quake 3 bsp loader and viewer in HaXe. And I'm pretty happy with the results so far:

 

Try it here

I might release the source code later after some polishing and optimizations.

Saturday, March 27, 2010

HaXe iterators vs whiles

I have run some test about haXe iterators. Iterators provide a very nice abstraction but their speed when used in for loops was a big disappointment.

I tested it by creating many different ways to calculate the fibonacci series

Infinite Fibonacci Iterator:

class FibIter1 {
var val : UInt;
var nextVal : UInt;

public function new() {
val = 0;
nextVal = 1;
}

public inline function hasNext() : Bool {
return true;
}

public inline function next() : UInt {
var tmpVal = val;
val = nextVal;
nextVal = tmpVal + val;
return tmpVal;
}
}

Bounded Fibonacci Iterator:

class FibIter2 {
var val : UInt;
var nextVal : UInt;
var count : UInt;

public function new( count : UInt ) {
val = 0;
nextVal = 1;
this.count = count;
}

public inline function hasNext() : Bool {
return count != 0;
}

public inline function next() : UInt {
var tmpVal = val;
val = nextVal;
nextVal += tmpVal;
count--;
return tmpVal;
}
}

Test code:

package ;

import flash.Lib;
import org.flashdevelop.utils.FlashConnect;

class Main
{

inline static var Test = 5;
static function main() {


var startTime = Lib.getTimer();
var acc = 0;
for ( i in 0...99999 ) {
if ( Test == 1 ) {

// ********************************
// Test 1
// Infinite Iterator + Manual Break
// ********************************

var count : UInt = 99;
for ( a in new FibIter1() ) {
if ( count-- == 0 ) break;
acc += a;
}

}else if ( Test == 2 ) {

// ********************************
// Test 2
// Bounded iterator
// ********************************

for ( a in new FibIter2(99) ) {
acc += a;
}

}else if ( Test == 3 ) {

// ********************************
// Test 3
// Manual algorithm using while
// ********************************

var val : UInt = 0;
var nextVal : UInt = 1;
var count : UInt = 99;
while ( count-- != 0 ) {
acc += val;
var tmpVal = val;
val = nextVal;
nextVal += tmpVal;
}

}else if ( Test == 4 ) {

// ********************************
// Test 4
// Manual algorithm using for
// ********************************

var val : UInt = 0;
var nextVal : UInt = 1;
for ( i in 0...99 ) {
acc += val;
var tmpVal = val;
val = nextVal;
nextVal += tmpVal;
}

}else if ( Test == 5 ) {

// ********************************
// Test 5
// Bounded iterator using while
// ********************************

var iter = new FibIter2(99);
while ( iter.hasNext() ) {
acc += iter.next();
}

}

}

FlashConnect.trace(acc);
var endTime = Lib.getTimer();
FlashConnect.trace( "Test " + Test + ": " + (endTime - startTime) );

}
}

Results (In haXe 2.05):

Test 1: 5893ms
Test 2: 5692ms
Test 3: 125ms
Test 4: 131ms
Test 5: 136ms

Conclusions:

After being disappointed by the results of the for expression I decided to check the generated bytecode: apparently the haXe compiler wont inline iterator methods in a for loop. The test 5 shows that the iterator class can be pretty fast if it is correctly inlined.

I can't think of a good reason why the compiler shouldn't be able to inline the iterator methods, which can be a good thing, because if there is no good reason then maybe in future versions of haxe for will inline too.


HaXe Vs Flex: Constant If Removal

When the condition of an if is constant the compiler can decide to remove the condition and body from the generated bytecode. This reduces space and (possibly) improves execution performance.

I set out to test the haXe and flex compilers, to see if any of them would apply such optimizations. To do so I used hxswfml, a tool which lets you convert swf files to xml or hx files that represent the bytecode in a very readable manner.

Flex SDK 4:

package 
{
import org.flashdevelop.utils.FlashConnect;
import flash.display.Sprite;
import flash.events.Event;

public class Main extends Sprite
{
private static const a : uint = 1;
public function Main():void
{
if ( a == 0 ) {
FlashConnect.trace( "Test A" );
}

const b : uint = 1;
if ( b == 0 ) {
FlashConnect.trace( "Test B" );
}

var c : uint = 1;
if ( c == 0 ) {
FlashConnect.trace( "Test C" );
}

if ( false ) {
FlashConnect.trace( "Test D" );
}
}
}
}

The result was that only Test D was optimized away from the bytecode.

HaXe 2.05 Compiler:

package ;

import flash.Lib;
import org.flashdevelop.utils.FlashConnect;

class Main
{
static inline var b = 1;
static inline var c = 0;

static function main()
{
var a = 1;
if ( a == 0 ) {
FlashConnect.trace("Test A");
}

if ( b == 0 ) {
FlashConnect.trace("Test B");
}

if ( b == c ) {
FlashConnect.trace("Test C");
}

if ( false ) {
FlashConnect.trace("Test D");
}

}
}

Tests B to D were removed, Test A remained.


Both as3 and haxe provide means for conditional compilation, however they both require you to define the preprocessor variables outside of the sourcecode. HaXe's inline variables allow configuration of the source compilation in a less powerful but less tedious way.

I was disappointed that haXe didn't optimize away all cases though.


Friday, March 26, 2010

Adventures with Haxe!

Today this project called Haxe has come to my attention. Its a language very similar to Actionscript3 which can compile to flash bytecode ( among many other possible targets ).

The project claims its compiler produces better performing swf files than the as3 flex compiler and the language adds a features which are just lovely ( Type inference, more detailed function types, c-like local variable scopes, iterators, etc. ) so trying it out was irresistible.

I decided to test the inline function feature by implementing a 2d vector class:

class Vec2f
{
public var x:Float;
public var y:Float;

public function new( x : Float, y : Float ) {
this.x = x;
this.y = y;
}

public inline function inlineAdd( v : Vec2f ) : Vec2f {
x += v.x;
y += v.y;
return this;
}

public function add( v : Vec2f ) : Vec2f {
x += v.x;
y += v.y;
return this;
}
}

And made a few simple benchmarks:

package ;

import flash.geom.Point;
import flash.Lib;
import org.flashdevelop.utils.FlashConnect;

class Main
{
static inline var TestIterations = 99999999;

static inline var Inline = 0;
static inline var NonInline = 1;
static inline var ManualInline = 2;
static inline var ManualVariableInline = 4;

static inline var TestKind = ManualVariableInline;

static function main()
{
if ( TestKind == NonInline ) {

// ***************************************
// Non inline test:
// ***************************************
var v1 = new Vec2f(0, 0);
var v2 = new Vec2f(1, 1);
var startTime : Int = Lib.getTimer();
for ( i in 0...TestIterations ) v1.add( v2 );
var endTime : Int = Lib.getTimer();
FlashConnect.trace( "Non Inline Method: " + (endTime - startTime) );

}else if ( TestKind == Inline ) {

// ***************************************
// Inline test:
// ***************************************
var v1 = new Vec2f(0, 0);
var v2 = new Vec2f(1, 1);
var startTime : Int = Lib.getTimer();
for ( i in 0...TestIterations ) v1.inlineAdd( v2 );
var endTime : Int = Lib.getTimer();
FlashConnect.trace( "Inline Method: " + (endTime - startTime) );

} else if ( TestKind == ManualInline ) {

// ***************************************
// Manually inlined test:
// ***************************************
var v1 = new Vec2f(0, 0);
var v2 = new Vec2f(1, 1);
var startTime : Int = Lib.getTimer();
for ( i in 0...TestIterations ) {
v1.x += v2.x;
v1.y += v2.y;
}
var endTime : Int = Lib.getTimer();
FlashConnect.trace( "Manual Inline: " + (endTime - startTime) );

} else if ( TestKind == ManualVariableInline ) {
// ***************************************
// Manual inline of vector variables:
// ***************************************
var x1 : Float = 0;
var y1 : Float = 0;
var x2 : Float = 1;
var y2 : Float = 1;
var startTime : Int = Lib.getTimer();
for ( i in 0...TestIterations ) {
x1 += x2;
y1 += y2;
}
var endTime : Int = Lib.getTimer();
FlashConnect.trace( "Manual inline of variables: " + (endTime - startTime) );
}
}
}

The results:

Non Inline Method: 9743
Inline Method: 708
Manual Inline: 709
Manual Variables Inline: 963

Method inlining beats the non inline methods by a factor of 14! Unsurprizingly manual inlining performed exactly the same. This is very good news since it will allow Haxe users to abstract things like vectors operations without worrying about the performance cost.

What was very surprizing however, was that inlining the variables and not using the Vec2f class at all was considerably slower. I was expecting it to be faster since the VM stores the variables in local registers instead of having to access heap memory through the Vec2f instance.