Forums

Full Version: NEON
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
NEON is a specialised optimisation system which allow the CPU to load in multiple chunks of data at a time and perform staggeringly fast maths on them by doing the same instructions on multiple values in parallel. Not as fast as a GPU but still quite a bit faster.

NEON is the ARM method, and its available on almost all modern ARM SBC's, SIMD is the intel version on a select few SBC's.

Now the fact is, I don't really get too involved at the lower end of a system, I have too many to keep track of, but it should be possible to get the compiler to optimise some of our code, especially the maths libs (though they may already do so) by asking the compiler to use NEON where viable.

This is no where near as good as writing your code with NEON in mind and using some of the specific NEON instructions, but it might provide a nice little boost if your project is maths/physics heavy and you find the performance is just missing the frame rate you need.

You need a post 2015 version of GCC on your system, and all you have to do is add  -mfpu=neon -ftree-vectorize to your gcc compile options.  (you can leave out the -mfpu=neon, but I found it reported much fewer instances were it could optimise). You also need to set the optimisation levels to 3 with -O3 
My compiler  CFLAGS currently looks like this.

-ggdb -ffunction-sections -O3 -std=c++14 -mfpu=neon -ftree-vectorize -fopt-info-vec-optimized


I have NO IDEA, at the moment what performance gain you will get, if any, but I hope that bullet, maybe PhysX and GLM will get some benefit from it. I will try to pop back to this sometime and give better guidance on how effective this turns out to be!

You can see if any optimisation is taking place by adding -fopt-info-vec-optimized  to your compiler commands (it does slow the compile time though) and reading the output.

This is quite advanced stuff and you need to really understand a little about registers and parallel processing, there's some decent info on the web and this I found quite helpful.
http://www.add.ece.ufl.edu/4924/docs/arm...opment.pdf
Nice!
I wonder if this is similar to NUMA on Intel based systems where you can map process per CPU core based on what PCI lane it is mapped to?
umm..... watermelon.

ie. I have no idea Big Grin