Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
SpreadsheetsontheGPU
MichaelMeeks<michael.meeks@collabora.com>
mmeeks,#libreofficedev,irc.freenode.net
Stand at the crossroads and look; ask for the
ancient paths, ask where the good way is,
and walk in it, and you will find rest for your
souls... - Jeremiah 6:16
Overview
LibreOffice?
Abitabout:
GPUs
Spreadsheets
Internalrefactoring
OpenCLoptimisation
newcalcfeatures
XML/loadperformance
Calc/GPUquestions?
Questions?
60,000,000
50,000,000
40,000,000
30,000,000
20,000,000
10,000,000
0
AdvisoryBoardMembers
This slide's layout is a victim of our success here ...
4 / 41
WhyusetheGPU?
APUsGPUfasterthanCPU
TonsofunusedComputeUnitsacrossyourAPU
Doubleprecisionisunreasonablyslower
Andprecisionisnonnegotiablefor
spreadsheetsIEE764required.
Betterpowerusageperflop.
Numbers based
on a Kaveri 7850K
APU - & top-end
discrete Graphics
card.
fp64
CPU flops
GPU flops
FirePro 7990
fp32
1
10
100
1000
10000
Developersbehindthecalcrework:
Kohei Yoshida:
MDDS maintainer
Heroic calc core re-factorer
Code Ninja etc.
Markus Mohrhard
Calc maintainer,
Chart2 wrestler
Unit tester par
Excellence
etc.
Jagan Lokanatha
Kismat Singh
Matus Kukan
Data Streamer,
G-builder,
Size optimizer ..
SpreadsheetGeometry
An early
Spreadsheet
C 3000 BC
Aspect ratio: 8:1
Contents:
Victory against
every land
who giveth all life
forever
Excel 2003
Excel 2010
64k x 256
10^6 x 16k
Aspect:
256:1
Aspect:
16:1
50% of
spreadsheets
used to make
business
decisions.
Columnar data structures
The 'Broom
Handle'
aspect
ratio.
SpreadsheetCoreDataStorage
ThejoyofObjectOrientation
ScTable
ScBaseCell
ScDocument
ScColumn
Broadcaster (8 bytes)
Text width (2 bytes)
Cell type (1 byte)
Script type (1 byte)
ScValueCell
ScFormulaCell
ScStringCell
ScEditCell
ScNoteCell*
10 / 41
ScDocument
11
Undo / Redo
RTF Filter
Change Tracking
Content Rendering
HTML Filter
External Reference
Document Iterators
CSV Filter
DIF Filter
Conditional Formatting
SYLK Filter
DBF Filter
ODF Filter
Cell Validation
CppUnit Test
ScDocument
Document Iterators
12
Before(ScBaseCell)
ScTable
ScBaseCell
ScDocument
ScColumn
Broadcaster (8 bytes)
Text width (2 bytes)
Cell type (1 byte)
Script type (1 byte)
ScValueCell
ScFormulaCell
ScStringCell
ScEditCell
ScNoteCell*
13 / 41
Scattered
pointer
chasing
walking cells
down a
column ...
After(mdds::multi_type_vector)
ScTable
ScColumn
svl::SharedString block
ScDocument
double block
EditTextObject block
ScFormulaCell block
Broadcasters
Cell notes
Text widths
Script types
14 / 41
Cell values
Iteratingovercells(oldway)
loop down a column and the inner loop:
double nSum = 0.0;
ScBaseCell* pCell = pCol >maItems[nColRow].pCell;
++nColRow;
switch (pCell->GetCellType())
{
case CELLTYPE_VALUE:
nSum += ((ScValueCell*)pCell)->GetValue();
break;
case CELLTYPE_FORMULA:
something worse ...
case CELLTYPE_STRING:
case CELLTYPE_EDIT:
case CELLTYPE_NOTE:
15 / 41
Iteratingovercells(newway)
double nSum = 0.0;
for (size_t i = 0; i < nChunkLength; i++)
nSum += pDoubleChunk[i];
ONO. from a vectoriser ...
16 / 41
SharedFormula
Before
Tokens
18 / 41
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
ScFormulaCell
ScTokenArray
...
...
RPN
After
ScFormulaCell
ScFormulaCell
ScFormulaCellGroup
ScFormulaCell
Tokens
ScFormulaCell
ScTokenArray
ScFormulaCell
ScFormulaCell
ScFormulaCell
19 / 41
RPN
Memoryusage
Heap memory size (MB)
400
372
300
259
200
100
27
0
Shared formula on
Empty document
Shared formula off
Sharedstringrework
Stringcomparisonswereslow
AlsonottractableforaGPU
Caseinsensitiveequalityisahard
problemICU&heavylifting.
Stringcomparisonsalotin
functions,andPivotTables.
Sharedstringstorageisuseful.
Sofixit...
Concept
svl::SharedStringPool
svl::SharedString
Original string pool
svl::SharedString
Upcased string pool
svl::SharedString
22 / 41
Stringcomparison(oldway)
23 / 41
Stringcomparison(newway)
24 / 41
OpenCL/calculation...
WhyOpenCL&HSA...
GPUandCPUoptimisation
WhywritecustomSSE2/SSE3etc.assembly
detectarch,andselectbackendcross
platforms.
InsteadgetOpenCL(fromAPUvendor)to
generatethebestcode...
HetrogenousSystemArchitecturerocks:
AnAMD64likeinnovation:
sharedVirtualMemoryAddressspace&pointers:
GPUCPU.
Avoidwastefulcopies,fastdispatch
GreatOpenCL2.0support.
UsetherightComputeUnitforthejob.
__kernel void
The same formula for a longer sum
tmp0_0_0_reduction(__global double* A,
__global double *result,
int arrayLength, int windowSize)
Compiled from standard formula syntax
{
double tmp, current_result =0;
int writePos = get_group_id(1);
int lidx = get_local_id(0);
double tmp0_0_fsum(__global double
__local double shm_buf[256];
*tmp0_0_0) {
int offset = 0;
double tmp = 0;
int end = windowSize;
int gid0 = get_global_id(0);
end = min(end, arrayLength);
tmp = ((tmp0_0_0[gid0])+(tmp));
barrier(CLK_LOCAL_MEM_FENCE);
return tmp;
int loop = arrayLength/512 + 1;
}
for (int l=0; l<loop; l++) {
double tmp0_nop(__global double
tmp = 0;
*tmp0_0_0) {
int loopOffset = l*512;
double tmp = 0;
if((loopOffset + lidx + offset + 256) < end) {
int gid0 = get_global_id(0);
tmp = legalize(((A[loopOffset + lidx + offset])+
tmp = tmp0_0_fsum(tmp0_0_0);
(tmp)), tmp);
return tmp;
tmp = legalize(((A[loopOffset + lidx + offset +
}
256])+(tmp)), tmp);
__kernel void
} else if ((loopOffset + lidx + offset) < end)
DynamicKernel_nop_fsum(__global double
tmp = legalize(((A[loopOffset + lidx + offset])+
*result,
(tmp)), tmp);
shm_buf[lidx] = tmp;
__global double *tmp0_0_0)
barrier(CLK_LOCAL_MEM_FENCE);
{
for (int i = 128; i >0; i/=2) {
int gid0 = get_global_id(0);
if (lidx < i)
result[gid0] = tmp0_nop(tmp0_0_0);
shm_buf[lidx] = ((shm_buf[lidx])+
}
(shm_buf[lidx + i]));
barrier(CLK_LOCAL_MEM_FENCE);
}
if (lidx == 0)
current_result =((current_result)+(shm_buf[0]));
barrier(CLK_LOCAL_MEM_FENCE);
}
if (lidx == 0)
result[writePos] = current_result;
}
min_max_avg_r
30x 500x
faster for
these
samples vs.
the legacy
software
calculation
destination-workbook
Shorter is better
dates-worked
stock-history
on Kaveri.
ground-water
10
100
1,000
10,000
100,000
Inmoredetail...
Thisisaspreadsheet
Highlyspreadsheetgeometrydependent
WhatdoyoumeanwhatistheXfactor?
Don'tlikeyourXfactoraddmorerows,or
complexity.
Representativesheetsimportantsomebased
onrealworldmadness
Functions:
Researchshowsvastmajorityofdistinct
fomulaehaveverysimplefunctions:SUM,
AVERAGE,SUMIF,VLOOKUP,etc.
Weoptimisethose
Wedon'tdoeg.TextfunctionslikeUPPER
Howthatworksinpractise:
33 / 41
BigdataneedsDocument
Loadoptimization
ParallelizedLoading...
DesktopCPUcoresareoftenidle.
XMLparsing:
Theidealapplicationofparallelism
SAXparsers:
SuckingicAcheeXperienceparsers
read,parseatinypieceofXML&emitanevent
punchthatdeepintothecoreoftheAPPlogic,and
return..
ParseanothertinypieceofXML.
BetterAPIsandimpl'sneeded:Tokenizing,
Namespacehandlingetc.
Luckilyeasytoretrofitthreading...
DozensofperformancewinsinXFastParser.
Utilisingyour32coreCPU...
(boxesarethreads).
Thread 2
SplitXMLParse&
Sheetpopulate
Thread 1
Unzip,
XML Parse,
Tokenize
Populate
Sheet Data
Structures.
ParallelisedSheet
Loading
Unzip,
XML Parse,
Tokenize
Populate
Sheet Data
Structures.
Progress bar
thread
ParalleltoGPU
compilation
etc.
=COVAR(A1:A300,B1:B300)
OpenCL code
Ready to execute kernels
Doesitwork?withGPUenabled
Wall-clock time to load set of large XLSX spreadsheets: 8 thread Intel machine
num-formula-2-sheets-1m.xlsx
numbers-formula-8-sheets-100k.xlsx
numbers-formula-100k.xlsx
Shorter is better
numbers-100k.xlsx
sumifs-testsheet.xlsx
Calc 4.1.3
Calc
Reference
stock-history.xlsm
matrix-inverse.xlsx
mandy.xlsm
mandy-no-macro.xlsx
groundwater-daily.xlsm
dates-worked.xlsx
0.1
10
100
Howdoesthatpanout?
Problems^WOpportunities...
PickingagoodOpenCLdriver
White/Black/Anylistingofknowngood/bad/
mixedHardware/Driver/OS
Whichcoretopick?
fp64perfetc.Timevs.Power
Currentlymicrobenchmarktime.
HSArocks
CL_MEM_USE_HOST_PTRisaroyalpain:
Alignmentissuescurrentlycauselotsofcopyingin
severalcases.
OpenCL2.0'sSharedVirtualMemoryisawesome
CompilerPerformance:
ExcelRPNCstringIRGPU
SPIRsoundsgreatifitcanbestable.
FutureOpenCLwork...
Volunteers/funderswelcome
Killpercelldependencygraphing
Badlyneedstobepercolumn:
Shrinkmemoryusage,improveloadtime
Detectindependentcolumncalculations
SPIRintegration
Enablingparallelexecution,widerCSEetc.
Avoid'NaN'foobyadaptingtodatashapefaster.
Calcasaflowprocess,'constructyour
pipelineinasheet'
Crazyawesomedemos:Mobilevs.PC...
ZIPLZ77/OpenCLaccelerationorsimilar
LibreOfficeConclusions
LibreOfficeisinnovating:
Goinginterestingplacesnoonehasgonebefore:
OpenCLinagenericspreadsheetsafirst
Whywrite5xhandcodedassemblerversionsandselectperplatform.
RunyourworkloadontherightComputeUnittosavetime&battery.
RefactoringforOpenCLimprovesperformanceforall
FasterforCPUandGPU
PCMark8.2includesLibreOfficebenchmarking.
LibreOfficelovesnewcontributor&features
thereisalreadyatoolforthat.
Talktomeaboutgettinginvolved...
Thanksforallofyourhelpandsupport!
Oh, that my words were recorded, that they were written on a scroll, that they were
inscribed with an iron tool on lead, or engraved in rock for ever! I know that my Redeemer
lives, and that in the end he will stand upon the earth. And though this body has been
destroyed yet in my flesh I will see God, I myself will see him, with my own eyes - I and not
another. How my heart yearns within me. - Job 19: 23-27
41