[Closed] RESEARCH on bpe compression

Status
Not open for further replies.

dadydodo

Main Eventer
Joined
Feb 2, 2012
Messages
719
Reaction score
0
tekken57 said:
Having taken a look at differences, there are number of values where the byte swap occurs. To further complicate things there are certain values where the byte swap occurs which are one off e.g. 41 only occurs once, everywhere else where 41 appears, the byte swap does not occur.

The are also some instances where there are trailing zero's after the values and the byte swap only appear after the trailing zero's. The number of trailing zero's also varies.

What this means is that you cannot write code to swap the bytes as the rules are consistently changing. the actual compression script is the issue.

Here are the values which the byte swap precedes:

82
87
84
81
80
86
80
FF
83
86
87
84
8b
89
83
8A
85
82
90
FF
94
88
8D
3f
96
41
ok first tekken57 it's good to have u with us my friend
I managed to compress a yobj file to bpe by fixing the inverted numbers manually
http://z4.ifrm.com/30080/166/0/p1159276/CH.PAC.70.zip
I already told hardx36 before about the tool we can make to fix the twisted numbers
about the twisted numbers the list I did I don't think u understood it
these are nearly fixed numbers this means every time u search for
00 00 80 01 in the new compressed file and the original one u will find that the 01 is twisted with the next number
also 00 00 0080 02 , 00 00 80 03 , 00 00 83 02 the list works like this
first search the hole file for 00 00 80 01 (g) (g) could be any number from 00 to ff
then replace it with 00 00 80 (g) 01
then search the file for 00 00 80 02 (g) replace it with 00 00 80 (g) 02
then search the file for 00 00 81 01 (g) replace it with 00 00 81 (g) 01
the most important condition do not replace the already replaced ones
for example
if u found 00 00 80 01 02 in this example (g= 02 ) make it 00 00 80 02 01
then fix it like this do not replace it again when u search for 00 00 80 02 01 do not reverse it back
the already replaced ones do not replace them again
this how the list I wrote works
when u finish 80 and 90 numbers
fix these ones also
00 00 01 (g) make it 00 00 (g) 01
If we made this tool we will fix nearly 99% of the errors
the rest errors we don't know we will fix it by a new version
and this tool with the way I wrote will make the fixing research easier and we will definitely have the correct tool by the end of the research
 

tekken57

Young Lion
Joined
Mar 19, 2013
Messages
8
Reaction score
0
I understood you correctly the first time. I use a tool called vbindiff which can compare the hex values of two files and reflect the differences.

By using this tool I determined that the values I listed also has byte swapping occurring after. For example 82 AA BB should be 82 BB AA. This is what I mean by byte swapping.

The byte swap occurs many times in the file. Take a look at the program I mentioned and compare the values to see what I mean.
 

ERI619

Legend
Joined
Sep 25, 2011
Messages
2,442
Reaction score
0
Website
www.razorinstinctmods.wordpress.com
Here are some conclusions about BPE from monash website.Read the 2nd point.

Overall BPE is not a very fast, nor effective compressor. It does have a good decompression rate though.
The BPE algorithm currently has several unaddressed faults with certain parameter combinations.
BPE did lend itself to parametrisation of its HASHSIZE, BLOCKSIZE and MAXCHAR values.
The smallest compressed result, which was job number 80, had a HASHSIZE of 16384 bytes, BLOCKSIZE of 19660 (20% greater than 16384) and a MAXCHAR value of 145. This combination resulted in a file of 1564726 bytes, from an original file of 3255838 bytes uncompressed. This is a compression ratio of 48%.
None of the results were greater than the orignal size, so the size limit is guaranteed.

so we need to find the correct parameter combinations inorder to prevent the reversing numbers problem.Its not the problem of the program,but its the fault of the compression itself.I think we are very close to solving this issue,we just need to input the right parameters.
 

NMCM

Upper Midcard
Joined
Dec 6, 2010
Messages
624
Reaction score
0
Hope so, that would change modding once again, in a major way.
 

dadydodo

Main Eventer
Joined
Feb 2, 2012
Messages
719
Reaction score
0
i will see what i can do also guys i think we are getting closer
 

HARDX36

Main Eventer
Joined
Feb 17, 2011
Messages
1,538
Reaction score
0
have you tried to run the .c source file step by step using for examplo devc++ compiler? that way you can monitor the proces of how the file is created
 

dadydodo

Main Eventer
Joined
Feb 2, 2012
Messages
719
Reaction score
0
HARDX36 I don't know any thing of this all I did is just compiling the c script using Dev-Cpp 5.6.1 TDM-GCC x64 4.8.1
then I clicked compile any thing else I don't know
 

ERI619

Legend
Joined
Sep 25, 2011
Messages
2,442
Reaction score
0
Website
www.razorinstinctmods.wordpress.com
Something which i found from the net.
The hash table size HASHSIZE must be a power of two, and should not be too much smaller than the buffer size BLOCKSIZE or overflow may occur. Programmers can adjust the value of BLOCKSIZE for optimum performance, up to a maximum of 32767 bytes. The parameter THRESHOLD, which specifies the minimum occurrence count of pairs to be compressed, can also be adjusted.
 

HARDX36

Main Eventer
Joined
Feb 17, 2011
Messages
1,538
Reaction score
0
yes it can be adjusted but that doesnt change the outputfile with switched bytes, the problem is the script, i mean thq used a modified version of the bpe compression, the script we use is the original one published in the 90's, one solution is figure out the decompression script in quickbms source folder, the only problem is that the source file is not commented so you basically need to figure out everything
 

ERI619

Legend
Joined
Sep 25, 2011
Messages
2,442
Reaction score
0
Website
www.razorinstinctmods.wordpress.com
Damn!!! Symbian had the correct BPE Script,since microsoft took over they removed symbian site which had the script,very unfortunate.

could someone compile this
// -*- mode:c++; tab-width:2; indent-tabs-mode:nil; c-basic-offset:2 -*-
2
3 /*
4 * Copyright (C) 2010-2011 ZXing authors
5 *
6 * Licensed under the Apache License, Version 2.0 (the "License");
7 * you may not use this file except in compliance with the License.
8 * You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 #include <zxing/common/StringUtils.h>
20 #include <zxing/DecodeHints.h>
21
22 using namespace std;
23 using namespace zxing;
24 using namespace zxing::common;
25
26 // N.B.: these are the iconv strings for at least some versions of iconv
27
28 char const* const StringUtils::pLATFORM_DEFAULT_ENCODING = "UTF-8";
29 char const* const StringUtils::ASCII = "ASCII";
30 char const* const StringUtils::SHIFT_JIS = "SHIFT_JIS";
31 char const* const StringUtils::GB2312 = "GBK";
32 char const* const StringUtils::EUC_JP = "EUC-JP";
33 char const* const StringUtils::UTF8 = "UTF-8";
34 char const* const StringUtils::ISO88591 = "ISO8859-1";
35 const bool StringUtils::ASSUME_SHIFT_JIS = false;
36
37 string
38 StringUtils::guessEncoding(unsigned char* bytes, int length, Hashtable const& hints) {
39 Hashtable::const_iterator i = hints.find(DecodeHints::CHARACTER_SET);
40 if (i != hints.end()) {
41 return i->second;
42 }
43 // Does it start with the UTF-8 byte order mark? then guess it's UTF-8
44 if (length > 3 &&
45 bytes[0] == (unsigned char) 0xEF &&
46 bytes[1] == (unsigned char) 0xBB &&
47 bytes[2] == (unsigned char) 0xBF) {
48 return UTF8;
49 }
50 // For now, merely tries to distinguish ISO-8859-1, UTF-8 and Shift_JIS,
51 // which should be by far the most common encodings. ISO-8859-1
52 // should not have bytes in the 0x80 - 0x9F range, while Shift_JIS
53 // uses this as a first byte of a two-byte character. If we see this
54 // followed by a valid second byte in Shift_JIS, assume it is Shift_JIS.
55 // If we see something else in that second byte, we'll make the risky guess
56 // that it's UTF-8.
57 bool canBeISO88591 = true;
58 bool canBeShiftJIS = true;
59 bool canBeUTF8 = true;
60 int utf8BytesLeft = 0;
61 int maybeDoubleByteCount = 0;
62 int maybeSingleByteKatakanaCount = 0;
63 bool sawLatin1Supplement = false;
64 bool sawUTF8Start = false;
65 bool lastWasPossibleDoubleByteStart = false;
66
67 for (int i = 0;
68 i < length && (canBeISO88591 || canBeShiftJIS || canBeUTF8);
69 i++) {
70
71 int value = bytes & 0xFF;
72
73 // UTF-8 stuff
74 if (value >= 0x80 && value <= 0xBF) {
75 if (utf8BytesLeft > 0) {
76 utf8BytesLeft--;
77 }
78 } else {
79 if (utf8BytesLeft > 0) {
80 canBeUTF8 = false;
81 }
82 if (value >= 0xC0 && value <= 0xFD) {
83 sawUTF8Start = true;
84 int valueCopy = value;
85 while ((valueCopy & 0x40) != 0) {
86 utf8BytesLeft++;
87 valueCopy <<= 1;
88 }
89 }
90 }
91
92 // ISO-8859-1 stuff
93
94 if ((value == 0xC2 || value == 0xC3) && i < length - 1) {
95 // This is really a poor hack. The slightly more exotic characters people might want to put in
96 // a QR Code, by which I mean the Latin-1 supplement characters (e.g. u-umlaut) have encodings
97 // that start with 0xC2 followed by [0xA0,0xBF], or start with 0xC3 followed by [0x80,0xBF].
98 int nextValue = bytes[i + 1] & 0xFF;
99 if (nextValue <= 0xBF &&
100 ((value == 0xC2 && nextValue >= 0xA0) || (value == 0xC3 && nextValue >= 0x80))) {
101 sawLatin1Supplement = true;
102 }
103 }
104 if (value >= 0x7F && value <= 0x9F) {
105 canBeISO88591 = false;
106 }
107
108 // Shift_JIS stuff
109
110 if (value >= 0xA1 && value <= 0xDF) {
111 // count the number of characters that might be a Shift_JIS single-byte Katakana character
112 if (!lastWasPossibleDoubleByteStart) {
113 maybeSingleByteKatakanaCount++;
114 }
115 }
116 if (!lastWasPossibleDoubleByteStart &&
117 ((value >= 0xF0 && value <= 0xFF) || value == 0x80 || value == 0xA0)) {
118 canBeShiftJIS = false;
119 }
120 if ((value >= 0x81 && value <= 0x9F) || (value >= 0xE0 && value <= 0xEF)) {
121 // These start double-byte characters in Shift_JIS. Let's see if it's followed by a valid
122 // second byte.
123 if (lastWasPossibleDoubleByteStart) {
124 // If we just checked this and the last byte for being a valid double-byte
125 // char, don't check starting on this byte. If this and the last byte
126 // formed a valid pair, then this shouldn't be checked to see if it starts
127 // a double byte pair of course.
128 lastWasPossibleDoubleByteStart = false;
129 } else {
130 // ... otherwise do check to see if this plus the next byte form a valid
131 // double byte pair encoding a character.
132 lastWasPossibleDoubleByteStart = true;
133 if (i >= length - 1) {
134 canBeShiftJIS = false;
135 } else {
136 int nextValue = bytes[i + 1] & 0xFF;
137 if (nextValue < 0x40 || nextValue > 0xFC) {
138 canBeShiftJIS = false;
139 } else {
140 maybeDoubleByteCount++;
141 }
142 // There is some conflicting information out there about which bytes can follow which in
143 // double-byte Shift_JIS characters. The rule above seems to be the one that matches practice.
144 }
145 }
146 } else {
147 lastWasPossibleDoubleByteStart = false;
148 }
149 }
150 if (utf8BytesLeft > 0) {
151 canBeUTF8 = false;
152 }
153
154 // Easy -- if assuming Shift_JIS and no evidence it can't be, done
155 if (canBeShiftJIS && ASSUME_SHIFT_JIS) {
156 return SHIFT_JIS;
157 }
158 if (canBeUTF8 && sawUTF8Start) {
159 return UTF8;
160 }
161 // Distinguishing Shift_JIS and ISO-8859-1 can be a little tough. The crude heuristic is:
162 // - If we saw
163 // - at least 3 bytes that starts a double-byte value (bytes that are rare in ISO-8859-1), or
164 // - over 5% of bytes could be single-byte Katakana (also rare in ISO-8859-1),
165 // - and, saw no sequences that are invalid in Shift_JIS, then we conclude Shift_JIS
166 if (canBeShiftJIS && (maybeDoubleByteCount >= 3 || 20 * maybeSingleByteKatakanaCount > length)) {
167 return SHIFT_JIS;
168 }
169 // Otherwise, we default to ISO-8859-1 unless we know it can't be
170 if (!sawLatin1Supplement && canBeISO88591) {
171 return ISO88591;
172 }
173 // Otherwise, we take a wild guess with platform encoding
174 return PLATFORM_DEFAULT_ENCODING;
175 }
 

The Best

Legend
Joined
Feb 5, 2014
Messages
2,769
Reaction score
0
ERI619 said:
Damn!!! Symbian had the correct BPE Script,since microsoft took over they removed symbian site which had the script,very unfortunate.
could someone compile this
.... How ?
 
Status
Not open for further replies.
Top