Today's challenge left me with three options. To thread, the regex path, the bit manipulation path, or parse the input and make a direct comparison. Bit manipulation seemed to be the most appropriate for this problem.
Why did I decide to thread the bit manipulation path?
A bitwise operation operates on a bit string at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operations, and directly supported by the processor. Most bitwise operations are presented as two-operand instructions where the result replaces one of the input operands.
In a nutshell, bitwise operations are much lower level than other operations listed above, and these translate to better space and time complexities since we do not need to parse(convert to binary digits) the individual data element and store them in extra variables.
Problem Statement: UTF-8 Validation
Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
For a 1-byte character, the first bit is a 0, followed by its Unicode code. For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10. This is how the UTF-8 encoding would work:
Number of Bytes | UTF-8 Octet Sequence
| (binary)
--------------------+-----------------------------------------
1 | 0xxxxxxx
2 | 110xxxxx 10xxxxxx
3 | 1110xxxx 10xxxxxx 10xxxxxx
4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
x denotes a bit in the binary form of a byte that may be either 0 or 1.
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Example 1:
Input: data = [197,130,1] Output: true Explanation: data represents the octet sequence: 11000101 10000010 00000001. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character. Example 2:
Input: data = [235,140,4] Output: false Explanation: data represented the octet sequence: 11101011 10001100 00000100. The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that's correct. But the second continuation byte does not start with 10, so it is invalid.
Constraints:
1 <= data.length <= 2 * 10^4 0 <= data[i] <= 255
My approach:
First I listed what my program requirements were
Firstly, I needed to make sure that the first element, data[0], followed the pattern of the UTF-8 Octet Sequence, namely
0xxxxxxx
110xxxxx
1110xxxx
11110xxx
To do this, I needed to compare the result of making various right-shifting operations on the patterns during the first loop. This is achieved with the following code snippet:
if (datum >> 5) == 0b110:
byte = 1
elif (datum >> 4) == 0b1110:
byte = 2
elif (datum >> 3) == 0b11110:
byte = 3
elif (datum >> 7):
return False
Then depending on which pattern the first element evaluates to, I update the byte variable to either 1, 2, or 3. This byte variable refers to the n - 1 byte with the most significant 2 bits being 10 from the problem.
For the data to pass the UTF-8 Validation, it still needs to loop n - 1 time with the most significant 2 bits being 10 after 6 right shifting operations have been performed on it.
If all these conditions are met, the program should return True, else it should return False.
Take a look at my implementation in this code snippet below for a better understanding
class Solution:
def validUtf8(self, data: List[int]) -> bool:
byte = 0
for datum in data:
if byte == 0:
if (datum >> 5) == 0b110:
byte = 1
elif (datum >> 4) == 0b1110:
byte = 2
elif (datum >> 3) == 0b11110:
byte = 3
elif (datum >> 7):
return False
else:
if (datum >> 6) != 0b10:
return False
byte -= 1
return byte == 0
`