WO2015140562A1

WO2015140562A1 - Steganographic document alteration

Info

Publication number: WO2015140562A1
Application number: PCT/GB2015/050813
Authority: WO
Inventors: Ralph Mahmoud Omar
Original assignee: Omarco Network Solutions Limited
Priority date: 2014-03-19
Filing date: 2015-03-19
Publication date: 2015-09-24
Also published as: GB2524724A; GB2524724B; GB201404959D0

Abstract

Methods and systems relating to encoding data steganographically within a document, and decoding such steganographically encoded data are disclosed. Data to be steganographically encoded is mapped to a set of document alteration instructions which, when carried out, alter geometrical features of the document. Data is decoded by analysing characteristics of such geometrical features. Computer programs and documents associated with these methods and systems are also disclosed.

Description

Steganographic Document Alteration

FIELD OF THE INVENTION

The present invention relates to methods and systems for encoding data steganographically within documents. Naturally, the invention also extends to methods and systems for decoding data from such documents, and also the documents themselves that contain steganographically encoded data therein.

BACKGROUND OF THE INVENTION

Steganography is the practice of concealing data within other data. The data to be hidden, such as a message, is concealed within a generally open document so that the existence of the hidden data is not suggested by or apparent in the open document. Steganography can be applied to both physical documents and electronic documents.

One of the better-known applications of steganography is watermarking of digital images. For example, a photographer can steganographically embed authorship data within photographs by subtly altering predetermined components of the digital image. For example, a 24-bit bitmap image file encodes the colour of each pixel using 8 bits for each colour component (red, green and blue). The least significant bit of each colour component of each pixel can thereby be altered to encode hidden data at a data density of three bits per pixel without the image change being perceptible to the human eye. If different identifiers are steganographically embedded within different versions of the same photograph, it is also possible for the photographer to track or verify the source of each distributed photograph. This requires analysis of each photograph to extract the data from the least significant bit of each colour component of each pixel.

Similar steganography techniques can be used to watermark or fingerprint other documents, effectively providing them with identification marks which cannot be easily detected or noticed by the human eye. So long as the digital file is copied, the identification marks are also copied.

One of the problems in the art is that unfaithful copies of a document do not reliably retain the steganographically hidden data, especially if a poor quality

reproduction technique is employed.

Thus, there is an inherent problem with providing confidential information to uncontrolled parties in that they may copy it and provided to third parties in such a way that it cannot be proved the source of a leak. Whilst watermarking and fingerprinting techniques are known for providing identification marks in documents which cannot be detected by the human eye, reproduction techniques (such as photocopying) can degrade content and these security features can be lost. This is particularly the case with documents containing text but also applies to image documents. There is a need therefore to encode data within a document, for example with a reference to the recipient of that document, without enabling the receiver to identify those reference and circumvent them - for example by simply covering up the reference in any given photocopy. There is also a need to avoid degradation of any security features with low- quality mass reproduction techniques such as photocopying.

It is an object of the present invention to ameliorate the above-mentioned problems, at least in part.

SUMMARY OF THE INVENTION

According to a first aspect of the present invention, there is provided a method of altering a document to encode data steganographically within it, so that the encoded data, from a rendered form of that document, is machine-readable yet is substantially imperceptible to humans.

Ideally, the method comprises at least one of the steps of:

mapping the data to be steganographically encoded within the document to a set of document alteration instructions; and

carrying out the document alteration instructions to alter geometrical features of the document to steganographically encode data therein.

The method may also comprise at identifying at least one group of geometrical features of the rendered form of the document. Ideally the method comprises registering parameter values that are associated with and define geometrical characteristics of the geometrical features. Thus carrying out the document alteration instructions may involve altering the parameter values associated with and defining the geometrical

characteristics of the geometrical features.

Ideally, the method further comprises selecting a reference. Ideally a reference set of parameter values are selected using a predetermined reference selection process.

Ideally, the method further comprises mapping the data to be steganographically encoded within the document to a set of relative differences by using a predetermined mapping process. Furthermore it is preferred that the method comprises encoding the data within the document by modifying the identified geometrical features of the document using a predetermined document alteration process. Ideally this is so as to alter the parameter values associated with those geometrical features relative to the reference set of parameter values, and ideally by an amount dependent on the set of relative differences.

Ideally, the geometrical features are typographical features, such as those associated with text. For example, the geometrical feature may comprise at least one of: baselines, mean lines, cap height, descender height, ascender height, character size, character position, character kerning, letter spacing, sentence spacing, size of dots and/or diacritics, size and position of superscript and/or subscript text, paragraph position and margins size.

Ideally, alterations to the geometrical features include at least one of: translation, scaling and rotation of those geometrical features. Ideally, said alterations are carried out with respect to a predetermined reference. Ideally, said alterations are made to otherwise regularly repeated geometrical features.

Ideally, the method further comprises receiving at least one user input; and wherein the data to be encoded is derived, at least in part, from the at least one user input. Ideally, wherein receiving at least one user input comprises receiving at least one of: biometric data and an alphanumeric code.

Ideally, the method further comprises determining at least one metric

representative of at least a portion of the content of the document. Ideally, the data to be encoded is derived, at least in part, from the at least one metric.

Ideally, the method further comprises treating at least part of the data to be encoded prior to steganographically encoding the data within the document. Said treating may comprise encrypting it using an encryption process. Ideally, the encryption process comprises receiving as an input at least one of: a user input and a metric representing the content of the document. Said treating may comprise appending a verifier to the data, such as a checksum.

Ideally, the method further comprises choosing an encoding strategy. Ideally, this is so as to determine how to map the data to be steganographically encoded within the document to a set of document alteration instructions.

Ideally, the method further comprises checking the altered document to determine that data has been successfully encoded therein. This checking step may comprise comparing the data extracted by a decoding method with the original data to be encoded. Ideally, the altered document is first treated with a degradation process prior to the checking step so as to test the resistance of the encoded data to corruption. According to a second aspect of the present invention, there is provided a method of processing a document to decode data steganographically encoded within it. Ideally, the data is encoded in a rendered form of that document in a way that is machine-readable yet is substantially imperceptible to humans. Ideally, the method comprises at least one of the steps of:

identifying at least one group of geometrical features of the rendered form of the document;

registering parameter values that are associated with and define geometrical characteristics of the geometrical features;

selecting a reference set of parameter values using a predetermined reference selection process;

analysing the geometrical features to determine a set of relative differences between the parameter values associated with the geometrical features and the reference set of parameter values; and

decoding data steganographically encoded within the document by mapping the set of relative differences into said data using a predetermined mapping process.

According to a third aspect of the present invention there is provided a computer program arranged to carry out the method of the first or second aspects of the present invention.

According to a fourth aspect of the present invention there is provided a computer program product comprising at least one non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer- readable program code portions comprising at least one executable portion which, when executed, carries out a method according to the first or second aspect of the present invention.

According to a fifth aspect of the present invention there is provided a system for altering a document to encode data steganographically within it, so that the encoded data, from a rendered form of that document, is machine-readable yet is substantially imperceptible to humans. Ideally, the system comprises at least one of:

a reader ideally arranged to read in the document and convert it into an electronic format in which at least one group of geometrical features of the rendered form of the document can be identified and altered;

a processor ideally arranged to: analyse the at least one group of geometrical features to derive therefrom parameter values that are associated with and define geometrical characteristics of the geometrical features;

select a reference set of parameter values;

map the data to be steganographically encoded within the document to a set of relative differences; and/or

encode the data within the document by modifying the identified geometrical features of the document so as to alter the parameter values associated with those geometrical features relative to the reference set of parameter values by an amount dependent on the set of relative differences; and

a writer to write out the modified document with the data encoded

steganographically encoded therein. According to a sixth aspect of the present invention, there is provided a system for processing a document to decode data steganographically encoded within it. Ideally, the data is encoded in a rendered form of that document in a way that is machine- readable yet is substantially imperceptible to humans. Ideally, the system comprises at least one of:

a reader ideally arranged to read the document and convert it into an electronic format in which at least one group of geometrical features of the rendered form of the document can be identified;

a processor ideally arranged to:

analyse the at least one group of geometrical features to derive therefrom parameter values that are associated with and define geometrical characteristics of the geometrical features;

select a reference set of parameter values;

calculate a set of relative differences between the parameter values associated with the geometrical features and the reference set of parameter values; and/or

decode data steganographically encoded within the document by mapping the set of relative differences into said data using a predetermined mapping process;

and

an output module to output the decoded data. Ideally, the system is in the form of a document replication apparatus, such as a photocopier, wherein the reader is a scanner, and the output module is a printer. Ideally, the system is arranged to control replication of a document in dependence on scanning and decoding steganographically embedded data therein. For example, if a certain item of steganographic data indicates that a document is a protected document, then the system can cease replication of that document.

The may be arranged to issue an authorisation signal in dependence on decoding steganographically embedded data within a document. According to a seventh aspect of the present invention, there is provided a document within which data is steganographically encoded. Ideally the document comprises a medium (such as paper) supporting geometrical feature thereon. For example, the geometrical features may be printed on the medium. Ideally, the geometrical features are arranged to encode data steganographically in a way that is machine-readable yet is substantially imperceptible to humans. Ideally, the document comprises at least one group of geometrical features having associated parameter values that define geometrical characteristics of the geometrical features. Ideally, the document comprises a reference - for example, via features supporting a reference set of parameter values. Ideally, a set of relative differences between the parameter values of the at least one group of geometrical features and the reference set of parameter values steganographically encodes data within the document.

Ideally, the geometrical features are typographical features. Ideally, the medium supports geometrical features in a way that is machine readable via visible light scanning of the document. Ideally, the document is in the form of at least one of: a bank note, a cheque, an agreement document, a deed document, an assignment document, a power of attorney document, a book, an identity document, a legal document, a business document and a government document.

Further aspects of the present invention may reside in features of the various aspects of the present invention. Furthermore, it will be understood that features and/or advantages of the different aspects of the present invention may be combined and/or substituted where context allows, with the necessary changes being applied.

For example, steps of the encoding method according to the first aspect of the present invention can be applied to the decoding method according to the second aspect of the present invention, optionally with a complementary or inverse operation of that step being performed. By way of further example, a decoding process of the first aspect used to check the integrity of data steganographically encoded within a document may utilise the decoding method of the second aspect of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

In order that the invention may be more readily understood, reference will now be made, by way of example, to the accompanying drawings in which:

Figure 1 is a schematic diagram of a system according to various embodiments of the present invention, the system being arranged to process documents so as to steganographically encode data therein, and/or to decode data already

steganographically encoded therein;

Figure 2 is an encoding flow diagram according to various embodiments of the present invention;

Figure 3 is a decoding flow diagram according to various embodiments of the present invention; and

Figure 4 is a sample from a rendered form of a source document of Figure 1 , the sample including a number of different geometrical features associated with text of that source document.

DESCRIPTION OF PREFERRED EMBODIMENTS

Figure 1 is a schematic diagram of a system 1 according to various embodiments of the present invention, the system 1 being arranged to process source documents 5, 5a by modifying them to steganographically encode data 7 therein to form modified documents 6, 6a. To do this, the system 1 is arranged to carry out the encoding workflow as summarised in the flow diagram of Figure 2. The system 1 is also capable of decoding and extracting data 7 steganographically encoded within modified documents 6, 6a. To this end, the system 1 is arranged to carry out the decoding workflow as summarised in the flow diagram of Figure 3.

The system 1 comprises a computer 2, an image capture device in the form of a flat-bed scanner 3, and a printer 4. The scanner 3 and printer 4 are communicatively linked to the computer 2, for example via a wired or wireless connection. The scanner 3 enables a physical document to be converted into an electronic form which can then be processed electronically by the computer 2. Conversely, the printer 4 enables an electronic form of a document to be converted into a physical, printed document. The computer 2 may comprise other features such as a display that provides a user with information (e.g. an output of an encoding or decoding process and/or the presentation of electronic documents), and input means, for example a mouse and keyboard (e.g. to receive inputs to an encoding or decoding process - for example, a password or a fingerprint). The computer 2 may also comprises features such as a processor and/or processing modules for carrying out the processing steps (such as steps to encode or decode data). The computer 2 may comprise memory and/or memory modules for storing and/or registering information such as data to be encoded, values and parameters.

It will be understood by a person skilled in the art that functional alternatives to the system 1 are possible. For example, the system may comprise a portable electronic computing device, such as a smart-phone (or a tablet). In such a case, the image capture device may be in the form of a camera of the smart-phone. In other alternatives, the computer may be a photocopier.

The foregoing description focusses primarily on the processes carried out by the computer on the electronic forms of documents 5, 6. However, it will be understood that similar features, functions and advantages may be applied to the physical form of the documents 5, 6a, where context allows.

Encoding data overview

An overview of the data encoding process 20 is summarised in Figure 2.

A first step 21 of the encoding process 20 is to obtain a source document 5

(sometimes referred to as the "carrier") and ensure that it is in a suitable form. This requires the rendered form of the document (i.e. one including geometrical features) to be available, so that image analysis and modification of the subsequent steps of the process can be carried out effectively. In the system 1 of Figure 1 , this may include converting a physical, printed form of the document 5a into an electronic form by scanning it in using the scanner 3. However, other conversions may be carried out, possibly entirely in the electronic domain. For example, an electronic file, such as an HTML file, or a Microsoft Word® file may first need to be rendered so that the

geometrical feature of that document (e.g. the shape and arrangement of text and images) are available for processing. The rendered form of a document 5 may be a bitmapped version of the document. It will also be understood that a rendered document 5 may comprise more than one page.

A second step 22 of the process is analysis of the document. This is primarily to determine whether and how data 7 may be steganographically encoded within the document. The second step 22 may include an assessment of the quantity and/or nature of content within a document, and a determination of how much data can be steganographically encoded within that content. This assessment or determination can be fed back to a user so that if the source document is unsuitable - for example, it does not have a sufficient quantity or type of content - the user can be provided with guidance on how to improve the source document 5. This is so that more data can be

steganographically encoded within a processed, modified document 6, and/or data can be encoded in a way that makes the presence of the encoded data substantially imperceptible to humans. One of the assessments that may be carried out is

determining whether the document has content in the form of text. Another assessment could be determining the amount of text and/or image within the document, by determining the proportion of whitespace in the image. This can be done using a relatively low resolution version of the document, and so is relatively less processor- intensive than operations such as OCR that involving identifying the formation of particular characters of text.

The second step 22 includes an identification of predetermined types of geometrical features of the rendered form of the document. For example, geometrical features may include the arrangement and formation of text and/or images. In addition to this, the second step 22 includes registering how those geometrical features are defined and characterised. Specifically, each geometrical feature may be at least partly defined by a number of parameters, and values of those parameters can be generated from analysis of the geometrical features. Thus, registered parameter values associated with an identified geometrical feature can define the geometrical characteristics of that geometrical feature. Parameter values also quantify the geometrical characteristics of geometrical features in a way that allow alteration of those geometrical features by shifting the associated parameter values. The second step 22 of the encoding process also involves selecting a reference using a predetermined reference selection process. This may include selecting one or more of the identified geometric features as a reference, and setting a reference set of parameter values associated with that geometric feature as the reference. Ideally, the second step 22 of the encoding process follows a series of rules that determine which geometrical features, parameter values and references are to be used to encode data, and in which order they are to be used. Where a document comprises multiple pages, different geometrical features, parameter values and references may be used for each page.

The third step 23 of the encoding process 20 is obtaining the data 7 to be encoded. This may involve generating or deriving the data 7 from the content of the original document 5, from another predetermined source and/or from user-provided information. The data may be encrypted and/or include a checksum. The fourth step 24 of the encoding process involves choosing an encoding strategy, ideally in response to the result of carrying out the second step 22. Thus, the encoding strategy is dependent on the analysed characteristics of the document of the second step - specifically, the determined characteristics of the geometrical features of the document 5 as represented by the parameter values, and as measured relative to a selected reference. Each encoding strategy is effectively a predetermined document alteration process that maps the data to be steganographically encoded to alterations to the source document 5 so as to arrive at the modified document 6. As will be described in greater detail below, this involves shifting the parameter values relative to the selected reference in dependence on the data to be steganographically encoded. This allows that data to be hidden within the formatting of the document. This makes that data machine- readable, yet the presence of that data is substantially imperceptible to humans.

Moreover, as data is encoded in the formatting of a document, it is particularly resistant to corruption as a result of unfaithful or otherwise poor quality reproduction of that document, especially if a threshold operation is applied during a copying process. It should be noted that the document alteration is carried out with respect to the reference selected during the second step 22 of the encoding process. Thus, a reference set of parameter values will not shifted, but rather the other non-reference parameter values will shifted relative to the reference set of parameter values.

The fifth step 25 of the encoding process is carrying out the chosen encoding strategy to alter the document. This involves applying the changes to the parameter values so that the geometrical features of the original document 5 are altered, thereby encoding the data in a modified document 6. The modified document 6 can then be re- rendered and outputted in a form which ensures that the formatting cannot be altered, for example as a non-editable PDF document, or simply printed.

The sixth step 26 of the encoding process is checking that the data has been properly encoded. This can simply be done by running a decoding process, as will be described below, and comparing the data extracted by the decoding process with the original data to be encoded (i.e. of the third step 23). To test the resistance of the encoded data to corruption, this sixth step may first include applying a degradation process to the modified document 6 - for example, reducing the colour-space by applying a thresholding algorithm, reducing the resolution of the document and/or transforming the image content of the document by rotation, scaling or x-y translation. Decoding data overview

An overview of the data decoding process 30 is summarised in Figure 3. Effectively, the decoding process is an inverse of the encoding process 20 to enable data 7 steganographically encoded within a modified document 6 to be extracted. The first step 31 and second step 32 of the decoding process 30 are similar to the first and second steps 21 , 22 of the encoding process 20, but applied to a modified document 6 (sometimes referred to as the "package").

In particular, the second step 32 of the decoding process 30 is analysis of the modified document 6 to determine how data 7 may have been steganographically encoded within the document 6. Again, this includes an identification of predetermined types of geometrical features, ideally with each geometrical feature being at least partly defined by a number of parameters, and the values of those parameters being measured relative to a reference selected by a predetermined reference selection process. The second step 32 of the decoding process 30 follows the same series of rules as the encoding process 20 so that the order of the geometrical features, parameter values and references used to encode data can be reliably determined. In any case, a set of relative differences between the non-reference parameter values and the reference set of parameter values can be determined.

The third step 33 of the decoding process 30 is extraction of the data from the modified document 6. This is achieved by mapping the determined relative differences to the steganographically encoded data via a predetermined mapping process. The mapping is dependent on, and effectively the inverse of the encoding strategy employed.

The fourth step 34 of the decoding process 30 is outputting of the extracted data. This may involve verifying the data via a checksum component of the extracted data, and/or receiving a user input to decrypt the extracted data. Document analysis examples

There are many different ways that a document may be analysed to identify geometrical features therein, especially as there can be many different types of geometrical features within a document. Aspects of the present embodiments are particularly advantageous and applicable to geometrical features in the form of typographical features - i.e. those that are associated with text. Accordingly, the foregoing description will focus on the processing of text documents. However, it will be appreciated that the same principles and advantages can be extended to include nontextual features of a document.

Figure 4 shows a sample from a rendered form of a source document 5, the sample including a number of different geometrical features associated with text 40 of that source document. These geometrical features are generally typographical features having definitions that are well-known in the art of typography.

For example, one group of geometrical features are baselines 41. Baselines 41 are generally defined as the lines upon which most letters of a standard body of text sit. From the sample of Figure 4, in the word "Steganography", nine of its letters "Ste-ano-ra- h-" sit on a respective baseline 41 a and four of its letters "— g— g-p-y" also sit on the same baseline 41 a but have descenders extending below the baseline.

Image analysis can be carried out on the rendered form of the document 5 to identify these baselines 41. Moreover, image analysis can also be used to determine the position, arrangement and extent of these baselines 41 , and to populate the appropriate parameter values associated with these characteristics.

By way of a trivial example, a "bx" position of a baseline in the sample may be registered. In the present case, the "bx" position is a notional distance in pixels from the top edge 42 of the sample (which is at position bx=0). Accordingly, the first baseline 41a of the first line of text of the sample may be registered as having the parameter/value: bx=1080, the second 41 b as bx=1480, the third 41c as bx=1880, the fourth 41d as bx=2280. This simple notation and example relies on the assumption that all the baselines are parallel to one another and the top edge of the sample.

The exact "bx" positions of these baselines can then be varied slightly relative to one another so that data can be encoded within the document in a way that is not easily perceptible. In the present example, this requires a reference to be set against which the parameter values associated with baseline position can be varied. For example, the relative spacing between the first and second baseline 41a, 41 b can be selected as a reference. In the present example, this reference value is calculated as 1480 - 1080 = 400 pixels. It will be noted that the spacing between the second and third baseline 41 b, 41 c is 400 pixels, as is the spacing between the third and fourth baselines 41 c, 41 d.

Accordingly, data can be encoded in the document by slightly varying the non-reference "bx" parameter values of the third and fourth baselines 41c, 41 d. This can be achieved using a predetermined mapping process which specifies how the "bx" values are to be altered to encode different data. For example, the mapping process may specify that the first four bits of a payload message can be mapped to the "bx" spacing between the second and third baselines 41 b, 41 c, and the second fours bits of a payload message can be mapped to the "bx" spacing between the third and fourth baselines 41 c, 41 d. These "bx" spacings can be subsequently compared against the reference spacing (of 400 pixels) to obtain the payload message based on the variance from the reference spacing. For example, a "bx" relative spacing of 400 pixels (zero variance) can represent the four bit message "0000" whereas a relative spacing of 415 pixels (variance of 15 pixels) can represents the four bit message "1 11 1" and so forth with intermediate pixel variances representing corresponding intermediate values between "0000" and "11 11" . Thus, using this convention, to encode the message "00110010", the "bx" positions of the baselines would be modified by the mapping process as follows:

Baseline 41a: bx= 1080;

Baseline 41b: bx= 1480;

Baseline 41c: bx= 1883;

Baseline 41 d: bx=2285.

Accordingly, a modified document 6 can be created, with the lines of text 40 being shifted to ensure that the baselines are positioned according to the above- specified parameter values. Thus the eight-bit payload message can be

steganographically encoded within the modified document 6.

To decode this data from the modified document 6, a similar process can be carried out: firstly to determine the reference spacing of 400 pixels (between the first and second baselines); secondly to determine the non-reference spacings of: 403 pixels (between the second and third baselines) and 402 pixels (between the third and fourth baselines); and then thirdly mapping the differences in spacing "3" and "2" to two four bit messages "001 1" and "0010". These can then be concatenated to form the eight bit message "001 10010".

Using this convention, the maximum baseline shift from the standard reference spacing is a maximum of 15 pixels out of 400. This is less than a 4% difference, and so is very unlikely to be perceived by a user. Advantageously, the use of a relative difference (which employs a comparison to a reference derived from the document itself) ensures that any document alteration that is carried out is relatively difficult to perceive. It will be understood that data could, in principle, be encoded in a document using absolute values. However, employing this approach does not make the method flexible enough to account for document having significantly different geometrical features - for example, characters of different sizes, types and arrangements.

It should also be noted that the reference employed in the present example is to a geometric feature that repeats regularly (namely the regular spacing of 400 pixels between adjacent baselines). So that a reference can be reliably employed, the selection of a reference is dependent on that reference being applicable to such regularly repeating geometrical features. This is so that data can be encoded in the small variations deviating from the reference. If there were large spatial variations between consecutive baselines 1 to 3, then the spacing between baselines 1 and 2 could not be used as a reference for modulating the spacing between baselines 2 and 3 to encode data. In view of this, method and systems according various embodiments of the present invention may employ different references for significantly varying geometrical features, even if they are of the same type (e.g. baselines). For example, if there were a significant difference in the spacing between the baselines of the end and start of adjacent paragraphs (e.g. 700 pixels vs. the 400 pixel spacing between baselines of adjacent lines within the same paragraph), then a different reference would be selected, applicable only to adjacent baselines of different paragraphs.

It should also be noted that whilst the above example uses a single one- dimensional parameter ("bx") to define the position of baselines, a more extensive set of parameters and values may be used to more accurately define the formation of geometrical features. For example, baselines may be specified more accurately by including the two-dimensional coordinates of the start point and end point of a baseline. With this information, the length of each baseline can be determined along with the orientation of the baseline. Accordingly, these geometrical characteristics can be modulated relative to a reference to encode data.

The perceptibility of changes made to geometrical features of a document will vary in dependence on factors such the type of geometrical feature and the characteristic of it that is being modulated to encode data. However, the amount of data that can be encoded by a variation is proportional to the number of different (machine- distinguishable) variations. Furthermore, this consideration also is related to the resistance to corruption of data encoded as geometric feature variations. In the above example where a variation of one pixel equates to a difference of one bit of information, a down-sampling of the document would cause corruption of the data encoded in the spacing of the baselines. Thus, there is a trade-off to be made between perceptibility, data density and resistance to data corruption in view of unfaithful reproduction. For example, in the above case relating to baselines, a courser mapping may be used - e.g. a shift of 3 pixels representing a difference of 1 bit of information.

In view of this, it is useful to use many different geometrical features (and characteristics of those features) to encode data, so as to increase data density, whilst minimising human perceptibility and data corruption. The different types of text-based geometrical features are listed below by way of example.

As mentioned previously, aspects of the present embodiments are particularly advantageous and applicable to geometrical features in the form of typographical features - i.e. those that are associated with text. However, it will be appreciated that the same principles and advantages can be extended to non-typographical geometrical features, for example, predetermined shapes. Example typographical features and parameters

a. Baselines - as set out above.

b. Mean line - generally a line above and parallel to the baseline which forms the upper boundary of most of the lowercase letters in a body of text.

Parameters which may be varied include the spacing between the mean line and a corresponding baseline.

c. Cap height - the vertical height of a capital letter relative to the baseline.

The relative height between a capital letter and a consecutive lower case letter (as defined by the cap height and mean line) can be used to encode data.

d. Descender height - the vertical height of parts of a lower case letter which extend below the baseline. For example, parts the letters "g", "p" and "y" of the word "Steganography" have a descender that extends below the baseline to a beard line that is parallel to and below the baseline. Parameters which may be varied include the spacing between the beard line and a

corresponding baseline. Alternatively, the height of consecutive letters having a descender can be used to encode a payload message.

e. Ascender height - the vertical height of parts of a lower case letter which extend above the mean line. For example, parts of the letters "t" and "h" of the word "Steganography" extend above the mean line. Parameter similar to those in respect the descender height can be used to encode data.

f. Character size and position - the vertical height, horizontal width and/or orientation between consecutive letters of the same type may be modulated to encode data.

g. Kerning - the degree of overlap between adjacent characters.

h. Letter spacing - overall spacing of a word.

i. Sentence spacing - i.e. the space size after a sentence.

j. Size of dots or diacritics - e.g. different size period marks can encode

different data.

k. Size and position of superscript and subscript - relative to normal text.

I. Paragraph adjustments - including alignment and justification. For example, if text is not fully justified (e.g. left-aligned only), then it is possible to use letter spacing to ensure that sequential lines of text can be used to encode different values.

m. Margins size - these are less helpful as they are likely to significantly vary, especially with physical documents which are often shifted in registration, especially when copying between media of different proportions - (e.g. A4 to "US letter")-

As can be seen, generally, parameters of these and other geometrical features which may be modulated to encode data relate to the positioning, size and/or orientation ideally relative to a reference, or another geometrical feature. Advantageously, this ensures that the data being encoded within a document is resistant to corruption, especially when the document's colour-space is significantly reduced (e.g. reduced to monotone via a thresholding operation as typically carried out on a black-and-white photocopier).

As mentioned, the encoding process follows a series of rules that determine which geometrical features, parameter values and references are to be used to encode data, and in which order they are to be used. Thus, the rules effectively lead to the predetermined selection of geometrical features, parameter values and references.

For example, the rule may determine that the first two baselines encountered in a document are to be used as reference. Extending further, the rules may determine that baselines be used to encode a first portion of data 7, followed by letter spacing to encode a following second portion of that data, followed by sentence spacing for a third portion, and so forth. This ensures the different types of geometrical features can be utilised to maximise the data to steganographically encoded within the document. Moreover, these rules may codify the manner in which geometrical features can be used to encode of data. For example, if fonts are too small, or the spacing between sequential lines of text is too compact, then baseline modulation may be used to encode a smaller range of data (e.g. 2 bits instead of 4 bits), or be completely disregarded as a means of encoding data altogether so as to avoid data corruption or likely human perception of encoded data.

Examples of data to be encoded

As mentioned, the third step 23 of the encoding process 20 involves obtaining the data 7 to be encoded. Once the data to be encoded has been obtained, its quantity can be enumerated for the purpose of determining whether that data can be encoded within the source document using a primary encoding strategy

As alluded to earlier, the data to be encoded can be generated or derived from a number of different sources:

Firstly, the data to be encoded may be based on the content of the original document. This serves as an indicator of the integrity of the content of the document. For example, assuming there is text within the document, a series of metrics can be generated which relate to the content. Thus, if the content of the document were to be slightly changed (but retaining the formatting encoding the data), then this can be flagged by the metrics which are part of the data encoded within the document. These metrics may include at least one of the number of pages in document; the number of words, sentences, paragraphs and/or letters per document and/or page; and the instances of certain characters. For example, the letter "e" is one of the most frequently occurring letters in the English language. Accordingly, the number of letters "e" in an English language document can be enumerated as a metric which represents the content of the document. A different character (or series of characters) may be used for different languages and/or topics. Effectively, these metrics act as a checksum for the content, or content portions of the document. These metrics can be beneficial, especially when they are spread across the entire document and/or include redundant data relating to the content of the document. For example, each page of a document may include metrics representing the content of the whole document (thereby enabling the checking of the integrity of a document as a whole, based on the data within any one page). Similarly, metrics embedded within one page can include a reference to the content on other pages, allowing a cross-check of the integrity of various pages of the document to be performed.

Secondly, the data to be encoded may be based on user-provided information. For example, the data may include alphanumeric data entered by the user. Such data may include an identifier, a PIN, a password or pass-phrase. Similarly, the data to be encoded may be based on information derived from another user input - for example, from biometric data generated from scanning a user's fingerprint or iris.

Thirdly, the data to be encoded may include information not dependent on a user or the document. For example, this may include a time stamp, a random number, a unique identifier and/or a version number of a program used to encode the data.

Regardless of the source used, the data can then be further treated prior to steganographically encoding it (or part of that data) within the document.

For example, portions of the data may be encrypted using one of a number of different techniques known in the art with the resulting cipher-text being passed to the encoding process that steganographically encodes it within the document. One portion of the data (e.g. the user-dependent data) may be used to control the encryption of the other (e.g. the data based on the content of the document). For example, a user- provided password can be used as an encryption key to a cryptographic process. In a similar fashion, the metrics based on the content of the document may be used as a cryptographic salt. As mentioned, many different types of cryptographic process may be employed, such as that disclosed by the Applicant in document: PCT/IB2011/052799 the content of which is hereby incorporated by reference to the extent permitted by applicable law.

An additional treatment of the data to be encoded could be to append a verifier (such as a checksum) which can be used to verify subsequent successful extraction of steganographically encoded data, for example via a decoding process such as the decoding method of the present embodiment.

Encoding strategies

The fourth step 24 of the encoding process involves choosing an encoding strategy. This may be dependent on second step 22 of document analysis, and/or the third step 23 of obtaining the data to be encoded. As mentioned, once the data to be encoded has been obtained, its quantity can be enumerated for the purpose of determining whether that data can be encoded within the source document using a primary encoding strategy. If the quantity of data and/or encoding strategy means that the data cannot be encoded, a sequence of auxiliary encoding strategies can be employed instead to encode the data. The user may be provided feedback about which encoding strategy is being used, and whether or not data can be encoded using that strategy. A user can be provided with a choice of encoding strategies, and provided with a means to choose one. This choice can be made automatically for the user in response to a password or other user-provided input. Advantageously, this means that both the data and the encoding strategy used to encode that data can be kept a secret.

Once an appropriate encoding strategy is chosen, it can be carried out according to the fifth step 25 of the encoding process, and then checked according to the sixth step 26 of the encoding process.

As mentioned, an encoding strategy generally involves the alteration of a source document 5 in a way that is dependent on second step 22 of document analysis, and/or the third step 23 of obtaining the data to be encoded. In general, this involves changing the parameter values so that the geometrical features of the original document 5 are altered. However, the encoding strategy may also alter a document in another way that assists subsequent decoding and extraction of the data 7 from a modified document 6. For example, the encoding strategy may alter a document to provide it with markings that indicate which encoding strategy has been used. Ideally, these markings are not associated with any of the geometrical features of the document, and so are independent of the content of a document 5. These markings thus define a clue which can facilitate a subsequent decoding process by specifying the encoding strategy used. Specifically, the second step 32 of the decoding process 30 can be driven following an identification of such markings. These markings can be provided at a predetermined location within the document to facilitate detection of those marking, and so speeding up the decoding process. Effectively, this means that the decoding process 30 can bypass the step of guessing which encoding process has been used.

Similarly, portions of the data to be encoded could also be rendered within a modified document without being associated with or affecting the geometrical features. For example, a timestamp can be rendered in plaintext at a predetermined location of the document. Again, this can aid a subsequent decoding process and/or act as a verifier of data integrity and/or successful data extraction.

The above-described embodiment is flexible enough to be applied to a variety of different scenarios as will be described below.

Further examples and scenarios

A technique may be provided whereby characteristics of the document can be determined and used to change the format of the document in a manner which is not readily perceptible to the human eye but which is detectable by an imaging and decoding technique. The measured degree of format change acts as a unique identifier of the document and can be used to identify the original recipient of that copy.

Format change can take many forms. For example in text word spacing, font adjustments, line spacing, border size, indentation etc. can all be used either individually or in combination to slightly modify the document in a manner which is representative of a data to be encoded (e.g. a unique identifier). The modification is so slight that it is not readily perceivable to the human eye. However, such format changes are faithfully reproduced in any low-quality reproduction of that document. For example, no matter how many times the document is photocopied the format information is always reproduced faithfully. As mentioned, there are many different ways in which a conversion can occur to represent payload data.

One set of scenarios involve encoding a unique identifier. The conversion is carried out by a conversion algorithm which analyses the document. The conversion algorithm derives several non-format parameters (typically parameters based on content). Once the non-format parameters have been established by this process, they can be used as inputs into another algorithm (a formatting algorithm) for adjusting the formatting parameters of the document. The document is then recreated either by printing or in electronic format (such as a PDF format) using the new formatting parameters as determined by the formatting algorithm.

For example, in one scenario, the conversion algorithm can sum the words per page and count the letters in the page to arrive at two content parameters for the document. Also the letters can be summed according to a predetermined value given to each letter and an algorithmic rendition of any of these sums can be used to create a new summed parameter. These parameters can individually be provided to the formatting algorithm or alternatively the new summed parameter can be provided. The formatting algorithm then spaces the words and letters within a given page so as not to change the layout (format) as perceived by the human eye, but with sufficient change to create an encoding to re-render into a value once the page formatting is analysed from an image of the newly-created document.

The document content can get summed automatically by normal search engine parameters and a digitised result created and then this digitised result submitted to a randomised time-based algorithm (with the time base element further concealed according to the base units time setting as is described in our co-pending international patent application no PCT/IB2011/052799. This not only allows for secure encryption but also post facto detection of the authoring machine). These results would be expressed in the spacing of the document.

A variant of this method is to scan a fingerprint on a portable device e.g. a mobile phone, render a digital value that is also encrypted not only through normal methods but via time-based encryption and sent to a central database (akin to that described in our co-pending international patent application no PCT/IB2011/052799). In this case the rate of decay in the time base renders even the mobile phone incapable of decrypting the fingerprint moments after it has sent it. This digitised value can be used simultaneously to vary an electronic document that had been received on the mobile device so that each page of that document can be "signed" individually by the user in such a way that it is personal only to the user. The method can be used to sign documents which require proof that the user has agreed to every page and that no page has been "slipped" in intentionally or accidentally after he/she has agreed to the whole document. This is the electronic equivalent of "initialling" every page, but can now be done totally electronically using portable devices.

As further security measure (in an further scenario), the unique digital reference of the authoring device, e.g. laptop or mobile phone, is added into the digital signature now expressed in the document in a concealed way. As a further security measure (in a further scenario), a time reference is printed in the open on the page and a concealed time reference is printed using the above method, with the central database alone able to decipher, from the unique identifier of the authoring machine, the appropriate time decay differential between the two (as per our co-pending international patent application number PCT/IB201 1/052799). In this way, the document is further authorised as coming from a valid authoring machine.

This method can be used with all forms of value documents (including banknotes) to authorise that they have been printed by an authorised machine. For example a time reference can be printed in the open on the document and used as an algorithm reference in a serial number as well as a set of micro text which would be printed at the same time as the serial number and spaced etc. as above with a document linking all three items but with the appropriate time decay known only to the central database. This would allow a check to be undertaken at ATM's or other forms of electronic value document sorters for physical documents etc not only if the document has been printed by an authorised machine but if there has been an "overprint" i.e. unauthorised print run using valid machines.

In addition in any document, ordinary or otherwise, the exact position of the side margin and the extent of the header and footer margins, represent additional features which can be used in the determination of the unique identifier of the document. Besides the spacing of letters within words and words themselves, minute changes to punctuation mark spacings as well as line heights can be used. The conversion algorithm re-renders the same document into a digital print version, such as in an Adobe pdf format, without any readily perceivable visually changes to the format created for the standard (original) document. However, these subtle format changes are present, but not perceivable by the human eye.

An unauthorised copier will not know whether their document has been sanitised and labelled for their use only or is a standard document which hasn't been treated by this aspect of the present invention. The document when photocopied will faithfully reproduce the document format, such as the concealed spacing differences, so that when the illegally obtained document is scanned and the image processed by the formatting algorithm (operating in reverse) a set of parameters or a summed parameter can be created representing a unique identifier of the document. These reconstituted parameters (which can be termed as a unique identifier of the document or its intended recipient) can be compared to the original parameters and if they are equal, this indicates that the scanned document is from the original which is linked to an intended recipient. Accordingly, this process enables linking of any processed document back to its original intended recipient.

Other format changes include character size alteration in such a way as to be imperceptible to the human eye but nonetheless containing information by reference to the variation to the standard size in the document. In all these incarnations, the document would be produced in the digital word processing form so that the character style and character size could be assessed in order that a reference point be established prior to it being rendered in a print style as in Adobe® software.

Conceptually the conversion algorithm and the formatting algorithm and the inverse of the formatting algorithm, can be incorporated in to modern photocopiers such that certain documents with standardised algorithmic values will be refused for reproduction or scanning if they do not match a required intended recipient or document identity. This would represent an extra element to the above scenario where part of the document setting would be used to produce the standardised algorithm to trigger the software enabled scanner/photocopier and would no doubt be used for banknotes or government documents etc.

In a further scenario, a digital signature of document can also be created and stored in addition or separate from the converted document creation. This would be carried out by simply scanning the entire multiple page document, creating a signature code via central database with date and other descriptive information about the document being used and storing this unique signature code at the central database. If it was an electronic document, the unique signature code would be created directly from the electronic version of the document. The signature code could be integrated into the document thus making the whole document "safe", not just its signature page. This embodiment can be used with mobile devices to type and sign electronic documents using the mobile device.

In a further scenario an app is provided which scans the users fingerprint using the camera function of the mobile phone, renders this into a digital encoding and then using the document altering embodiment described above imperceptibly alters every page of a document sent by word processing attachment or e-mail to the mobile phone and then sent back as a printed or PDF (or some other form of unalterable visual representation) file. This would be used for very sensitive documents with the customer has to signify that they have agreed each and every page and would allow for customers signifying that they have agreed to each and every page.

Also in another scenario, when conducting page by page verification of a document to a mobile device, the app can also have a vendor-specific code and a customer-specific code that is combined with the fingerprint code plus also using relativity co-ordinates (as described in our co-pending international patent application no PCT/IB2011/052799) to link them so that the vendor knows the verification code he gets from the app is personal to him and his customer.

Claims

1. A method of altering a document to encode data steganographically within it, so that the encoded data, from a rendered form of that document, is machine-readable yet is substantially imperceptible to humans, the method comprising:

2. The method of claim 1 , further comprising:

identifying at least one group of geometrical features of the rendered form of the document; and

registering parameter values that are associated with and define geometrical characteristics of the geometrical features; wherein

carrying out the document alteration instructions involve altering the parameter values associated with and defining the geometrical characteristics of the geometrical features.

3. The method of claim 1 or claim 2, further comprising selecting a reference set of parameter values using a predetermined reference selection process.

4. The method of claims 1 to 3, further comprises:

mapping the data to be steganographically encoded within the document to a set of relative differences by using a predetermined mapping process; and

encoding the data within the document by modifying the identified geometrical features of the document using a predetermined document alteration process so as to alter the parameter values associated with those geometrical features relative to the reference set of parameter values by an amount dependent on the set of relative differences.

5. The method of any preceding claims, wherein the geometrical features are typographical features associated with text.

6. The method of claim 5, wherein the geometrical feature comprise at least one of: baselines, mean lines, cap height, descender height, ascender height, character size, character position, character kerning, letter spacing, sentence spacing, size of dots and/or diacritics, size and position of superscript and/or subscript text, paragraph position and margins size.

7. The method of any preceding claim, wherein alterations to the geometrical features include at least one of: translation, scaling and rotation of those geometrical features.

8. The method of claim 7, wherein said alterations are carried out with respect to a predetermined reference.

9. The method of claim 7 or claim 8, wherein said alterations are made to otherwise regularly repeated geometrical features.

10. The method of any preceding claim, further comprising receiving at least one user input; and wherein the data to be encoded is derived, at least in part, from the at least one user input.

1 1. The method of claim 10, wherein receiving at least one user input comprises receiving at least one of: biometric data and an alphanumeric code.

12. The method of any preceding claim, further comprising determining at least one metric representative of at least a portion of the content of the document; and wherein the data to be encoded is derived, at least in part, from the at least one metric.

13. The method of any preceding claim, further comprising treating at least part of the data to be encoded prior to steganographically encoding the data within the document.

14. The method of claim 13, wherein treating at least part of the data to be encoded comprises encrypting it using an encryption process.

15. The method of claim 14, wherein the encryption process comprises receiving as an input at least one of: a user input and a metric representing the content of the document.

16. The method of any one of claims 13 to 15, wherein treating at least part of the data to be encoded comprises appending a verifier to the data.

17. The method of any preceding claim, further comprising choosing an encoding strategy so as to determine how to map the data to be steganographically encoded within the document to a set of document alteration instructions.

18. The method of any preceding claim, further comprising checking the altered document to determine that data has been successfully encoded therein.

19. The method of claim 18, wherein said checking step comprises comparing the data extracted by a decoding method with the original data to be encoded.

20. The method of claim 18 or 19, wherein the altered document is first treated with a degradation process prior to the checking step so as to test the resistance of the encoded data to corruption.

21. A method of processing a document to decode data steganographically encoded within it, the data being encoded in a rendered form of that document in a way that is machine-readable yet is substantially imperceptible to humans, the method comprising: identifying at least one group of geometrical features of the rendered form of the document;

22. A computer program arranged to carry out the method of any preceding claim.

23. A computer program product comprising at least one non-transitory computer- readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising at least one executable portion which, when executed, carries out a process according to any one of claims 1 to 21.

24. A system for altering a document to encode data steganographically within it, so that the encoded data, from a rendered form of that document, is machine-readable yet is substantially imperceptible to humans, the system comprising:

a reader arranged to read in the document and convert it into an electronic format in which at least one group of geometrical features of the rendered form of the document can be identified and altered;

a processor arranged to:

select a reference set of parameter values;

map the data to be steganographically encoded within the document to a set of relative differences; and

a writer to write out the modified document with the data encoded

steganographically encoded therein.

25. A system for processing a document to decode data steganographically encoded within it, the data being encoded in a rendered form of that document in a way that is machine-readable yet is substantially imperceptible to humans, the system comprising: a reader arranged to read the document and convert it into an electronic format in which at least one group of geometrical features of the rendered form of the document can be identified;

a processor arranged to:

select a reference set of parameter values; calculate a set of relative differences between the parameter values associated with the geometrical features and the reference set of parameter values; and

and

an output module to output the decoded data.

26. The system of claim 25 in the form of a document replication apparatus, wherein the reader is a scanner, and the output module is a printer.

27. The system of claim 26 arranged to control replication of a document in dependence on scanning and decoding steganographically embedded data therein.

28. The system of any one of claim 25 to 27, arranged to issue an authorisation signal in dependence on decoding steganographically embedded data within a document.

29. A document comprising a medium supporting geometrical feature thereon, the geometrical features arranged to encode data steganographically in a way that is machine-readable yet is substantially imperceptible to humans, wherein the document comprises:

at least one group of geometrical features having associated parameter values that define geometrical characteristics of the geometrical features; and

features supporting a reference set of parameter values;

wherein a set of relative differences between the parameter values of the at least one group of geometrical features and the reference set of parameter values

steganographically encodes data within the document.

30. The document of claim 29, wherein the geometrical features are typographical features.

31. The document of claim 29 or claim 30, wherein the medium supports geometrical features in a way that is machine readable via visible light scanning of the document.

32. The document of any one of claims 29 to 31 , in the form of at least one of: a bank note, a cheque, an agreement document, a deed document, an assignment document, a power of attorney document, a book, an identity document, a legal document, a business document and a government document.